## Tokenizing .loom or .h5ad single cell RNA-seq data to rank value encoding .dataset format

#### Input data is a directory with .loom or .h5ad files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. The input file type is specified by the argument file_format in the tokenize_data function.

#### The discussion below references the .loom file format, but the analagous labels are required for .h5ad files, just that they will be column instead of row attributes and vice versa due to the transposed format of the two file types.

#### Genes should be labeled with Ensembl IDs (loom row attribute "ensembl_id"), which provide a unique identifer for conversion to tokens. Other forms of gene annotations (e.g. gene names) can be converted to Ensembl IDs via Ensembl Biomart. Cells should be labeled with the total read count in the cell (loom column attribute "n_counts") to be used for normalization.

#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes "cell_type" and "organ_major" and one would like to retain these attributes as labels in the tokenized dataset with the new names "cell_type" and "organ", respectively, the following custom attribute dictionary should be provided: {"cell_type": "cell_type", "organ_major": "organ"}. 

#### Additionally, if the original .loom file contains a cell column attribute called "filter_pass", this column will be used as a binary indicator of whether to include these cells in the tokenized data. All cells with "1" in this attribute will be tokenized, whereas the others will be excluded. One may use this column to indicate QC filtering or other criteria for selection for inclusion in the final tokenized dataset.

#### If one's data is in other formats besides .loom or .h5ad, one can use the relevant tools (such as Anndata tools) to convert the file to a .loom or .h5ad format prior to running the transcriptome tokenizer.

In [4]:
from geneformer import TranscriptomeTokenizer

In [5]:
!ls

cell_classification.ipynb
cell_classification.py
example_input_files
extract_and_plot_cell_embeddings.ipynb
gene_classification.ipynb
hyperparam_optimiz_for_disease_classifier.py
in_silico_perturbation.ipynb
pretraining_new_model
token
tokenizing_scRNAseq_data.ipynb


In [6]:
!head -n 8 /mnt/nas/user/yixuan/Geneformer/examples/token/Homo_sapiens.GRCh38.111.gtf

#!genome-build GRCh38.p14
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession GCA_000001405.29
#!genebuild-last-updated 2023-07
1	havana	gene	182696	184174	.	+	.	gene_id "ENSG00000279928"; gene_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene";
1	havana	transcript	182696	184174	.	+	.	gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1	havana	exon	182696	182746	.	+	.	gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; exon_number "1"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havan

In [7]:
with open('/mnt/nas/user/yixuan/Geneformer/examples/token/Homo_sapiens.GRCh38.111.gtf') as f:
    gtf=list(f)
gtf=[x for x in gtf if not x.startswith('#')]
len(gtf)

3424897

In [8]:
gtf = [x for x in gtf if 'gene_id "' in x and 'gene_name "' in x]
len(gtf)

3262884

In [9]:
gtf=list(map(lambda x: (x.split('gene_id "')[1].split('"')[0],x.split('gene_name "')[1].split('"')[0]),gtf))

In [10]:
gtf=list(set(gtf))

In [11]:
len(gtf)

42640

In [12]:
import scanpy as sc
adata=sc.read_h5ad('/mnt/nas/user/yixuan/Geneformer/examples/token/10x-Multiome-Pbmc10k-small-RNA.h5ad')
# Load the h5ad file
gtf=dict(gtf)
adata.var['ensembl_id']=adata.var['gene_ids']

In [25]:
gtf

{'ENSG00000249065': 'PCNAP1',
 'ENSG00000258969': 'LINC02307',
 'ENSG00000204442': 'NALF1',
 'ENSG00000163406': 'SLC15A2',
 'ENSG00000206697': 'RNY1P8',
 'ENSG00000123815': 'COQ8B',
 'ENSG00000286676': 'ACTBP4',
 'ENSG00000153246': 'PLA2R1',
 'ENSG00000154059': 'IMPACT',
 'ENSG00000200817': 'RNU6-899P',
 'ENSG00000200755': 'RNA5SP68',
 'ENSG00000197364': 'S100A7L2',
 'ENSG00000163029': 'SMC6',
 'ENSG00000233702': 'RPL37P13',
 'ENSG00000226867': 'SSX21P',
 'ENSG00000212597': 'RNU6-876P',
 'ENSG00000101230': 'ISM1',
 'ENSG00000235032': 'BMP7-AS1',
 'ENSG00000273555': 'MIR6812',
 'ENSG00000131236': 'CAP1',
 'ENSG00000087494': 'PTHLH',
 'ENSG00000225999': 'NDUFA12P1',
 'ENSG00000096433': 'ITPR3',
 'ENSG00000228466': 'TUBB4AP1',
 'ENSG00000283842': 'MIR4751',
 'ENSG00000243779': 'RPL36AP49',
 'ENSG00000126218': 'F10',
 'ENSG00000165178': 'NCF1C',
 'ENSG00000125991': 'ERGIC3',
 'ENSG00000130520': 'LSM4',
 'ENSG00000198815': 'FOXJ3',
 'ENSG00000207771': 'MIR550A1',
 'ENSG00000249142': 'TMEM18

In [18]:
act=sc.read_h5ad('/mnt/nas/user/yixuan/Multiomics-benchmark-main/data/download/10x-Multiome-Pbmc10k-small/10x-Multiome-Pbmc10k-small-ACTIVE.h5ad')

In [28]:
act.var

Unnamed: 0_level_0,colnames.activity.
gene_ids,Unnamed: 1_level_1
,A1BG
,A1BG-AS1
,A1CF
,A2M
,A2M-AS1
...,...
,ZYG11A
,ZYG11B
,ZYX
,ZZEF1


In [29]:
# Adding a new column using the dictionary
inverted_dict = {v: k for k, v in gtf.items()}

act.var['gene_ids'] = act.var['colnames.activity.'].map(inverted_dict)

# Setting the index to be the Ensembl IDs column
act.var.set_index('gene_ids', inplace=True)

In [32]:
act.var

Unnamed: 0_level_0,colnames.activity.
gene_ids,Unnamed: 1_level_1
ENSG00000121410,A1BG
ENSG00000268895,A1BG-AS1
ENSG00000148584,A1CF
ENSG00000175899,A2M
ENSG00000245105,A2M-AS1
...,...
ENSG00000203995,ZYG11A
ENSG00000162378,ZYG11B
ENSG00000159840,ZYX
ENSG00000074755,ZZEF1


In [31]:
act.write_h5ad('/mnt/nas/user/yixuan/Multiomics-benchmark-main/data/download/10x-Multiome-Pbmc10k-small/10x-Multiome-Pbmc10k-small-ACTIVE-id.h5ad')

TypeError: Can't implicitly convert non-string objects to strings

In [13]:
adata.obs['cell_type']

cells
AAACAGCCAATCCCTT-1      CD4 TCM
AAACAGCCAATGCGCT-1    CD4 Naive
AAACAGCCACCAACCG-1    CD8 Naive
AAACAGCCAGGATAAC-1    CD4 Naive
AAACAGCCAGTTTACG-1      CD4 TCM
                        ...    
TTTGTTGGTGACATGC-1    CD8 Naive
TTTGTTGGTGTTAAAC-1    CD8 Naive
TTTGTTGGTTAGGATT-1           NK
TTTGTTGGTTGGTTAG-1      CD4 TCM
TTTGTTGGTTTGCAGA-1    CD8 TEM_1
Name: cell_type, Length: 9631, dtype: category
Categories (19, object): ['CD4 Naive', 'CD4 TCM', 'CD4 TEM', 'CD8 Naive', ..., 'Treg', 'cDC', 'gdT', 'pDC']

In [14]:
adata.var.set_index('ensembl_id')

Unnamed: 0_level_0,gene_ids,feature_types,genome,chrom,chromStart,chromEnd,name,score,strand,thickStart,...,gene_name,hgnc_id,havana_gene,tag,n_counts,highly_variable,highly_variable_rank,means,variances,variances_norm
ensembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000177757,ENSG00000177757,Gene Expression,GRCh38,chr1,817370,819837,ENSG00000177757,.,+,.,...,FAM87B,HGNC:32236,OTTHUMG00000002471.2,,16.0,True,6452.0,0.001661,0.001866,1.025814
ENSG00000188290,ENSG00000188290,Gene Expression,GRCh38,chr1,998961,1000172,ENSG00000188290,.,-,.,...,HES4,HGNC:24149,OTTHUMG00000040758.2,,776.0,True,580.0,0.080573,0.184992,1.798308
ENSG00000187608,ENSG00000187608,Gene Expression,GRCh38,chr1,1001137,1014540,ENSG00000187608,.,+,.,...,ISG15,HGNC:4053,OTTHUMG00000040777.4,,5597.0,True,1456.0,0.581144,1.431603,1.415658
ENSG00000224969,ENSG00000224969,Gene Expression,GRCh38,chr1,1011996,1013193,ENSG00000224969,.,-,.,...,AL645608.1,,OTTHUMG00000040779.1,,27.0,True,6846.0,0.002803,0.003211,1.015311
ENSG00000131591,ENSG00000131591,Gene Expression,GRCh38,chr1,1081817,1116361,ENSG00000131591,.,-,.,...,C1orf159,HGNC:26062,OTTHUMG00000000745.9,,868.0,True,5244.0,0.090126,0.123964,1.070259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000012817,ENSG00000012817,Gene Expression,GRCh38,chrY,19703864,19744939,ENSG00000012817,.,-,.,...,KDM5D,HGNC:11115,OTTHUMG00000036508.3,,6.0,True,900.0,0.000623,0.001038,1.593556
ENSG00000198692,ENSG00000198692,Gene Expression,GRCh38,chrY,20575775,20593154,ENSG00000198692,.,+,.,...,EIF1AY,HGNC:3252,OTTHUMG00000036544.3,,6.0,True,2255.0,0.000623,0.000830,1.274726
ENSG00000198804,ENSG00000198804,Gene Expression,GRCh38,chrM,5903,7445,ENSG00000198804,.,+,.,...,MT-CO1,HGNC:7419,,,492715.0,True,1167.0,51.159277,1739.724368,1.493188
ENSG00000198712,ENSG00000198712,Gene Expression,GRCh38,chrM,7585,8269,ENSG00000198712,.,+,.,...,MT-CO2,HGNC:7421,,,687834.0,True,5648.0,71.418752,2244.862739,1.054136


In [15]:
adata.obs['n_counts']=adata.obs['nCount_RNA']

In [16]:
adata.write_h5ad('/mnt/nas/user/yixuan/Geneformer/examples/token/in/10x.h5ad')

In [17]:
tk = TranscriptomeTokenizer({"cell_type": "cell_type"}, nproc=16)
tk.tokenize_data("/mnt/nas/user/yixuan/Geneformer/examples/token/in", 
                 "/mnt/nas/user/yixuan/Geneformer/examples/token/out", 
                 "tk", 
                 file_format="h5ad")

Tokenizing /mnt/nas/user/yixuan/Geneformer/examples/token/in/10x.h5ad
/mnt/nas/user/yixuan/Geneformer/examples/token/in/10x.h5ad has no column attribute 'filter_pass'; tokenizing all cells.


  for i in adata.var["ensembl_id"][coding_miRNA_loc]
  coding_miRNA_ids = adata.var["ensembl_id"][coding_miRNA_loc]


Creating dataset.


Map (num_proc=16):   0%|          | 0/9631 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/9631 [00:00<?, ? examples/s]

In [None]:
tk

In [15]:
# tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ"}, nproc=16)
# tk.tokenize_data("loom_data_directory", 
#                  "output_directory", 
#                  "output_prefix", 
#                  file_format="loom")