# Step 1 - Prepare transcript2gene map

## Introduction
The developmental forebrain dataset contain 16 samples, in our data preparation step, we used salmon to quantify transcript counts and the results were in 16 separate directories. The salmon quant results are in transcript level, however, for further DEG analysis, we would merge the result into gene level .

## Things I do below
1. I checked the content of one sample, to know the structure of the data
2. I checked the content of our GENCODE vm24 GTF file, the same gene annotation file used in salmon index. We will keep use this file throughout the analysis. In a separate notebook, I will display more way of manipulating and extract informations from the GTF.
3. I created a transcript to gene map csv file, which is needed in next notebook

## Load Package

In [3]:
import pandas as pd
import pathlib  # deal with file paths

## Check out one sample first

### Load one sample quant table

In [3]:
# use relative path to navigate to the salmon quants dir
# if this do not work, change to wherever you saved th salmon quants dirs.
data_dir = '../../../data/DevFB/quant/'

data_dir = pathlib.Path(data_dir)
data_dir  # now data_dir is an Path obj, it has convenient functions to support path management

PosixPath('../../../data/DevFB/quant')

In [8]:
# Let's check out this sample first
sample_id = 'forebrain_P0_1.quant'

sample_quant_path = data_dir / f'{sample_id}/quant.sf'

In [13]:
quant_df = pd.read_csv(sample_quant_path, sep='\t', 
                       index_col='Name')
# read_csv() can read tsv too, just use sep='\t'
# index_col tell the function which column can be row names. 
# If index_col=None, the index will be default int index, you can try it yourself

print(quant_df.shape)  # this table has (N_rows, N_columns)
quant_df.head()

(140948, 4)


Unnamed: 0_level_0,Length,EffectiveLength,TPM,NumReads
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENSMUST00000193812.1,1070,821.0,0.150564,5.0
ENSMUST00000082908.1,110,4.749,0.0,0.0
ENSMUST00000162897.1,4153,3904.0,0.748282,118.163
ENSMUST00000159265.1,2989,2740.0,0.583525,64.672
ENSMUST00000070533.4,3634,3385.0,2.060818,282.165


### Stop and read salmon's documentation

Now that we have salmon's output table, we saw several columns. Although the name is kind of friendly, it's always good to fully understand what are they before we proceed. 

[Here](https://salmon.readthedocs.io/en/latest/file_formats.html) is salmon's documentation about their output format. I copied the column explaination here:
- Name — This is the name of the target transcript provided in the input transcript database (FASTA file).
- Length — This is the length of the target transcript in nucleotides.
- EffectiveLength — This is the computed effective length of the target transcript. It takes into account all factors being modeled that will effect the probability of sampling fragments from this transcript, including the fragment length distribution and sequence-specific and gc-fragment bias (if they are being modeled).
- TPM — This is salmon’s estimate of the relative abundance of this transcript in units of Transcripts Per Million (TPM). TPM is the recommended relative abundance measure to use for downstream analysis.
- NumReads — This is salmon’s estimate of the number of reads mapping to each transcript that was quantified. It is an “estimate” insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.

Note: **TPM is an important term in RNA-seq**, everyone needs to know how it's calculated. If you don't, [check out this page and the video](https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/)


## Now let's focus on the index

### Important Note
Salmon generate transcript-level quantification, the index of the above quant dataframe are transcript ids of mouse genes, we need to create a dict to annotate their corresponding gene ids. To do so, we need the gtf file we used in creating salmon index.

### More about IDs

In database like GENCODE/ensembl or NCBI, gene, transcripts and protein all has different IDs.

GENCODE and ensembl share the same ID pattern, which is also recommend to use in RNA-seq analysis for human and mouse. But you will also see NCBI RefSeq ID a lot.

Use mouse ensembl/GENCODE as example:
- ENSMUST... is transcript ID
- ENSMUSG... is gene ID

For human ensembl/GENCODE:
- ENST... is transcript ID
- ENSG... is gene ID

## Create transcript2gene dict using GTF file

In [18]:
# gtf table has some spetial formats, we need to set more parameters when read it in,
# here I just provide you the answer, you can search pandas.read_csv() documentation for more information
gtf = pd.read_csv(
    '../../../data/ref/GENCODEvM24/gencode.vM24.annotation.gtf.gz',
    comment='#',
    sep='\t',
    header=None,
    names=[
        'chrom', 'source', 'feature', 'start', 'end', 'na1', 'strand', 'na2',
        'annotation'
    ])
gtf.head()

Unnamed: 0,chrom,source,feature,start,end,na1,strand,na2,annotation
0,chr1,HAVANA,gene,3073253,3074322,.,+,.,"gene_id ""ENSMUSG00000102693.1""; gene_type ""TEC..."
1,chr1,HAVANA,transcript,3073253,3074322,.,+,.,"gene_id ""ENSMUSG00000102693.1""; transcript_id ..."
2,chr1,HAVANA,exon,3073253,3074322,.,+,.,"gene_id ""ENSMUSG00000102693.1""; transcript_id ..."
3,chr1,ENSEMBL,gene,3102016,3102125,.,+,.,"gene_id ""ENSMUSG00000064842.1""; gene_type ""snR..."
4,chr1,ENSEMBL,transcript,3102016,3102125,.,+,.,"gene_id ""ENSMUSG00000064842.1""; transcript_id ..."


In [19]:
# gtf contain different type of features
gtf['feature'].value_counts()

exon              842873
CDS               528267
UTR               185941
transcript        142552
start_codon        59973
stop_codon         55713
gene               55385
Selenocysteine        65
Name: feature, dtype: int64

In [39]:
# here we only need those transcript rows, so we do a filter here
# the transcript_gtf has 142552, its similar to our salmon quant table, which is a little bit less, 
# probably some transcripts are unmappable by salmon.
transcript_gtf = gtf[gtf['feature'] == 'transcript'].copy()
print(transcript_gtf.shape)
transcript_gtf.head()

(142552, 9)


Unnamed: 0,chrom,source,feature,start,end,na1,strand,na2,annotation
1,chr1,HAVANA,transcript,3073253,3074322,.,+,.,"gene_id ""ENSMUSG00000102693.1""; transcript_id ..."
4,chr1,ENSEMBL,transcript,3102016,3102125,.,+,.,"gene_id ""ENSMUSG00000064842.1""; transcript_id ..."
7,chr1,HAVANA,transcript,3205901,3216344,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ..."
10,chr1,HAVANA,transcript,3206523,3215632,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ..."
13,chr1,HAVANA,transcript,3214482,3671498,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ..."


### Extract gene_id from annotation column

In [40]:
# the last column contains long annotation string, which you can see the gene_id is there
example_annotation = transcript_gtf.iloc[0, -1]
example_annotation

'gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";'

In [49]:
# we need to extract the gene_id from this string. here is how I do this with a function

def extract_gene_id(annotation):
    kv_pairs = annotation.split(';')  # split into key-value pairs
    for kv_pair in kv_pairs:  # iterate the key-value pairs
        kv_pair = kv_pair.strip(' ')  # strip removed the space
        if kv_pair.startswith('gene_id'):
            _, gene_id = kv_pair.split(' ')
            gene_id = gene_id.strip('"')  # strip removed the ""
            break
    return gene_id

extract_gene_id(example_annotation)

'ENSMUSG00000102693.1'

In [42]:
# now we need to apply this function on to each row of the transcript_gtf
# there are two way to do this, I first show you the fast (i.e. correct) way:
gene_ids = transcript_gtf['annotation'].apply(extract_gene_id)
# it take ~250ms in my computer
# search pandas.Dataframe.apply() see more about this function
gene_ids

1          ENSMUSG00000102693.1
4          ENSMUSG00000064842.1
7          ENSMUSG00000051951.5
10         ENSMUSG00000051951.5
13         ENSMUSG00000051951.5
                   ...         
1870749    ENSMUSG00000064368.1
1870756    ENSMUSG00000064369.1
1870759    ENSMUSG00000064370.1
1870764    ENSMUSG00000064371.1
1870767    ENSMUSG00000064372.1
Name: annotation, Length: 142552, dtype: object

In [36]:
# this is the slow way to achieve the same purpose,
# I just want to demo that using correct pandas functions/methods can simplify your code, 
# and sometimes even make it faster
# DO NOT USE THIS, USE THE ABOVE ONE

# this is iterate the table and take the ann
_gene_ids = []
for annotation in transcript_gtf['annotation']:
    _gene_ids.append(extract_gene_id(annotation))
_gene_ids = pd.Series(_gene_ids, index=transcript_gtf.index)

_gene_ids

# the result is the same, but here we need to write more

1          ENSMUSG00000102693.1
4          ENSMUSG00000064842.1
7          ENSMUSG00000051951.5
10         ENSMUSG00000051951.5
13         ENSMUSG00000051951.5
                   ...         
1870749    ENSMUSG00000064368.1
1870756    ENSMUSG00000064369.1
1870759    ENSMUSG00000064370.1
1870764    ENSMUSG00000064371.1
1870767    ENSMUSG00000064372.1
Length: 142552, dtype: object

In [45]:
# Now lets add gene_ids back to transcript_gtf
transcript_gtf['gene_ids'] = gene_ids
print(transcript_gtf.shape)
transcript_gtf.head()

(142552, 10)


Unnamed: 0,chrom,source,feature,start,end,na1,strand,na2,annotation,gene_ids
1,chr1,HAVANA,transcript,3073253,3074322,.,+,.,"gene_id ""ENSMUSG00000102693.1""; transcript_id ...",ENSMUSG00000102693.1
4,chr1,ENSEMBL,transcript,3102016,3102125,.,+,.,"gene_id ""ENSMUSG00000064842.1""; transcript_id ...",ENSMUSG00000064842.1
7,chr1,HAVANA,transcript,3205901,3216344,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ...",ENSMUSG00000051951.5
10,chr1,HAVANA,transcript,3206523,3215632,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ...",ENSMUSG00000051951.5
13,chr1,HAVANA,transcript,3214482,3671498,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ...",ENSMUSG00000051951.5


### Extract transcript_id from annotation column

In [52]:
## similarly we need to extract transcript_id from annotation too
# this time I will write a new function

def extract_transcript_id(annotation):
    kv_pairs = annotation.split(';')  # split into key-value pairs
    for kv_pair in kv_pairs:  # iterate the key-value pairs
        kv_pair = kv_pair.strip(' ')  # strip removed the space
        if kv_pair.startswith('transcript_id'):
            _, transcript_id = kv_pair.split(' ')
            transcript_id = transcript_id.strip('"')  # strip removed the ""
            break
    return transcript_id

extract_transcript_id(example_annotation)

'ENSMUST00000193812.1'

In [53]:
transcript_ids = transcript_gtf['annotation'].apply(extract_transcript_id)
transcript_gtf['transcript_id'] = transcript_ids
print(transcript_gtf.shape)
transcript_gtf.head()

(142552, 11)


Unnamed: 0,chrom,source,feature,start,end,na1,strand,na2,annotation,gene_ids,transcript_id
1,chr1,HAVANA,transcript,3073253,3074322,.,+,.,"gene_id ""ENSMUSG00000102693.1""; transcript_id ...",ENSMUSG00000102693.1,ENSMUST00000193812.1
4,chr1,ENSEMBL,transcript,3102016,3102125,.,+,.,"gene_id ""ENSMUSG00000064842.1""; transcript_id ...",ENSMUSG00000064842.1,ENSMUST00000082908.1
7,chr1,HAVANA,transcript,3205901,3216344,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ...",ENSMUSG00000051951.5,ENSMUST00000162897.1
10,chr1,HAVANA,transcript,3206523,3215632,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ...",ENSMUSG00000051951.5,ENSMUST00000159265.1
13,chr1,HAVANA,transcript,3214482,3671498,.,-,.,"gene_id ""ENSMUSG00000051951.5""; transcript_id ...",ENSMUSG00000051951.5,ENSMUST00000070533.4


### Save transcript2gene.csv

In [55]:
# subset the table give us the transcript to gene map needed in next notebook
tx2gene = transcript_gtf[['transcript_id', 'gene_ids']]
tx2gene

Unnamed: 0,transcript_id,gene_ids
1,ENSMUST00000193812.1,ENSMUSG00000102693.1
4,ENSMUST00000082908.1,ENSMUSG00000064842.1
7,ENSMUST00000162897.1,ENSMUSG00000051951.5
10,ENSMUST00000159265.1,ENSMUSG00000051951.5
13,ENSMUST00000070533.4,ENSMUSG00000051951.5
...,...,...
1870749,ENSMUST00000082419.1,ENSMUSG00000064368.1
1870756,ENSMUST00000082420.1,ENSMUSG00000064369.1
1870759,ENSMUST00000082421.1,ENSMUSG00000064370.1
1870764,ENSMUST00000082422.1,ENSMUSG00000064371.1


In [56]:
tx2gene.to_csv('tx2gene.csv', index=None) 
# set index=None, which means do not export the index into csv, so the output file only has 2 columns

In [57]:
!head tx2gene.csv

transcript_id,gene_ids
ENSMUST00000193812.1,ENSMUSG00000102693.1
ENSMUST00000082908.1,ENSMUSG00000064842.1
ENSMUST00000162897.1,ENSMUSG00000051951.5
ENSMUST00000159265.1,ENSMUSG00000051951.5
ENSMUST00000070533.4,ENSMUSG00000051951.5
ENSMUST00000192857.1,ENSMUSG00000102851.1
ENSMUST00000195335.1,ENSMUSG00000103377.1
ENSMUST00000192336.1,ENSMUSG00000104017.1
ENSMUST00000194099.1,ENSMUSG00000103025.1
