# Step 2 - Transcript Quant Into Gene Quant

## Introduction


## Things I do below
1. I used tximport to aggregate transcript-level quantification into gene-level quantification
2. I aggregated all gene-level quantification from all 16 samples into a single large table


## Use R and Python in the same notebook

To achieve this magic, we need a wonderful package called rpy2. Again, we don't need to understand it all! Here I provide you a minimum example, which would be sufficient to use most of the open-box R packages in our further analysis!

## Install rpy2 first if you haven't

```shell

conda install -n genome_book rpy2==3.3.2

```


## Import Packages

### Import python package

In [2]:
import pandas as pd
import pathlib

import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

%load_ext rpy2.ipython

### Install tximport if you haven't

In [3]:
%%R
# install the tximport package
# this is R code! We are running R and python in the same notebook!
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("tximport")

R[write to console]: Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.3 (2020-02-29)

R[write to console]: Installing package(s) 'tximport'

R[write to console]: trying URL 'https://bioconductor.org/packages/3.10/bioc/bin/macosx/el-capitan/contrib/3.6/tximport_1.14.2.tgz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 338954 bytes (331 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to cons


The downloaded binary packages are in
	/var/folders/cz/7q7963t101755yktmcz_v3_h0000gn/T//RtmpxvVJE4/downloaded_packages


### Import R package

In [4]:
# import the tximport
importr('tximport')

rpy2.robjects.packages.Package as a <module 'tximport'>

## Calculate one sample as an example

In [5]:
# read table use pandas
tx2gene = pd.read_csv('tx2gene.csv')
tx2gene.head()

Unnamed: 0,transcript_id,gene_ids
0,ENSMUST00000193812.1,ENSMUSG00000102693.1
1,ENSMUST00000082908.1,ENSMUSG00000064842.1
2,ENSMUST00000162897.1,ENSMUSG00000051951.5
3,ENSMUST00000159265.1,ENSMUSG00000051951.5
4,ENSMUST00000070533.4,ENSMUSG00000051951.5


In [6]:
# see rpy2 documentation https://rpy2.github.io/doc/latest/html/pandas.html

with localconverter(ro.default_converter + pandas2ri.converter):
    tx2gene_in_r = ro.conversion.py2rpy(tx2gene)
    gene_quant = ro.r['tximport']('../../../data/DevFB/quant/forebrain_P0_1.quant/quant.sf', 
                                        type='salmon', 
                                        tx2gene=tx2gene_in_r, 
                                        countsFromAbundance='lengthScaledTPM')
    

R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



In [7]:
# this is a list with four items, we need the counts for further DEG analysis
gene_quant

0,1
abundance,[RTYPES.REALSXP]
counts,[RTYPES.REALSXP]
length,[RTYPES.REALSXP]
countsFromAbundance,[RTYPES.STRSXP]


In [8]:
# gene_quant[1] is the counts
gene_count = pd.Series(gene_quant[1], index=gene_quant[1].rownames)

In [9]:
gene_count

ENSMUSG00000000001.4     8660.999877
ENSMUSG00000000003.15       0.000000
ENSMUSG00000000028.15     230.000066
ENSMUSG00000000031.16     578.000018
ENSMUSG00000000037.17     181.000134
                            ...     
ENSMUSG00000118636.1        3.000045
ENSMUSG00000118637.1        0.000000
ENSMUSG00000118638.1        0.000000
ENSMUSG00000118639.1        0.000000
ENSMUSG00000118640.1        0.000000
Length: 54331, dtype: float64

In [10]:
# also, it doesn't hurt to save other useful quant as well
def gene_quant_to_gene_df(gene_quant):
    gene_df = pd.DataFrame({
        'TPM': pd.Series(gene_quant[0], index=gene_quant[0].rownames),
        'Counts': pd.Series(gene_quant[1], index=gene_quant[1].rownames),
        'Length': pd.Series(gene_quant[2], index=gene_quant[2].rownames)
    })
    return gene_df

In [11]:
gene_quant_to_gene_df(gene_quant)

Unnamed: 0,TPM,Counts,Length
ENSMUSG00000000001.4,71.066276,8660.999877,3013.000000
ENSMUSG00000000003.15,0.000000,0.000000,550.500000
ENSMUSG00000000028.15,3.437285,230.000066,1654.272593
ENSMUSG00000000031.16,9.216219,578.000018,1550.492691
ENSMUSG00000000037.17,1.118736,181.000134,3999.870583
...,...,...,...
ENSMUSG00000118636.1,4.342193,3.000045,17.081000
ENSMUSG00000118637.1,0.000000,0.000000,7.349000
ENSMUSG00000118638.1,0.000000,0.000000,1768.000000
ENSMUSG00000118639.1,0.000000,0.000000,6.040000


## Calculate All Samples and Save into one Table

In [12]:
salmon_dir = '../../../data/DevFB/quant/'
salmon_dir = pathlib.Path(salmon_dir)

In [13]:
quant_list = list(salmon_dir.glob('*/quant.sf'))
print(len(quant_list), 'quantification tables')
quant_list[0]

16 quantification tables


PosixPath('../../../data/DevFB/quant/forebrain_E13.5_2.quant/quant.sf')

In [14]:
def transcript_to_gene_quant(input_path, tx2gene_df):
    with localconverter(ro.default_converter + pandas2ri.converter):
        tx2gene_in_r = ro.conversion.py2rpy(tx2gene_df)
        gene_quant = ro.r['tximport'](
            str(input_path),
            type='salmon',
            tx2gene=tx2gene_in_r,
            countsFromAbundance='lengthScaledTPM')
        
    gene_df = gene_quant_to_gene_df(gene_quant)
    return gene_df

In [15]:
gene_quant_df = transcript_to_gene_quant(quant_list[0], tx2gene)
gene_quant_df.head()

R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



Unnamed: 0,TPM,Counts,Length
ENSMUSG00000000001.4,98.051257,11963.999549,3013.0
ENSMUSG00000000003.15,0.0,0.0,550.5
ENSMUSG00000000028.15,28.238953,2043.214448,1786.657347
ENSMUSG00000000031.16,15.520992,985.99996,1568.676287
ENSMUSG00000000037.17,1.683454,235.000077,3447.010925


## Generate a single large table
- Gene-level quant from all 16 samples in one file

In [16]:
total_quant_list = []
for path in quant_list:
    sample_name = path.parent.name[:-6]  # get sample name from the path
    print(sample_name)
    
    gene_quant_df = transcript_to_gene_quant(path, tx2gene)
    gene_quant_df.columns = sample_name + '.' + gene_quant_df.columns  # add sample name into column names
    
    total_quant_list.append(gene_quant_df)

forebrain_E13.5_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_P0_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E12.5_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E14.5_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E15.5_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E12.5_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_P0_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E13.5_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E15.5_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E14.5_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E10.5_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E11.5_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E16.5_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E11.5_1


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E10.5_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



forebrain_E16.5_2


R[write to console]: reading in files with read_tsv

R[write to console]: 1 
R[write to console]: 

R[write to console]: summarizing abundance

R[write to console]: summarizing counts

R[write to console]: summarizing length



In [20]:
total_quant = pd.concat(total_quant_list, axis=1)
total_quant

Unnamed: 0,forebrain_E13.5_2.TPM,forebrain_E13.5_2.Counts,forebrain_E13.5_2.Length,forebrain_P0_2.TPM,forebrain_P0_2.Counts,forebrain_P0_2.Length,forebrain_E12.5_1.TPM,forebrain_E12.5_1.Counts,forebrain_E12.5_1.Length,forebrain_E14.5_1.TPM,...,forebrain_E16.5_1.Length,forebrain_E11.5_1.TPM,forebrain_E11.5_1.Counts,forebrain_E11.5_1.Length,forebrain_E10.5_2.TPM,forebrain_E10.5_2.Counts,forebrain_E10.5_2.Length,forebrain_E16.5_2.TPM,forebrain_E16.5_2.Counts,forebrain_E16.5_2.Length
ENSMUSG00000000001.4,98.051257,11963.999549,3013.000000,56.339107,8311.999614,3013.000000,134.231784,17287.999636,3013.000000,91.061106,...,3013.000000,139.674917,8477.999884,3013.000000,137.580286,14367.999857,3013.000000,118.966145,8364.999786,3013.000000
ENSMUSG00000000003.15,0.000000,0.000000,550.500000,0.000000,0.000000,550.500000,0.000000,0.000000,550.500000,0.000000,...,550.500000,0.000000,0.000000,550.500000,0.000000,0.000000,550.500000,0.000000,0.000000,550.500000
ENSMUSG00000000028.15,28.238953,2043.214448,1786.657347,3.526304,283.999985,1644.758518,39.650689,3024.005822,1784.192189,20.505089,...,1819.274836,55.126178,1944.011126,1750.511670,59.116165,3665.999943,1789.143710,15.232014,620.083455,1744.413344
ENSMUSG00000000031.16,15.520992,985.999960,1568.676287,7.819190,610.999967,1595.816669,36.678962,2379.999942,1517.992615,18.633953,...,1451.871210,67.601876,1920.999955,1410.563864,76.426513,4204.999975,1587.380266,23.593628,865.999992,1572.822620
ENSMUSG00000000037.17,1.683454,235.000077,3447.010925,0.818803,142.000137,3541.709324,3.798637,483.000019,2974.603626,2.180449,...,3216.806343,3.374957,212.000003,3118.112431,2.200729,282.999968,3710.042154,1.877517,173.999935,3971.200310
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSMUSG00000118636.1,2.891339,2.000029,17.081000,1.195630,1.000015,17.081000,1.369629,1.000015,17.081000,2.988177,...,17.081000,2.906145,1.000015,17.081000,5.067263,3.000045,17.081000,5.017412,2.000030,17.081000
ENSMUSG00000118637.1,0.000000,0.000000,7.349000,0.000000,0.000000,7.349000,0.000000,0.000000,7.349000,0.000000,...,7.349000,0.000000,0.000000,7.349000,0.000000,0.000000,7.349000,0.000000,0.000000,7.349000
ENSMUSG00000118638.1,0.000000,0.000000,1768.000000,0.000000,0.000000,1768.000000,0.000000,0.000000,1768.000000,0.000000,...,1768.000000,0.000000,0.000000,1768.000000,0.016318,0.999977,1768.000000,0.000000,0.000000,1768.000000
ENSMUSG00000118639.1,0.000000,0.000000,6.040000,0.000000,0.000000,6.040000,0.000000,0.000000,6.040000,0.000000,...,6.040000,0.000000,0.000000,6.040000,0.000000,0.000000,6.040000,0.000000,0.000000,6.040000


In [21]:
total_quant.to_csv('DevFB.16sample.gene_level_quant.csv.gz')

Now let's make two additional files for the DEG analysis

### Make a table for all counts cols

In [22]:
total_quant.columns.str.endswith('Counts')

array([False,  True, False, False,  True, False, False,  True, False,
       False,  True, False, False,  True, False, False,  True, False,
       False,  True, False, False,  True, False, False,  True, False,
       False,  True, False, False,  True, False, False,  True, False,
       False,  True, False, False,  True, False, False,  True, False,
       False,  True, False])

In [25]:
counts_df = total_quant.loc[:, total_quant.columns.str.endswith('Counts')]
counts_df.columns = counts_df.columns.str[:-7]
counts_df

Unnamed: 0,forebrain_E13.5_2,forebrain_P0_2,forebrain_E12.5_1,forebrain_E14.5_1,forebrain_E15.5_2,forebrain_E12.5_2,forebrain_P0_1,forebrain_E13.5_1,forebrain_E15.5_1,forebrain_E14.5_2,forebrain_E10.5_1,forebrain_E11.5_2,forebrain_E16.5_1,forebrain_E11.5_1,forebrain_E10.5_2,forebrain_E16.5_2
ENSMUSG00000000001.4,11963.999549,8311.999614,17287.999636,10750.999694,9958.999556,17480.999687,8660.999877,11523.999548,11127.999598,9458.999830,16155.999859,12384.999835,7995.999807,8477.999884,14367.999857,8364.999786
ENSMUSG00000000003.15,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
ENSMUSG00000000028.15,2043.214448,283.999985,3024.005822,1428.142436,974.013277,3026.258085,230.000066,2011.011893,969.055625,1038.034104,3997.351217,2662.011545,653.020094,1944.011126,3665.999943,620.083455
ENSMUSG00000000031.16,985.999960,610.999967,2379.999942,1058.999957,931.999994,3216.999929,578.000018,919.999919,903.999928,908.999972,4060.999927,2563.000001,653.999991,1920.999955,4204.999975,865.999992
ENSMUSG00000000037.17,235.000077,142.000137,483.000019,289.000014,211.000080,464.999994,181.000134,266.000021,283.999942,224.999992,355.000000,405.999957,188.000010,212.000003,282.999968,173.999935
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSMUSG00000118636.1,2.000029,1.000015,1.000015,2.000030,5.000074,3.000044,3.000045,1.000015,3.000044,0.000000,2.000030,4.000060,3.000044,1.000015,3.000045,2.000030
ENSMUSG00000118637.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
ENSMUSG00000118638.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000032,0.000000,0.000000,0.000000,0.000000,0.000000,0.999977,0.000000
ENSMUSG00000118639.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [26]:
counts_df.to_csv('DevFB.16sample.counts_for_DEG.csv.gz')

### Make a sample table for experiment design

In [31]:
sample_table = pd.DataFrame([i.split('_') for i in counts_df.columns],
                            index=counts_df.columns,
                            columns=['Tissue', 'Time', 'Replicate'])
sample_table

Unnamed: 0,Tissue,Time,Replicate
forebrain_E13.5_2,forebrain,E13.5,2
forebrain_P0_2,forebrain,P0,2
forebrain_E12.5_1,forebrain,E12.5,1
forebrain_E14.5_1,forebrain,E14.5,1
forebrain_E15.5_2,forebrain,E15.5,2
forebrain_E12.5_2,forebrain,E12.5,2
forebrain_P0_1,forebrain,P0,1
forebrain_E13.5_1,forebrain,E13.5,1
forebrain_E15.5_1,forebrain,E15.5,1
forebrain_E14.5_2,forebrain,E14.5,2


In [32]:
sample_table.to_csv('DevFB.16sample.design.csv.gz')