In [3]:
# %load ~/ipyhead
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import seaborn as sns


We're going to load quantification data from Kallisto. We used the index: `Homo_sapiens.GRCh38.cdna.all.kallisto.idx`

Per http://f1000research.com/articles/4-1521/v2 (https://github.com/mikelove/tximport), we should use TPM (not estimated counts, which are sensitive to transcript length) and aggregate to gene level.

Here is how to annotate with gene names in Python: https://github.com/hammerlab/cohorts/blob/master/cohorts/load.py#L797-L801

We made the index by downloading from http://ftp.ensembl.org/pub/release-79/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz , which is the same as what's suggested on kallisto website. So use release 79.

Though note that the newest release is release 85, from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/README_ensembl:

```
#tax_id	ncbi_release	ncbi_assembly	ensembl_release	ensembl_assembly	date_compared
9606	Homo sapiens Annotation Release 108	GRCh38.p7	85	GRCh38.p7	20160801
```

And http://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/cdna/ has different files.

In [6]:
df = pd.read_csv('output/ERR431617/abundance.tsv', sep='\t')
df.head()

Unnamed: 0,target_id,length,eff_length,est_counts,tpm
0,ENST00000415118,8,5.08333,0.0,0.0
1,ENST00000448914,13,8.66667,0.0,0.0
2,ENST00000434970,9,5.69231,0.0,0.0
3,ENST00000390577,37,21.3714,1.0,6.44565
4,ENST00000437320,19,11.9048,0.0,0.0


In [5]:
!pip install pyensembl

Collecting pyensembl
  Downloading pyensembl-0.9.5.tar.gz (60kB)
[K    100% |████████████████████████████████| 61kB 1.8MB/s 
Collecting datacache>=0.4.16 (from pyensembl)
  Downloading datacache-0.4.17.tar.gz
Collecting memoized-property>=1.0.2 (from pyensembl)
  Downloading memoized-property-1.0.2.tar.gz
Collecting gtfparse>=0.0.3 (from pyensembl)
  Downloading gtfparse-0.0.6.tar.gz
Collecting serializable (from pyensembl)
  Downloading serializable-0.0.6.tar.gz
Collecting progressbar33>=2.4 (from datacache>=0.4.16->pyensembl)
  Downloading progressbar33-2.4.tar.gz
Collecting biopython>=1.65 (from datacache>=0.4.16->pyensembl)
  Downloading biopython-1.67.tar.gz (14.3MB)
[K    100% |████████████████████████████████| 14.3MB 68kB/s 
Collecting simplejson (from serializable->pyensembl)
  Downloading simplejson-3.8.2.tar.gz (76kB)
[K    100% |████████████████████████████████| 81kB 4.6MB/s 
[?25hBuilding wheels for collected packages: pyensembl, datacache, memoized-property, gtfparse, 

In [7]:
!pyensembl install --release 79 --species human

-- Running 'install' for EnsemblRelease(release=79, species='homo_sapiens')
INFO:root:Fetching /home/maxim/.cache/pyensembl/GRCh38/ensembl79/Homo_sapiens.GRCh38.79.gtf.gz from URL ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/Homo_sapiens.GRCh38.79.gtf.gz
Downloading ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/Homo_sapiens.GRCh38.79.gtf.gz to /home/maxim/.cache/pyensembl/GRCh38/ensembl79/Homo_sapiens.GRCh38.79.gtf.gz
INFO:root:Fetching /home/maxim/.cache/pyensembl/GRCh38/ensembl79/Homo_sapiens.GRCh38.cdna.all.fa.gz from URL ftp://ftp.ensembl.org/pub/release-79/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
Downloading ftp://ftp.ensembl.org/pub/release-79/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz to /home/maxim/.cache/pyensembl/GRCh38/ensembl79/Homo_sapiens.GRCh38.cdna.all.fa.gz
INFO:root:Fetching /home/maxim/.cache/pyensembl/GRCh38/ensembl79/Homo_sapiens.GRCh38.pep.all.fa.gz from URL ftp://ftp.ensembl.org/pub/release-79/fasta/homo_sapiens/

In [None]:
from pyensembl import cached_release

In [None]:
ensembl_release = cached_release(79)
df['sample_id'] = 1
df['gene_name'] = df['target_id'].map(lambda t: ensembl_release.gene_name_of_transcript_id(t))

In [None]:
# sum counts across genes
df = df.groupby(['sample_id', 'gene_name'])[['est_counts']].sum().reset_index()
# can we just sum() tpm? look at the other github.