# Notebook 1: loading & processing data
author: jake siegel

date: 11.28.2017

This notebook loads the asynchronous single cell RNA Seq data from the Karlson et al. 2017 JMB paper.  It's in /references.

Here is the reference:

https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6142/

In [2]:
%matplotlib inline

import pandas as pd

data_src = '../data/raw/counts.txt'
df = pd.read_csv(data_src, sep = '\t')
print df.shape
df.head()

(59838, 97)


Unnamed: 0,Gene_ID,S1,S2,S3,S4,S5,S6,S7,S8,S9,...,S87,S88,S89,S90,S91,S92,S93,S94,S95,S96
0,ENSG00000000003.10,360,5,437,136,328,253,1101,39,157,...,391,429,148,397,424,317,403,280,470,725
1,ENSG00000000005.5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ENSG00000000419.8,111,421,70,179,93,57,60,35,174,...,63,228,173,115,87,308,92,179,164,104
3,ENSG00000000457.8,2,5,0,0,1,1,3,0,5,...,3,0,2,1,2,0,0,0,38,1
4,ENSG00000000460.12,179,448,0,0,135,47,0,0,0,...,151,68,147,84,151,11,0,0,169,77


In [68]:
# Cool.  Now let's load the info file to get the cell cycle stage

info_src = '../data/raw/E-MTAB-6142.sdrf.txt'
df_info = pd.read_csv(info_src, sep = '\t')
print df_info.shape
df_info.head()

(192, 39)


Unnamed: 0,Source Name,Comment[ENA_SAMPLE],Comment[BioSD_SAMPLE],Characteristics[organism],Characteristics[cell line],Characteristics[disease],Characteristics[organism part],Characteristics[cell cycle phase],Material Type,Protocol REF,...,Comment[ENA_RUN],Comment[FASTQ_URI],Comment[SPOT_LENGTH],Comment[READ_INDEX_1_BASE_COORD],Protocol REF.5,Protocol REF.6,Derived Array Data File,Comment [Derived ArrayExpress FTP file],Factor Value[cell cycle phase],Factor Value[single cell identifier]
0,S73,ERS1979261,SAMEA104354283,Homo sapiens,MLS 1765-92,myxoid liposarcoma,adipose tissue,G2/M,cell,P-MTAB-68727,...,ERR2177244,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/004/...,300,151,P-MTAB-68731,P-MTAB-68732,counts.txt,ftp://ftp.ebi.ac.uk/pub/databases/microarray/d...,G2/M,S73
1,S73,ERS1979261,SAMEA104354283,Homo sapiens,MLS 1765-92,myxoid liposarcoma,adipose tissue,G2/M,cell,P-MTAB-68727,...,ERR2177244,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/004/...,74,38,P-MTAB-68731,P-MTAB-68732,counts.txt,ftp://ftp.ebi.ac.uk/pub/databases/microarray/d...,G2/M,S73
2,S1,ERS1979262,SAMEA104354284,Homo sapiens,MLS 1765-92,myxoid liposarcoma,adipose tissue,G1,cell,P-MTAB-68727,...,ERR2177245,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/005/...,302,152,P-MTAB-68731,P-MTAB-68732,counts.txt,ftp://ftp.ebi.ac.uk/pub/databases/microarray/d...,G1,S1
3,S1,ERS1979262,SAMEA104354284,Homo sapiens,MLS 1765-92,myxoid liposarcoma,adipose tissue,G1,cell,P-MTAB-68727,...,ERR2177245,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/005/...,64,33,P-MTAB-68731,P-MTAB-68732,counts.txt,ftp://ftp.ebi.ac.uk/pub/databases/microarray/d...,G1,S1
4,S9,ERS1979263,SAMEA104354285,Homo sapiens,MLS 1765-92,myxoid liposarcoma,adipose tissue,G1,cell,P-MTAB-68727,...,ERR2177246,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/006/...,302,152,P-MTAB-68731,P-MTAB-68732,counts.txt,ftp://ftp.ebi.ac.uk/pub/databases/microarray/d...,G1,S9


In [69]:
df_info = df_info.iloc[:,-2:].drop_duplicates().transpose().reset_index().drop(columns=['index'],axis=1)
df_info.columns = df_info.iloc[1]
df_info = df_info.drop(1)
df_info

1,S73,S1,S9,S17,S57,S33,S41,S49,S65,S74,...,S72,S81,S82,S91,S11,S25,S26,S34,S64,S90
0,G2/M,G1,G1,G1,S,S,S,S,G2/M,G2/M,...,G2/M,G2/M,G2/M,G2/M,G1,G1,G1,S,S,G2/M


In [83]:
test = pd.concat([df,df_info]).reset_index(drop=True)
# test.reset_index(drop=True)
test

Unnamed: 0,Gene_ID,S1,S10,S11,S12,S13,S14,S15,S16,S17,...,S88,S89,S9,S90,S91,S92,S93,S94,S95,S96
0,ENSG00000000003.10,360,253,432,276,587,255,178,256,495,...,429,148,157,397,424,317,403,280,470,725
1,ENSG00000000005.5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ENSG00000000419.8,111,91,135,154,247,149,196,389,269,...,228,173,174,115,87,308,92,179,164,104
3,ENSG00000000457.8,2,1,1,1,105,0,0,126,0,...,0,2,5,1,2,0,0,0,38,1
4,ENSG00000000460.12,179,159,48,31,52,0,0,1,212,...,68,147,0,84,151,11,0,0,169,77
5,ENSG00000000938.8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,ENSG00000000971.11,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,34,0,0,0
7,ENSG00000001036.8,293,205,120,0,84,89,0,203,1,...,110,375,112,46,55,291,283,224,293,93
8,ENSG00000001084.6,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,40,0,0,0
9,ENSG00000001167.10,0,0,0,0,258,0,37,0,106,...,0,0,0,98,108,149,60,125,46,2


In [None]:
# OK, so there are 59838 transcripts mapped.  First, how manyn are never observed?
test = df.iloc[:,1:].sum(axis = 1)
print 'number of genes:',test.shape[0]

unobserved = 0
for i in range(0,len(test)):
    if test[i]==0:
        unobserved+=1

print 'number of unobserved genes:',unobserved
print 'number of observed genes:',(test.shape[0]-unobserved)

# this drops all the all zero rows:
df_small = df.loc[~(df.select_dtypes(include=['number'])==0).all(axis=1),:]
print df_small.shape
df_small.head()

# How to normalize the data. 
Reads Per Million is what they mostly use in the paper.  
RPKM (transcript length normalized) would be better to get and is also pretty standard,
TPM does the length scaling before the per million reads scaling and might be a better metric for the number
of transcripts.
Transcript length normalization is a better option because of the tagmenation step in the Smart Seq2 protocol that they use.  This step breaks up transcripts to roughly equal sized chunks to improve the sequencing.  The downside is that it maps each chunk of a transcript to a unique read.
It's paired end sequencing, so it's technically FPKM, but that's unimportant.
Because I want to eventually look for signals that go across cells, I'm going to use TPM to normalize the number of transcripts in each cell.

TPM first scales by transcript length and then by sequencing depth.

So for the length... yeah, I need to get that data.  The Smart Seq 2 protocol is a poly-A selection protocol, so the length can be obtained by summing over the exons in the mature transcript.  The paper used the hg19 reference genome to map and gencode v17 for the splice junctions.  I downloaded the gtf for gencode v17 and can use that to sum over the exons and build a reference table.  That code will be in the next notebook.