# LC-MS/MS to mRNA-Seq comparison

#### Overview of LC-MS/MS data
    
    * When you do LC-MS/MS:
        1. Sample is separated in the liquid/solid filter (LC) then ionized and emitted out to the mass spec. This measures mass/charge (m/z) ratio. The second MS is for the ions that look interesting and get diverted and broken up into constituent parts which are then also measured for m/z ratio.
        2. The output is a tic plot which has a chromatogram of peaks of all the ions as measured in intensity across time emitted from the LC. Some of these peaks contain the MS/MS plot of m/z charge for those ions broken up.
        3. The program Proteome Discoverer uses a reference protein database to convert these chromatogram MS/MS plots into an msf file which contains the filtered peptide spectral matches from our sample to the reference database.
        
        "The Proteins page lists all the identified proteins and the associated peptides found in a sample during a search. You can examine the search results in terms of protein identification, as well as access more details about the peptide identifications and corresponding information from the search input. The Proteins page gives you detailed tabular information, a shortcut menu, and access to the peptide information."
        
        
####Our data:
    
    * 42,840 proteins detected
    
    
####Comparison Steps:
    1. Nomenclature of genes must be the same across datasets
        match genes to proteins using fbgn_NAseq_Uniprot_fb_2015_04.tsv
    2. Compare total # of proteins to total # of transcripts
    3. Compare # of de proteins to # of de transcripts
    4. Are highly expressed proteins also highly expressed mRNAs? 
        * rember this is regardless of DE between conditions
    5. Spearman correlation for highly expressed
        * are outliers similar to Palmblad findings?
            (ρ 0.14, outliers ribosomal, histones, vitellogens)
    6. Does ρ improve within significant groups like cuticular?

In [21]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#read in protein-gene match file (fbgn_NAseq_Uniprot_fb_2015_04.tsv.gz)
pg_matches = pd.read_table("fbgn_NAseq_Uniprot_fb_2015_04.tsv")
pg_matches = pg_matches.rename(columns = {"UniprotKB/Swiss-Prot/TrEMBL_accession":"protein", "gene_symbol":"gene"})
#read in sig protein file (pval <0.1)
proteins = pd.read_csv("sig_proteins_0.1.csv")
#read in gene expression file (pval < 0.05)
#transcripts = pd.read_csv("../DESeq2/deseq2_sig.05_3G1G.csv")
#read in all detected transcrips
transcripts = pd.read_csv("../DESeq2/deseq2_degenes_3G1G.csv")
#check that files look correctprint(pg_matches.head())
print(pg_matches.shape)
print(proteins.head())
print(proteins.shape)
print(transcripts.head())
print(transcripts.shape) 

(1149082, 5)
    Assession      X-mu        mu         X       std    zscore    pvalue  \
0  'P05661-3' -0.227329  0.902985  0.675656  0.052633 -4.319124  0.000008   
1    'Q9VHS2' -0.210521  0.882477  0.671956  0.070654 -2.979597  0.001443   
2    'A1Z9M2' -0.199726  0.862572  0.662846  0.070010 -2.852821  0.002167   
3    'O44081' -0.164959  0.887575  0.722616  0.068446 -2.410070  0.007975   
4    'Q0KHY3' -0.207974  0.925934  0.717959  0.087573 -2.374881  0.008777   

   Fold Change  log pvalue  log Fold Change  log pvalue.1  
0     0.748247   16.962097        -0.418413     16.962097  
1     0.761443    9.436573        -0.393192      9.436573  
2     0.768453    8.850318        -0.379971      8.850318  
3     0.814146    6.970348        -0.296640      6.970348  
4     0.775390    6.832005        -0.367007      6.832005  
(111, 11)
      gene       baseMean  log2FoldChange    lfcMLE     lfcSE       stat  \
0   CG4151  173781.895300        9.546780  9.940192  0.229635  41.573766   
1 

In [3]:
#

In [4]:
#clean up the significant proteins dataframe
proteins['Assession'] = proteins['Assession'].map(lambda x: x.lstrip("'").rstrip("'"))
proteins = proteins.rename(columns = {'Assession':'protein'})
proteins.drop(proteins.columns[[1,2,3,4,5]], axis=1, inplace=True)

In [5]:
pg_matches.drop(pg_matches.columns[[1,2,3]], axis=1, inplace=True)

In [6]:
#match gene names to sig. proteins
protein_data = pd.merge(proteins, pg_matches, how="left", on="protein")

In [7]:
print(protein_data.shape)
protein_data.head()

(112, 7)


Unnamed: 0,protein,pvalue,Fold Change,log pvalue,log Fold Change,log pvalue.1,gene
0,P05661-3,8e-06,0.748247,16.962097,-0.418413,16.962097,
1,Q9VHS2,0.001443,0.761443,9.436573,-0.393192,9.436573,COX7A
2,A1Z9M2,0.002167,0.768453,8.850318,-0.379971,8.850318,CG8331
3,O44081,0.007975,0.814146,6.970348,-0.29664,6.970348,Nop60B
4,Q0KHY3,0.008777,0.77539,6.832005,-0.367007,6.832005,mesh


In [10]:
#find proteins missing genes
mask=False
for col in protein_data.columns: mask = mask | protein_data['gene'].isnull()
dfnulls = protein_data[mask]
print(dfnulls)

Empty DataFrame
Columns: [protein, pvalue, Fold Change, log pvalue, log Fold Change, log pvalue.1, gene]
Index: []


In [9]:
#add missing gene names and check locations
protein_data.loc[109, 'gene']="fau"
print(protein_data[109:110])
protein_data.loc[0, 'gene'] = "Mhc"
protein_data[0:1]

      protein    pvalue  Fold Change  log pvalue  log Fold Change  \
109  Q9VGX3-3  0.097916      0.77478    3.352314        -0.368142   

     log pvalue.1 gene  
109      3.352314  fau  


Unnamed: 0,protein,pvalue,Fold Change,log pvalue,log Fold Change,log pvalue.1,gene
0,P05661-3,8e-06,0.748247,16.962097,-0.418413,16.962097,Mhc


In [11]:
protein_data.columns

Index(['protein', 'pvalue', 'Fold Change', 'log pvalue', 'log Fold Change', 'log pvalue.1', 'gene'], dtype='object')

In [12]:
transcripts.drop(transcripts.columns[[1,5]], axis=1, inplace=True)
transcripts.columns

Index(['gene', 'log2FoldChange', 'lfcMLE', 'lfcSE', 'pvalue', 'padj'], dtype='object')

In [13]:
#add gene expression info
sig_pandg = pd.merge(protein_data, transcripts, how='left', on='gene')

In [None]:
sig_pandg

In [15]:
print(sig_pandg.shape)
for col in sig_pandg.columns: mask = mask | sig_pandg['gene'].isnull()
dfnulls = sig_pandg[mask]
dfnulls

(112, 12)


Unnamed: 0,protein,pvalue_x,Fold Change,log pvalue,log Fold Change,log pvalue.1,gene,log2FoldChange,lfcMLE,lfcSE,pvalue_y,padj


In [16]:
#add functions from panther/ravi
functions = pd.read_csv("../pathway_analysis/panther/all_functions.csv")

In [None]:
#functions

In [18]:
df = pd.merge(sig_pandg, functions, how="left", on="gene")

In [19]:
df.columns

Index(['protein', 'pvalue_x', 'Fold Change', 'log pvalue', 'log Fold Change', 'log pvalue.1', 'gene', 'log2FoldChange', 'lfcMLE_x', 'lfcSE_x', 'pvalue_y', 'padj_x', 'category', 'baseMean', 'log2FoldChange_y', 'lfcMLE_y', 'lfcSE_y', 'stat', 'pvalue', 'padj_y', 'Panther gene ID', 'Mapped ID_y', 'Gene Symbol', 'Panther family/subfamily', 'Panther protein class', 'GO database BP Complete_y', 'GO database MF Complete_y'], dtype='object')

In [23]:
#Load Panther annotation for sig proteins. Merge with comparison dataframe.
panther = pd.read_table('pantherGeneList.txt', header=None)
panther.columns = ['gene_id','gene','gene_name','Panther_family','Panther_protein_class','PANTHER_GO-Slim_Molecular_Function','PANTHER_GO-Slim_Biological_Process','GO_database_MF_complete','GO_database_BP_complete']
panther.head()
protein_gene_functions = pd.merge(df, panther, how='left', on='gene')

In [39]:
#protein_gene_functions.to_csv("sigpro_allgene_comparison........csv")
#after this I manually added genes with different names, so don't overwrite!

In [None]:
#Looking at correlation between protein and gene abundances (DE)
new_cats = pd.read_csv("categories_only.csv")
#new_cats.head()

In [None]:
#read in the necessary files, merge, remove extraneous columns
pro_gene_comp = pd.read_csv("sigpro_allgene_comparison.csv")

In [4]:
pg_comp = pd.merge(new_cats, pro_gene_comp, how="outer", on="gene")
pg_comp.drop(pg_comp.columns[[12]], axis=1, inplace=True)
pg_comp = pg_comp.rename(columns = {'category_x':'category'})

In [7]:
#pg_comp.head()
#pg_comp.shape
#pg_comp.describe()
pg_comp.columns
#count_pgcomp = pd.Series(pg_comp["category"])
#cat_count = count_pgcomp.value_counts()
#print(cat_count)

Index([u'protein_x', u'gene', u'category', u'protein_y', u'protein_pvalue', u'protein_fold_change', u'protein_log2FoldChange', u'RNA_log2FoldChange', u'RNA_lfcMLE', u'RNA_lfcSE', u'RNA_pvalue', u'RNA_padj', u'gene_id', u'gene_name', u'Panther_family', u'Panther_protein_class', u'PANTHER_GO-Slim_Molecular_Function', u'PANTHER_GO-Slim_Biological_Process', u'GO_database_MF_complete', u'GO_database_BP_complete'], dtype='object')

In [8]:
binding = pg_comp[(pg_comp['category'] == 'binding') & (pg_comp['RNA_padj'] < 0.1)]
oxidoreductase = pg_comp[(pg_comp['category'] == 'oxidoreductase') & (pg_comp['RNA_padj'] < 0.1)]
transport = pg_comp[(pg_comp['category'] == 'transport') & (pg_comp['RNA_padj'] < 0.1)]
endopeptidase = pg_comp[(pg_comp['category'] == 'endopeptidase') & (pg_comp['RNA_padj'] < 0.1)]
transferase = pg_comp[(pg_comp['category'] == 'transferase') & (pg_comp['RNA_padj'] < 0.1)]
defense = pg_comp[(pg_comp['category'] == 'defense') & (pg_comp['RNA_padj'] < 0.1)]
cuticle = pg_comp[(pg_comp['category'] == 'cuticle') & (pg_comp['RNA_padj'] < 0.1)]
myosin = pg_comp[(pg_comp['category'] == 'myosin') & (pg_comp['RNA_padj'] < 0.1)]
significant = pg_comp[pg_comp['RNA_padj'] < 0.1]

In [9]:
significant.groupby('category').groups.keys()

['endopeptidase',
 'binding',
 'myosin',
 'transferase',
 'defense',
 'oxidoreductase',
 'cuticule',
 'transport']

In [None]:
significant[significant['category'] == 'defense']

In [None]:
#significant.groupby('category').agg(['mean','count','std'])
significant.groupby('category')[['protein_log2FoldChange','RNA_log2FoldChange']].corr(method='spearman')

In [78]:
spearman = pg_comp.groupby('category')[['protein_log2FoldChange','RNA_log2FoldChange']].corr(method='spearman')

In [79]:
pearson = pg_comp.groupby('category')[['protein_log2FoldChange','RNA_log2FoldChange']].corr(method='pearson')

In [83]:
spearman.to_csv('spearman.csv')
pearson.to_csv('pearson.csv')

In [None]:
significant.head()

In [54]:
spearsig = significant[['protein_log2FoldChange','RNA_log2FoldChange']].corr(method='spearman')

In [None]:
plt.matshow(significant.corr())

In [None]:
fold_change_plot = significant.plot(kind='scatter', x='RNA_log2FoldChange', y='protein_log2FoldChange', title="Protein-RNA Differential Expression, p < 0.1")
fold_change_plot.set_xlabel("RNA log 2 fold change")
fold_change_plot.set_ylabel("Protein log 2 fold change")