### <p style="text-align: right;"> &#9989; **Group 1 (GoGreen)** </p>
#### <p style="text-align: right;"> &#9989; Beth, Zhongjie, McKenna, Erik</p>

# Module 1



## How do SNPs relate to gene transcription?

We will work with maize. 

Jupyter by default converts `tabs` into `spaces` which is good.

First, import the usual libraries
- `math`: basic math operations
- `os`: enable file manipulation with the OS
- `sys`: enable interaction with commandline
- `glob`: more variable manipulation
- `matplotlib.pyplot`: default plotter (I personally like ggplot waaaaay better. E)
    - `inline`: so that plots are shown in the notebook
- `seaborn`: nicer plots
- `numpy`: all number cruching done here
- `pandas`: data wrangling

In [1]:
import math  
import os   
import glob 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import numpy as np
import pandas as pd


Define some functions. Let's see for how long can we keep tidy commented code.

In [2]:
def dot_to_int(char_in, substitute='.', value=0.0):
    """
    Substitute a string for a given value if such string is of interest
    
    Parameters
    ----------
    char_in : string
         possible char to be substituted
    
    substitute : string
         if the given input matches this object, substitute
    
    value : int
         if applicable, substitute the input with this value
    
    Returns
    -------
    value : if input matches substitute, else returns unchanged input
    """
    if(char_in == substitute):
        return value
    else:
        return char_in
    
def numMissing(row, start=10, missing='.'):
    foo = row[start:] == missing
    return(foo.sum())


def numData(row, data_cols=['Total Calls', 'Missing Data']):
    return row[data_cols].sum()  

We can print whatever text was written between the `"""` above.

In [5]:
print(dot_to_int.__doc__)


    Substitute a string for a given value if such string is of interest
    
    Parameters
    ----------
    char_in : string
         possible char to be substituted
    
    substitute : string
         if the given input matches this object, substitute
    
    value : int
         if applicable, substitute the input with this value
    
    Returns
    -------
    value : if input matches substitute, else returns unchanged input
    


In [3]:
# Set the paths to the raw data
src = '/home/ejam/documents/css893/raw_data/'
dst = '/home/ejam/documents/css893/output/'
file_snp = 'B73_plus_RTAs_snp_matrix_995785.txt'
file_gen = '942_FPKM_B73_genes_w_feature.txt'

In [4]:
ATCG = ['Percent A', 'Percent T', 'Percent C', 'Percent G']
converters = dict((A,dot_to_int) for A in ATCG)

In [6]:
# BEWARE : Line below consumes 24Gb of RAM at some point. 
snp_matrix = pd.read_table(src + file_snp, dtype={'chromosome': str}, converters=converters)

#snp_matrix = pd.read_table(src + file_snp, dtype={'chromosome': str}, converters=converters, nrows=1000)

In [9]:
snp_matrix

Unnamed: 0,chromosome,position,reference_allele,Allele Counts,Missing Data,Total Calls,Percent A,Percent T,Percent C,Percent G,...,NC328,NC326,PHV53,DKIBC2,A641,WIL900,Va22,E8501,PHP85,Oh43
0,B73V4_ctg102,36353,A,A(0)T(0)C(0)G(264),695,264,0,0,0,100,...,.,.,G,G,G,.,.,G,G,G
1,B73V4_ctg10,292327,T,A(0)T(744)C(0)G(105),110,849,0,87.6325088339223,0,12.3674911660777,...,T,T,.,T,T,.,G,T,T,G
2,B73V4_ctg10,292384,G,A(2)T(0)C(0)G(906),51,908,0.220264317180617,0,0,99.7797356828194,...,G,G,G,G,G,G,G,G,G,G
3,B73V4_ctg10,292386,G,A(107)T(0)C(0)G(794),58,901,11.8756936736959,0,0,88.1243063263041,...,G,G,G,G,G,G,A,G,G,A
4,B73V4_ctg10,292402,A,A(912)T(1)C(0)G(0),46,913,99.8904709748083,0.109529025191676,0,0,...,A,A,A,A,A,A,A,A,A,A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995780,10,99998392,C,A(0)T(5)C(954)G(0),0,959,0,0.521376433785193,99.4786235662148,0,...,C,C,C,C,C,C,C,C,C,C
995781,10,99998427,A,A(855)T(0)C(30)G(0),74,885,96.6101694915254,0,3.38983050847458,0,...,A,A,A,A,.,A,A,A,A,A
995782,10,99998428,G,A(7)T(0)C(0)G(857),95,864,0.810185185185185,0,0,99.1898148148148,...,G,G,G,G,.,G,G,G,G,G
995783,10,99998430,C,A(0)T(0)C(843)G(5),111,848,0,0,99.4103773584906,0.589622641509434,...,C,C,C,C,.,C,C,C,C,C


We have 995,785 rows × 952 columns. I need roughly 16Gb to simply load the matrix. To do actual data wrangling, I need way more memory 😿

First 10 columns are metadata. The rest correspond to cultivars.

The column names for cultivars is found in `B73_plus_RTAs_snp_matrix_imputed/widiv_942g_979873SNPs_imputed_filteredGenos_withRTA_AGPv4.hmp.txt`

In [31]:
#print(snp_matrix.dtypes.values)
chromosomes = np.unique(snp_matrix['chromosome'].values.astype(str))

In [48]:
print('There are', len(chromosomes), 'different chromosomes.')
print('\n', chromosomes[:106], '\n\n', chromosomes[5000:5020])

There are 9735 different chromosomes.

 ['1' '10' '2' '3' '4' '5' '6' '7' '8' '9' 'B73V4_ctg10' 'B73V4_ctg102'
 'B73V4_ctg103' 'B73V4_ctg104' 'B73V4_ctg105' 'B73V4_ctg107' 'B73V4_ctg11'
 'B73V4_ctg115' 'B73V4_ctg123' 'B73V4_ctg129' 'B73V4_ctg134' 'B73V4_ctg14'
 'B73V4_ctg140' 'B73V4_ctg141' 'B73V4_ctg142' 'B73V4_ctg147'
 'B73V4_ctg150' 'B73V4_ctg152' 'B73V4_ctg153' 'B73V4_ctg154'
 'B73V4_ctg165' 'B73V4_ctg166' 'B73V4_ctg167' 'B73V4_ctg168'
 'B73V4_ctg170' 'B73V4_ctg171' 'B73V4_ctg173' 'B73V4_ctg178'
 'B73V4_ctg181' 'B73V4_ctg182' 'B73V4_ctg188' 'B73V4_ctg189'
 'B73V4_ctg190' 'B73V4_ctg193' 'B73V4_ctg2' 'B73V4_ctg203' 'B73V4_ctg205'
 'B73V4_ctg207' 'B73V4_ctg23' 'B73V4_ctg24' 'B73V4_ctg247' 'B73V4_ctg248'
 'B73V4_ctg26' 'B73V4_ctg28' 'B73V4_ctg3' 'B73V4_ctg31' 'B73V4_ctg35'
 'B73V4_ctg4' 'B73V4_ctg41' 'B73V4_ctg44' 'B73V4_ctg48' 'B73V4_ctg5'
 'B73V4_ctg50' 'B73V4_ctg51' 'B73V4_ctg52' 'B73V4_ctg53' 'B73V4_ctg56'
 'B73V4_ctg58' 'B73V4_ctg6' 'B73V4_ctg61' 'B73V4_ctg65' 'B73V4_ctg66'
 'B73V

In [10]:
downIndex = np.array([])
for i in range(1,11):
    chromo = snp_matrix[snp_matrix['chromosome'] == '{}'.format(i)]
    downIndex = np.concatenate((downIndex, chromo.index.values))


In [13]:
snp_matrix.loc[downIndex]

Unnamed: 0,chromosome,position,reference_allele,Allele Counts,Missing Data,Total Calls,Percent A,Percent T,Percent C,Percent G,...,NC328,NC326,PHV53,DKIBC2,A641,WIL900,Va22,E8501,PHP85,Oh43
82131.0,1,100085818,C,A(0)T(21)C(508)G(0),430,529,0,3.96975425330813,96.0302457466919,0,...,.,C,C,C,.,C,.,C,.,C
82132.0,1,100086117,G,A(30)T(0)C(0)G(275),654,305,9.83606557377049,0,0,90.1639344262295,...,.,G,G,A,G,G,G,G,.,G
82133.0,1,100086147,C,A(0)T(0)C(299)G(33),627,332,0,0,90.0602409638554,9.93975903614458,...,C,C,C,G,C,C,C,C,.,C
82134.0,1,100086204,G,A(53)T(0)C(0)G(304),602,357,14.8459383753501,0,0,85.1540616246499,...,G,G,G,G,A,G,G,G,.,G
82135.0,1,100086286,C,A(0)T(0)C(434)G(54),471,488,0,0,88.9344262295082,11.0655737704918,...,C,C,C,G,C,C,C,C,.,C
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995780.0,10,99998392,C,A(0)T(5)C(954)G(0),0,959,0,0.521376433785193,99.4786235662148,0,...,C,C,C,C,C,C,C,C,C,C
995781.0,10,99998427,A,A(855)T(0)C(30)G(0),74,885,96.6101694915254,0,3.38983050847458,0,...,A,A,A,A,.,A,A,A,A,A
995782.0,10,99998428,G,A(7)T(0)C(0)G(857),95,864,0.810185185185185,0,0,99.1898148148148,...,G,G,G,G,.,G,G,G,G,G
995783.0,10,99998430,C,A(0)T(0)C(843)G(5),111,848,0,0,99.4103773584906,0.589622641509434,...,C,C,C,C,.,C,C,C,C,C


In [15]:
995784 - 82130

913654

In [39]:
b73v4_ctg10 = snp_matrix[snp_matrix['chromosome'] == '1']
b73v4_ctg107 = snp_matrix[snp_matrix['chromosome'] == 'B73V4_ctg107']
b73v4_ctg105 = snp_matrix[snp_matrix['chromosome'] == 'B73V4_ctg105']
b73v4_ctg104 = snp_matrix[snp_matrix['chromosome'] == 'B73V4_ctg104']
b73v4_ctg10.index.values
downIndex = np.concatenate((b73v4_ctg10.index.values, b73v4_ctg104.index.values, b73v4_ctg105.index.values, b73v4_ctg107.index.values))

In [40]:
print('We have\t', len(b73v4_ctg10), '\tSNPs in chromosome b73v4_ctg10')
print('We have\t', len(b73v4_ctg104), '\tSNPs in chromosome b73v4_ctg104')
print('We have\t', len(b73v4_ctg105), '\tSNPs in chromosome b73v4_ctg105')
print('We have\t', len(b73v4_ctg107), '\tSNPs in chromosome b73v4_ctg107')
print('TOTAL\t', len(downIndex))
print('Between', downIndex.min(), 'and', downIndex.max())

We have	 184 	SNPs in chromosome b73v4_ctg10
We have	 9 	SNPs in chromosome b73v4_ctg104
We have	 51 	SNPs in chromosome b73v4_ctg105
We have	 277 	SNPs in chromosome b73v4_ctg107
TOTAL	 521
Between 1 and 635


In [41]:
downsample = snp_matrix.loc[downIndex]
downsample.to_csv(dst+'snp_matrix_downsample.csv', index=False)

In [42]:
downsample = pd.read_csv(dst+'snp_matrix_downsample.csv')

In [25]:
missing_data = downsample.apply(numMissing, axis=1).values
discrepancy_missing = downsample['Missing Data'].values - missing_data
print('A total of', len(discrepancy_missing != 0), 'SNPs have different Missing_Data as reported.')
print('Largest discrepancy: ', discrepancy_missing.max(), '[ at row', discrepancy_missing.argmax(),']' )

A total of 512 SNPs have different Missing_Data as reported.
Largest discrepancy:  17 [ at row 223 ]


In [27]:
total_data = downsample.apply(numData, axis=1).values
print('Mean total:\t', total_data.mean())
print('Var total:\t', total_data.var())

Mean total:	 959.0
Var total:	 0.0


The given data assumes **959** total calls (cultivars?), but we only have **942** available cultivars. Thus the `Missing Data` column has **wrong** values.

Now let's explore the expression data!!

In [28]:
genes_w_features = pd.read_table(src+file_gen)
genes_w_features

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,gene,chromosome,feature_type,position_left,position_right,LH128,DKMBZA,CQ806,DKF274,Ill.Hy,...,NC328,NC326,PHV53,DKIBC2,A641,WIL900,Va22,E8501,PHP85,Oh43
0,Zm00001d012719,8,gene,179164454,179168169,5.825000,4.845710,6.323400,6.999010,6.661100,...,15.154700,12.870700,8.81507,13.77870,8.31790,12.78320,4.94950,9.39900,10.41090,15.840500
1,Zm00001d024742,10,gene,85863323,85863746,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.000000
2,Zm00001d007137,2,tRNA_gene,223153553,223153627,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.000000
3,Zm00001d025653,10,gene,125107973,125113328,9.350780,9.954800,7.735490,10.360900,9.072890,...,4.685600,3.870230,5.27849,5.17829,3.80291,6.44675,5.61858,3.98055,4.17545,4.500850
4,Zm00001d036003,6,gene,65322472,65323171,8.955580,0.333921,5.445820,0.332438,1.232330,...,0.232934,0.721774,2.95424,8.44716,12.33680,1.19629,1.71452,7.01768,11.75760,0.793672
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44254,Zm00001d028653,1,gene,41833737,41838204,368.777000,301.349000,412.319000,310.820000,278.737000,...,172.293000,186.552000,147.63300,165.60800,147.58300,244.82100,171.15400,219.07400,147.01700,165.612000
44255,Zm00001d009511,8,gene,68418436,68421217,6.003590,5.297160,10.421700,6.304130,4.856300,...,4.120700,4.670860,3.47752,4.92093,6.26904,4.95873,7.25625,5.66864,4.28185,5.765310
44256,Zm00001d032828,1,gene,239455289,239464371,151.479000,46.479600,65.075600,70.169900,59.555600,...,100.249000,102.226000,100.29500,64.90400,91.17440,101.80300,108.58200,63.72140,95.96760,56.404900
44257,Zm00001d002031,2,gene,4673452,4674247,0.237774,0.386886,0.179111,0.415566,0.192793,...,0.000000,0.000000,0.00000,0.00000,0.00000,1.51552,0.00000,0.00000,0.00000,2.555730


In [38]:
bar = np.unique(genes_w_features['chromosome'].values.astype(str))
print('There are', len(bar), 'different chromosomes.')

print(bar)

There are 106 different chromosomes.
['1' '10' '2' '3' '4' '5' '6' '7' '8' '9' 'B73V4_ctg10' 'B73V4_ctg100'
 'B73V4_ctg102' 'B73V4_ctg103' 'B73V4_ctg104' 'B73V4_ctg105'
 'B73V4_ctg107' 'B73V4_ctg11' 'B73V4_ctg110' 'B73V4_ctg115' 'B73V4_ctg117'
 'B73V4_ctg119' 'B73V4_ctg12' 'B73V4_ctg120' 'B73V4_ctg121' 'B73V4_ctg123'
 'B73V4_ctg125' 'B73V4_ctg129' 'B73V4_ctg13' 'B73V4_ctg133' 'B73V4_ctg134'
 'B73V4_ctg14' 'B73V4_ctg140' 'B73V4_ctg141' 'B73V4_ctg142' 'B73V4_ctg144'
 'B73V4_ctg147' 'B73V4_ctg150' 'B73V4_ctg151' 'B73V4_ctg170'
 'B73V4_ctg173' 'B73V4_ctg178' 'B73V4_ctg18' 'B73V4_ctg180' 'B73V4_ctg181'
 'B73V4_ctg182' 'B73V4_ctg188' 'B73V4_ctg189' 'B73V4_ctg190'
 'B73V4_ctg193' 'B73V4_ctg2' 'B73V4_ctg20' 'B73V4_ctg205' 'B73V4_ctg206'
 'B73V4_ctg208' 'B73V4_ctg219' 'B73V4_ctg23' 'B73V4_ctg24' 'B73V4_ctg242'
 'B73V4_ctg245' 'B73V4_ctg247' 'B73V4_ctg248' 'B73V4_ctg250'
 'B73V4_ctg253' 'B73V4_ctg26' 'B73V4_ctg28' 'B73V4_ctg29' 'B73V4_ctg3'
 'B73V4_ctg31' 'B73V4_ctg35' 'B73V4_ctg4' 'B73V4_ctg40'

We have 44259 rows × 947 columns. The set of chromosomes in both data files **are different!!!** **Neither** is a subset of the other. Observe that `'B73V4_ctg100'` shows up only in `genes_w_features`. Both data sets contain 942 cultivars.

In [49]:
b73v4_ctg10 = genes_w_features[genes_w_features['chromosome'] == 'B73V4_ctg10']
b73v4_ctg107 = genes_w_features[genes_w_features['chromosome'] == 'B73V4_ctg107']
b73v4_ctg105 = genes_w_features[genes_w_features['chromosome'] == 'B73V4_ctg105']
b73v4_ctg104 = genes_w_features[genes_w_features['chromosome'] == 'B73V4_ctg104']
b73v4_ctg10.index.values
downIndex = np.concatenate((b73v4_ctg10.index.values, b73v4_ctg104.index.values, b73v4_ctg105.index.values, b73v4_ctg107.index.values))

print('We have\t', len(b73v4_ctg10), '\tSNPs in chromosome b73v4_ctg10')
print('We have\t', len(b73v4_ctg104), '\tSNPs in chromosome b73v4_ctg104')
print('We have\t', len(b73v4_ctg105), '\tSNPs in chromosome b73v4_ctg105')
print('We have\t', len(b73v4_ctg107), '\tSNPs in chromosome b73v4_ctg107')
print('TOTAL\t', len(downIndex))
print('Between', downIndex.min(), 'and', downIndex.max())

downsample = genes_w_features.loc[downIndex]
downsample.to_csv(dst+'genes_w_features_downsample.csv', index=False)

We have	 12 	SNPs in chromosome b73v4_ctg10
We have	 3 	SNPs in chromosome b73v4_ctg104
We have	 1 	SNPs in chromosome b73v4_ctg105
We have	 8 	SNPs in chromosome b73v4_ctg107
TOTAL	 24
Between 459 and 41134
