# Capstone 2 – Project Proposal – Somatic Mutation

# 1 Objective: 

The goal is to segment different types of Somatic Germline Mutations in human genes associated with inherited and acquired diseases. It will be a one-stop-shop comprehensive collection of mutation data(Segments) for easy discovery in the era of personalized medicine. As part of this project I would like to find: 

a.	Somatic Germline segmentation

b.	Acquired Disease with maximum number of Somatic Mutation 

c.	Inherited Disease with maximum number of Somatic Mutation

### Outcome: Comprehensive Somatic Mutation Database an invaluable resource for all scientists. 

# 2 Client

The client for this project is Dr. Sona Vasudevan, Professor, Department of Biochemistry and Molecular
Biology, Georgetown University Medical Center. Disease is a complex phenomenon. It was expected that
with the completion of the Human Genome Project in 2003, we will have a better understanding of the
underlying causes of diseases. While we have come a long way in identifying the genes implicated in
many diseases’, we are still a long way towards better diagnosis, prevention and cure.

The era of “BIG DATA” is a blessing but we have huge number of resources for different –ommics data
types. The goal of this project is to build a one-stop- shop for various data types so we can start
discovering the relationships between mutations, variants and diseases.

The resource once completed and published will be made available to the scientific community.

### Contact Details:

Dr. Sona Vasudevan, Ph.D.,
Professor, Medical Education,
Director, MD/MS Dual Degree Program in Systems Medicine,
Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center,
3300, Whitehaven Street, NW, Suite 1200
Washington DC 20007.

Phone: 202-687-2242
http://systemsmedicine.georgetown.edu/

# 3 Data source and Credits


a) Uniprot: http://www.uniprot.org/
The UniProt Consortium UniProt: the universal protein knowledgebase Nucleic Acids Res. 45: D158-D169 (2017)


b) OMIM: https://www.omim.org/
Citing OMIM as a whole: Online Mendelian Inheritance in Man, OMIM (TM). McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD). Retrieved from http://www.ncbi.nlm.nih.gov/omim/ 



c) Clinvar: https://www.ncbi.nlm.nih.gov/clinvar/ 
Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Hoover J, Jang W, Katz K, Ovetsky M, Riley G, Sethi A, Tully R, Villamarin-Salomon R, Rubinstein W, Maglott DR. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2015 Nov 17. PubMed PMID: 26582918.


d) GWAS: https://www.ebi.ac.uk/gwas/ 
Burdett T (EBI), Hall PN (NHGRI), Hastings E (EBI), Hindorff LA (NHGRI), Junkins HA (NHGRI), Klemm AK (NHGRI), MacArthur J (EBI), Manolio TA (NHGRI), Morales J (EBI), Parkinson H (EBI) and Welter D (EBI). The NHGRI-EBI Catalog of published genome-wide association studies.


e) COSMIC DATABASE: http://cancer.sanger.ac.uk/cosmic 
Simon A. Forbes David Beare Harry Boutselakis Sally Bamford Nidhi Bindal John Tate Charlotte G. Cole Sari Ward Elisabeth Dawson Laura Ponting ... Show more Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777–D783, https://doi.org/10.1093/nar/gkw1121


# 4. Solution Approach

The solution plans to use PCA and NMF ML techniques to help with dimension reduction and segmentation of Somatic Mutation. It is specifically designed to address large dataset with biodiversity and quality issues like redundancy, missing, wrong label etc. The solution is sub-divided into three phases as listed below.





### a)Data Assembly - Phase I: 
This phase of the project is designed to gather and do basic cleanup like join, merge, add or update attributes.

### b)Explore and Preprocessing – Phase II: 

This phase of the project is designed to validate and explore the dataset for all the problems listed in the “Problem” section of this proposal. 
### c)Modelling and Evaluation Phase III: 

This phase of the project will focus on exploring various machine learning algorithms and finding the right hyperparameters to find the best ML model to cluster the Somatic Mutation. 

# Proof of Principle

To demonstrate the usefulness of a one-stop- shop approach of data mining as supposed to one resource
at a time, we demonstrate using the following genes.
1. Tp53
2. EGFR
3. NOD2
4. MDR1
5. IL10

We will explore the following questions.
1. What diseases are these genes implicated in?
2. How many polymorphisms are found?
3. Are any of these snps pathogenic?
4. What pathways are these implicated in?
5. Come up with a Diseasome and explain the Systems Approach of looking at these diseases.

### Based on the results we will get to the next step.

## 4.1 Data Assembly - Phase I:
In this phase we will collect data and assemble data required to create the one-stop-shop for Somatic mutation. 

# 4.1.2 Import Uniprot Data.

In [1]:
import pandas as pd
from google.cloud import storage

import datetime as dt
from datetime import datetime
from pytz import timezone
pd.options.display.max_colwidth = 10000
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import uuid

client = storage.Client()
bucket=client.get_bucket('somatic_germline_mutations')
blob = storage.Blob('uniprot-all.tab',bucket)
with open('uniprot-all.tab', 'wb') as file_obj:
    blob.download_to_file(file_obj)

df_U=pd.read_csv('uniprot-all.tab',sep='\t', header=0, \
               names=['Entry_U','GeneName','ProteinName_U','Organism_U','Entryname_U', \
                      'NaturalVariant_U','InvolmentInDisease_U'], \
               dtype={'Entry_U':object,'Entryname_U':object,'ProteinName_U':object, \
                      'GeneName':object,'Organism_U':object,'NaturalVariant_U':object,\
                      'Mutagenesis_U':object})

df_U=df_U[['GeneName','Entry_U','ProteinName_U','Organism_U','Entryname_U',\
           'NaturalVariant_U','InvolmentInDisease_U']]
df_U.shape

(20244, 7)

### Number of entires in above Uniprot list: 20,244 with 7 columns.
We only started with 7 columns so that the uniprot list is managable. 

In [2]:
df_U[df_U.Entry_U=='Q93074'] #df_U[df_U.GeneName=='MED12']
#we can see that GeneName is comprising of more than one names. We are only considering the 1st Name as the GeneName

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,NaturalVariant_U,InvolmentInDisease_U
9640,MED12 ARC240 CAGH45 HOPA KIAA0192 TNRC11 TRAP230,Q93074,Mediator of RNA polymerase II transcription subunit 12 (Activator-recruited cofactor 240 kDa component) (ARC240) (CAG repeat protein 45) (Mediator complex subunit 12) (OPA-containing protein) (Thyroid hormone receptor-associated protein complex 230 kDa component) (Trap230) (Trinucleotide repeat-containing gene 11 protein),Homo sapiens (Human),MED12_HUMAN,"VARIANT 961 961 R -> W (in OKS; dbSNP:rs80338758). {ECO:0000269|PubMed:17334363}. /FTId=VAR_033112.; VARIANT 1007 1007 N -> S (in LUJFRYS; dbSNP:rs80338759). {ECO:0000269|PubMed:17369503}. /FTId=VAR_037534.; VARIANT 1148 1148 R -> H (in OHDOX; dbSNP:rs387907360). {ECO:0000269|PubMed:23395478}. /FTId=VAR_069770.; VARIANT 1165 1165 S -> P (in OHDOX; dbSNP:rs387907361). {ECO:0000269|PubMed:23395478}. /FTId=VAR_069771.; VARIANT 1392 1392 Q -> R (in dbSNP:rs1139013). {ECO:0000269|PubMed:10198638, ECO:0000269|PubMed:8724849}. /FTId=VAR_046672.; VARIANT 1729 1729 H -> N (in OHDOX; dbSNP:rs387907362). {ECO:0000269|PubMed:23395478}. /FTId=VAR_069772.; VARIANT 1974 1974 Q -> H (found in a family with X-linked intellectual disability; unknown pathological significance; dbSNP:rs879255528). {ECO:0000269|PubMed:26273451}. /FTId=VAR_074018.","DISEASE: Opitz-Kaveggia syndrome (OKS) [MIM:305450]: X-linked disorder characterized by mental retardation, relative macrocephaly, hypotonia and constipation. {ECO:0000269|PubMed:17334363}. Note=The disease is caused by mutations affecting the gene represented in this entry.; DISEASE: Lujan-Fryns syndrome (LUJFRYS) [MIM:309520]: Clinically, Lujan-Fryns syndrome can be distinguished from Opitz-Kaveggia syndrome by tall stature, hypernasal voice, hyperextensible digits and high nasal root. {ECO:0000269|PubMed:17369503}. Note=The disease is caused by mutations affecting the gene represented in this entry.; DISEASE: Ohdo syndrome, X-linked (OHDOX) [MIM:300895]: A syndrome characterized by mental retardation, feeding problems, and distinctive facial appearance with coarse facial features, severe blepharophimosis, ptosis, a bulbous nose, micrognathia and a small mouth. Dental hypoplasia and deafness can be considered as common manifestations of the syndrome. Male patients show cryptorchidism and scrotal hypoplasia. {ECO:0000269|PubMed:23395478}. Note=The disease is caused by mutations affecting the gene represented in this entry."


# Observations:

1. Multiple Gene Names. we need to consider only the 1st Gene names as the key.
2. Natutal Variant: This data also has multiple variants. we need to extract few data points from this text. 

    a) VARIANT ID
    
    b) From To
    
    c) Position 
    
    d) dbSNP
    
    e) PubMed id
3. Involvement in Disease: Multiple disease are staked into this. We need to extract the few data points
    
    a) Disease name 
    
    b) MIM ID 
    
    c) Pub Med

In [3]:
#df_U[df_U.GeneName.isnull()].to_csv('GeneNull.csv') #viewing entries from Uniprot file without a gene name.
df_U=df_U[df_U.GeneName.notnull()] #Removing entries from Uniprot file without a gene name.
df_U['GeneName']=df_U['GeneName'].apply(lambda x:x.split(' ')[0]) 
df_U.shape

(20087, 7)

# Number of entires in above Uniprot list: 20,087 with 7 columns.

### Extracting Natural Variant,dbSNP, Variant Position, PubMed,FromTo,MIM and Disease information

In [4]:
df_U_variant=df_U[['Entry_U','NaturalVariant_U']][df_U.NaturalVariant_U.notnull()]
df_U_disease=df_U[['Entry_U','InvolmentInDisease_U']][df_U.InvolmentInDisease_U.notnull()]

In [6]:
# This function extracts disease information from the involment in disease column.
import re
lst__D=[]
def findDisease(row):
    for x in row['InvolmentInDisease_U'].split('.;'):
        #print(x)
        #print('++++++++++++++++++++++')
        PubMed=''
        mim_=''
        disease_=''
        
        aa=[]
        search_pattern='(?s)DISEASE:(.*?)(?=\[)'
        aa=re.findall(search_pattern,x)            
        if len(aa)!=0:
            disease_=aa[0].strip()
        else:
            #DISEASE: Note=Defects in ATM contribute to B-cell non-Hodgkin lymphomas (BNHL), 
            #including mantle cell lymphoma (MCL).
            #DISEASE: Note=A chromosomal aberration involving BRAF is 
            #found in pilocytic astrocytomas. 
            #A tandem duplication of 2 Mb at 7q34 leads to the expression of a 
            #KIAA1549-BRAF fusion protein with a 
            #constitutive kinase activity and inducing cell transformation. 
            #{ECO:0000269|PubMed:18974108}
                
            search_pattern='(?:^|\W\()[A-Z]*(?:$|\))'
            aa=re.findall(search_pattern,x)
            disease_=' &'.join(aa).strip().replace('(','').replace(')','')
            
        aa=[]
        search_pattern="(?s)MIM:(.*?)(?=])"
        aa=re.findall(search_pattern,x)            
        if len(aa)!=0:
            mim_=aa[0]
            
        aa=[]
        search_pattern='(?s){(.*?)(?=})'
        aa=re.findall(search_pattern,x)            
        if len(aa)!=0:
            for i in aa[0].split(','):
                PubMed=''
                yy=[]
                if '|' in i:                    
                    #print(row['Entry_U'],i,i.split('|')[1][7:])
                    PubMed=i.split('|')[1][7:]
                    if (len(PubMed)!=0) and (len(disease_)==0): #handling boundary cases                                                 
                        search_pattern='(?s)Note=(.*?)(?=\.)'
                        note1=re.findall(search_pattern,x)                                  
                        if len(note1)!=0:
                            search_pattern='[A-Z]{3,}'
                            yy=re.findall(search_pattern,note1[0])
                            disease_=','.join(yy).strip()                        
                    lst__D.append([row['Entry_U'],disease_,mim_,PubMed])
                else:
                    PubMed=i
                    lst__D.append([row['Entry_U'],disease_,mim_,PubMed])                    
    return(1)

In [7]:
#this function extracts variant information from Natural Variant column
import re
lst__V=[]
def findVariant(row):        
    for x in row['NaturalVariant_U'].split('.;'):
        dbSNP=''
        FromTo=''
        Variant=''
        PubMed=''
        aa=[]
        part1=''
        part2=''
        Muatation=''
        part1=x.strip().split('/')[0].strip()
        if len(x.strip().split('/'))==2:
            part2=x.strip().split('/')[1][5:].replace('.','')
        
        search_pattern='(?s)dbSNP:(.*?)(?=\).)'
        aa=re.findall(search_pattern,part1)            
        if len(aa)!=0:
            dbSNP=aa[0]
        
        aa=[]
        search_pattern='(?s)VARIANT(.*?)(?=\()'
        aa=re.findall(search_pattern,part1)            
        if len(aa)!=0:            
            if len(aa[0].strip().split(' '))==5:
                #print('---->'.join(aa))
                FromTo=aa[0].strip().split(' ')[0]+'-'+aa[0].strip().split(' ')[1]
                Variant=aa[0].strip().split(' ')[2]+aa[0].strip().split(' ')[3]+aa[0].strip().split(' ')[4]
                #print(FromTo,Variant)
            else:
                FromTo=aa[0].strip().split(' ')[0]+'-'+aa[0].strip().split(' ')[1]
                Variant=aa[0].strip().split(' ')[2] #Missing variant information??
            Muatation=aa[0].strip()
                            
        search_pattern='(?s){(.*?)(?=})'
        aa=re.findall(search_pattern,part1)            
        if len(aa)!=0:
            PubMed=[]
            for i in aa[0].split(','):                
                if '|' in i:
                    i.split('|')[1]                    
                    PubMed.append(i.split('|')[1][7:])
                else:
                    PubMed.append(i)
            lst__V.append([row['Entry_U'],part2,'Uniprot','Substitution-Missense',dbSNP, \
                           Muatation,','.join(PubMed)])
            
    return (1)

In [8]:
#appling the functions
_=df_U_variant.apply(lambda row: findVariant(row),axis=1)
_=df_U_disease.apply(lambda row: findDisease(row),axis=1)

In [9]:
#cpnverting the list object to dataframe
_df_V = pd.DataFrame(data=lst__V,columns=['Entry_U','VariantID','VariantSource',\
                                          'MutationDescription','dbSNP_U','Mutation','PubMed'])
_df_D = pd.DataFrame(data=lst__D,columns=['Entry_U','Disease_U','MIM_U','PubMed'])

### cheat sheets for checking data

In [10]:
_df_V[_df_V.Entry_U=='Q93074']
#df_U_disease[df_U_disease.Entry_U=='Q03164']
#df_U_disease[df_U_disease.InvolmentInDisease_U.isnull()]
#df_U_disease[df_U_disease.Entry_U=='P15056']
#__=df_U_disease[df_U_disease.Entry_U=='P15056'].apply(lambda row: findDisease(row),axis=1)
#lst__D
#_df_D[_df_D.Entry_U=='Q9UKU0']
#_df_D[_df_D.duplicated(keep=False)]  #checking duplicate records
#df_merged[df_merged.MIMNumber_O.isnull()].head(5) #checking genes without any disease id 
#df_merged[df_merged['GeneName'] =='HSD3B7'] #checking a specific gene
#df_merged=df_merged.dropna(subset=['MIMNumber_O']) # 
#df_merged.shape

Unnamed: 0,Entry_U,VariantID,VariantSource,MutationDescription,dbSNP_U,Mutation,PubMed
28663,Q93074,VAR_033112,Uniprot,Substitution-Missense,rs80338758,961 961 R -> W,17334363
28664,Q93074,VAR_037534,Uniprot,Substitution-Missense,rs80338759,1007 1007 N -> S,17369503
28665,Q93074,VAR_069770,Uniprot,Substitution-Missense,rs387907360,1148 1148 R -> H,23395478
28666,Q93074,VAR_069771,Uniprot,Substitution-Missense,rs387907361,1165 1165 S -> P,23395478
28667,Q93074,VAR_046672,Uniprot,Substitution-Missense,rs1139013,1392 1392 Q -> R,101986388724849
28668,Q93074,VAR_069772,Uniprot,Substitution-Missense,rs387907362,1729 1729 H -> N,23395478
28669,Q93074,VAR_074018,Uniprot,Substitution-Missense,rs879255528,1974 1974 Q -> H,26273451


## Removing all duplicate rows from Disease

In [11]:
_df_D_Final=_df_D.drop_duplicates(keep=False)

In [12]:
# checking if we have any duplicate data
_df_D_Final[_df_D_Final.duplicated(keep=False)]

Unnamed: 0,Entry_U,Disease_U,MIM_U,PubMed


In [13]:
print(_df_V.shape,_df_D_Final.shape)

(54697, 7) (17770, 4)


In [14]:
_df_VD=pd.merge(_df_V,_df_D_Final,how='left', on=['Entry_U','PubMed'])
_df_VD.shape

(57852, 9)

In [15]:
_df_VD[_df_VD.Entry_U=='Q93074']

Unnamed: 0,Entry_U,VariantID,VariantSource,MutationDescription,dbSNP_U,Mutation,PubMed,Disease_U,MIM_U
30604,Q93074,VAR_033112,Uniprot,Substitution-Missense,rs80338758,961 961 R -> W,17334363,Opitz-Kaveggia syndrome (OKS),305450.0
30605,Q93074,VAR_037534,Uniprot,Substitution-Missense,rs80338759,1007 1007 N -> S,17369503,Lujan-Fryns syndrome (LUJFRYS),309520.0
30606,Q93074,VAR_069770,Uniprot,Substitution-Missense,rs387907360,1148 1148 R -> H,23395478,"Ohdo syndrome, X-linked (OHDOX)",300895.0
30607,Q93074,VAR_069771,Uniprot,Substitution-Missense,rs387907361,1165 1165 S -> P,23395478,"Ohdo syndrome, X-linked (OHDOX)",300895.0
30608,Q93074,VAR_046672,Uniprot,Substitution-Missense,rs1139013,1392 1392 Q -> R,101986388724849,,
30609,Q93074,VAR_069772,Uniprot,Substitution-Missense,rs387907362,1729 1729 H -> N,23395478,"Ohdo syndrome, X-linked (OHDOX)",300895.0
30610,Q93074,VAR_074018,Uniprot,Substitution-Missense,rs879255528,1974 1974 Q -> H,26273451,,


In [16]:
df_U_mod=df_U[['GeneName','Entry_U','ProteinName_U','Organism_U','Entryname_U']] #[df_U.Entry_U=='Q93074']

# Merging Variant and Disease dataset

In [17]:
df_merged_U=pd.merge(_df_VD,df_U_mod,how='left', on=['Entry_U'])
df_merged_U.shape

(57852, 13)

In [18]:
df_merged_U.sample(7) #[df_merged_U.Entry_U=='P0CW18']

Unnamed: 0,Entry_U,VariantID,VariantSource,MutationDescription,dbSNP_U,Mutation,PubMed,Disease_U,MIM_U,GeneName,ProteinName_U,Organism_U,Entryname_U
7323,O76090,VAR_043493,Uniprot,Substitution-Missense,,152 152 P -> A,1817988121330666,,,BEST1,Bestrophin-1 (TU15B) (Vitelliform macular dystrophy protein 2),Homo sapiens (Human),BEST1_HUMAN
51844,Q8WZ42,VAR_040124,Uniprot,Substitution-Missense,,3482 3482 E -> K,17344846,,,TTN,Titin (EC 2.7.11.1) (Connectin) (Rhabdomyosarcoma antigen MU-RMS-40.14),Homo sapiens (Human),TITIN_HUMAN
44352,P07949,VAR_009461,Uniprot,Substitution-Missense,,157 157 C -> Y,,Hirschsprung disease 1 (HSCR1),142623.0,RET,Proto-oncogene tyrosine-protein kinase receptor Ret (EC 2.7.10.1) (Cadherin family member 12) (Proto-oncogene c-Ret) [Cleaved into: Soluble RET kinase fragment; Extracellular cell-membrane anchored RET cadherin 120 kDa fragment],Homo sapiens (Human),RET_HUMAN
54229,Q8N9V7,VAR_039225,Uniprot,Substitution-Missense,rs7645375,88 88 P -> Q,14702039,,,TOPAZ1,Testis- and ovary-specific PAZ domain-containing protein 1,Homo sapiens (Human),TOPZ1_HUMAN
32165,Q00266,VAR_006939,Uniprot,Substitution-Missense,rs118204001,322 322 I -> M,106772947560086,,,MAT1A,S-adenosylmethionine synthase isoform type-1 (AdoMet synthase 1) (EC 2.5.1.6) (Methionine adenosyltransferase 1) (MAT 1) (Methionine adenosyltransferase I/III) (MAT-I/III),Homo sapiens (Human),METK1_HUMAN
56768,Q9BQB6,VAR_065790,Uniprot,Substitution-Missense,,59 59 W -> C,20946155,Coumarin resistance (CMRES),122700.0,VKORC1,"Vitamin K epoxide reductase complex subunit 1 (EC 1.17.4.4) (Vitamin K1 2,3-epoxide reductase subunit 1)",Homo sapiens (Human),VKOR1_HUMAN
3330,P10275,VAR_004732,Uniprot,Substitution-Missense,rs137852578,878 878 T -> A,1036396310569618156253916129672173119142260966818706882744098827083,,,AR,Androgen receptor (Dihydrotestosterone receptor) (Nuclear receptor subfamily 3 group C member 4),Homo sapiens (Human),ANDR_HUMAN


### Number of entires in above Uniprot list after removing entries without gene name: 57,852 with 14 columns.

# 4.1.4 Import COSMIC

### Note:
Import mutation dataset from COSMIC database. It provides a tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set. The name of the file is 'CosmicCompleteTargetedScreensMutantExport.tsv.gz'

We will only consider those entires which have Mutation IDs.

In [None]:
#blob = storage.Blob('CosmicCompleteTargetedScreensMutantExport.tsv',bucket)
#with open('CosmicCompleteTargetedScreensMutantExport.tsv', 'wb') as file_obj:
#    blob.download_to_file(file_obj)

In [19]:
import pandas as pd
chunksize = 10 ** 6
cosmic_C=pd.DataFrame()
colnames=["GeneName","AccessionNumber_C","GeneCDSlength","HGNCid","SampleName","SampleId", \
          "IdTumour", "PrimarySite","SiteSubtype1","SiteSubtype2","SiteSubtype3", \
          "PrimaryHistology","HistologySubtype1", "HistologySubtype2","HistologySubtype3", \
          "GenomeWideScreen","MutationId_C","MutationCDS","MutationAA_C",\
          "MutationDescription_C","MutationZygosity","LOH","GRCh",\
          "MutationGenomePosition_C","MutationStrand","SNP","ResistanceMutation",\
          "FATHMMPrediction","FATHMMScore","MutationSomaticStatus","Pubmed_PMID_C",\
          "IdStudy","SampleSource","TumourOrigin","Age"]

for chunk in pd.read_csv('CosmicCompleteTargetedScreensMutantExport.tsv',sep='\t',header=0, \
                         names=colnames,low_memory=False, \
                 dtype={"GeneName":object,"AccessionNumber":object,\
                        "GeneCDSlength":object,"HGNCid":object,\
                        "SampleName":object,"SampleId":object,\
                        "IdTumour":object,"PrimarySite":object,\
                        "SiteSubtype1":object,"SiteSubtype2":object, \
                        "SiteSubtype3":object,"PrimaryHistology":object,\
                        "HistologySubtype1":object,"HistologySubtype2":object,\
                        "HistologySubtype3":object,"GenomeWideScreen":object, \
                        "MutationId":object,"MutationCDS":object,"MutationAA_C":object, \
                        "MutationDescription":object, "MutationZygosity":object,
                        "LOH":object,"GRCh":object,"MutationGenomePosition":object,
                        "MutationStrand":object,"SNP":object,"ResistanceMutation":object,\
                        "FATHMMPrediction":object,"FATHMMScore":object,\
                        "MutationSomaticStatus":object,"Pubmed_PMID_C":object, \
                        "IdStudy":object,"SampleSource":object,"TumourOrigin":object,\
                        "Age":object}, \
                 chunksize=chunksize
                ):
    cosmic_C=chunk[['GeneName','MutationId_C','MutationDescription_C' \
                    ,'MutationAA_C','Pubmed_PMID_C']].loc[chunk.MutationId_C.notnull()]
                  #selecting only with mutation id from Cosmic file

In [20]:
cosmic_C.shape

(13155, 5)

In [21]:
cosmic_C.columns=['GeneName','VariantID', 'MutationDescription', 'Mutation','PubMed']

In [None]:
import numpy as np

In [22]:
cosmic_C['VariantSource']='COSMIC'

In [None]:
#s = ['p.L858R','p.C229fs*10','p.?_?ins?','p.E604_F605ins15','p.EAC746_E749delELRE','p.D770_N771insSVD','p.?fs*?']
#a=[re.findall(r"[^\W\d_]+|\d+",i[1:]) for i in s]
#a

In [None]:
#cosmic_C_mod=cosmic_C[cosmic_C.MutationDescription=='Substitution - Missense'].copy()
#cosmic_C_mod=cosmic_C_mod.reset_index(drop=True)
#cosmic_C_mod.head(4)

In [None]:
def FromTo(s_row):
    a=[]
    fromTo=''
    VariantPos=''
    a=re.findall(r"[^\W\d_]+|\d+",s_row['MutationAA_C'][1:])
    if len(a)==3:
        fromTo=a[0]+'->'+a[2]
        VariantPos=a[1]
    elif len(a)==2:
        fromTo=a[0]+'->?'
        VariantPos=a[1]
    return(fromTo)

In [None]:
def VariantPos(s_row):
    a=[]
    fromTo=''
    VariantPos=''
    a=re.findall(r"[^\W\d_]+|\d+",s_row['MutationAA_C'][1:])
    if len(a)==3:        
        VariantPos=a[1]
    elif len(a)==2:
        VariantPos=a[1]
    return(VariantPos)

In [None]:
#cosmic_C_mod.VaiantPos=cosmic_C_mod.apply(VariantPos,axis=1)
#cosmic_C_mod.Variant=cosmic_C_mod.apply(FromTo,axis=1)

In [23]:
cosmic_C[cosmic_C.GeneName=='EGFR']

Unnamed: 0,GeneName,VariantID,MutationDescription,Mutation,PubMed,VariantSource
6000185,EGFR,COSM12979,Substitution - Missense,p.L858R,22523351,COSMIC
6000186,EGFR,COSM13243,Deletion - In frame,p.?del,21129809,COSMIC
6000718,EGFR,COSM13243,Deletion - In frame,p.?del,22695392,COSMIC
6001346,EGFR,COSM12979,Substitution - Missense,p.L858R,23915069,COSMIC
6001657,EGFR,COSM18486,Substitution - Missense,p.S768I,26862733,COSMIC
6002334,EGFR,COSM36937,Substitution - Missense,p.P741H,19776290,COSMIC
6002557,EGFR,COSM21943,Substitution - Missense,p.T790M,22215752,COSMIC
6002829,EGFR,COSM6224,Substitution - Missense,p.L858R,15118125,COSMIC
6002831,EGFR,COSM13243,Deletion - In frame,p.?del,19517135,COSMIC
6004558,EGFR,COSM12979,Substitution - Missense,p.L858R,23014527,COSMIC


In [None]:
#c=cosmic_C_mod.columns.difference(['PubMed']).tolist()
#a=cosmic_C_mod[cosmic_C_mod.PubMed.notnull()].to_dict()
#df1=pd.DataFrame(a)

In [25]:
c = cosmic_C.columns.difference(['PubMed']).tolist()
cosmic_C_final=cosmic_C.astype(str).groupby(c, as_index=False).PubMed.agg(','.join)

In [39]:
import numpy as np
cosmic_C_final['Entry_U']= np.NaN
cosmic_C_final['ProteinName_U']=np.NaN
cosmic_C_final['Organism_U']= np.NaN
cosmic_C_final['Entryname_U']=np.NaN

In [47]:
#df_merged_U[df_merged_U.GeneName=='ABCA12_ENST00000389661']
#df_merged_U[df_merged_U.VariantID=='VAR_019297']

In [42]:
cosmic_C_final.loc[cosmic_C_final.GeneName.isin(df_merged_U.GeneName),\
                   ['Entry_U','ProteinName_U', 'Organism_U','Entryname_U']]=\
df_merged_U[['Entry_U','ProteinName_U', 'Organism_U','Entryname_U']]

In [46]:
cosmic_C_final[cosmic_C_final.Entry_U.isnull()]

Unnamed: 0,GeneName,Mutation,MutationDescription,VariantID,VariantSource,PubMed,Entry_U,ProteinName_U,Organism_U,Entryname_U
0,AASS,p.T263T,Substitution - coding silent,COSM308738,COSMIC,22941188,,,,
2,ABCA12_ENST00000389661,p.E523K,Substitution - Missense,COSM4604878,COSMIC,25056374,,,,
5,ABCA13_ENST00000435803,p.R1733P,Substitution - Missense,COSM5049106,COSMIC,24686850,,,,
7,ABCC4_ENST00000376887,p.E675K,Substitution - Missense,COSM4990391,COSMIC,25589618,,,,
8,ABCC4_ENST00000376887,p.S323S,Substitution - coding silent,COSM147705,COSMIC,25589003,,,,
10,ABCG2_ENST00000237612,p.Q141K,Substitution - Missense,COSM3760823,COSMIC,25589003,,,,
39,ABL1_ENST00000318560,p.P1108P,Substitution - coding silent,COSM5019121,COSMIC,25589003,,,,
40,ABL1_ENST00000318560,p.P914P,Substitution - coding silent,COSM5607711,COSMIC,26343386,,,,
41,ABL1_ENST00000372348,p.P1127P,Substitution - coding silent,COSM5019120,COSMIC,2558900325589003,,,,
42,ABL1_ENST00000372348,p.P803P,Substitution - coding silent,COSM3720675,COSMIC,25589003,,,,


In [99]:
cosmic_C_final.columns

Index(['GeneName', 'Mutation', 'MutationDescription', 'VariantID', 'VariantSource', 'PubMed', 'Entry_U', 'ProteinName_U', 'Organism_U', 'Entryname_U'], dtype='object')

# Merging COSMIC to Uniprot

In [48]:
df__=pd.concat([df_merged_U, cosmic_C_final], ignore_index=True)

In [52]:
df__.shape

(62662, 13)

## Number of entires in the COSMIC list after removing entires without mutation IDs: 62,662 with 13 columns.

In [100]:
df__.columns

Index(['Disease_U', 'Entry_U', 'Entryname_U', 'GeneName', 'MIM_U', 'Mutation', 'MutationDescription', 'Organism_U', 'ProteinName_U', 'PubMed', 'VariantID', 'VariantSource', 'dbSNP_U'], dtype='object')

## Processing Clinvar

In [54]:
blob = storage.Blob('variant_summary.txt.gz',bucket)
with open('variant_summary.txt.gz', 'wb') as file_obj:
    blob.download_to_file(file_obj)

In [95]:
import pandas as pd
chunksize = 10 ** 6
Clinvar_V=pd.DataFrame()
colnames=['#AlleleID', 'Type', 'Name', 'GeneID', 'GeneSymbol',\
          'HGNC_ID', 'ClinicalSignificance','ClinSigSimple',\
          'LastEvaluated', 'RS# (dbSNP)', 'nsv/esv (dbVar)',\
          'RCVaccession', 'PhenotypeIDS','PhenotypeList',\
          'Origin','OriginSimple','Assembly','ChromosomeAccession',\
          'Chromosome','Start','Stop','ReferenceAllele','AlternateAllele',\
          'Cytogenetic','ReviewStatus','NumberSubmitters',
          'Guidelines','TestedInGTR','OtherIDs','SubmitterCategories']

for chunk in pd.read_csv('variant_summary.txt.gz',sep='\t',header=0,\
                         low_memory=False,chunksize=chunksize):
    Clinvar_V=chunk[['#AlleleID', 'Type', 'Name', 'GeneSymbol','GeneID',\
                     'ClinicalSignificance', 'RS# (dbSNP)','Origin',\
                     'Assembly']].loc[chunk.Assembly=='GRCh38']

In [96]:
Clinvar_V.shape

(356597, 9)

In [112]:
Clinvar_V['VariantSource']='Clinvar'
Clinvar_V['Entry_U']=np.NAN
Clinvar_V['Entryname_U']=np.NAN
Clinvar_V['Organism_U']=np.nan
Clinvar_V['ProteinName_U']=np.nan

In [113]:
Clinvar_V.columns=['VariantID', 'MutationDescription', 'Mutation', 'GeneName','GeneID_CV',\
                   'ClinicalSignificance_CV', 'dbSNP_U','Origin_CV', 'Assembly_CV',\
                   'VariantSource','Entry_U','Entryname_U','Organism_U','ProteinName_U']

In [116]:
Clinvar_V.head()

Unnamed: 0,VariantID,MutationDescription,Mutation,GeneName,GeneID_CV,ClinicalSignificance_CV,dbSNP_U,Origin_CV,Assembly_CV,VariantSource,Entry_U,Entryname_U,Organism_U,ProteinName_U
1,15041,indel,NM_014855.2(AP5Z1):c.80_83delGGATinsTGCTGTAAACTGTAACTGTAAA (p.Arg27_Ala362delinsLeuLeuTer),AP5Z1,9907,Pathogenic,397704705,germline,GRCh38,Clinvar,,,,
3,15042,deletion,NM_014855.2(AP5Z1):c.1413_1426delGGACCTGCCCTGCT (p.Leu473Glyfs),AP5Z1,9907,Pathogenic,397704709,germline,GRCh38,Clinvar,,,,
5,15043,single nucleotide variant,NM_014630.2(ZNF592):c.3136G>A (p.Gly1046Arg),ZNF592,9640,Uncertain significance,150829393,germline,GRCh38,Clinvar,P30443,1A01_HUMAN,Homo sapiens (Human),"HLA class I histocompatibility antigen, A-1 alpha chain (MHC class I antigen A*1)"
7,15044,single nucleotide variant,NM_017547.3(FOXRED1):c.694C>T (p.Gln232Ter),FOXRED1,55572,Pathogenic,267606829,germline,GRCh38,Clinvar,P30443,1A01_HUMAN,Homo sapiens (Human),"HLA class I histocompatibility antigen, A-1 alpha chain (MHC class I antigen A*1)"
9,15045,single nucleotide variant,NM_017547.3(FOXRED1):c.1289A>G (p.Asn430Ser),FOXRED1,55572,Pathogenic,267606830,germline,GRCh38,Clinvar,P30443,1A01_HUMAN,Homo sapiens (Human),"HLA class I histocompatibility antigen, A-1 alpha chain (MHC class I antigen A*1)"


In [106]:
df__[df__.GeneName=='ZNF592']

Unnamed: 0,Disease_U,Entry_U,Entryname_U,GeneName,MIM_U,Mutation,MutationDescription,Organism_U,ProteinName_U,PubMed,VariantID,VariantSource,dbSNP_U
57537,,Q92610,ZN592_HUMAN,ZNF592,,926 926 S -> N,Substitution-Missense,Homo sapiens (Human),Zinc finger protein 592,154893349039502,VAR_047033,Uniprot,rs8182086
57538,,Q92610,ZN592_HUMAN,ZNF592,,1046 1046 G -> R,Substitution-Missense,Homo sapiens (Human),Zinc finger protein 592,20531441,VAR_064583,Uniprot,rs150829393


In [115]:
Clinvar_V.loc[Clinvar_V.GeneName.isin(df__.GeneName),\
              ['Entry_U','ProteinName_U', 'Organism_U', \
               'Entryname_U']] = df__[['Entry_U','ProteinName_U', 'Organism_U','Entryname_U']]

In [117]:
df__=pd.concat([df__, Clinvar_V], ignore_index=True)

In [119]:
df__.sample(3)

Unnamed: 0,Assembly_CV,ClinicalSignificance_CV,Disease_U,Entry_U,Entryname_U,GeneID_CV,GeneName,MIM_U,Mutation,MutationDescription,Organism_U,Origin_CV,ProteinName_U,PubMed,VariantID,VariantSource,dbSNP_U
317946,GRCh38,Pathogenic/Likely pathogenic,,,,157680.0,VPS13B,,NM_017890.4(VPS13B):c.11595delA (p.Arg3865Serfs),deletion,,germline;unknown,,,357751,Clinvar,747217399.0
171651,GRCh38,Benign/Likely benign,,,,10577.0,NPC2,,NM_006432.3(NPC2):c.442-4A>C,single nucleotide variant,,germline,,,194949,Clinvar,114950106.0
31,,,Congenital bile acid synthesis defect 1 (CBAS1),Q9H2F3,3BHS7_HUMAN,,HSD3B7,607765.0,19 19 G -> S,Substitution-Missense,Homo sapiens (Human),,"3 beta-hydroxysteroid dehydrogenase type 7 (3 beta-hydroxysteroid dehydrogenase type VII) (3-beta-HSD VII) (3-beta-hydroxy-Delta(5)-C27 steroid oxidoreductase) (C(27) 3-beta-HSD) (EC 1.1.1.-) (Cholest-5-ene-3-beta,7-alpha-diol 3-beta-dehydrogenase) (EC 1.1.1.181)",12679481.0,VAR_054775,Uniprot,


## Processing GWAS

In [59]:
blob = storage.Blob('gwas_catalog_v1.0.1-associations_e91_r2018-01-16.tsv',bucket)
with open('gwas_catalog_v1.0.1-associations_e91_r2018-01-16.tsv', 'wb') as file_obj:
    blob.download_to_file(file_obj)

In [120]:
chunksize = 10 ** 6
colnames=[]
GWAS=pd.DataFrame()
for chunkGWAS in pd.read_csv('gwas_catalog_v1.0.1-associations_e91_r2018-01-16.tsv',sep='\t', \
                         header=0,low_memory=False,chunksize=chunksize):
    GWAS=chunkGWAS[['STUDY ACCESSION','PUBMEDID','DISEASE/TRAIT',  'CHR_ID',\
                    'CHR_POS', 'REPORTED GENE(S)',\
                    'MAPPED_GENE', 'STRONGEST SNP-RISK ALLELE','CONTEXT',\
                    'P-VALUE','OR or BETA', 'MAPPED_TRAIT']]        

In [127]:
GWAS.columns

Index(['STUDY ACCESSION', 'PUBMEDID', 'DISEASE/TRAIT', 'CHR_ID', 'CHR_POS', 'REPORTED GENE(S)', 'MAPPED_GENE', 'STRONGEST SNP-RISK ALLELE', 'CONTEXT', 'P-VALUE', 'OR or BETA', 'MAPPED_TRAIT', 'VariantSource', 'Entry_U', 'Entryname_U', 'Organism_U', 'ProteinName_U'], dtype='object')

In [126]:
GWAS['VariantSource']='GWAS'
GWAS['Entry_U']=np.NAN
GWAS['Entryname_U']=np.NAN
GWAS['Organism_U']=np.nan
GWAS['ProteinName_U']=np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [131]:
GWAS.columns=['VariantID', 'PubMed', 'Disease_U', 'CHR_ID',\
 'CHR_POS', 'GeneName', 'Mapped_Gene', 'dbSNP_U',\
 'Mutation', 'PValue', 'OR_or_BETA',\
 'Mapped_Trait','VariantSource','Entry_U',\
 'Entryname_U','Organism_U','ProteinName_U']

In [135]:
GWAS.sample(31)

Unnamed: 0,VariantID,PubMed,Disease_U,CHR_ID,CHR_POS,GeneName,Mapped_Gene,dbSNP_U,Mutation,PValue,OR_or_BETA,Mapped_Trait,VariantSource,Entry_U,Entryname_U,Organism_U,ProteinName_U
32368,GCST003045,26192919,Ulcerative colitis,14.0,88006251.0,NR,GPR65,rs8005161-A,intron_variant,3e-09,1.137108,ulcerative colitis,GWAS,,,,
51737,GCST004347,28346443,Glioma,20.0,63680946.0,RTEL1,"RTEL1-TNFRSF6B, RTEL1",rs2297440-C,intron_variant,1.9999999999999998e-42,1.36,"central nervous system cancer, glioma",GWAS,P13726,TF_HUMAN,Homo sapiens (Human),Tissue factor (TF) (Coagulation factor III) (Thromboplastin) (CD antigen CD142)
6735,GCST000667,20418888,Smoking behavior,19.0,40852719.0,"CYP2A6, RAB4D",CYP2A6 - CYP2A7,rs4105144-C,intron_variant,2e-12,0.39,smoking behavior,GWAS,,,,
23459,GCST004624,27863252,Sum eosinophil basophil counts,21.0,38482555.0,ERG,ERG,rs58030288-T,intron_variant,2e-09,0.110141,"basophil count, eosinophil count",GWAS,,,,
2150,GCST001207,21862451,Butyrylcholinesterase levels,2.0,203384676.0,ABI2,ABI2,rs11675251-A,intron_variant,4e-18,0.15,butyrylcholinesterase measurement,GWAS,,,,
21376,GCST004617,27863252,Eosinophil percentage of granulocytes,19.0,3179519.0,S1PR4,S1PR4,rs61731111-T,missense_variant,1e-18,0.150202,eosinophil percentage of granulocytes,GWAS,,,,
2889,GCST000760,20686565,"Cholesterol, total",5.0,156963286.0,"TIMD4, HAVCR1",TIMD4 - HAVCR1,rs6882076-T,upstream_gene_variant,7e-28,1.98,total cholesterol measurement,GWAS,,,,
47763,GCST004313,28441456,"Facial morphology (factor 9, facial height related to vertical position of nasion)",5.0,154697614.0,NR,LARP1,rs6580110-G,intergenic_variant,4e-06,0.1528,facial height measurement,GWAS,,,,
49576,GCST004134,28081215,Multiple keratinocyte cancers,,,intergenic,,chr3:79848880-A,,2e-06,4.739,"squamous cell carcinoma, mulitple keratinocyte carcinoma susceptibility measurement, basal cell carcinoma",GWAS,,,,
26054,GCST004616,27863252,Platelet distribution width,10.0,32380143.0,EPC1,EPC1 - RNU6-1244P,rs10559647-C,upstream_gene_variant,2e-09,0.027269,platelet distribution width,GWAS,Q12809,KCNH2_HUMAN,Homo sapiens (Human),Potassium voltage-gated channel subfamily H member 2 (Eag homolog) (Ether-a-go-go-related gene potassium channel 1) (ERG-1) (Eag-related protein 1) (Ether-a-go-go-related protein 1) (H-ERG) (hERG-1) (hERG1) (Voltage-gated potassium channel subunit Kv11.1)


In [141]:
GWAS.loc[GWAS.GeneName.isin(df__.GeneName),\
              ['Entry_U','ProteinName_U', 'Organism_U', \
               'Entryname_U']] = df__[['Entry_U','ProteinName_U', 'Organism_U','Entryname_U']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [136]:
df__=pd.concat([df__, GWAS], ignore_index=True)

In [140]:
df__.shape

(483498, 23)

# Number of entires in the merged list after including COSMIC,Uniprot,Clinvar and GWAS dataset: 483,498 with 23 columns.

# 4.2 Analysis

In [None]:
df_merged.columns

In [None]:
df_merged.columns=['GeneName', 'Entry_U', 'ProteinName_U', 'Organism_U', 'Entryname_U',
       'EnsemblGeneID_U', 'MIMNumber_O', 'MIMEntryType_O',
       'EntrezGeneID_NCBI_O', 'EnsemblGeneID_O', 'AccessionNumber_C',
       'MutationId_C', 'MutationDescription_C', 'MutationGenomePosition_C', 'UniprotID',
       'NaturalVariant_U', 'Mutagenesis_U']

In [None]:
df_merged=df_merged[['UniprotID','GeneName', 'Entry_U', 'ProteinName_U','EnsemblGeneID_U', \
                     'EnsemblGeneID_O','EntrezGeneID_NCBI_O','MIMNumber_O', 'MIMEntryType_O', \
                     'NaturalVariant_U','Mutagenesis_U','AccessionNumber_C','MutationId_C', \
                     'MutationDescription_C', 'MutationGenomePosition_C']]

In [None]:
df_merged[df_merged['GeneName'] =='TP53']

In [None]:
df_merged.columns

In [None]:
import datetime as dt
from datetime import datetime
from pytz import timezone

import uuid

tz = timezone('EST') # adding time zone info
datetime.now(tz) 
df_merged['Entrydate'] = dt.datetime.now()

df_merged.insert(0,'Id',uuid.uuid4()) 
df_merged.Id= df_merged.Id.apply(lambda x: uuid.uuid4()) # adding unique identifier

In [None]:
df_merged['EntrezGeneID_NCBI_O']=df_merged.EntrezGeneID_NCBI_O.apply(lambda x: str(x))
df_merged['MIMNumber_O']=df_merged.MIMNumber_O.apply(lambda x: str(x))

In [None]:
df_merged.head(200).to_csv('somaticFinal.csv')

# Inserting Somatic Data into Cassandra Database
The step is done for persistence and reliability among other benefits using a database cluster.

In [None]:
import itertools
from multiprocessing import Pool
import sys
import time
from cassandra.cluster import Cluster
from cassandra.concurrent import execute_concurrent_with_args
from cassandra.query import tuple_factory
from cassandra.auth import PlainTextAuthProvider

In [None]:
df_con=pd.read_csv('~/connection_point.csv',header=0) # this is done to make add basic level of security. 
#Please note that this file is not uploaded. It is only present in the Jupyter server. 

In [None]:
def _insertData(params):
    cluster = Cluster(contact_points=[df_con.ip[0]], auth_provider = \
                      PlainTextAuthProvider(username=df_con.user[0], \
                                            password=df_con.token[0]))
    session = cluster.connect()
    session.set_keyspace('somatic')
    session.row_factory = tuple_factory
    prepared=session.prepare("INSERT INTO TABLE somatic.somaticMerged \
                             (id,UniprotID,GeneName,Entry_U,ProteinName_U, \
                             EnsemblGeneID_U,EnsemblGeneID_O,EntrezGeneID_NCBI_O, \
                             MIMNumber_O,MIMEntryType_O,NaturalVariant_U,Mutagenesis_U, \
                             AccessionNumber_C,MutationId_C,MutationDescription_C, \
                             MutationGenomePosition_C,entrydate) \
                             VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)")            
    
    #using datastax driver for multiprocessing 
    execute_concurrent_with_args(session, prepared, params, concurrency=50) 
    return None

def multiprocess(params):
    pool = Pool(processes=2)
    results = [pool.map(_insertData, (params[n:n+100],)) for n in range(0, len(params),100)]
    return results

if __name__ == "__main__":
    parameters=[]
    for index, row in enumerate(df_merged.values):        
        (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q) = row
        row1=(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q)
        parameters.append(row1)           
    a = multiprocess(parameters)

In [None]:
Graph 
ML
