# Capstone 2 – Project Proposal – Somatic Mutation

# 1 Objective: 

The goal is to segment different types of Somatic Germline Mutations in human genes associated with inherited and acquired diseases. It will be a one-stop-shop comprehensive collection of mutation data(Segments) for easy discovery in the era of personalized medicine. As part of this project I would like to find: 

a.	Somatic Germline segmentation

b.	Acquired Disease with maximum number of Somatic Mutation 

c.	Inherited Disease with maximum number of Somatic Mutation

### Outcome: Comprehensive Somatic Mutation Database an invaluable resource for all scientists. 

# 2 Client

#### Will be provided by Sona
Sample: The client for this project is Georgetown University (www.georgetown.edu ) and the Bioinformatics Department. The purpose is to find an ML model which can be used to correctly cluster somatic mutation.

# 3 Data source and Credits


We will provided by Sona.

# 4. Solution Approach

The solution plans to use PCA and NMF ML techniques to help with dimension reduction and segmentation of Somatic Mutation. It is specifically designed to address large dataset with biodiversity and quality issues like redundancy, missing, wrong label etc. The solution is sub-divided into three phases as listed below.





### a)Data Assembly - Phase I: 
This phase of the project is designed to gather and do basic cleanup like join, merge, add or update attributes.

### b)Explore and Preprocessing – Phase II: 

This phase of the project is designed to validate and explore the dataset for all the problems listed in the “Problem” section of this proposal. 
### c)Modelling and Evaluation Phase III: 

In this phase of the project will focus on exploring various machine learning algorithms and finding the right hyperparameters to find the best ML model to cluster the Somatic Mutation.  

## 4.1 Data Assembly - Phase I:

OMIM Data identifier is O

UNIPROT data identifier is U

COSMIC data identifier is C

# 4.1.2 Import Uniprot Data.

In [1]:
import pandas as pd
from google.cloud import storage

import datetime as dt
from datetime import datetime
from pytz import timezone
pd.options.display.max_colwidth = 10000
import uuid

client = storage.Client()
bucket=client.get_bucket('somatic_germline_mutations')
blob = storage.Blob('uniprot-all-2.tab',bucket)
with open('uniprot-all-2.tab', 'wb') as file_obj:
    blob.download_to_file(file_obj)

df_U=pd.read_csv('uniprot-all-2.tab',sep='\t', header=0, \
               names=['Entry_U','GeneName','ProteinName_U','Organism_U','Entryname_U', \
                      'NaturalVariant_U','InvolmentInDisease_U'], \
               dtype={'Entry_U':object,'Entryname_U':object,'ProteinName_U':object,'GeneName':object, \
                      'Organism_U':object,'NaturalVariant_U':object,'Mutagenesis_U':object})

df_U=df_U[['GeneName','Entry_U','ProteinName_U','Organism_U','Entryname_U','NaturalVariant_U','InvolmentInDisease_U']]
df_U.shape

(20244, 7)

### Number of entires in above Uniprot list: 20,244 with 7 columns.
We only started with 7 columns so that the uniprot list is managable. 

In [2]:
df_U[df_U.Entry_U=='Q93074'] #df_U[df_U.GeneName=='MED12']
#we can see that GeneName is comprising of more than one names. We are only considering the 1st Name as the GeneName

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,NaturalVariant_U,InvolmentInDisease_U
9640,MED12 ARC240 CAGH45 HOPA KIAA0192 TNRC11 TRAP230,Q93074,Mediator of RNA polymerase II transcription subunit 12 (Activator-recruited cofactor 240 kDa component) (ARC240) (CAG repeat protein 45) (Mediator complex subunit 12) (OPA-containing protein) (Thyroid hormone receptor-associated protein complex 230 kDa component) (Trap230) (Trinucleotide repeat-containing gene 11 protein),Homo sapiens (Human),MED12_HUMAN,"VARIANT 961 961 R -> W (in OKS; dbSNP:rs80338758). {ECO:0000269|PubMed:17334363}. /FTId=VAR_033112.; VARIANT 1007 1007 N -> S (in LUJFRYS; dbSNP:rs80338759). {ECO:0000269|PubMed:17369503}. /FTId=VAR_037534.; VARIANT 1148 1148 R -> H (in OHDOX; dbSNP:rs387907360). {ECO:0000269|PubMed:23395478}. /FTId=VAR_069770.; VARIANT 1165 1165 S -> P (in OHDOX; dbSNP:rs387907361). {ECO:0000269|PubMed:23395478}. /FTId=VAR_069771.; VARIANT 1392 1392 Q -> R (in dbSNP:rs1139013). {ECO:0000269|PubMed:10198638, ECO:0000269|PubMed:8724849}. /FTId=VAR_046672.; VARIANT 1729 1729 H -> N (in OHDOX; dbSNP:rs387907362). {ECO:0000269|PubMed:23395478}. /FTId=VAR_069772.; VARIANT 1974 1974 Q -> H (found in a family with X-linked intellectual disability; unknown pathological significance; dbSNP:rs879255528). {ECO:0000269|PubMed:26273451}. /FTId=VAR_074018.","DISEASE: Opitz-Kaveggia syndrome (OKS) [MIM:305450]: X-linked disorder characterized by mental retardation, relative macrocephaly, hypotonia and constipation. {ECO:0000269|PubMed:17334363}. Note=The disease is caused by mutations affecting the gene represented in this entry.; DISEASE: Lujan-Fryns syndrome (LUJFRYS) [MIM:309520]: Clinically, Lujan-Fryns syndrome can be distinguished from Opitz-Kaveggia syndrome by tall stature, hypernasal voice, hyperextensible digits and high nasal root. {ECO:0000269|PubMed:17369503}. Note=The disease is caused by mutations affecting the gene represented in this entry.; DISEASE: Ohdo syndrome, X-linked (OHDOX) [MIM:300895]: A syndrome characterized by mental retardation, feeding problems, and distinctive facial appearance with coarse facial features, severe blepharophimosis, ptosis, a bulbous nose, micrognathia and a small mouth. Dental hypoplasia and deafness can be considered as common manifestations of the syndrome. Male patients show cryptorchidism and scrotal hypoplasia. {ECO:0000269|PubMed:23395478}. Note=The disease is caused by mutations affecting the gene represented in this entry."


# Observations:

1. Multiple Gene Names. we need to consider only the 1st Gene names as the key.
2. Natutal Variant: This data also has multiple variants. we need to extract few data points from this text. 

    a) VARIANT ID
    
    b) From To
    
    c) Position 
    
    d) dbSNP
    
    e) PubMed id
3. Involvement in Disease: Multiple disease are staked into this. We need to extract the few data points
    
    a) Disease name 
    
    b) MIM ID 
    
    c) Pub Med

In [3]:
#df_U[df_U.GeneName.isnull()].to_csv('GeneNull.csv') #viewing entries from Uniprot file without a gene name.

df_U=df_U[df_U.GeneName.notnull()] #Removing entries from Uniprot file without a gene name.
df_U['GeneName']=df_U['GeneName'].apply(lambda x:x.split(' ')[0]) 
df_U.shape

(20087, 7)

# Number of entires in above Uniprot list: 20,087 with 7 columns.

### Extracting Natural Variant,dbSNP, Variant Position, PubMed,FromTo,MIM and Disease information

In [4]:
df_U_variant=df_U[['Entry_U','NaturalVariant_U']][df_U.NaturalVariant_U.notnull()]
df_U_disease=df_U[['Entry_U','InvolmentInDisease_U']][df_U.InvolmentInDisease_U.notnull()]

In [5]:
# This function extracts disease information from the involment in disease column.
import re
lst__D=[]
def findDisease(row):
    for x in row['InvolmentInDisease_U'].split('.;'):
        #print(x)
        #print('++++++++++++++++++++++')
        PubMed=''
        mim_=''
        disease_=''
        
        aa=[]
        search_pattern='(?s)DISEASE:(.*?)(?=\[)'
        aa=re.findall(search_pattern,x)            
        if len(aa)!=0:
            disease_=aa[0].strip()
        else:
            #DISEASE: Note=Defects in ATM contribute to B-cell non-Hodgkin lymphomas (BNHL), 
            #including mantle cell lymphoma (MCL).
            #DISEASE: Note=A chromosomal aberration involving BRAF is found in pilocytic astrocytomas. 
            #A tandem duplication of 2 Mb at 7q34 leads to the expression of a KIAA1549-BRAF fusion protein with a 
            #constitutive kinase activity and inducing cell transformation. {ECO:0000269|PubMed:18974108}
                
            search_pattern='(?:^|\W\()[A-Z]*(?:$|\))'
            aa=re.findall(search_pattern,x)
            disease_=' &'.join(aa).strip().replace('(','').replace(')','')
            
        aa=[]
        search_pattern="(?s)MIM:(.*?)(?=])"
        aa=re.findall(search_pattern,x)            
        if len(aa)!=0:
            mim_=aa[0]
            
        aa=[]
        search_pattern='(?s){(.*?)(?=})'
        aa=re.findall(search_pattern,x)            
        if len(aa)!=0:
            for i in aa[0].split(','):
                PubMed=''
                yy=[]
                if '|' in i:                    
                    #print(row['Entry_U'],i,i.split('|')[1][7:])
                    PubMed=i.split('|')[1][7:]
                    if (len(PubMed)!=0) and (len(disease_)==0): #handling boundary cases                                                 
                        search_pattern='(?s)Note=(.*?)(?=\.)'
                        note1=re.findall(search_pattern,x)                                  
                        if len(note1)!=0:
                            search_pattern='[A-Z]{3,}'
                            yy=re.findall(search_pattern,note1[0])
                            disease_=','.join(yy).strip()
                        
                    lst__D.append([row['Entry_U'],disease_,mim_,PubMed])
                else:
                    PubMed=i
                    lst__D.append([row['Entry_U'],disease_,mim_,PubMed])                    
    return(1)

In [6]:
#this function extracts variant information from Natural Variant column
import re
lst__V=[]
def findVariant(row):        
    for x in row['NaturalVariant_U'].split('.;'):
        dbSNP=''
        FromTo=''
        Variant=''
        PubMed=''
        aa=[]
        part1=''
        part2=''
        part1=x.strip().split('/')[0].strip()
        if len(x.strip().split('/'))==2:
            part2=x.strip().split('/')[1][5:].replace('.','')
        
        search_pattern='(?s)dbSNP:(.*?)(?=\).)'
        aa=re.findall(search_pattern,part1)            
        if len(aa)!=0:
            dbSNP=aa[0]
        
        aa=[]
        search_pattern='(?s)VARIANT(.*?)(?=\()'
        aa=re.findall(search_pattern,part1)            
        if len(aa)!=0:
            if len(aa[0].strip().split(' '))==5:                
                FromTo=aa[0].strip().split(' ')[0]+'-'+aa[0].strip().split(' ')[1]
                Variant=aa[0].strip().split(' ')[2]+aa[0].strip().split(' ')[3]+aa[0].strip().split(' ')[4]
                #print(FromTo,Variant)
            else:
                FromTo=aa[0].strip().split(' ')[0]+'-'+aa[0].strip().split(' ')[1]
                Variant=aa[0].strip().split(' ')[2] #Missing variant information??
                            
        search_pattern='(?s){(.*?)(?=})'
        aa=re.findall(search_pattern,part1)            
        if len(aa)!=0:
            PubMed=[]
            for i in aa[0].split(','):                
                if '|' in i:
                    i.split('|')[1]                    
                    PubMed.append(i.split('|')[1][7:])
                else:
                    PubMed.append(i)
                    
            lst__V.append([row['Entry_U'],part1,part2,'Uniprot',dbSNP,FromTo,Variant,','.join(PubMed)])
            
    return (1)

In [7]:
#appling the functions
_=df_U_variant.apply(lambda row: findVariant(row),axis=1)
_=df_U_disease.apply(lambda row: findDisease(row),axis=1)

In [8]:
#cpnverting the list object to dataframe
_df_V = pd.DataFrame(data=lst__V,columns=['Entry_U','Variant_U','VariantID_U','VariantSource','dbSNP_U', \
                                        'FromTo_U','VariantPos_U','PubMed_U'])
_df_D = pd.DataFrame(data=lst__D,columns=['Entry_U','Disease_U','MIM_U','PubMed_U'])

### cheat sheets for checking data

In [None]:
_df_V[_df_V.Entry_U=='Q93074']
df_U_disease[df_U_disease.Entry_U=='Q03164']
#df_U_disease[df_U_disease.InvolmentInDisease_U.isnull()]
#df_U_disease[df_U_disease.Entry_U=='P15056']
#__=df_U_disease[df_U_disease.Entry_U=='P15056'].apply(lambda row: findDisease(row),axis=1)
#lst__D
#_df_D[_df_D.Entry_U=='Q9UKU0']
#_df_D[_df_D.duplicated(keep=False)]  #checking duplicate records

## Removing all duplicate rows from Disease

In [9]:
_df_D_Final=_df_D.drop_duplicates(keep=False)

In [11]:
# checking if we have any duplicate data
_df_D_Final[_df_D_Final.duplicated(keep=False)]

Unnamed: 0,Entry_U,Disease_U,MIM_U,PubMed_U


In [12]:
print(_df_V.shape,_df_D_Final.shape)

(54697, 8) (17770, 4)


In [13]:
_df_VD=pd.merge(_df_V,_df_D_Final,how='left', on=['Entry_U','PubMed_U'])
_df_VD.shape

(57852, 10)

In [14]:
_df_VD[_df_VD.Entry_U=='Q93074']

Unnamed: 0,Entry_U,Variant_U,VariantID_U,VariantSource,dbSNP_U,FromTo_U,VariantPos_U,PubMed_U,Disease_U,MIM_U
30604,Q93074,VARIANT 961 961 R -> W (in OKS; dbSNP:rs80338758). {ECO:0000269|PubMed:17334363}.,VAR_033112,Uniprot,rs80338758,961-961,R->W,17334363,Opitz-Kaveggia syndrome (OKS),305450.0
30605,Q93074,VARIANT 1007 1007 N -> S (in LUJFRYS; dbSNP:rs80338759). {ECO:0000269|PubMed:17369503}.,VAR_037534,Uniprot,rs80338759,1007-1007,N->S,17369503,Lujan-Fryns syndrome (LUJFRYS),309520.0
30606,Q93074,VARIANT 1148 1148 R -> H (in OHDOX; dbSNP:rs387907360). {ECO:0000269|PubMed:23395478}.,VAR_069770,Uniprot,rs387907360,1148-1148,R->H,23395478,"Ohdo syndrome, X-linked (OHDOX)",300895.0
30607,Q93074,VARIANT 1165 1165 S -> P (in OHDOX; dbSNP:rs387907361). {ECO:0000269|PubMed:23395478}.,VAR_069771,Uniprot,rs387907361,1165-1165,S->P,23395478,"Ohdo syndrome, X-linked (OHDOX)",300895.0
30608,Q93074,"VARIANT 1392 1392 Q -> R (in dbSNP:rs1139013). {ECO:0000269|PubMed:10198638, ECO:0000269|PubMed:8724849}.",VAR_046672,Uniprot,rs1139013,1392-1392,Q->R,101986388724849,,
30609,Q93074,VARIANT 1729 1729 H -> N (in OHDOX; dbSNP:rs387907362). {ECO:0000269|PubMed:23395478}.,VAR_069772,Uniprot,rs387907362,1729-1729,H->N,23395478,"Ohdo syndrome, X-linked (OHDOX)",300895.0
30610,Q93074,VARIANT 1974 1974 Q -> H (found in a family with X-linked intellectual disability; unknown pathological significance; dbSNP:rs879255528). {ECO:0000269|PubMed:26273451}.,VAR_074018,Uniprot,rs879255528,1974-1974,Q->H,26273451,,


In [16]:
df_U_mod=df_U[['GeneName','Entry_U','ProteinName_U','Organism_U','Entryname_U']] #[df_U.Entry_U=='Q93074']

In [17]:
df_merged_U=pd.merge(_df_VD,df_U_mod,how='left', on=['Entry_U'])
df_merged_U.shape

(57852, 14)

In [18]:
df_merged_U[df_merged_U.Entry_U=='P0CW18']

Unnamed: 0,Entry_U,Variant_U,VariantID_U,VariantSource,dbSNP_U,FromTo_U,VariantPos_U,PubMed_U,Disease_U,MIM_U,GeneName,ProteinName_U,Organism_U,Entryname_U
43460,P0CW18,VARIANT 176 176 R -> G (in MCOP6; dbSNP:rs387907096). {ECO:0000269|PubMed:21397065}.,VAR_065076,Uniprot,rs387907096,176-176,R->G,21397065,"Microphthalmia, isolated, 6 (MCOP6)",613517,PRSS56,Serine protease 56 (EC 3.4.21.-),Homo sapiens (Human),PRS56_HUMAN
43461,P0CW18,VARIANT 237 237 G -> R (in MCOP6; dbSNP:rs730882160). {ECO:0000269|PubMed:21850159}.,VAR_069226,Uniprot,rs730882160,237-237,G->R,21850159,"Microphthalmia, isolated, 6 (MCOP6)",613517,PRSS56,Serine protease 56 (EC 3.4.21.-),Homo sapiens (Human),PRS56_HUMAN
43462,P0CW18,VARIANT 302 302 V -> F (in MCOP6; dbSNP:rs74703359). {ECO:0000269|PubMed:21850159}.,VAR_069227,Uniprot,rs74703359,302-302,V->F,21850159,"Microphthalmia, isolated, 6 (MCOP6)",613517,PRSS56,Serine protease 56 (EC 3.4.21.-),Homo sapiens (Human),PRS56_HUMAN
43463,P0CW18,VARIANT 309 309 W -> S (in MCOP6; dbSNP:rs387907095). {ECO:0000269|PubMed:21397065}.,VAR_065077,Uniprot,rs387907095,309-309,W->S,21397065,"Microphthalmia, isolated, 6 (MCOP6)",613517,PRSS56,Serine protease 56 (EC 3.4.21.-),Homo sapiens (Human),PRS56_HUMAN
43464,P0CW18,VARIANT 320 320 G -> R (in MCOP6; dbSNP:rs730882158). {ECO:0000269|PubMed:21850159}.,VAR_069228,Uniprot,rs730882158,320-320,G->R,21850159,"Microphthalmia, isolated, 6 (MCOP6)",613517,PRSS56,Serine protease 56 (EC 3.4.21.-),Homo sapiens (Human),PRS56_HUMAN
43465,P0CW18,VARIANT 395 395 C -> R (in MCOP6; dbSNP:rs730882161). {ECO:0000269|PubMed:21850159}.,VAR_069229,Uniprot,rs730882161,395-395,C->R,21850159,"Microphthalmia, isolated, 6 (MCOP6)",613517,PRSS56,Serine protease 56 (EC 3.4.21.-),Homo sapiens (Human),PRS56_HUMAN
43466,P0CW18,VARIANT 599 599 P -> A (in MCOP6; dbSNP:rs61744404). {ECO:0000269|PubMed:21532570}.,VAR_069230,Uniprot,rs61744404,599-599,P->A,21532570,"Microphthalmia, isolated, 6 (MCOP6)",613517,PRSS56,Serine protease 56 (EC 3.4.21.-),Homo sapiens (Human),PRS56_HUMAN


### Number of entires in above Uniprot list after removing entries without gene name: 80,040 with 13 columns.

# 4.1.3 Import OMIM

In [19]:
blob = storage.Blob('uniprot-yourlist3AM20180120A7434721E10EE6586998A056CCD0537E3100E40.tab',bucket)
with open('uniprot-yourlist3AM20180120A7434721E10EE6586998A056CCD0537E3100E40.tab', 'wb') as file_obj:
    blob.download_to_file(file_obj)
df_O_X=pd.read_csv('uniprot-yourlist3AM20180120A7434721E10EE6586998A056CCD0537E3100E40.tab',sep='\t', \
                   skiprows=1,header=None,)
                #names=['MIMNumber_O','MIMEntryType_O','EntrezGeneID_NCBI_O','GeneName','EnsemblGeneID_O'])

In [20]:
df_O_X.columns=['MIMNumber_O', 1, 'Entry_U', 3, 4, 5, 'GeneName', 7, 8, 9]

In [21]:
df_O_X[df_O_X.GeneName=='PRSS56']

Unnamed: 0,MIMNumber_O,1,Entry_U,3,4,5,GeneName,7,8,9
11974,613517613858,",",P0CW18,PRS56_HUMAN,reviewed,Serine protease 56 (EC 3.4.21.-),PRSS56,Homo sapiens (Human),603,613517;613858;


In [22]:
df_O=df_O_X[['MIMNumber_O','Entry_U','GeneName']].copy()

In [23]:
df_O[df_O.duplicated(subset='GeneName',keep=False)]

Unnamed: 0,MIMNumber_O,Entry_U,GeneName
167,114130,P01258,CALCA CALC1
168,114130,P06881,CALCA CALC1
210,116896,P39880,CUX1 CUTL1
211,116896,Q13948,CUX1 CUTL1
398,134690,P35544,FAU
399,134690,P62861,FAU
438,137220,P01358,
567,142800,P01891,HLA-A HLAA
568,142800,P01892,HLA-A HLAA
570,142800,P04439,HLA-A HLAA


In [24]:
df_O[df_O.GeneName.isnull()]

Unnamed: 0,MIMNumber_O,Entry_U,GeneName
438,137220,P01358,
14268,616891,P0DN84,


In [25]:
df_O=df_O[df_O.GeneName.notnull()]
df_O['GeneName']=df_O['GeneName'].apply(lambda x:x.split(' ')[0])

In [26]:
df_O.sample(21)

Unnamed: 0,MIMNumber_O,Entry_U,GeneName
6285,605949,Q9NQ84,GPRC5C
8187,144700171300193300263400608537,P40337,VHL
3153,601514,Q92914,FGF11
5037,182601604277,Q9UBP0,SPAST
5929,605482,P78417,GSTO1
14555,617325,A0PJX4,SHISA3
11475,607439613150613156613158,Q9UKY4,POMT2
1716,300149,Q99966,CITED1
396,134651,P05413,FABP3
8343,608784,Q9ULC8,ZDHHC8


In [27]:
df_O.shape

(14859, 3)

## Number of entires in above OMIM list: 14,859 with 3 columns.

We are focusing on those entries which are associted with disease.

### Number of entires in above OMIM list after removing entries without gene name: 16,021 with 5 columns.

## Testing for Gene involved in multiple disorder

# Merging OMIM to Uniprot

In [28]:
df_merged=pd.merge(df_merged_U,df_O,how='left', on=['GeneName','Entry_U'])
df_merged.shape

(57852, 15)

In [30]:
df_merged.sample(2)

Unnamed: 0,Entry_U,Variant_U,VariantID_U,VariantSource,dbSNP_U,FromTo_U,VariantPos_U,PubMed_U,Disease_U,MIM_U,GeneName,ProteinName_U,Organism_U,Entryname_U,MIMNumber_O
43775,P01112,VARIANT 13 13 G -> R (in SFM; somatic mutation; shows constitutive activation of the MAPK and PI3K-AKT signaling pathways; dbSNP:rs104894228). {ECO:0000269|PubMed:22683711}.,VAR_068817,Uniprot,rs104894228,13-13,G->R,22683711,Schimmelpenning-Feuerstein-Mims syndrome (SFM),163200,HRAS,"GTPase HRas (H-Ras-1) (Ha-Ras) (Transforming protein p21) (c-H-ras) (p21ras) [Cleaved into: GTPase HRas, N-terminally processed]",Homo sapiens (Human),RASH_HUMAN,109800163200188470190020218040
22615,Q53GS7,VARIANT 684 684 I -> T (in LAAHD; dbSNP:rs121434409). {ECO:0000269|PubMed:18204449}.,VAR_043877,Uniprot,rs121434409,684-684,I->T,18204449,Lethal arthrogryposis with anterior horn cell disease (LAAHD),611890,GLE1,Nucleoporin GLE1 (hGLE1) (GLE1-like protein),Homo sapiens (Human),GLE1_HUMAN,253310603371611890


### Number of entires after merging OMIM and Uniprot list : 57,852 with 15 columns.

We are now focusing on Genes having known MIM IDs(Genes known to have involment in disease). We have filtered out the entires without any MIM ID.

In [None]:
#df_merged[df_merged.MIMNumber_O.isnull()].head(5) #checking genes without any disease id 
#df_merged[df_merged['GeneName'] =='HSD3B7'] #checking a specific gene
#df_merged=df_merged.dropna(subset=['MIMNumber_O']) # 
#df_merged.shape

# Number of entires in the merged list after removing entries without MIM ids: 15,034 with 11 columns.

In [None]:
df_merged[df_merged['GeneName'] =='TP53']#.to_csv(<<gene_name>>.csv') # checking one sample gene

### <font color='red'>Assumption and Considerations:</font>
Please note that Uniprot is considered as the master list and merged with the OMIM list.

We can see that the numner of entries for TP53 is reduced to only 1. Possible reasons(need investigation):
1) We may have GeneName in OMIM not present Uniprot list
2) We may have dropped ids without proper validation.

# 4.1.4 Import COSMIC

### Note:
Import mutation dataset from COSMIC database. It provides a tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set. The name of the file is 'CosmicCompleteTargetedScreensMutantExport.tsv.gz'

We will only consider those entires which have Mutation IDs.

In [None]:
blob = storage.Blob('CosmicCompleteTargetedScreensMutantExport.tsv',bucket)
with open('CosmicCompleteTargetedScreensMutantExport.tsv', 'wb') as file_obj:
    blob.download_to_file(file_obj)

In [None]:
import pandas as pd
chunksize = 10 ** 6
cosmic_C=pd.DataFrame()
colnames=["GeneName","AccessionNumber_C","GeneCDSlength","HGNCid","SampleName","SampleId", \
          "IdTumour", "PrimarySite","SiteSubtype1","SiteSubtype2","SiteSubtype3", \
          "PrimaryHistology","HistologySubtype1", "HistologySubtype2","HistologySubtype3", \
          "GenomeWideScreen","MutationId_C","MutationCDS","MutationAA_C", "MutationDescription_C", \
          "MutationZygosity","LOH","GRCh","MutationGenomePosition_C","MutationStrand", 
          "SNP","ResistanceMutation","FATHMMPrediction","FATHMMScore","MutationSomaticStatus", \
          "Pubmed_PMID", "IdStudy","SampleSource","TumourOrigin","Age"]

for chunk in pd.read_csv('CosmicCompleteTargetedScreensMutantExport.tsv',sep='\t',header=0, \
                         names=colnames,low_memory=False, \
                 dtype={"GeneName":object,"AccessionNumber":object,"GeneCDSlength":object, \
                        "HGNCid":object,"SampleName":object,"SampleId":object,"IdTumour":object, \
                        "PrimarySite":object,"SiteSubtype1":object,"SiteSubtype2":object, \
                        "SiteSubtype3":object,"PrimaryHistology":object,"HistologySubtype1":object, \
                        "HistologySubtype2":object,"HistologySubtype3":object,"GenomeWideScreen":object, \
                        "MutationId":object,"MutationCDS":object,"MutationAA_C":object, \
                        "MutationDescription":object, "MutationZygosity":object,"LOH":object,"GRCh":object, \
                        "MutationGenomePosition":object, "MutationStrand":object,"SNP":object, \
                        "ResistanceMutation":object,"FATHMMPrediction":object, 
                        "FATHMMScore":object,"MutationSomaticStatus":object,"Pubmed_PMID":object, \
                        "IdStudy":object,"SampleSource":object,"TumourOrigin":object,"Age":object}, \
                 chunksize=chunksize
                ):
    cosmic_C=chunk[['GeneName','AccessionNumber_C','MutationId_C','MutationDescription_C', \
                  'MutationGenomePosition_C','MutationAA_C']].loc[chunk.MutationId_C.notnull()]
                  #selecting only with mutation id from Cosmic file

In [None]:
cosmic_C.shape

In [None]:
cosmic_C.sample(31)

## Number of entires in the COSMIC list after removing entires without mutation IDs: 13,155 with 5 columns.

# Merging COSMIC to Uniprot

In [None]:
df_merged=pd.merge(df_merged,cosmic_C,how='left', on='GeneName')
df_merged.shape

## Number of entires in the merged list after including COSMIC dataset: 26,934 with 16 columns.

In [None]:
df_merged[df_merged['GeneName'] =='TP53'] # checking one sample gene. We can see lot of duplicate entries in the list.
#df_merged.NaturalVariant_U[5]

# 4.2 Analysis

In [None]:
#Removing entries from omim file without a gene name.
#df_snp_u=df_snp_u[(df_snp_u.NaturalVariant.notnull()) | (df_snp_u.Mutagenesis.notnull())] 
#df_snp_u.shape
df_merged.shape
df_merged.sample(5)

# Number of entires in the merged list after including OMIM,COSMIC and Uniprot dataset: 26,934, with 15 columns.¶

In [None]:
df_merged.columns

In [None]:
df_merged.columns=['GeneName', 'Entry_U', 'ProteinName_U', 'Organism_U', 'Entryname_U',
       'EnsemblGeneID_U', 'MIMNumber_O', 'MIMEntryType_O',
       'EntrezGeneID_NCBI_O', 'EnsemblGeneID_O', 'AccessionNumber_C',
       'MutationId_C', 'MutationDescription_C', 'MutationGenomePosition_C', 'UniprotID',
       'NaturalVariant_U', 'Mutagenesis_U']

In [None]:
df_merged=df_merged[['UniprotID','GeneName', 'Entry_U', 'ProteinName_U','EnsemblGeneID_U', \
                     'EnsemblGeneID_O','EntrezGeneID_NCBI_O','MIMNumber_O', 'MIMEntryType_O', \
                     'NaturalVariant_U','Mutagenesis_U','AccessionNumber_C','MutationId_C', \
                     'MutationDescription_C', 'MutationGenomePosition_C']]

In [None]:
df_merged[df_merged['GeneName'] =='TP53']

In [None]:
df_merged.columns

In [None]:
import datetime as dt
from datetime import datetime
from pytz import timezone

import uuid

tz = timezone('EST') # adding time zone info
datetime.now(tz) 
df_merged['Entrydate'] = dt.datetime.now()

df_merged.insert(0,'Id',uuid.uuid4()) 
df_merged.Id= df_merged.Id.apply(lambda x: uuid.uuid4()) # adding unique identifier

In [None]:
df_merged['EntrezGeneID_NCBI_O']=df_merged.EntrezGeneID_NCBI_O.apply(lambda x: str(x))
df_merged['MIMNumber_O']=df_merged.MIMNumber_O.apply(lambda x: str(x))

In [None]:
df_merged.head(200).to_csv('somaticFinal.csv')

# Inserting Somatic Data into Cassandra Database
The step is done for persistence and reliability among other benefits using a database cluster.

In [None]:
import itertools
from multiprocessing import Pool
import sys
import time
from cassandra.cluster import Cluster
from cassandra.concurrent import execute_concurrent_with_args
from cassandra.query import tuple_factory
from cassandra.auth import PlainTextAuthProvider

In [None]:
df_con=pd.read_csv('~/connection_point.csv',header=0) # this is done to make add basic level of security. 
#Please note that this file is not uploaded. It is only present in the Jupyter server. 

In [None]:
def _insertData(params):
    cluster = Cluster(contact_points=[df_con.ip[0]], auth_provider = \
                      PlainTextAuthProvider(username=df_con.user[0], \
                                            password=df_con.token[0]))
    session = cluster.connect()
    session.set_keyspace('somatic')
    session.row_factory = tuple_factory
    prepared=session.prepare("INSERT INTO TABLE somatic.somaticMerged \
                             (id,UniprotID,GeneName,Entry_U,ProteinName_U, \
                             EnsemblGeneID_U,EnsemblGeneID_O,EntrezGeneID_NCBI_O, \
                             MIMNumber_O,MIMEntryType_O,NaturalVariant_U,Mutagenesis_U, \
                             AccessionNumber_C,MutationId_C,MutationDescription_C, \
                             MutationGenomePosition_C,entrydate) \
                             VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)")            
    
    #using datastax driver for multiprocessing 
    execute_concurrent_with_args(session, prepared, params, concurrency=50) 
    return None

def multiprocess(params):
    pool = Pool(processes=2)
    results = [pool.map(_insertData, (params[n:n+100],)) for n in range(0, len(params),100)]
    return results

if __name__ == "__main__":
    parameters=[]
    for index, row in enumerate(df_merged.values):        
        (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q) = row
        row1=(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q)
        parameters.append(row1)           
    a = multiprocess(parameters)

In [None]:
Graph 
ML
