# Capstone 2 – Project Proposal – Somatic Mutation

# 1 Objective: 

The goal is to segment different types of Somatic Germline Mutations in human genes associated with inherited and acquired diseases. It will be a one-stop-shop comprehensive collection of mutation data(Segments) for easy discovery in the era of personalized medicine. As part of this project I would like to find: 

a.	Somatic Germline segmentation

b.	Acquired Disease with maximum number of Somatic Mutation 

c.	Inherited Disease with maximum number of Somatic Mutation

### Outcome: Comprehensive Somatic Mutation Database an invaluable resource for all scientists. 

# 2 Client

#### Will be provided by Sona
Sample: The client for this project is Georgetown University (www.gwu.edu ) and the Bioinformatics Department. The purpose is to find an ML model which can be used to correctly cluster somatic mutation.

# 3 Data source and Credits


We will provided by Sona.

# 4. Solution Approach

The solution plans to use PCA and NMF ML techniques to help with dimension reduction and segmentation of Somatic Mutation. It is specifically designed to address large dataset with biodiversity and quality issues like redundancy, missing, wrong label etc. The solution is sub-divided into three phases as listed below.





### a)Data Assembly - Phase I: 
This phase of the project is designed to gather and do basic cleanup like join, merge, add or update attributes.

### b)Explore and Preprocessing – Phase II: 

This phase of the project is designed to validate and explore the dataset for all the problems listed in the “Problem” section of this proposal. 
### c)Modelling and Evaluation Phase III: 

In this phase of the project will focus on exploring various machine learning algorithms and finding the right hyperparameters to find the best ML model to cluster the Somatic Mutation.  

## 4.1 Data Assembly - Phase I:

OMIM Data identifier is O

UNIPROT data identifier is U

COSMIC data identifier is C
#import warnings
#warnings.simplefilter("ignore", DeprecationWarning)

In [1]:
import pandas as pd
from google.cloud import storage
client = storage.Client()
bucket=client.get_bucket('somatic_germline_mutations')
blob = storage.Blob('mim2gene.txt',bucket)
with open('mim2gene.txt', 'wb') as file_obj:
    blob.download_to_file(file_obj)
df_O=pd.read_csv('mim2gene.txt',sep='\t',skiprows=5, header=None, \
                names=['MIMNumber_O','MIMEntryType_O','EntrezGeneID_NCBI_O','GeneName','EnsemblGeneID_O'])
df_O=df_O[['GeneName','MIMNumber_O','MIMEntryType_O','EntrezGeneID_NCBI_O','EnsemblGeneID_O']] # reordering the columns
df_O.shape

(25471, 5)

### Number of entires in above OMIM list: 25,471 with 5 columns.

We are focusing on those entries which are associted with disease.

In [2]:
df_O=df_O[df_O.GeneName.notnull()] #Removing entries from omim file without a gene name.
df_O.shape

(16021, 5)

### Number of entires in above OMIM list after removing entries without gene name: 16,021 with 5 columns.

## Testing for Gene involved in multiple disorder

In [3]:
df_O[df_O.duplicated(subset='GeneName',keep=False)]

Unnamed: 0,GeneName,MIMNumber_O,MIMEntryType_O,EntrezGeneID_NCBI_O,EnsemblGeneID_O
2092,IGH,146910,gene,3492.0,
2107,IGH,147010,gene,3492.0,
2114,IGH,147070,gene,3492.0,
6280,ASMT,300015,gene,438.0,ENSG00000196433
6297,ATRX,300032,gene,546.0,ENSG00000085224
6416,SLC25A6,300151,gene,293.0,ENSG00000169100
6427,ASMTL,300162,gene,8623.0,ENSG00000169093
6554,XAGE1E,300289,gene,653067.0,ENSG00000204382
6622,CRLF2,300357,gene,64109.0,ENSG00000205755
6769,ATRX,300504,phenotype,546.0,ENSG00000085224


### Import Uniprot Data.

In [4]:
blob = storage.Blob('uniprot-organismHomosapiens9606.tab',bucket)
with open('uniprot-organismHomosapiens9606.tab', 'wb') as file_obj:
    blob.download_to_file(file_obj)

df_U=pd.read_csv('uniprot-organismHomosapiens9606.tab',sep='\t', header=0, \
               names=['Entry_U','ProteinName_U','GeneName','Organism_U','Entryname_U','EnsemblGeneID_U'])

df_U=df_U[['GeneName','Entry_U','ProteinName_U','Organism_U','Entryname_U','EnsemblGeneID_U']]
df_U.shape

(161521, 6)

In [5]:
df_U.head() 
#we can see that GeneName is comprising of more than one names. We are only considering the 1st Name as the GeneName

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,EnsemblGeneID_U
0,TP53 P53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...
1,APP A4 AD1,P05067,Amyloid-beta A4 protein (ABPP) (APPI) (APP) (A...,Homo sapiens (Human),A4_HUMAN,ENST00000346798 [P05067-1];ENST00000348990 [P0...
2,SCN5A,Q14524,Sodium channel protein type 5 subunit alpha (H...,Homo sapiens (Human),SCN5A_HUMAN,ENST00000333535 [Q14524-1];ENST00000423572 [Q1...
3,FBN1 FBN,P35555,Fibrillin-1 [Cleaved into: Asprosin],Homo sapiens (Human),FBN1_HUMAN,ENST00000316623;
4,EGFR ERBB ERBB1 HER1,P00533,Epidermal growth factor receptor (EC 2.7.10.1)...,Homo sapiens (Human),EGFR_HUMAN,ENST00000275493 [P00533-1];ENST00000342916 [P0...


### Number of entires in above Uniprot list: 161,521 with 6 columns.
We only started with 6 columns so that the uniprot list is managable. 

In [6]:
df_U=df_U[df_U.GeneName.notnull()] #Removing entries from Uniprot file without a gene name.
df_U['GeneName']=df_U['GeneName'].apply(lambda x:x.split(' ')[0])
df_U.shape

(139590, 6)

### Number of entires in above Uniprot list after removing entries without gene name: 139,590 with 6 columns.

In [7]:
df_merged=pd.merge(df_U,df_O,how='left', on='GeneName')
df_merged.shape

(139654, 10)

In [8]:
df_merged.head()

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,EnsemblGeneID_U,MIMNumber_O,MIMEntryType_O,EntrezGeneID_NCBI_O,EnsemblGeneID_O
0,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510
1,APP,P05067,Amyloid-beta A4 protein (ABPP) (APPI) (APP) (A...,Homo sapiens (Human),A4_HUMAN,ENST00000346798 [P05067-1];ENST00000348990 [P0...,104760.0,gene,351.0,ENSG00000142192
2,SCN5A,Q14524,Sodium channel protein type 5 subunit alpha (H...,Homo sapiens (Human),SCN5A_HUMAN,ENST00000333535 [Q14524-1];ENST00000423572 [Q1...,600163.0,gene,6331.0,ENSG00000183873
3,FBN1,P35555,Fibrillin-1 [Cleaved into: Asprosin],Homo sapiens (Human),FBN1_HUMAN,ENST00000316623;,134797.0,gene,2200.0,ENSG00000166147
4,EGFR,P00533,Epidermal growth factor receptor (EC 2.7.10.1)...,Homo sapiens (Human),EGFR_HUMAN,ENST00000275493 [P00533-1];ENST00000342916 [P0...,131550.0,gene,1956.0,ENSG00000146648


### Number of entires after merging OMIM and Uniprot list : 139,654 with 10 columns.

We are now focusing on Genes having known MIM IDs(Genes known to have involment in disease). We have filtered out the entires without any MIM ID.

In [9]:
df_merged=df_merged.dropna(subset=['MIMNumber_O'])
df_merged.shape
#df_merged[df_merged.MIMNumber_O.notnull()]
#df_O=df_O.dropna(subset=['GeneName'])
#df_merged[df_merged['GeneName'] =='IGH']

(106097, 10)

# Number of entires in the merged list after removing entries without MIM ids: 106,097 with 10 columns.

In [10]:
df_merged[df_merged['GeneName'] =='TP53']#.to_csv('xx.csv') # checking one sample gene

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,EnsemblGeneID_U,MIMNumber_O,MIMEntryType_O,EntrezGeneID_NCBI_O,EnsemblGeneID_O
0,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510
21697,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
23178,TP53,K7PPA8,Cellular tumor antigen p53,Homo sapiens (Human),K7PPA8_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
23835,TP53,A0A1B1PFD4,p53 (Fragment),Homo sapiens (Human),A0A1B1PFD4_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
24022,TP53,A2I9Y7,Tumor protein p53 (p53) (Fragment),Homo sapiens (Human),A2I9Y7_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
24079,TP53,S5LJ61,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),S5LJ61_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
24225,TP53,A4GW67,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A4GW67_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
29721,TP53,Q1HGV3,Mutant p53 tumor suppressor (Fragment),Homo sapiens (Human),Q1HGV3_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
29855,TP53,L0EQE2,Tumor suppressor p53 (Fragment),Homo sapiens (Human),L0EQE2_HUMAN,,191170.0,gene,7157.0,ENSG00000141510
30088,TP53,L0ES54,Tumor suppressor p53 (Fragment),Homo sapiens (Human),L0ES54_HUMAN,,191170.0,gene,7157.0,ENSG00000141510


### Assumption and Consideration:
Please note that Uniprot is considered as the master list and merge with the OMIM list.

We can see that the numner of entries increased from the original Uniprot. Possible reasons(need investigation):

1) We may have GeneName in OMIM not present Uniprot list

2) There can be duplicate entries

In [None]:
blob = storage.Blob('CosmicCompleteTargetedScreensMutantExport.tsv',bucket)
with open('CosmicCompleteTargetedScreensMutantExport.tsv', 'wb') as file_obj:
    blob.download_to_file(file_obj)

## Note:
Import mutation dataset from COSMIC database. It provides a tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set. The name of the file is 'CosmicCompleteTargetedScreensMutantExport.tsv.gz'

We will only consider those entires which have Mutation IDs.

In [11]:
import pandas as pd
chunksize = 10 ** 6
cosmic_C=pd.DataFrame()
colnames=["GeneName","AccessionNumber","GeneCDSlength","HGNCid","SampleName","SampleId", \
          "IdTumour", "PrimarySite","SiteSubtype1","SiteSubtype2","SiteSubtype3", \
          "PrimaryHistology","HistologySubtype1", "HistologySubtype2","HistologySubtype3", \
          "GenomeWideScreen","MutationId","MutationCDS","MutationAA", "MutationDescription", \
          "MutationZygosity","LOH","GRCh","MutationGenomePosition","MutationStrand", 
          "SNP","ResistanceMutation","FATHMMPrediction","FATHMMScore","MutationSomaticStatus", \
          "Pubmed_PMID", "IdStudy","SampleSource","TumourOrigin","Age"]

for chunk in pd.read_csv('CosmicCompleteTargetedScreensMutantExport.tsv',sep='\t',header=0, \
                         names=colnames,low_memory=False, \
                 dtype={"GeneName":object,"AccessionNumber":object,"GeneCDSlength":object, \
                        "HGNCid":object,"SampleName":object,"SampleId":object,"IdTumour":object, \
                        "PrimarySite":object,"SiteSubtype1":object,"SiteSubtype2":object, \
                        "SiteSubtype3":object,"PrimaryHistology":object,"HistologySubtype1":object, \
                        "HistologySubtype2":object,"HistologySubtype3":object,"GenomeWideScreen":object, \
                        "MutationId":object,"MutationCDS":object,"MutationAA":object, \
                        "MutationDescription":object, "MutationZygosity":object,"LOH":object,"GRCh":object, \
                        "MutationGenomePosition":object, "MutationStrand":object,"SNP":object, \
                        "ResistanceMutation":object,"FATHMMPrediction":object, 
                        "FATHMMScore":object,"MutationSomaticStatus":object,"Pubmed_PMID":object, \
                        "IdStudy":object,"SampleSource":object,"TumourOrigin":object,"Age":object}, \
                 chunksize=chunksize
                ):
    cosmic_C=chunk[['GeneName','AccessionNumber','MutationId','MutationDescription', \
                  'MutationGenomePosition']].loc[chunk.MutationId.notnull()]
                  #selecting only with mutation id from Cosmic file

In [12]:
cosmic_C.shape

(13155, 5)

In [13]:
cosmic_C.head()

Unnamed: 0,GeneName,AccessionNumber,MutationId,MutationDescription,MutationGenomePosition
6000185,EGFR,ENST00000275493,COSM12979,Substitution - Missense,
6000186,EGFR,ENST00000275493,COSM13243,Deletion - In frame,
6000187,PIK3CA,NM_006218.1,COSM775,Substitution - Missense,3:179234297-179234297
6000188,FGFR3,ENST00000440486,COSM35896,Substitution - Missense,
6000189,IDH2,ENST00000330062,COSM1685352,Unknown,


## Number of entires in the COSMIC list after removing entires without mutation IDs: 13,155 with 5 columns.

In [14]:
df_merged=pd.merge(df_merged,cosmic_C,how='left', on='GeneName')
df_merged.shape

(503106, 14)

### Number of entires in the merged list after including COSMIC dataset: 503,106 with 14 columns.

In [15]:
df_merged[df_merged['GeneName'] =='TP53'] # checking one sample gene. We can see lot of duplicate entries in the list.

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,EnsemblGeneID_U,MIMNumber_O,MIMEntryType_O,EntrezGeneID_NCBI_O,EnsemblGeneID_O,AccessionNumber,MutationId,MutationDescription,MutationGenomePosition
0,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM43699,Substitution - Missense,17:7674194-7674194
1,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM44555,Substitution - Missense,17:7675176-7675176
2,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM11307,Substitution - Missense,17:7674888-7674888
3,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM43751,Unknown,17:7674291-7674291
4,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10991,Substitution - Missense,17:7675216-7675216
5,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM11354,Substitution - Nonsense,17:7673537-7673537
6,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10648,Substitution - Missense,17:7675088-7675088
7,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10911,Substitution - Missense,17:7673773-7673773
8,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10722,Substitution - Missense,17:7673767-7673767
9,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM13120,Deletion - Frameshift,17:7674904-7674905


In [16]:
df_merged.head()

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,EnsemblGeneID_U,MIMNumber_O,MIMEntryType_O,EntrezGeneID_NCBI_O,EnsemblGeneID_O,AccessionNumber,MutationId,MutationDescription,MutationGenomePosition
0,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM43699,Substitution - Missense,17:7674194-7674194
1,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM44555,Substitution - Missense,17:7675176-7675176
2,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM11307,Substitution - Missense,17:7674888-7674888
3,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM43751,Unknown,17:7674291-7674291
4,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,Homo sapiens (Human),P53_HUMAN,ENST00000269305 [P04637-1];ENST00000420246 [P0...,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10991,Substitution - Missense,17:7675216-7675216


# SNP and Mutation from Uniprot

In [17]:
blob = storage.Blob('uniprotOrganismHomosapiens9606.tab',bucket)
with open('uniprotOrganismHomosapiens9606.tab', 'wb') as file_obj:
    blob.download_to_file(file_obj)
import pandas as pd
df_snp_u=pd.read_csv('uniprotOrganismHomosapiens9606.tab',sep='\t', header=0, \
               names=['Entry','GeneName','NaturalVariant','Mutagenesis'],
                    dtype={'Entry':object,'GeneName':object,'NaturalVariant':object,'Mutagenesis':object}
                    )   

In [18]:
df_snp_u.shape

(161521, 4)

In [19]:
df_snp_u=df_snp_u[df_snp_u.GeneName.notnull()] #Removing entries from Uniprot file without a gene name.
df_snp_u['GeneName']=df_snp_u['GeneName'].apply(lambda x:x.split(' ')[0])
df_snp_u.head()

Unnamed: 0,Entry,GeneName,NaturalVariant,Mutagenesis
0,P04637,TP53,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...
1,P05067,APP,VARIANT 501 501 E -> K (in dbSNP:rs45588932). ...,MUTAGEN 99 102 KRGR->NQGG: Reduced heparin-bin...
2,Q14524,SCN5A,VARIANT 9 9 G -> V (in LQT3; dbSNP:rs199473043...,MUTAGEN 1476 1476 Q->K: Induces accelerated re...
3,P35555,FBN1,VARIANT 20 20 Y -> C (in MFS). {ECO:0000269|Pu...,MUTAGEN 1542 1542 G->D: Loss of integrin-media...
4,P00533,EGFR,VARIANT 30 297 Missing (variant EGFR vIII; fou...,MUTAGEN 275 275 Y->A: Strongly reduced autopho...


In [20]:
#Removing entries from omim file without a gene name.
df_snp_u=df_snp_u[(df_snp_u.NaturalVariant.notnull()) | (df_snp_u.Mutagenesis.notnull())] 
#df_snp_u.shape
df_snp_u.sample(17)

Unnamed: 0,Entry,GeneName,NaturalVariant,Mutagenesis
1293,Q9H0Y0,ATG10,VARIANT 62 62 S -> P (in dbSNP:rs3734114). /FT...,
3138,Q86Y33,CDC20B,VARIANT 8 8 T -> P (in dbSNP:rs173042). /FTId=...,
15591,P48061,CXCL12,,MUTAGEN 22 23 Missing: Abolished CXCR4 activat...
15318,Q8N2Y8,RUSC2,VARIANT 73 73 T -> A (in dbSNP:rs1535422). {EC...,
18539,Q13488,TCIRG1,VARIANT 56 56 R -> W (in dbSNP:rs36027301). /F...,
10638,A6NI61,MYMK,VARIANT 91 91 P -> T (in CFZS; unknown patholo...,
1870,Q8NEM8,AGBL3,VARIANT 45 45 F -> Y (in dbSNP:rs2348049). /FT...,
11394,Q9BZQ8,FAM129A,VARIANT 633 633 S -> L (in dbSNP:rs12750174). ...,
13093,A1L4L8,PLAC8L1,VARIANT 11 11 C -> S (in dbSNP:rs12187913). /F...,
15375,Q9HBV2,SPACA1,VARIANT 237 237 L -> S (in dbSNP:rs2276089). /...,


In [21]:
df_snp_u[df_snp_u['GeneName'] =='TP53']

Unnamed: 0,Entry,GeneName,NaturalVariant,Mutagenesis
0,P04637,TP53,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...


In [22]:
df_merged=pd.merge(df_merged,df_snp_u,how='left', on='GeneName')
df_merged.shape

(4464294, 17)

## Number of entires in the merged list after including SNP and Mutations from Uniprot dataset: 4,464,294 with 17 columns.¶

In [23]:
df_merged.columns=['GeneName', 'Entry_U', 'ProteinName_U', 'Organism_U', 'Entryname_U',
       'EnsemblGeneID_U', 'MIMNumber_O', 'MIMEntryType_O',
       'EntrezGeneID_NCBI_O', 'EnsemblGeneID_O', 'AccessionNumber_C',
       'MutationId_C', 'MutationDescription_C', 'MutationGenomePosition_C', 'UniprotID',
       'NaturalVariant_U', 'Mutagenesis_U']

In [24]:
df_merged=df_merged[['UniprotID','GeneName', 'Entry_U', 'ProteinName_U','EnsemblGeneID_U', \
                     'EnsemblGeneID_O','EntrezGeneID_NCBI_O','MIMNumber_O', 'MIMEntryType_O', \
                     'NaturalVariant_U','Mutagenesis_U','AccessionNumber_C','MutationId_C', \
                     'MutationDescription_C', 'MutationGenomePosition_C']]

In [25]:
df_merged[df_merged['GeneName'] =='TP53']

Unnamed: 0,UniprotID,GeneName,Entry_U,ProteinName_U,EnsemblGeneID_U,EnsemblGeneID_O,EntrezGeneID_NCBI_O,MIMNumber_O,MIMEntryType_O,NaturalVariant_U,Mutagenesis_U,AccessionNumber_C,MutationId_C,MutationDescription_C,MutationGenomePosition_C
0,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM43699,Substitution - Missense,17:7674194-7674194
1,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM44555,Substitution - Missense,17:7675176-7675176
2,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM11307,Substitution - Missense,17:7674888-7674888
3,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM43751,Unknown,17:7674291-7674291
4,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM10991,Substitution - Missense,17:7675216-7675216
5,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM11354,Substitution - Nonsense,17:7673537-7673537
6,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM10648,Substitution - Missense,17:7675088-7675088
7,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM10911,Substitution - Missense,17:7673773-7673773
8,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM10722,Substitution - Missense,17:7673767-7673767
9,P04637,TP53,P04637,Cellular tumor antigen p53 (Antigen NY-CO-13) ...,ENST00000269305 [P04637-1];ENST00000420246 [P0...,ENSG00000141510,7157.0,191170.0,gene,VARIANT 5 5 Q -> H (in a sporadic cancer; soma...,MUTAGEN 15 15 S->A: Loss of interaction with P...,ENST00000269305,COSM13120,Deletion - Frameshift,17:7674904-7674905


In [26]:
df_merged.columns

Index(['UniprotID', 'GeneName', 'Entry_U', 'ProteinName_U', 'EnsemblGeneID_U',
       'EnsemblGeneID_O', 'EntrezGeneID_NCBI_O', 'MIMNumber_O',
       'MIMEntryType_O', 'NaturalVariant_U', 'Mutagenesis_U',
       'AccessionNumber_C', 'MutationId_C', 'MutationDescription_C',
       'MutationGenomePosition_C'],
      dtype='object')

In [27]:
import datetime as dt
from datetime import datetime
from pytz import timezone

import uuid

tz = timezone('EST') # adding time zone info
datetime.now(tz) 
df_merged['Entrydate'] = dt.datetime.now()

df_merged.insert(0,'Id',uuid.uuid4()) 
df_merged.Id= df_merged.Id.apply(lambda x: uuid.uuid4()) # adding unique identifier

In [28]:
df_merged['EntrezGeneID_NCBI_O']=df_merged.EntrezGeneID_NCBI_O.apply(lambda x: str(x))
df_merged['MIMNumber_O']=df_merged.MIMNumber_O.apply(lambda x: str(x))

In [31]:
df_merged.head(200).to_csv('somaticFinal.csv')

# Inserting Somatic Data into Cassandra Database
The step is done for persistence and reliability among other benefits using a database cluster.

In [None]:
import itertools
from multiprocessing import Pool
import sys
import time
from cassandra.cluster import Cluster
from cassandra.concurrent import execute_concurrent_with_args
from cassandra.query import tuple_factory
from cassandra.auth import PlainTextAuthProvider

In [None]:
df_con=pd.read_csv('~/connection_point.csv',header=0) # this is done to make add basic level of security. 
#Please note that this file is not uploaded. It is only present in the Jupyter server. 

In [None]:
def _insertData(params):
    cluster = Cluster(contact_points=[df_con.ip[0]], auth_provider = \
                      PlainTextAuthProvider(username=df_con.user[0], \
                                            password=df_con.token[0]))
    session = cluster.connect()
    session.set_keyspace('somatic')
    session.row_factory = tuple_factory
    prepared=session.prepare("INSERT INTO TABLE somatic.somaticMerged \
                             (id,UniprotID,GeneName,Entry_U,ProteinName_U, \
                             EnsemblGeneID_U,EnsemblGeneID_O,EntrezGeneID_NCBI_O, \
                             MIMNumber_O,MIMEntryType_O,NaturalVariant_U,Mutagenesis_U, \
                             AccessionNumber_C,MutationId_C,MutationDescription_C, \
                             MutationGenomePosition_C,entrydate) \
                             VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)")            
    
    #using datastax driver for multiprocessing 
    execute_concurrent_with_args(session, prepared, params, concurrency=50) 
    return None

def multiprocess(params):
    pool = Pool(processes=2)
    results = [pool.map(_insertData, (params[n:n+100],)) for n in range(0, len(params),100)]
    return results

if __name__ == "__main__":
    parameters=[]
    for index, row in enumerate(df_merged.values):        
        (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q) = row
        row1=(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q)
        parameters.append(row1)           
    a = multiprocess(parameters)