# Capstone 2 – Project Proposal – Somatic Mutation

# 1 Objective: 

The goal is to segment different types of Somatic Germline Mutations in human genes associated with inherited and acquired diseases. It will be a one-stop-shop comprehensive collection of mutation data(Segments) for easy discovery in the era of personalized medicine. As part of this project I would like to find: 

a.	Somatic Germline segmentation

b.	Acquired Disease with maximum number of Somatic Mutation 

c.	Inherited Disease with maximum number of Somatic Mutation

### Outcome: Comprehensive Somatic Mutation Database an invaluable resource for all scientists. 

# 2 Client

#### Will be provided by Sona
Sample: The client for this project is Georgetown University (www.gwu.edu ) and the Bioinformatics Department. The purpose is to find an ML model which can be used to correctly cluster somatic mutation.

# 3 Data source and Credits


We will provided by Sona.

# 4. Solution Approach

The solution plans to use PCA and NMF ML techniques to help with dimension reduction and segmentation of Somatic Mutation. It is specifically designed to address large dataset with biodiversity and quality issues like redundancy, missing, wrong label etc. The solution is sub-divided into three phases as listed below.





### a)Data Assembly - Phase I: 
This phase of the project is designed to gather and do basic cleanup like join, merge, add or update attributes.

### b)Explore and Preprocessing – Phase II: 

This phase of the project is designed to validate and explore the dataset for all the problems listed in the “Problem” section of this proposal. 
### c)Modelling and Evaluation Phase III: 

In this phase of the project will focus on exploring various machine learning algorithms and finding the right hyperparameters to find the best ML model to cluster the Somatic Mutation.  

## 4.1 Data Assembly - Phase I:

OMIM Data identifier is O

UNIPROT data identifier is U

COSMIC data identifier is C
#import warnings
#warnings.simplefilter("ignore", DeprecationWarning)

In [2]:
import pandas as pd
from google.cloud import storage
client = storage.Client()
bucket=client.get_bucket('somatic_germline_mutations')
blob = storage.Blob('mim2gene.txt',bucket)
with open('mim2gene.txt', 'wb') as file_obj:
    blob.download_to_file(file_obj)
df_O=pd.read_csv('mim2gene.txt',sep='\t',skiprows=5, header=None, \
                names=['MIMNumber_O','MIMEntryType_O','EntrezGeneID_NCBI_O','GeneName','EnsemblGeneID_O'])
df_O=df_O[['GeneName','MIMNumber_O','MIMEntryType_O','EntrezGeneID_NCBI_O','EnsemblGeneID_O']] # reordering the columns
df_O.shape

(25471, 5)

### Number of entires in above OMIM list: 25,471 with 5 columns.

We are focusing on those entries which are associted with disease.

In [3]:
df_O=df_O[df_O.GeneName.notnull()] #Removing entries from omim file without a gene name.
df_O.shape

(16021, 5)

### Number of entires in above OMIM list after removing entries without gene name: 16,021 with 5 columns.

## Testing for Gene involved in multiple disorder

In [4]:
df_O[df_O.duplicated(subset='GeneName',keep=False)]

Unnamed: 0,GeneName,MIMNumber_O,MIMEntryType_O,EntrezGeneID_NCBI_O,EnsemblGeneID_O
2092,IGH,146910,gene,3492.0,
2107,IGH,147010,gene,3492.0,
2114,IGH,147070,gene,3492.0,
6280,ASMT,300015,gene,438.0,ENSG00000196433
6297,ATRX,300032,gene,546.0,ENSG00000085224
6416,SLC25A6,300151,gene,293.0,ENSG00000169100
6427,ASMTL,300162,gene,8623.0,ENSG00000169093
6554,XAGE1E,300289,gene,653067.0,ENSG00000204382
6622,CRLF2,300357,gene,64109.0,ENSG00000205755
6769,ATRX,300504,phenotype,546.0,ENSG00000085224


### Import Uniprot Data.

In [5]:
blob = storage.Blob('uniprot-organismHomosapiens9606.tab',bucket)
with open('uniprot-organismHomosapiens9606.tab', 'wb') as file_obj:
    blob.download_to_file(file_obj)

df_U=pd.read_csv('uniprot-organismHomosapiens9606.tab',sep='\t', header=0, \
               names=['Entry_U','ProteinName_U','GeneName','Organism_U','Entryname_U','EnsemblGeneID_U'])

df_U=df_U[['GeneName','Entry_U','ProteinName_U','Organism_U','Entryname_U','EnsemblGeneID_U']]
df_U.shape

(161521, 6)

### Number of entires in above Uniprot list: 161,521 with 6 columns.
We only started with 6 columns so that the uniprot list is managable. 

In [6]:
df_U=df_U[df_U.GeneName.notnull()] #Removing entries from Uniprot file without a gene name.
df_U.shape

(139590, 6)

### Number of entires in above Uniprot list after removing entries without gene name: 139,590 with 6 columns.

In [7]:
df_merged=pd.merge(df_U,df_O,how='left', on='GeneName')
df_merged.shape

(139636, 10)

### Number of entires after merging OMIM and Uniprot list : 139,636 with 10 columns.¶¶

We are now focusing on Genes having known MIM IDs(Genes known to have involment in disease). We have filtered out the entires without any MIM ID.

In [8]:
df_merged=df_merged.dropna(subset=['MIMNumber_O'])
df_merged.shape

#df_merged[df_merged.MIMNumber_O.notnull()]
#df_O=df_O.dropna(subset=['GeneName'])
#df_merged[df_merged['GeneName'] =='IGH']

(89331, 10)

### Number of entires in the merged list after removing entries without MIM ids: 89,331 with 10 columns.¶

In [15]:
df_merged[df_merged['GeneName'] =='TP53'].to_csv('xx.csv') # checking one sample gene

### Assumption and Consideration:
Please note that Uniprot is considered as the master list and merge with the OMIM list.

We can see that the numner of entries increased from the original Uniprot. Possible reasons(need investigation):

1) We may have GeneName in OMIM not present Uniprot list

2) There can be duplicate entries

In [None]:
blob = storage.Blob('CosmicCompleteTargetedScreensMutantExport.tsv',bucket)
with open('CosmicCompleteTargetedScreensMutantExport.tsv', 'wb') as file_obj:
    blob.download_to_file(file_obj)

## Note:
Import mutation dataset from COSMIC database. It provides a tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set. The name of the file is 'CosmicCompleteTargetedScreensMutantExport.tsv.gz'

We will only consider those entires which have Mutation IDs.

In [10]:
import pandas as pd
chunksize = 10 ** 6
cosmic_C=pd.DataFrame()
cols=[0,1,16,19,23]
colnames=["GeneName","AccessionNumber","GeneCDSlength","HGNCid","SampleName","SampleId", \
          "IdTumour", "PrimarySite","SiteSubtype1","SiteSubtype2","SiteSubtype3", \
          "PrimaryHistology","HistologySubtype1", "HistologySubtype2","HistologySubtype3", \
          "GenomeWideScreen","MutationId","MutationCDS","MutationAA", "MutationDescription", \
          "MutationZygosity","LOH","GRCh","MutationGenomePosition","MutationStrand", 
          "SNP","ResistanceMutation","FATHMMPrediction","FATHMMScore","MutationSomaticStatus", \
          "Pubmed_PMID", "IdStudy","SampleSource","TumourOrigin","Age"]

for chunk in pd.read_csv('CosmicCompleteTargetedScreensMutantExport.tsv',sep='\t',header=0, \
                         names=colnames,low_memory=False, \
                 dtype={"GeneName":object,"AccessionNumber":object,"GeneCDSlength":object, \
                        "HGNCid":object,"SampleName":object,"SampleId":object,"IdTumour":object, \
                        "PrimarySite":object,"SiteSubtype1":object,"SiteSubtype2":object, \
                        "SiteSubtype3":object,"PrimaryHistology":object,"HistologySubtype1":object, \
                        "HistologySubtype2":object,"HistologySubtype3":object,"GenomeWideScreen":object, \
                        "MutationId":object,"MutationCDS":object,"MutationAA":object, \
                        "MutationDescription":object, "MutationZygosity":object,"LOH":object,"GRCh":object, \
                        "MutationGenomePosition":object, "MutationStrand":object,"SNP":object, \
                        "ResistanceMutation":object,"FATHMMPrediction":object, 
                        "FATHMMScore":object,"MutationSomaticStatus":object,"Pubmed_PMID":object, \
                        "IdStudy":object,"SampleSource":object,"TumourOrigin":object,"Age":object}, \
                 chunksize=chunksize
                ):
    cosmic_C=chunk[['GeneName','AccessionNumber','MutationId','MutationDescription', \
                  'MutationGenomePosition']].loc[chunk.MutationId.notnull()]
                  #selecting only with mutation id from Cosmic file

In [11]:
cosmic_C.shape

(13155, 5)

## Number of entires in the COSMIC list after removing entires without mutation IDs: 13,155 with 5 columns.

In [12]:
df_merged=pd.merge(df_merged,cosmic_C,how='left', on='GeneName')
df_merged.shape

(468073, 14)

### Number of entires in the merged list after including COSMIC dataset: 468,073 with 14 columns.

In [14]:
df_merged[df_merged['GeneName'] =='TP53'] # checking one sample gene. We can see lot of duplicate entries in the list.

Unnamed: 0,GeneName,Entry_U,ProteinName_U,Organism_U,Entryname_U,EnsemblGeneID_U,MIMNumber_O,MIMEntryType_O,EntrezGeneID_NCBI_O,EnsemblGeneID_O,AccessionNumber,MutationId,MutationDescription,MutationGenomePosition
7925,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM43699,Substitution - Missense,17:7674194-7674194
7926,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM44555,Substitution - Missense,17:7675176-7675176
7927,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM11307,Substitution - Missense,17:7674888-7674888
7928,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM43751,Unknown,17:7674291-7674291
7929,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10991,Substitution - Missense,17:7675216-7675216
7930,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM11354,Substitution - Nonsense,17:7673537-7673537
7931,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10648,Substitution - Missense,17:7675088-7675088
7932,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10911,Substitution - Missense,17:7673773-7673773
7933,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM10722,Substitution - Missense,17:7673767-7673767
7934,TP53,A0A218MJD5,Cellular tumor antigen p53 (Fragment),Homo sapiens (Human),A0A218MJD5_HUMAN,,191170.0,gene,7157.0,ENSG00000141510,ENST00000269305,COSM13120,Deletion - Frameshift,17:7674904-7674905


We can see that TP53 does not have an OMIM ID in the list provided by Sona. Howevere, when we check the OMIM database we found that TP53 has OMIM ID. We need to now double check the source of the OMIM list. 

In [None]:
blob = storage.Blob('uniprot-organismHomoSapiens9606_snp_mut.tab',bucket)
with open('uniprot-organismHomoSapiens9606_snp_mut.tab', 'wb') as file_obj:
    blob.download_to_file(file_obj)

df_snp_u=pd.read_csv('uniprot-organismHomoSapiens9606_snp_mut.tab',sep='\t', header=0, \
               names=['Entry','EntryName','ProteinName','GeneName','NaturalVariant','Mutagenesis'],
                    dtype={'Entry':object,'EntryName':object,'ProteinName':object, \
                        'GeneName':object,'NaturalVariant':object,'Mutagenesis':object}
                    )   