Human gene research is moving fast. How do you summarize all of these discoveries in an objective way so they can be used in the nutrition (nutrigenomics) or healthcare fields?  How do you gather enough evidence to point the way toward needed medical research? 

Using gene definitions and abstracts from NCBI, I'd like to make a gene calculator that knows the language of gene research and can tell you which genes are most related to a health disorder. There are so many conditions for which the medical field has no answers, and that means diet and nutrition are an important aspect of quality of life improvement. Gene mutations often have very specific roles in metabolism and understanding these roles can lead the way to better health.


Using gene technical definitions and abstracts containing both gene and medical information from the NCBI website, I'm going to use Word2Vec similarity to match gene abbreviations with medical disorders. First I'll create a dataframe with tokenized strings. I'll use lemmas but i'll keep stop words. Then I'll run the Word2Vec model and see if I can match gene acronyms to diseases with varied number of mentions in the dataset. I'll look at the top 10 matches for my analysis and see if the disease keyword is mentioned. I'll optimize model parameters by running the training model many times using np.random and adjusting the model parameters to find the best matches each time.

In [60]:
import pandas as pd
import numpy as np
from sklearn.cluster import AffinityPropagation as Aff
import numpy as np
import distance
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import re

First I'll import and clean my datasets.

In [61]:
#import,clean and label datasets
abstracts= pd.read_csv('genesdf')
abstracts.head()

Unnamed: 0.1,Unnamed: 0,symbol,blurb,match
0,0,TP53,This gene encodes a tumor suppressor protein c...,0
1,1,EGFR,The protein encoded by this gene is a transmem...,0
2,2,TNF,This gene encodes a multifunctional proinflamm...,0
3,3,APOE,The protein encoded by this gene is a major ap...,0
4,4,VEGFA,This gene is a member of the PDGF/VEGF growth ...,0


In [62]:
abstracts['abstract'] = 1

In [63]:
genes = pd.read_csv('genesdf2')
genes['med']=0
genes = genes[genes.duplicated(subset=None, keep='first')==0]
genes.head()

Unnamed: 0.1,Unnamed: 0,acros,blurb2,match,med
0,0,TP53,Official Symbol- TP53 and Name: tumor protein ...,0,0
1,1,EGFR,Official Symbol- EGFR and Name: epidermal grow...,0,0
2,2,TNF,Official Symbol- TNF and Name: tumor necrosis ...,0,0
3,3,APOE,Official Symbol- APOE and Name: apolipoprotein...,0,0
4,4,VEGFA,Official Symbol- VEGFA and Name: vascular endo...,0,0


In [64]:
genes['symbols'] = genes['acros'].apply(lambda x: (str(x).upper()).strip())
genes = genes.drop(columns =['acros'])

In [65]:
abstracts['med']=0
abstracts['symbols'] = abstracts['symbol'].apply(lambda x: (str(x).upper()).strip())
abstracts = abstracts.drop(columns =['symbol'])

In [66]:
#Add acronym to abstracts text
abstracts['blurbs'] = abstracts['symbols']+','+abstracts['blurb']
abstracts.head()

Unnamed: 0.1,Unnamed: 0,blurb,match,abstract,med,symbols,blurbs
0,0,This gene encodes a tumor suppressor protein c...,0,1,0,TP53,"TP53,This gene encodes a tumor suppressor prot..."
1,1,The protein encoded by this gene is a transmem...,0,1,0,EGFR,"EGFR,The protein encoded by this gene is a tra..."
2,2,This gene encodes a multifunctional proinflamm...,0,1,0,TNF,"TNF,This gene encodes a multifunctional proinf..."
3,3,The protein encoded by this gene is a major ap...,0,1,0,APOE,"APOE,The protein encoded by this gene is a maj..."
4,4,This gene is a member of the PDGF/VEGF growth ...,0,1,0,VEGFA,"VEGFA,This gene is a member of the PDGF/VEGF g..."


In [67]:
abstracts['blurbs2'] = abstracts['blurbs'].apply(lambda x: str(x).split(',',1))
abstracts.head()

Unnamed: 0.1,Unnamed: 0,blurb,match,abstract,med,symbols,blurbs,blurbs2
0,0,This gene encodes a tumor suppressor protein c...,0,1,0,TP53,"TP53,This gene encodes a tumor suppressor prot...","[TP53, This gene encodes a tumor suppressor pr..."
1,1,The protein encoded by this gene is a transmem...,0,1,0,EGFR,"EGFR,The protein encoded by this gene is a tra...","[EGFR, The protein encoded by this gene is a t..."
2,2,This gene encodes a multifunctional proinflamm...,0,1,0,TNF,"TNF,This gene encodes a multifunctional proinf...","[TNF, This gene encodes a multifunctional proi..."
3,3,The protein encoded by this gene is a major ap...,0,1,0,APOE,"APOE,The protein encoded by this gene is a maj...","[APOE, The protein encoded by this gene is a m..."
4,4,This gene is a member of the PDGF/VEGF growth ...,0,1,0,VEGFA,"VEGFA,This gene is a member of the PDGF/VEGF g...","[VEGFA, This gene is a member of the PDGF/VEGF..."


In [68]:
def add_symbol(words_list,replace,texts):
    for words in words_list:
        new = texts.replace(words,replace)
        texts = new
    return texts
   

In [69]:
abstracts['blurbs3'] = abstracts['blurbs2'].apply(lambda x: add_symbol(['This gene','this gene'],x[0]+' gene',x[1]) if len(x)==2 else np.nan)

In [70]:
abstracts.head()

Unnamed: 0.1,Unnamed: 0,blurb,match,abstract,med,symbols,blurbs,blurbs2,blurbs3
0,0,This gene encodes a tumor suppressor protein c...,0,1,0,TP53,"TP53,This gene encodes a tumor suppressor prot...","[TP53, This gene encodes a tumor suppressor pr...",TP53 gene encodes a tumor suppressor protein c...
1,1,The protein encoded by this gene is a transmem...,0,1,0,EGFR,"EGFR,The protein encoded by this gene is a tra...","[EGFR, The protein encoded by this gene is a t...",The protein encoded by EGFR gene is a transmem...
2,2,This gene encodes a multifunctional proinflamm...,0,1,0,TNF,"TNF,This gene encodes a multifunctional proinf...","[TNF, This gene encodes a multifunctional proi...",TNF gene encodes a multifunctional proinflamma...
3,3,The protein encoded by this gene is a major ap...,0,1,0,APOE,"APOE,The protein encoded by this gene is a maj...","[APOE, The protein encoded by this gene is a m...",The protein encoded by APOE gene is a major ap...
4,4,This gene is a member of the PDGF/VEGF growth ...,0,1,0,VEGFA,"VEGFA,This gene is a member of the PDGF/VEGF g...","[VEGFA, This gene is a member of the PDGF/VEGF...",VEGFA gene is a member of the PDGF/VEGF growth...


In [71]:
#prepare to combine datasets
abstracts = abstracts.drop(columns = ['blurb','blurbs','blurbs2'])
abstracts.head()

Unnamed: 0.1,Unnamed: 0,match,abstract,med,symbols,blurbs3
0,0,0,1,0,TP53,TP53 gene encodes a tumor suppressor protein c...
1,1,0,1,0,EGFR,The protein encoded by EGFR gene is a transmem...
2,2,0,1,0,TNF,TNF gene encodes a multifunctional proinflamma...
3,3,0,1,0,APOE,The protein encoded by APOE gene is a major ap...
4,4,0,1,0,VEGFA,VEGFA gene is a member of the PDGF/VEGF growth...


In [72]:
abstracts = abstracts.rename(columns ={'blurbs3':'blurb2'})
abstracts.head()

Unnamed: 0.1,Unnamed: 0,match,abstract,med,symbols,blurb2
0,0,0,1,0,TP53,TP53 gene encodes a tumor suppressor protein c...
1,1,0,1,0,EGFR,The protein encoded by EGFR gene is a transmem...
2,2,0,1,0,TNF,TNF gene encodes a multifunctional proinflamma...
3,3,0,1,0,APOE,The protein encoded by APOE gene is a major ap...
4,4,0,1,0,VEGFA,VEGFA gene is a member of the PDGF/VEGF growth...


Next I'll combine the gene abstracts and gene definitions into one dataframe.

In [73]:
#Combine datasets and clean the data more
genes = genes.append(abstracts,ignore_index = True)
genes = genes.drop(columns = ['Unnamed: 0'])
genes.head()

Unnamed: 0,abstract,blurb2,match,med,symbols
0,,Official Symbol- TP53 and Name: tumor protein ...,0,0,TP53
1,,Official Symbol- EGFR and Name: epidermal grow...,0,0,EGFR
2,,Official Symbol- TNF and Name: tumor necrosis ...,0,0,TNF
3,,Official Symbol- APOE and Name: apolipoprotein...,0,0,APOE
4,,Official Symbol- VEGFA and Name: vascular endo...,0,0,VEGFA


In [74]:
genes['blurb3'] = genes['blurb2'].apply(lambda x: str(x).replace('Official Symbol-',''))

In [75]:
genes['blurb4'] = genes['blurb3'].apply(lambda x: str(x).replace('and Name:','is'))

In [76]:
genes['blurb5'] = genes['blurb4'].apply(lambda x: str(x).replace('Other Aliases:','also'))

In [77]:
genes['blurb6'] = genes['blurb5'].apply(lambda x: str(x).replace('Other Designations:','It is'))

In [78]:
genes['blurb7'] = genes['blurb6'].apply(lambda x: str(x).replace('[Homo sapiens (human)]','(human)'))

In [79]:
genes['blurb8'] = genes['blurb7'].apply(lambda x: str(x).replace('Other Aliases-','and'))

In [80]:
genes = genes.drop(columns = ['blurb2','blurb3','blurb4','blurb5','blurb6','blurb7'])

In [81]:
genes.head()

Unnamed: 0,abstract,match,med,symbols,blurb8
0,,0,0,TP53,"TP53 is tumor protein p53 (human),also BCC7, ..."
1,,0,0,EGFR,EGFR is epidermal growth factor receptor (hum...
2,,0,0,TNF,"TNF is tumor necrosis factor (human),also DIF..."
3,,0,0,APOE,"APOE is apolipoprotein E (human),also AD2, AP..."
4,,0,0,VEGFA,VEGFA is vascular endothelial growth factor A...


Next I'll tokenize the data and make lemmas.

In [82]:
#Tokenize data and make lemmas
import spacy
nlp = spacy.load('en',parser=False, entity=False,tagger=False,textcat=False,ner=False)


In [83]:
genes['tokens'] = genes['blurb8'].apply(lambda x: nlp(x))

In [84]:
genes['lemmas'] = genes['tokens'].apply(lambda x: [token.lemma_ for token in x])

In [85]:
genes.head()

Unnamed: 0,abstract,match,med,symbols,blurb8,tokens,lemmas
0,,0,0,TP53,"TP53 is tumor protein p53 (human),also BCC7, ...","( , TP53, is, tumor, protein, p53, (, human),a...","[ , tp53, be, tumor, protein, p53, (, human),a..."
1,,0,0,EGFR,EGFR is epidermal growth factor receptor (hum...,"( , EGFR, is, epidermal, growth, factor, recep...","[ , egfr, be, epidermal, growth, factor, recep..."
2,,0,0,TNF,"TNF is tumor necrosis factor (human),also DIF...","( , TNF, is, tumor, necrosis, factor, (, human...","[ , tnf, be, tumor, necrosis, factor, (, human..."
3,,0,0,APOE,"APOE is apolipoprotein E (human),also AD2, AP...","( , APOE, is, apolipoprotein, E, (, human),als...","[ , apoe, be, apolipoprotein, e, (, human),als..."
4,,0,0,VEGFA,VEGFA is vascular endothelial growth factor A...,"( , VEGFA, is, vascular, endothelial, growth, ...","[ , vegfa, be, vascular, endothelial, growth, ..."


In [86]:
genes_match = genes[genes['match']==1]

In [87]:
genes_no = genes[genes['match']==0]

I'll prepare the training and test sets in a way that makes sure both sets have matching definitions and abstracts.

In [88]:
#Prepare test and training sets
msk = np.random.rand(len(genes_match)) < 0.8
trainM = genes_match[msk]
testM = genes_match[~msk]

In [89]:
msk = np.random.rand(len(genes_no)) < 0.8
train = genes_no[msk]
test = genes_no[~msk]

In [93]:
train = train.append(trainM,ignore_index = True)

train.shape

(4106, 7)

In [94]:
train.shape

(4106, 7)

In [95]:
test = test.append(testM,ignore_index = True)

test.shape

(928, 7)

Because my training and test sets were prepared using np.random, I was able to optimize my model parameters by examining the success of  multiple versions of the training model. My final word2vec model uses CBOW with a window of 28, and a word vector length of 95.

In [381]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    np.asarray(train['lemmas']),
    workers=4,     # Number of threads to run in parallel 
    min_count=1,  # Minimum word count threshold.
    window=28,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=95,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

done!


In [382]:
# List of words in model.
vocab = model.wv.vocab.keys()

I tested the success of my model by looking at model.wv.similarity between gene abbrevitions and disorders. Results are below.

In [383]:
# 1 is a perfect match and 0 is no similarity
print(model.wv.similarity('disease', 'syndrome'))

 

0.790815382317325


The model found a strong similarity between 'disease' and 'syndrome' which is correct, so I decided to test the disease similarities.

In [384]:
train['disorder'] = np.where((train['blurb8'].str.contains('alzheimer|parkinson|dystrophy|ehlers|cancer|leukemia|diabetes')),1,0)


In [385]:
train['disorder'].value_counts()

0    3763
1     343
Name: disorder, dtype: int64

In [386]:
train['disorder'].value_counts()
abbrev = pd.DataFrame(train[train['disorder']==1])
abbrev.head()

Unnamed: 0,abstract,match,med,symbols,blurb8,tokens,lemmas,disorder
1,,0.0,0,EGFR,EGFR is epidermal growth factor receptor (hum...,"( , EGFR, is, epidermal, growth, factor, recep...","[ , egfr, be, epidermal, growth, factor, recep...",1
6,,0.0,0,ERBB2,ERBB2 is erb-b2 receptor tyrosine kinase 2 (h...,"( , ERBB2, is, erb, -, b2, receptor, tyrosine,...","[ , erbb2, be, erb, -, b2, receptor, tyrosine,...",1
10,,0.0,0,APP,APP is amyloid beta precursor protein (human)...,"( , APP, is, amyloid, beta, precursor, protein...","[ , app, be, amyloid, beta, precursor, protein...",1
12,,0.0,0,BRCA1,"BRCA1 is BRCA1, DNA repair associated (human)...","( , BRCA1, is, BRCA1, ,, DNA, repair, associat...","[ , brca1, be, brca1, ,, dna, repair, associat...",1
27,,0.0,0,PTEN,PTEN is phosphatase and tensin homolog (human...,"( , PTEN, is, phosphatase, and, tensin, homolo...","[ , pten, be, phosphatase, and, tensin, homolo...",1


In [402]:
#Word2vec vocab only contains lower case strings.
abbrev['symbols']=abbrev['symbols'].apply(lambda x: str(x).lower())
#Compare acronyms with different disorders and find their similarity score.
score=[]
for i in abbrev['symbols']:
        if i in vocab:
            score.append([i,(model.wv.similarity(i, 'dystrophy'))])\
#Sort score highest to lowest
scoredf = pd.DataFrame(score)
scoredf=scoredf.sort_values(1,ascending=False).iloc[0:8,:]


Top Gene Matches for Dystrophy

In [403]:
#20 window, 100 vector, dystrophy
####28 window, 95 vector, parkinson and all of them!
#27 works for both + alzheimers + cancer
#bring vector up to 110 improves it I think
scoredf.iloc[0:10,:]

Unnamed: 0,0,1
291,dmd,0.674957
258,dmd,0.674957
244,dmd,0.674957
305,dmd,0.674957
141,dmd,0.674957
26,stmn1,0.613236
248,stmn1,0.613236
295,stmn1,0.613236


In [404]:
scores=scoredf.iloc[0,0].upper()
i=np.where(train['symbols']==scores)[0][1]

In [405]:
train.iloc[i,4]

' Dmd is dystrophin, muscular dystrophy [Mus musculus (house mouse)],also DXSmh7, DXSmh9, Dp427, Dp71, dys, mdx, pke,It is dystrophin; X-linked muscular dystrophy; dystrophin Dp71 delta110 isoform; dystrophin Dp71c isoform; dystrophin Dp71d delta71,74 isoform; dystrophin Dp71d delta74 isoform; dystrophin Dp71d(delta71,73-74); dystrophin Dp71f delta74 isoform,Chromosome: X; Location: X 38.38 cM,Annotation: Chromosome X NC_000086.7 (82814664..85205050)'

Top Gene Matches for Cancer

In [406]:
score=[]
for i in abbrev['symbols']:
        if i in vocab:
            score.append([i,(model.wv.similarity(i, 'cancer'))])
#Sort score highest to lowest
scoredf = pd.DataFrame(score)
scoredf=scoredf.sort_values(1,ascending=False).iloc[0:8,:]


In [407]:
#20 window, 100 vector, dystrophy
####28 window, 95 vector, parkinson and all of them!
#27 works for both + alzheimers + cancer
#bring vector up to 110 improves it I think
scoredf.iloc[0:10,:]

Unnamed: 0,0,1
299,brca2,0.609682
328,brca2,0.609682
281,brca2,0.609682
101,brca2,0.609682
5,brca2,0.609682
252,brca2,0.609682
279,brca1,0.577704
326,brca1,0.577704


In [408]:
scores=scoredf.iloc[0,0].upper()
i=np.where(train['symbols']==scores)[0][0]
train.iloc[i,4]

' BRCA2 is BRCA2, DNA repair associated (human),also BRCC2, BROVCA2, FACD, FAD, FAD1, FANCD, FANCD1, GLM3, PNCA2, XRCC11,It is breast cancer type 2 susceptibility protein; BRCA1/BRCA2-containing complex, subunit 2; Fanconi anemia group D1 protein; breast and ovarian cancer susceptibility gene, early onset; breast and ovarian cancer susceptibility protein 2; breast cancer 2 tumor suppressor; breast cancer 2, early onset; mutant BRCA2; truncated breast cancer 2,Chromosome: 13; Location: 13q13.1,Annotation: Chromosome 13 NC_000013.11 (32315480..32399672),MIM: 600185'

Top Gene Matches for Alzheimer

In [411]:
score=[]
for i in abbrev['symbols']:
        if i in vocab:
            score.append([i,(model.wv.similarity(i, 'alzheimer'))])
#Sort score highest to lowest
scoredf = pd.DataFrame(score)
scoredf=scoredf.sort_values(1,ascending=False).iloc[0:8,:]


In [412]:
#20 window, 100 vector, dystrophy
####28 window, 95 vector, parkinson and all of them!
#27 works for both + alzheimers + cancer
#bring vector up to 110 improves it I think
scoredf.iloc[0:10,:]

Unnamed: 0,0,1
2,app,0.564204
247,app,0.564204
294,app,0.564204
278,shbg,0.525006
185,shbg,0.525006
325,shbg,0.525006
163,cebpa,0.431128
122,ep300,0.429253


In [413]:
scores=scoredf.iloc[0,0].upper()
i=np.where(train['symbols']==scores)[0][0]
train.iloc[i,4]

' APP is amyloid beta precursor protein (human),also AAA, ABETA, ABPP, AD1I, CTFgamma, CVAP, PN-II, PN2, preA4, APP,It is amyloid-beta A4 protein; alzheimer disease amyloid protein; amyloid beta (A4) precursor protein; amyloid beta A4 protein; amyloid precursor protein; beta-amyloid peptide; beta-amyloid peptide(1-40); beta-amyloid peptide(1-42); beta-amyloid precursor protein; cerebral vascular amyloid peptide; peptidase nexin-II; protease nexin-II; testicular tissue protein Li 2,Chromosome: 21; Location: 21q21.3,Annotation: Chromosome 21 NC_000021.9 (25880550..26171128, complement),MIM: 104760'

Top Gene Matches for Parkinson

In [416]:
score=[]
for i in abbrev['symbols']:
        if i in vocab:
            score.append([i,(model.wv.similarity(i, 'parkinson'))])
#Sort score highest to lowest
scoredf = pd.DataFrame(score)
scoredf=scoredf.sort_values(1,ascending=False).iloc[0:8,:]
scoredf.iloc[0:10,:]

Unnamed: 0,0,1
10,prkn,0.759337
17,park7,0.526299
227,bap1,0.506078
215,tgfb2,0.478677
31,best1,0.452795
116,cav1,0.407586
80,kdm5b,0.402326
88,mthfr,0.380639


In [417]:
scores=scoredf.iloc[0,0].upper()
i=np.where(train['symbols']==scores)[0][0]
train.iloc[i,4]

' PRKN is parkin RBR E3 ubiquitin protein ligase (human),also AR-JP, LPRS2, PARK2, PDJ,It is E3 ubiquitin-protein ligase parkin; Parkinson disease (autosomal recessive, juvenile) 2, parkin; parkinson juvenile disease protein 2; parkinson protein 2 E3 ubiquitin protein ligase; parkinson protein 2, E3 ubiquitin protein ligase (parkin),Chromosome: 6; Location: 6q26,Annotation: Chromosome 6 NC_000006.12 (161347417..162727802, complement),MIM: 602544'

Top Gene Matches for Diabetes

In [422]:
score=[]
for i in abbrev['symbols']:
        if i in vocab:
            score.append([i,(model.wv.similarity(i, 'diabetes'))])
#Sort score highest to lowest
scoredf = pd.DataFrame(score)
scoredf=scoredf.sort_values(1,ascending=False).iloc[0:8,:]
scoredf.iloc[0:10,:]

Unnamed: 0,0,1
329,stim1,0.511418
282,stim1,0.511418
98,lep,0.459461
114,mlh1,0.436029
7,mlh1,0.436029
244,dmd,0.432583
141,dmd,0.432583
258,dmd,0.432583


In [426]:
scores=scoredf.iloc[2,0].upper()
i=np.where(train['symbols']==scores)[0][0]
train.iloc[i,4]

'LEP gene encodes a protein that is secreted by white adipocytes into the circulation and plays a major role in the regulation of energy homeostasis. Circulating leptin binds to the leptin receptor in the brain, which activates downstream signaling pathways that inhibit feeding and promote energy expenditure. This protein also has several endocrine functions, and is involved in the regulation of immune and inflammatory responses, hematopoiesis, angiogenesis, reproduction, bone formation and wound healing. Mutations in LEP gene and its regulatory regions cause severe obesity and morbid obesity with hypogonadism in human patients. A mutation in LEP gene has also been linked to type 2 diabetes mellitus development. [provided by RefSeq, Aug 2017]'

Top Gene Matches for Leukemia

In [428]:
score=[]
for i in abbrev['symbols']:
        if i in vocab:
            score.append([i,(model.wv.similarity(i, 'leukemia'))])
#Sort score highest to lowest
scoredf = pd.DataFrame(score)
scoredf=scoredf.sort_values(1,ascending=False).iloc[0:8,:]
scoredf.iloc[0:10,:]

Unnamed: 0,0,1
39,set,0.656145
324,pml,0.628591
297,pml,0.628591
250,pml,0.628591
277,pml,0.628591
14,pml,0.628591
97,myc,0.567546
68,tet1,0.522146


In [429]:
scores=scoredf.iloc[0,0].upper()
i=np.where(train['symbols']==scores)[0][0]
train.iloc[i,4]

' SET is SET nuclear proto-oncogene (human),also 2PP2A, I2PP2A, IGAAD, IPP2A2, PHAPII, TAF-I, TAF-IBETA,It is protein SET; HLA-DR-associated protein II; SET nuclear oncogene; SET translocation (myeloid leukemia-associated); Template-Activating Factor-I, chromatin remodelling factor; inhibitor of granzyme A-activated DNase; inhibitor-2 of protein phosphatase-2A; phosphatase 2A inhibitor I2PP2A; protein phosphatase type 2A inhibitor,Chromosome: 9; Location: 9q34.11,Annotation: Chromosome 9 NC_000009.12 (128683432..128696396),MIM: 600960'

Top Gene Matches for Ehlers (Ehlers Danlos)

In [430]:
score=[]
for i in abbrev['symbols']:
        if i in vocab:
            score.append([i,(model.wv.similarity(i, 'ehlers'))])
#Sort score highest to lowest
scoredf = pd.DataFrame(score)
scoredf=scoredf.sort_values(1,ascending=False).iloc[0:8,:]
scoredf.iloc[0:10,:]

Unnamed: 0,0,1
191,postn,0.65963
228,aire,0.604809
28,aire,0.604809
78,ndc80,0.587843
205,abca4,0.563907
46,runx1t1,0.560244
144,gnas,0.555814
154,itgav,0.533829


In [433]:
scores=scoredf.iloc[0,0].upper()
i=np.where(train['symbols']==scores)[0][0]
train.iloc[i,4]

'POSTN gene encodes a secreted extracellular matrix protein that functions in tissue development and regeneration, including wound healing, and ventricular remodeling following myocardial infarction. The encoded protein binds to integrins to support adhesion and migration of epithelial cells. This protein plays a role in cancer stem cell maintenance and metastasis. Mice lacking POSTN gene exhibit cardiac valve disease, and skeletal and dental defects. Alternative splicing results in multiple transcript variants encoding different isoforms. [provided by RefSeq, Sep 2015]'

The top gene matches for 5 out of 7 of the health disorders I searched for contained the health disorder keyword in the abstract and one of them, diabetes, is related to the top gene match and linked by keyword to the second gene match. The model didn't succeed with Ehlers Danlos because it is more rare and therefore there is less research relating to it and much fewer mentions in the dataset (see analysis below). However, the model did put an article on wound healing at the top of the list for Ehlers Danlos, and Ehlers Danlos is a collagen defect disorder that results in poor wound healing. So perhaps research into the top gene match will help those with Ehlers Danlos.

In [133]:
len(np.where(train['blurb8'].str.contains('Ehlers'))[0])


3

In [134]:
len(np.where(train['blurb8'].str.contains('Alzheimer\'s'))[0])

23

In [122]:
len(np.where(train['blurb8'].str.contains('Parkinson'))[0])

13

In [123]:
len(np.where(train['blurb8'].str.contains('dystrophy'))[0])

24

In [124]:
len(np.where(train['blurb8'].str.contains('diabetes'))[0])

48

In [125]:
len(np.where(train['blurb8'].str.contains('cancer'))[0])

195

In [126]:
len(np.where(train['blurb8'].str.contains('leukemia'))[0])

81

There are 3 mentions of Ehlers Danlos in the whole dataframe, and all of the other disorders I searched for  had 13-195 mentions. This gene calculator is able to identify important genes relating to certain health disorders that have as little as 13 mentions in the document.

An important aspect of this model is searching within a dataframe that only contains genes connected with certain health disorders. In a database that contains entries without medical disorders, there is no way to verify if the gene is somehow related to the queried disease or if its definition was just general enough to match easily.

Now I'll try the test data.

In [434]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    np.asarray(test['lemmas']),
    workers=4,     # Number of threads to run in parallel 
    min_count=1,  # Minimum word count threshold.
    window=28,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=95,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

done!


In [435]:
# 1 is a perfect match and 0 is no similarity
print(model.wv.similarity('disease', 'syndrome'))

0.8069515292726308


In [436]:
test['disorder'] = np.where((test['blurb8'].str.contains('alzheimer|parkinson|dystrophy|cancer|leukemia|diabetes')),1,0)


In [437]:
test['disorder'].value_counts()

0    858
1     70
Name: disorder, dtype: int64

In [438]:
abbrevt = pd.DataFrame(test[test['disorder']==1])
abbrevt.head()

Unnamed: 0,abstract,match,med,symbols,blurb8,tokens,lemmas,disorder
44,,0,0,ABL1,"ABL1 is ABL proto-oncogene 1, non-receptor ty...","( , ABL1, is, ABL, proto, -, oncogene, 1, ,, n...","[ , abl1, be, abl, proto, -, oncogene, 1, ,, n...",1
46,,0,0,MSH2,"MSH2 is mutS homolog 2 (human),also COCA1, FC...","( , MSH2, is, mutS, homolog, 2, (, human),also...","[ , msh2, be, muts, homolog, 2, (, human),also...",1
51,,0,0,RUNX1,RUNX1 is runt related transcription factor 1 ...,"( , RUNX1, is, runt, related, transcription, f...","[ , runx1, be, runt, relate, transcription, fa...",1
55,,0,0,MCL1,"MCL1 is MCL1, BCL2 family apoptosis regulator...","( , MCL1, is, MCL1, ,, BCL2, family, apoptosis...","[ , mcl1, be, mcl1, ,, bcl2, family, apoptosis...",1
87,,0,0,ERBB3,ERBB3 is erb-b2 receptor tyrosine kinase 3 (h...,"( , ERBB3, is, erb, -, b2, receptor, tyrosine,...","[ , erbb3, be, erb, -, b2, receptor, tyrosine,...",1


In [439]:
# List of words in model.
vocabt = model.wv.vocab.keys()
vocabt

dict_keys([' ', 'tnf', 'be', 'tumor', 'necrosis', 'factor', '(', 'human),also', 'dif', '-', 'alpha', ',', 'tnfa', 'tnfsf2', 'tnlg1f', '-PRON-', ';', 'apc1', 'protein', 'macrophage', 'derive', 'monocyte', 'a', 'cachectin', 'ligand', '1f', 'superfamily', 'member', '2', 'chromosome', ':', '6', 'location', '6p21.33,annotation', 'nc_000006.12', '31575567', '..', '31578336),mim', '191160', 'vegfa', 'vascular', 'endothelial', 'growth', 'mvcd1', 'vegf', 'vpf', 'a121', 'a165', 'permeability', '6p21.1,annotation', '43770209', '43786487),mim', '192240', 'solute', 'carrier', 'family', '4', 'gene', 'promoter', 'human),other', 'designations-', '5-htt', '5-httlpr', 'polymorphism', 'region', 'slc6a4', 'serotonin', 'transporter', '17', '17q11.2,annotation', 'nc_000017.11', '30235481', '30237521', ')', 'adipoq', 'adiponectin', 'c1q', 'and', 'collagen', 'domain', 'contain', 'acdc', 'acrp30', 'adipqtl1', 'adpn', 'apm-1', 'apm1', 'gbp28,it', '30', 'kda', 'adipocyte', 'complement', 'relate', 'adipose', 'mos

In [440]:
#Word2vec vocab only contains lower case strings.
abbrevt['symbols']=abbrevt['symbols'].apply(lambda x: str(x).lower())

In [482]:
#Compare acronyms with different disorders and find their similarity score.
scoret=[]
for i in abbrevt['symbols']:
        if i in vocabt:
            scoret.append([i,(model.wv.similarity(i, 'alzheimer'))])
            

In [483]:
 #Sort score highest to lowest
scoretdf = pd.DataFrame(scoret)
scoretdf=scoretdf.sort_values(1,ascending=False).iloc[0:8,:]

In [484]:
scoretdf.iloc[0:10,:]

Unnamed: 0,0,1
28,hspb1,0.880104
35,kcnj11,0.764149
29,bmi1,0.700393
26,plk1,0.689233
33,dkk1,0.633706
50,hmga1,0.617731
47,irf1,0.609436
54,ndrg1,0.599029


In [491]:
scorest=scoretdf.iloc[0,0].upper()

In [492]:
i=np.where(test['symbols']==scorest)[0][0]

In [493]:
test.iloc[i,4]

'HSPB1 gene encodes a member of the small heat shock protein (HSP20) family of proteins. In response to environmental stress, the encoded protein translocates from the cytoplasm to the nucleus and functions as a molecular chaperone that promotes the correct folding of other proteins. This protein plays an important role in the differentiation of a wide variety of cell types. Expression of HSPB1 gene is correlated with poor clinical outcome in multiple human cancers, and the encoded protein may promote cancer cell proliferation and metastasis, while protecting cancer cells from apoptosis. Mutations in HSPB1 gene have been identified in human patients with Charcot-Marie-Tooth disease and distal hereditary motor neuropathy. [provided by RefSeq, Aug 2017]'

In [494]:
scoret=[]
for i in abbrevt['symbols']:
        if i in vocabt:
            scoret.append([i,(model.wv.similarity(i, 'cancer'))])
#Sort score highest to lowest
scoretdf = pd.DataFrame(scoret)
scoretdf=scoretdf.sort_values(1,ascending=False).iloc[0:8,:]
scoretdf.iloc[0:10,:]

Unnamed: 0,0,1
52,ctcf,0.879913
53,ercc5,0.857221
35,kcnj11,0.797364
23,lmna,0.788733
36,erg,0.784335
20,vdr,0.757245
69,cdh1,0.723604
44,fbxw7,0.677511


In [495]:
scorest=scoretdf.iloc[0,0].upper()
i=np.where(test['symbols']==scorest)[0][0]
test.iloc[i,4]

"CTCF gene is a member of the BORIS + CTCF gene family and encodes a transcriptional regulator protein with 11 highly conserved zinc finger (ZF) domains. This nuclear protein is able to use different combinations of the ZF domains to bind different DNA target sequences and proteins. Depending upon the context of the site, the protein can bind a histone acetyltransferase (HAT)-containing complex and function as a transcriptional activator or bind a histone deacetylase (HDAC)-containing complex and function as a transcriptional repressor. If the protein is bound to a transcriptional insulator element, it can block communication between enhancers and upstream promoters, thereby regulating imprinted expression. Mutations in CTCF gene have been associated with invasive breast cancers, prostate cancers, and Wilms' tumors. Alternatively spliced transcript variants encoding different isoforms have been found for CTCF gene. [provided by RefSeq, Jul 2010]"

In [496]:
scoret=[]
for i in abbrevt['symbols']:
        if i in vocabt:
            scoret.append([i,(model.wv.similarity(i, 'dystrophy'))])
#Sort score highest to lowest
scoretdf = pd.DataFrame(scoret)
scoretdf=scoretdf.sort_values(1,ascending=False).iloc[0:8,:]
scoretdf.iloc[0:10,:]

Unnamed: 0,0,1
15,mcm3,0.823161
23,lmna,0.811645
45,pappa,0.795664
25,nat2,0.763765
52,ctcf,0.687483
39,avp,0.663865
3,mcl1,0.6611
69,cdh1,0.646263


In [497]:
scorest=scoretdf.iloc[1,0].upper()
i=np.where(test['symbols']==scorest)[0][0]
test.iloc[i,4]

'The nuclear lamina consists of a two-dimensional matrix of proteins located next to the inner nuclear membrane. The lamin family of proteins make up the matrix and are highly conserved in evolution. During mitosis, the lamina matrix is reversibly disassembled as the lamin proteins are phosphorylated. Lamin proteins are thought to be involved in nuclear stability, chromatin structure and gene expression. Vertebrate lamins consist of two types, A and B. Alternative splicing results in multiple transcript variants. Mutations in LMNA gene lead to several diseases: Emery-Dreifuss muscular dystrophy, familial partial lipodystrophy, limb girdle muscular dystrophy, dilated cardiomyopathy, Charcot-Marie-Tooth disease, and Hutchinson-Gilford progeria syndrome. [provided by RefSeq, Apr 2012]'

In [498]:
scoret=[]
for i in abbrevt['symbols']:
        if i in vocabt:
            scoret.append([i,(model.wv.similarity(i, 'leukemia'))])
#Sort score highest to lowest
scoretdf = pd.DataFrame(scoret)
scoretdf=scoretdf.sort_values(1,ascending=False).iloc[0:8,:]
scoretdf.iloc[0:10,:]

Unnamed: 0,0,1
0,abl1,0.787283
56,prf1,0.74094
48,epor,0.734225
14,ptprj,0.683358
34,xrcc3,0.671951
21,pparg,0.632348
3,mcl1,0.606683
27,tyms,0.599458


In [499]:
scorest=scoretdf.iloc[0,0].upper()
i=np.where(test['symbols']==scorest)[0][0]
test.iloc[i,4]

' ABL1 is ABL proto-oncogene 1, non-receptor tyrosine kinase (human),also ABL, CHDSKM, JTK7, bcr/abl, c-ABL, c-ABL1, p150, v-abl,It is tyrosine-protein kinase ABL1; Abelson tyrosine-protein kinase 1; bcr/c-abl oncogene protein; c-abl oncogene 1, receptor tyrosine kinase; proto-oncogene c-Abl; proto-oncogene tyrosine-protein kinase ABL1; v-abl Abelson murine leukemia viral oncogene homolog 1,Chromosome: 9; Location: 9q34.12,Annotation: Chromosome 9 NC_000009.12 (130713881..130887675),MIM: 189980'

In [500]:
scoret=[]
for i in abbrevt['symbols']:
        if i in vocabt:
            scoret.append([i,(model.wv.similarity(i, 'diabetes'))])
#Sort score highest to lowest
scoretdf = pd.DataFrame(scoret)
scoretdf=scoretdf.sort_values(1,ascending=False).iloc[0:8,:]
scoretdf.iloc[0:10,:]

Unnamed: 0,0,1
57,dcn,0.741716
15,mcm3,0.696583
23,lmna,0.693009
25,nat2,0.587872
50,hmga1,0.570986
45,pappa,0.532165
65,pten,0.528039
61,pten,0.528039


In [501]:
scorest=scoretdf.iloc[0,0].upper()
i=np.where(test['symbols']==scorest)[0][0]
test.iloc[i,4]

'DCN gene encodes a member of the small leucine-rich proteoglycan family of proteins. Alternative splicing results in multiple transcript variants, at least one of which encodes a preproprotein that is proteolytically processed to generate the mature protein. This protein plays a role in collagen fibril assembly. Binding of this protein to multiple cell surface receptors mediates its role in tumor suppression, including a stimulatory effect on autophagy and inflammation and an inhibitory effect on angiogenesis and tumorigenesis. DCN gene and the related gene biglycan are thought to be the result of a gene duplication. Mutations in DCN gene are associated with congenital stromal corneal dystrophy in human patients. [provided by RefSeq, Nov 2015]'

In [502]:
scoret=[]
for i in abbrevt['symbols']:
        if i in vocabt:
            scoret.append([i,(model.wv.similarity(i, 'parkinson'))])
#Sort score highest to lowest
scoretdf = pd.DataFrame(scoret)
scoretdf=scoretdf.sort_values(1,ascending=False).iloc[0:8,:]
scoretdf.iloc[0:10,:]

Unnamed: 0,0,1
52,ctcf,0.883455
21,pparg,0.759933
53,ercc5,0.750906
25,nat2,0.717534
8,bcar1,0.715035
38,stim1,0.700112
69,cdh1,0.687375
64,klk3,0.679251


In [506]:
scorest=scoretdf.iloc[0,0].upper()
i=np.where(test['symbols']==scorest)[0][0]
test.iloc[i,4]

"CTCF gene is a member of the BORIS + CTCF gene family and encodes a transcriptional regulator protein with 11 highly conserved zinc finger (ZF) domains. This nuclear protein is able to use different combinations of the ZF domains to bind different DNA target sequences and proteins. Depending upon the context of the site, the protein can bind a histone acetyltransferase (HAT)-containing complex and function as a transcriptional activator or bind a histone deacetylase (HDAC)-containing complex and function as a transcriptional repressor. If the protein is bound to a transcriptional insulator element, it can block communication between enhancers and upstream promoters, thereby regulating imprinted expression. Mutations in CTCF gene have been associated with invasive breast cancers, prostate cancers, and Wilms' tumors. Alternatively spliced transcript variants encoding different isoforms have been found for CTCF gene. [provided by RefSeq, Jul 2010]"

It's interesting that many of the top matching gene results for the test set did not match with the specific disease but instead with the particular proteins affecting the disease: heat shock proteins for Alzheimer, zinc finger proteins for Parkinson, and the small leucine-rich proteoglycan family for diabetes. The test set may have been too small to differentiate genes, but not too small to differentiate protein families. The test set model was successful at finding a gene match for dystrophy, and, not suprisingly, it was sucessful at finding cancer and leukemia genes. Cancer and leukemia are very general terms, and most of the genes in the gene database seem to be factors in cancer so there's plenty of data for matching. 

The results of this test set show that the training model was not overfit, but improvements can still be made. The model needs more gene abstracts to be more robust. Searches can return multiple matches, but I think that the matches should not be returned unless the first match contains the disease name. Protein matches are useful, but looking for the disease name in the top matching gene abstract is a good way to check and make sure the model has been successful. It's also important to make it so that 2-word search terms can be used - like 'muscular dystrophy', or 'breast cancer'. 

As far as what I've learned from this model about gene research, I'd say more reserach needs to be done in diabetes. Although there are plenty of gene abstracts containing the keyword 'diabetes", it was rare that any of the top gene matches were directly linked to diabetes. They were related to diabetes, but not yet considered factors.