# Initial Data Cleaning of Semmed DB

There are two main problems found when first looking into semmedDB

1. There are several rows where there are multiple subjects or objects, sepearted by a pipe `|` character.
2. The database is not entirely in CUI space, some concepts are given entrez gene ids.

There is a third minor problem of data corruption, however this is on less than 0.001% of the data, so when identified, these will just be removed.

In [1]:
import os
import pickle
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
sem_df = pd.read_csv('../data/semmedVER30_A.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
print('Rows: {:,}'.format(sem_df.shape[0]))
print('Cols: {}'.format(sem_df.shape[1]))

Rows: 89,173,359
Cols: 12


In [4]:
sem_df.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,61,128,14420371,ISA,C0026879,Mutagens,hops,1,C0220806,Chemicals,chem,1
1,62,116,5659339,PART_OF,C0017725,Glucose,bacs,1,C0022378,jejunum,bpoc,1
2,63,146,12255310,PROCESS_OF,C0006147,Breast Feeding,orgf,1,C0020114,Human,humn,1
3,64,170,12305488,TREATS,C0279494,Oestrogen therapy,topp,1,C0043210,Woman,popg,1
4,65,116,5659339,PROCESS_OF,C0232338,Blood flow,orgf,1,C0012984,Canis familiaris,mamm,1


# Expanding the pipes in subjects and objects

One of the first things that was noticed upon look at the data in semmedDB was that some subjects and objects of extracted statments contained the pipe character `|` as an indicator of multiple concepts in the sentence.

## Examining Pipes in Subject/Object IDs

First thing to do is just examine some of these pipes and took at their correspodning sentences in the database, see if they do infact correspond to two concepts.

In [5]:
multi_subject = sem_df[sem_df['SUBJECT_CUI'].str.contains('|', regex=False)]
print("There are {:,} lines that contain a pipe in the subject".format(multi_subject.shape[0]))

sentence_ids = multi_subject['SENTENCE_ID'].values

There are 3,412,125 lines that contain a pipe in the subject


In [6]:
multi_subject.iloc[:10]

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
224,285,854,14782737,ASSOCIATED_WITH,C0001655|5443,Corticotropin|POMC,horm,1,C0021311,Infection,dsyn,0
251,312,850,13384314,INTERACTS_WITH,C1366490|156,ADRBK1 gene|ADRBK1,gngm,1,C0040615,Antipsychotic Agents,phsu,1
525,586,1760,4229045,PART_OF,C0032140|5340,Plasminogen|PLG,aapp,1,C0020114,Human,humn,1
883,944,2748,14880803,TREATS,C0001655|5443,Corticotropin|POMC,horm,1,C0003873,Rheumatoid Arthritis,dsyn,1
920,981,2806,18933908,PART_OF,C0001924|213,Albumins|ALB,aapp,1,C0020114,Human,humn,1
952,1013,2955,13680884,PART_OF,C0017796|2744,Glutaminase|GLS,aapp,1,C0014792,Erythrocytes,cell,1
1043,1104,3185,14770469,TREATS,C0001655|5443,Corticotropin|POMC,aapp,1,C0023418,leukemia,neop,1
1107,1168,3344,6065080,PART_OF,C0001457|100,ADENOSINE DEAMINASE|ADA,aapp,1,C0013303,Duodenum,bpoc,1
1453,1514,4181,4165881,PART_OF,C0034833|5241,"Receptors, Progesterone|PGR",aapp,1,C0027567,African race,humn,0
1924,1985,5834,13680882,PART_OF,C0001457|100,ADENOSINE DEAMINASE|ADA,aapp,1,C0014792,Erythrocytes,cell,1


Before we go any further, lets drop any rows that conatin NaN values, as these are corrupted rows that have no good data

In [7]:
# Remove any NaN values
print('Rows before NaN removal {:,}'.format(sem_df.shape[0]))
sem_df = sem_df.dropna()
print('Rows after NaN removal {:,}'.format(sem_df.shape[0]))

Rows before NaN removal 89,173,359
Rows after NaN removal 89,173,358


In [8]:
multi_start = sem_df['SUBJECT_CUI'].str.contains('|', regex=False)
multi_end = sem_df['OBJECT_CUI'].str.contains('|', regex=False)

In [9]:
pipe_lines = sem_df[multi_start | multi_end]
good_lines = sem_df[~multi_start & ~multi_end]
print('Rows with multiple subjects or objects {:,}'.format(len(pipe_lines)))
print('Rows with only 1 subject AND only 1 object {:,}'.format(len(good_lines)))

Rows with multiple subjects or objects 6,075,927
Rows with only 1 subject AND only 1 object 83,097,431


Lines with a pipe in the subject XOR a pipe in the object can be delt with in a rather straightforward mannor. 

Those with a pipe in both the subject AND the object will require a slightly different algorithm, so we'll separate those out.

In [10]:
# get indices for those only with a multi start, multi end, and those with bith a multi start and multi end
multi_start_subset = multi_start[multi_start | multi_end]
multi_end_subset = multi_end[multi_start | multi_end]
multi_both_subset = multi_start_subset & multi_end_subset

In [11]:
start_only_subset = multi_start_subset & ~multi_end_subset
end_only_subset = multi_end_subset & ~multi_start_subset

### Splitting the IDs of the Subjects OR Objects

To split the IDs, the IDs and names will be split into `n+1` rows where `n` is the number of pipes `|`, then the data from the rest of the columns will be duplicated across these new rows.

In [12]:
from itertools import chain

In [13]:
# Split the IDs and Names
start_id_split = pipe_lines.loc[start_only_subset, 'SUBJECT_CUI'].str.split('|')
start_name_split = pipe_lines.loc[start_only_subset, 'SUBJECT_NAME'].str.split('|')

In [14]:
# Get the number of items after splitting
start_lens = start_id_split.apply(len)

In [15]:
# Need the column names for duplicating the data
all_cols = list(pipe_lines.columns)

# Copy the columns and only keep those where the data will be duped
start_cols = all_cols[:]
start_cols.remove('SUBJECT_CUI')
start_cols.remove('SUBJECT_NAME')

In [16]:
# Retaining the same order, duplicate the data times of the new number of rows after the split
new_starts = dict()
for c in start_cols:
    tmp = pipe_lines.loc[start_only_subset, c].apply(lambda x: [x]) * start_lens
    new_starts[c] = [x for x in chain(*tmp.values)]

In [17]:
# Now we have the expanded rows with everthing except the subject CUIs and Names
fixed_starts = pd.DataFrame(new_starts)
fixed_starts.head(10)

Unnamed: 0,OBJECT_CUI,OBJECT_NAME,OBJECT_NOVELTY,OBJECT_SEMTYPE,PMID,PREDICATE,PREDICATION_ID,SENTENCE_ID,SUBJECT_NOVELTY,SUBJECT_SEMTYPE
0,C0021311,Infection,0,dsyn,14782737,ASSOCIATED_WITH,285,854,1,horm
1,C0021311,Infection,0,dsyn,14782737,ASSOCIATED_WITH,285,854,1,horm
2,C0040615,Antipsychotic Agents,1,phsu,13384314,INTERACTS_WITH,312,850,1,gngm
3,C0040615,Antipsychotic Agents,1,phsu,13384314,INTERACTS_WITH,312,850,1,gngm
4,C0020114,Human,1,humn,4229045,PART_OF,586,1760,1,aapp
5,C0020114,Human,1,humn,4229045,PART_OF,586,1760,1,aapp
6,C0003873,Rheumatoid Arthritis,1,dsyn,14880803,TREATS,944,2748,1,horm
7,C0003873,Rheumatoid Arthritis,1,dsyn,14880803,TREATS,944,2748,1,horm
8,C0020114,Human,1,humn,18933908,PART_OF,981,2806,1,aapp
9,C0020114,Human,1,humn,18933908,PART_OF,981,2806,1,aapp


In [18]:
# Add in the subject CUIs and Names
fixed_starts['SUBJECT_CUI'] = [x for x in chain(*start_id_split.values)]
fixed_starts['SUBJECT_NAME'] = [x for x in chain(*start_name_split.values)]

fixed_starts = fixed_starts[all_cols]

In [19]:
fixed_starts.head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,285,854,14782737,ASSOCIATED_WITH,C0001655,Corticotropin,horm,1,C0021311,Infection,dsyn,0
1,285,854,14782737,ASSOCIATED_WITH,5443,POMC,horm,1,C0021311,Infection,dsyn,0
2,312,850,13384314,INTERACTS_WITH,C1366490,ADRBK1 gene,gngm,1,C0040615,Antipsychotic Agents,phsu,1
3,312,850,13384314,INTERACTS_WITH,156,ADRBK1,gngm,1,C0040615,Antipsychotic Agents,phsu,1
4,586,1760,4229045,PART_OF,C0032140,Plasminogen,aapp,1,C0020114,Human,humn,1
5,586,1760,4229045,PART_OF,5340,PLG,aapp,1,C0020114,Human,humn,1
6,944,2748,14880803,TREATS,C0001655,Corticotropin,horm,1,C0003873,Rheumatoid Arthritis,dsyn,1
7,944,2748,14880803,TREATS,5443,POMC,horm,1,C0003873,Rheumatoid Arthritis,dsyn,1
8,981,2806,18933908,PART_OF,C0001924,Albumins,aapp,1,C0020114,Human,humn,1
9,981,2806,18933908,PART_OF,213,ALB,aapp,1,C0020114,Human,humn,1


#### Fixing the lines where the Objects contain pipes

In [20]:
end_id_split = pipe_lines.loc[end_only_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[end_only_subset, 'OBJECT_NAME'].str.split('|')

When examining the data, we can see that some of the lines were not parsed correctly.  This must have happened before the data was download, because mysql shows the same issues when the dump is loaded and queried. 

These line will be dropped since there aren't many and they're pretty much garbage.

In [21]:
end_lens = end_id_split.apply(len)
end_lens1 = end_name_split.apply(len)

print('There are {} lines with data corruped in this manner'.format(sum(end_lens != end_lens1)))

pipe_lines.loc[end_only_subset][(end_lens != end_lens1)]

There are 11 lines with data corruped in this manner


Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
39799424,39799487,68755828,15221490,PREP,C0334168,New bone formation,ortf,1,1|anim,C0599779,Animal Model,1
56658105,56658168,100064629,19333722,SPEC,C0600251,Interleukin-1 alpha,aapp,1,0|aapp,C0021764,Interleukins,1
64899747,64911399,116113695,23005300,PREP,C0040210,Tidal Volume,ortf,1,1|npop,C0337014,Avalanche,1
67617161,67731202,121681036,23023178,PREP,C0019291,"Hernia, Hiatal",dsyn,1,1|humn,C0030705,Patients,1
71691690,71932030,130086309,26191840,PREP,1756,DMD,gngm,1,1|medd,C0175723,Bands,1
71691694,71932034,130086309,26191840,PREP,1756,DMD,gngm,1,1|medd,C0175723,Bands,1
72146082,72386934,131008143,10002167,258,C0369286,H NOS ANTIBODY,aapp,1,1|inch|irda,1|inch,C0647605,1
73015930,73268879,132949913,25452567,PREP,C1552160,Supernumerary mandibular right primary canine,bpoc,1,2|sosy,C1457887,Symptoms,1
74711222,74985331,136601123,11116804,394,C0024141,"Lupus Erythematosus, Systemic",dsyn,1,388|Patients,1|podg|humn,1,1
78562982,78924976,136605633,11116842,240,C0949149,Superior glenoid labrum lesion,dsyn,1,643|athlete,1|prog|humn,1,1


In [22]:
# get the index for the bad lines
bad_lines = pipe_lines.loc[end_only_subset][(end_lens != end_lens1)].index

# Remove them from the main dataframe
pipe_lines = pipe_lines.drop(bad_lines)

# Remove from the indicies that still need to be used as well...
end_only_subset = end_only_subset.drop(bad_lines)
multi_both_subset = multi_both_subset.drop(bad_lines)

Now the splitting algorithm is identical to that used for the Subject lines

In [24]:
end_id_split = pipe_lines.loc[end_only_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[end_only_subset, 'OBJECT_NAME'].str.split('|')
end_lens = end_id_split.apply(len)

In [25]:
end_cols = all_cols[:]
end_cols.remove('OBJECT_CUI')
end_cols.remove('OBJECT_NAME')

In [26]:
new_ends = dict()
for c in end_cols:
    tmp = pipe_lines.loc[end_only_subset, c].apply(lambda x: [x]) * end_lens
    new_ends[c] = [x for x in chain(*tmp.values)]

In [27]:
fixed_ends = pd.DataFrame(new_ends)
fixed_ends['OBJECT_CUI'] = [x for x in chain(*end_id_split.values)]
fixed_ends['OBJECT_NAME'] = [x for x in chain(*end_name_split.values)]

fixed_ends = fixed_ends[all_cols]

In [28]:
fixed_ends.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,650,1840,4291298,LOCATION_OF,C0006104,Brain,bpoc,1,C0001655,Corticotropin,aapp,1
1,650,1840,4291298,LOCATION_OF,C0006104,Brain,bpoc,1,5443,POMC,aapp,1
2,960,2680,14770466,USES,C0087111,Therapeutic procedure,topp,0,C0001655,Corticotropin,aapp,1
3,960,2680,14770466,USES,C0087111,Therapeutic procedure,topp,0,5443,POMC,aapp,1
4,1046,3021,13738601,LOCATION_OF,C0022646,Kidney,bpoc,1,708,C1QBP,aapp,1


### Splitting of lines where both the subject and object contain pipes

These differ slightly in the way that they will have to be treated.  First, the number of pipe in the subject and object can be different.  The total number of new lines to be created is `(n+1) * (m+1)` where `n` is the number of pipes in the subject and `m` is the number of pipes in the object.

Secondly, every possible combination of subject and object will need to be made.  Given subjects `A` and `B`, and objects `X` and `Y`, and predicate `p` you will need rows for the following combinatinos `ApX`, `ApY`, `BpX`, `BpY`.

In [29]:
start_id_split = pipe_lines.loc[multi_both_subset, 'SUBJECT_CUI'].str.split('|')
start_name_split = pipe_lines.loc[multi_both_subset, 'SUBJECT_NAME'].str.split('|')
start_lens = start_id_split.apply(len)

end_id_split = pipe_lines.loc[multi_both_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[multi_both_subset, 'OBJECT_NAME'].str.split('|')
end_lens = end_id_split.apply(len)

In [30]:
# Multiply the start splits by the end length, so you get end_len*start_len total rows
start_id_split = start_id_split * end_lens
start_name_split = start_name_split * end_lens

end_id_split = end_id_split * start_lens
end_name_split = end_name_split * start_lens

In [31]:
# only sort the starts so that you get all possible combinations....
# For example right now we have start = [A, B, C, A, B, C] and end = [X, Y, X, Y, X, Y]
# By sorting the start we will have start = [A, A, B, B, C, C] and end = [X, Y, X, Y, X, Y]
# Therefore when combined element-wise, all possible combinatinos will arise

sorting_df = pd.DataFrame()
sorting_df['ID'] = start_id_split
sorting_df['NAME'] = start_name_split

sorted_start_id_split = sorting_df['ID'].apply(lambda x: sorted(x))
# Need to sort the names based on IDs so that the same name still corresponds to the same ID
sorted_start_name_split = sorting_df.apply(lambda row: [x for y,x in sorted(zip(row['ID'], row['NAME']))], axis = 1)

In [32]:
sorting_df.head()

Unnamed: 0,ID,NAME
26944,"[C0001473, 1769, C0001473, 1769]","[ATP phosphohydrolase, DNAH8, ATP phosphohydro..."
43906,"[C0001655, 5443, C0001655, 5443]","[Corticotropin, POMC, Corticotropin, POMC]"
57216,"[C0001924, 213, C0001924, 213, C0001924, 213]","[Albumins, ALB, Albumins, ALB, Albumins, ALB]"
57841,"[C0755813, 325, 4068, C0755813, 325, 4068, C07...","[SKAP55-related protein, APCS, SH2D1A, SKAP55-..."
57842,"[C0755813, 325, 4068, C0755813, 325, 4068, C07...","[SKAP55-related protein, APCS, SH2D1A, SKAP55-..."


In [33]:
sorted_start_id_split.head()

26944                     [1769, 1769, C0001473, C0001473]
43906                     [5443, 5443, C0001655, C0001655]
57216        [213, 213, 213, C0001924, C0001924, C0001924]
57841    [325, 325, 325, 4068, 4068, 4068, C0755813, C0...
57842    [325, 325, 325, 4068, 4068, 4068, C0755813, C0...
Name: ID, dtype: object

In [34]:
sorted_start_name_split.head()

26944    [DNAH8, DNAH8, ATP phosphohydrolase, ATP phosp...
43906           [POMC, POMC, Corticotropin, Corticotropin]
57216        [ALB, ALB, ALB, Albumins, Albumins, Albumins]
57841    [APCS, APCS, APCS, SH2D1A, SH2D1A, SH2D1A, SKA...
57842    [APCS, APCS, APCS, SH2D1A, SH2D1A, SH2D1A, SKA...
dtype: object

Now the algorithm continues in a simialr mannor to that of the Only subject or Only Object corrections

In [35]:
both_cols = all_cols[:]
both_cols.remove('SUBJECT_CUI')
both_cols.remove('SUBJECT_NAME')
both_cols.remove('OBJECT_CUI')
both_cols.remove('OBJECT_NAME')

In [36]:
new_both = dict()
for c in both_cols:
    tmp = pipe_lines.loc[multi_both_subset, c].apply(lambda x: [x]) * (start_lens * end_lens)
    new_both[c] = [x for x in chain(*tmp.values)]

In [37]:
fixed_both = pd.DataFrame(new_both)

fixed_both['SUBJECT_CUI'] = [x for x in chain(*sorted_start_id_split.values)]
fixed_both['SUBJECT_NAME'] = [x for x in chain(*sorted_start_name_split.values)]

fixed_both['OBJECT_CUI'] = [x for x in chain(*end_id_split.values)]
fixed_both['OBJECT_NAME'] = [x for x in chain(*end_name_split.values)]

fixed_both = fixed_both[all_cols]

In [38]:
fixed_both.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,27005,75178,4291334,INTERACTS_WITH,1769,DNAH8,enzy,1,C1366832,"ATPase, Aminophospholipid Transporter-Like, Cl...",aapp,1
1,27005,75178,4291334,INTERACTS_WITH,1769,DNAH8,enzy,1,51761,ATP8A2,aapp,1
2,27005,75178,4291334,INTERACTS_WITH,C0001473,ATP phosphohydrolase,enzy,1,C1366832,"ATPase, Aminophospholipid Transporter-Like, Cl...",aapp,1
3,27005,75178,4291334,INTERACTS_WITH,C0001473,ATP phosphohydrolase,enzy,1,51761,ATP8A2,aapp,1
4,43967,120050,14892061,compared_with,5443,POMC,aapp,1,C0001655,Corticotropin,aapp,1


Recombine these lines into the new dataframe.

In [39]:
sem_df = pd.concat([good_lines, fixed_starts, fixed_ends, fixed_both]).reset_index(drop=True)

In [40]:
print('The data now contains {:,} rows'.format(sem_df.shape[0]))

The data now contains 97,215,230 rows


# NORMALIZE IDS FOR GENES to CUIs

Sometimes genes appear with a CUI as an identifier, other times they have an entrez gene id.

In [41]:
# POMC Gene is 5443 and has CUI C1337111
sem_df.query('SUBJECT_CUI == "C1337111"').head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
83097527,8950,25694,13029321,TREATS,C1337111,POMC gene,aapp,1,C0023449,"Leukemia, Lymphocytic, Acute",neop,1
83098366,82302,226266,13584735,NEG_AFFECTS,C1337111,POMC gene,gngm,1,C0232804,Renal function,ortf,1
83104999,714303,2002540,13176585,ISA,C1337111,POMC gene,aapp,1,C0087111,Therapeutic procedure,topp,0
83105001,714310,2002540,13176585,TREATS,C1337111,POMC gene,aapp,1,C0023467,"Leukemia, Myelocytic, Acute",neop,1
83105247,735910,2060779,13954719,PREVENTS,C1337111,POMC gene,aapp,1,C0023418,leukemia,neop,1


In [42]:
sem_df.query('SUBJECT_CUI == "5443"').head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
587286,592510,1671553,13841870,INHIBITS,5443,POMC,aapp,1,C0001655,Corticotropin,aapp,1
2582924,2625339,6220779,4366303,INTERACTS_WITH,5443,POMC,aapp,1,C0001655,Corticotropin,aapp,1
2935328,2986147,6952043,203610,NEG_AUGMENTS,5443,POMC,gngm,1,C0871828,Neophobia,menp,1
2975065,3026797,7031738,195674,PART_OF,5443,POMC,gngm,1,C0034693,Rattus norvegicus,mamm,1
2975213,3026949,7032037,195674,COEXISTS_WITH,5443,POMC,gngm,1,C0030956,Peptides,aapp,1


## Strategy for normalizing to CUI

Finding a direct map from enterz to CUI is not straightforward. The UMLS API has a way to map from HGNC ID to cui, however, not directly from Entrez ID (or at least I couldn't figure it out).  

We will be mapping to HGNC id first, then to CUI.

In [43]:
# Map Created from WIKI DATA see README.txt for more info
e_to_h = pd.read_csv('../entrez_to_hgnc.csv')

In [44]:
e_to_h = e_to_h.set_index('entrez_id')['hgnc_id'].to_dict()

In [45]:
gene_lines = ~sem_df['SUBJECT_CUI'].str.startswith('C')
genes_entrez = set(sem_df.loc[gene_lines, 'SUBJECT_CUI'])

gene_lines1 = ~sem_df['OBJECT_CUI'].str.startswith('C')
genes_entrez.update(set(sem_df.loc[gene_lines1, 'OBJECT_CUI']))

genes_need_fixing = gene_lines | gene_lines1

In [47]:
print("{} genes appear with Entrez IDs that will need to be mapped".format(len(genes_entrez)))

18934 genes appear with Entrez IDs that will need to be mapped


### Generate Enterez to CUI map

In [48]:
import requests
from tqdm import tqdm
from pyquery import PyQuery as pq

In order to use the UMLS api (which is pretty ridiculously limited in its capabilities)
An api key is needed.  This can be obtained by creating an account with them https://uts.nlm.nih.gov//license.html
then logging into the account and going to the 'My Profile' page.

In [49]:
with open('../data/api.key', 'r') as fin: 
    TICKET = fin.read().rstrip()

In [50]:
def handshake():
    # insane umls api instructions
    # https://documentation.uts.nlm.nih.gov/rest/authentication.html
    r = requests.post("https://utslogin.nlm.nih.gov/cas/v1/api-key", data={'apikey': TICKET})
    d = pq(r.text)
    tgt = d.find('form').attr('action')
    return tgt

In [51]:
def hgnc_to_umls(hgnc_id, tgt):
    """
    given an hgnc_id, get the umls cui
    """
    data = {'service': 'http://umlsks.nlm.nih.gov'}
    r = requests.post(tgt, data=data)
    st = r.text
    url = "https://uts-ws.nlm.nih.gov/rest/content/current/source/HGNC/HGNC:{}/atoms?ticket={}"
    d = requests.get(url.format(hgnc_id, st)).json()
    for res in d['result']:
        if res['code'].split('/')[-1] == 'HGNC:{}'.format(hgnc_id) and 'concept' in res.keys():
            return res['concept'].split('/')[-1]
    
    print('no match')
    return float('NaN')

In [52]:
def cui_to_name(cui, tgt):
    """
    given a name, get the umls cui
    """
    data = {'service': 'http://umlsks.nlm.nih.gov'}
    r = requests.post(tgt, data=data)
    st = r.text
    url = "https://uts-ws.nlm.nih.gov/rest/content/current/CUI/{}?ticket={}"
    d = requests.get(url.format(cui, st)).json()
    return d['result']['name']

In [56]:
def get_entrez_to_cui(ids, tgt):
    mapper = dict()
    for eid in tqdm(ids):
        try:
            hid = e_to_h.get(int(eid), None)
            if not hid:
                continue
            else:
                cui = hgnc_to_umls(hid, tgt)
                mapper[eid] = cui
        except:
            continue

    return mapper

def get_cui_to_name(cuis, tgt):
    mapper = dict()
    for cui in tqdm(cuis):
        name = cui_to_name(cui, tgt)
        mapper[cui] = name
    return mapper

In [57]:
tgt = handshake()

In [53]:
# Because their API takes so long to run (handshake process, no easy way to query many IDs at once)
# We will save the results to a pickle, so future runs don't need to be repeated
if os.path.exists("../data/entrez_to_cui.pkl"):
    e_to_cui = pickle.load(open( "../data/entrez_to_cui.pkl", "rb" ))
else:
    e_to_cui = dict()

to_query = set(genes_entrez) - set(e_to_cui.keys())

In [54]:
print("{} of the original {} still do not have a map and will be queried.".format(len(to_query), len(genes_entrez)))

568 of the original 18934 still do not have a map and will be queried.


In [58]:
# The typical response rate is about 1 query per second
# Thats about 5 horus for the whole data

# However if the ID cannot be found, the next query is isntant
# So the 568 which seem to not be mappable query very quickly
if to_query:
    query_result = get_entrez_to_cui(to_query, tgt)
    e_to_cui.update(query_result)

100%|██████████| 568/568 [00:19<00:00, 29.56it/s]


In [59]:
got_cuis = set(e_to_cui.keys())

no_cuis = genes_entrez - got_cuis

print("Unable to map {} of the original {} enterz genes.".format(len(no_cuis), len(genes_entrez)))

Unable to map 568 of the original 18934 enterz genes.


In [60]:
pickle.dump(e_to_cui, open( "../data/entrez_to_cui.pkl", "wb" ) )

In [61]:
# Qucick test to see if all the cuis we quieried for are in-fact CUIs
total = 0
for cui in e_to_cui.values():
    if not cui.startswith('C'):
        total += 1
print(total)  # should be 0

0


### Generate a CUI to name map


Once the Entrez IDs are changed to CUIs, the names will not be the true name that is associated with that CUI.  We will use the UMLS API to get the correct names. This will ensure the correct UMLS name appears with the CUI.


In [62]:
# First get the names from semmed for everything that already has a CUI
d = sem_df[sem_df['SUBJECT_CUI'].str.startswith('C')].set_index('SUBJECT_CUI')['SUBJECT_NAME'].to_dict()
od = sem_df[sem_df['OBJECT_CUI'].str.startswith('C')].set_index('OBJECT_CUI')['OBJECT_NAME'].to_dict()

c_to_name_dict = {**d, **od}

In [63]:
len(c_to_name_dict)

259334

In [64]:
if os.path.exists("../data/cui_to_name.pkl"):
    c_to_name_dict.update(pickle.load(open( "../data/cui_to_name.pkl", "rb" )))

to_query = set(e_to_cui.values()) - set(c_to_name_dict.keys())

In [65]:
len(to_query)

0

In [66]:
if to_query:
    tgt = handshake()
    query_result = get_cui_to_name(to_query, tgt)
    c_to_name_dict.upate(query_result)

In [67]:
len(c_to_name_dict)

268923

In [68]:
pickle.dump(c_to_name_dict, open( "../data/cui_to_name.pkl", "wb" ) )

In [69]:
# Check that the mapper produces the correct name when given the CUI for POMC gene
c_to_name_dict[e_to_cui['5443']]

'POMC gene'

### Apply the Changes

In [70]:
genes_need_fixing = gene_lines | gene_lines1

sem_df[genes_need_fixing].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
2941,3031,8734,4892304,PART_OF,820,CAMP,gngm,1,C0949876,Nazis,humn,1
3446,3540,10247,20342456,LOCATION_OF,C0003062,Animals,anim,0,55578,SUPT20H,aapp,1
4854,4971,14349,13930654,PART_OF,C0036825,Serum Proteins,bacs,1,567,B2M,aapp,1
4989,5108,14715,13738792,LOCATION_OF,C0459385,Brain tissue,bpoc,1,23038,WDTC1,aapp,1
6379,6511,18747,13310143,PART_OF,115825,WDFY2,gngm,1,C0402112,Scientist,humn,0
7683,7822,22294,4221862,PART_OF,23038,WDTC1,gngm,1,C0034652,Rana esculenta,amph,1
7686,7825,22294,4221862,PART_OF,51761,ATP8A2,gngm,1,C0034652,Rana esculenta,amph,1
7692,7831,22294,4221862,PART_OF,51761,ATP8A2,gngm,1,C0260307,Triturus cristatus,amph,1
7696,7835,22294,4221862,PART_OF,23038,WDTC1,gngm,1,C0260307,Triturus cristatus,amph,1
9330,9480,27045,4872999,AFFECTS,1652,DDT,gngm,1,C0018270,Growth,orgf,1


In [71]:
sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'] = sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'].apply(lambda e: e_to_cui.get(e,e))
sem_df.loc[genes_need_fixing, 'OBJECT_CUI'] = sem_df.loc[genes_need_fixing, 'OBJECT_CUI'].apply(lambda e: e_to_cui.get(e,e))

In [72]:
sem_df.loc[genes_need_fixing, 'SUBJECT_NAME'] = sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'].apply(lambda e: c_to_name_dict.get(e,e))
sem_df.loc[genes_need_fixing, 'OBJECT_NAME'] = sem_df.loc[genes_need_fixing, 'OBJECT_CUI'].apply(lambda e: c_to_name_dict.get(e,e))

In [73]:
sem_df[genes_need_fixing].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
2941,3031,8734,4892304,PART_OF,C1413106,CAMP gene,gngm,1,C0949876,Nazis,humn,1
3446,3540,10247,20342456,LOCATION_OF,C0003062,Animals,anim,0,C1539449,SUPT20H gene,aapp,1
4854,4971,14349,13930654,PART_OF,C0036825,Serum Proteins,bacs,1,C1412709,B2M gene,aapp,1
4989,5108,14715,13738792,LOCATION_OF,C0459385,Brain tissue,bpoc,1,C1428806,WDTC1 gene,aapp,1
6379,6511,18747,13310143,PART_OF,C1426966,WDFY2 gene,gngm,1,C0402112,Scientist,humn,0
7683,7822,22294,4221862,PART_OF,C1428806,WDTC1 gene,gngm,1,C0034652,Rana esculenta,amph,1
7686,7825,22294,4221862,PART_OF,C1366628,ATP8A2 gene,gngm,1,C0034652,Rana esculenta,amph,1
7692,7831,22294,4221862,PART_OF,C1366628,ATP8A2 gene,gngm,1,C0260307,Triturus cristatus,amph,1
7696,7835,22294,4221862,PART_OF,C1428806,WDTC1 gene,gngm,1,C0260307,Triturus cristatus,amph,1
9330,9480,27045,4872999,AFFECTS,C1413950,DDT gene,gngm,1,C0018270,Growth,orgf,1


### The lines that got no Query Result

Some of the ID produced no query result.  They should still have a SUBJECT_CUI with only a number. We'll examine those and see if they produce any insight.

In [76]:
gene_lines2 = ~sem_df['SUBJECT_CUI'].str.startswith('C')
gene_lines3 = ~sem_df['OBJECT_CUI'].str.startswith('C')

genes_need_fixing1 = gene_lines2 | gene_lines3

sem_df[genes_need_fixing1]

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
56437,57007,154785,5699417,AFFECTS,8066,8066,gngm,1,C0234222,Baresthesia,ortf,1
72761,73517,201119,13032397,USES,C0033573,Prostatectomy,topp,1,100188776,100188776,aapp,1
141945,143333,394354,13388624,PART_OF,C0870883,Metabolites,bacs,1,619511,619511,aapp,1
162966,164557,454063,4961715,PART_OF,57306,57306,gngm,1,C0162547,Pseudomonas Phages,virs,1
300098,302807,851071,5217476,PART_OF,C0035668,RNA,bacs,1,100271694,100271694,aapp,1
308319,311085,874939,5637040,DISRUPTS,100188784,100188784,gngm,1,C1328948,RNA Synthesis,moft,1
330446,333401,940922,5723528,PART_OF,474256,474256,gngm,1,C0004651,Bacteriophages,virs,1
331439,334401,944070,14192648,PART_OF,C0035668,RNA,bacs,1,100271694,100271694,aapp,1
428707,432512,1230148,14114111,ASSOCIATED_WITH,100188864,100188864,gngm,1,C0027708,Nephroblastoma,neop,1
434966,438830,1248132,5638139,PART_OF,544326,544326,gngm,1,C0007125,"Carcinoma, Ehrlich Tumor",neop,1


For now we will save these to their own file and remove them from the 'cleaned' data.

In [77]:
sem_df[genes_need_fixing1].to_csv('../data/semmedVER30_A_no_CUI.csv', index=False)

In [78]:
sem_df = sem_df.drop(sem_df[genes_need_fixing1].index)
sem_df.to_csv('../data/semmedVER30_A_clean.csv', index=False)