# Initial Data Cleaning of Semmed DB

This notebook will go through the SemmedDB database and clean it of most of the errors before it can be
processed into a hetnet.


There are two main problems found when first looking into semmedDB:

1. There are several rows where there are multiple subjects or objects, sepearted by a pipe `|` character.
2. The database is not entirely in CUI space, some concepts are given entrez gene ids.


There are also two minor problems that we will help clear up:

1. There is a third minor problem of data corruption, however this is on less than 0.001% of the data, so when identified, these will just be removed.
2. Finally, some of the CUIs contained in SemmedDB are either depricated or have been merged with other CUIs.  These will be resolved

In [1]:
import os
import pickle
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import sys
sys.path.append('../tools')
import load_umls

In [2]:
sem_df = pd.read_csv('../data/semmedVER31_R.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
print('Rows: {:,}'.format(sem_df.shape[0]))
print('Cols: {}'.format(sem_df.shape[1]))

Rows: 96,363,098
Cols: 12


In [4]:
sem_df.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,38999,87169,4958141,PART_OF,C0040291,Tissue Extracts,bacs,1,C0085979,Cavia,mamm,1
1,39000,87167,13997680,LOCATION_OF,C0005767,Blood,tisu,1,C0007061,Carboxyhemoglobin,aapp,1
2,39001,87171,12254865,PROCESS_OF,C0702166,Acne,dsyn,1,C0043210,Woman,humn,1
3,39002,87175,14847741,PART_OF,C0027809,Neurilemmoma,neop,1,C0038351,Stomach,bpoc,1
4,39003,87165,11396507,PART_OF,C0221921,Stratum corneum,tisu,1,C0020114,Human,humn,0


In [5]:
# Get all the pmids and save them to a file
pmids = set(sem_df['PMID'])
out = []
print('{:,}'.format(len(pmids)))
for pmid in pmids:
    try:
        # PMIDs should be convertable to int, if not, probably corrupted so don't add
        out.append(int(pmid))
    except:
        pass
print('{:,}'.format(len(out)))
with open('../data/pmid_list_ver31.txt', 'w') as out_file:
    for pmid in out:
        out_file.write(str(pmid)+'\n')
print('Done!')

17,899,155
17,898,897
Done!


# Expanding the pipes in subjects and objects

One of the first things that was noticed upon look at the data in semmedDB was that some subjects and objects of extracted statments contained the pipe character `|` as an indicator of multiple concepts in the sentence.

## Examining Pipes in Subject/Object IDs

First thing to do is just examine some of these pipes and took at their correspodning sentences in the database, see if they do infact correspond to two concepts.

In [6]:
multi_subject = sem_df[sem_df['SUBJECT_CUI'].str.contains('|', regex=False)]
print("There are {:,} lines that contain a pipe in the subject".format(multi_subject.shape[0]))

sentence_ids = multi_subject['SENTENCE_ID'].values

There are 3,645,614 lines that contain a pipe in the subject


In [7]:
multi_subject.iloc[:10]

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
213,39212,87854,13384314,INTERACTS_WITH,C1366490|156,ADRBK1 gene|ADRBK1,gngm,1,C0040615,Antipsychotic Agents,phsu,1
290,39289,88096,14782737,ASSOCIATED_WITH,C0001655|5443,Corticotropin|POMC,horm,1,C0021311,Infection,dsyn,1
354,39353,88296,4229045,PART_OF,C0032140|5340,Plasminogen|PLG,aapp,1,C0020114,Human,humn,0
785,39784,89525,18933908,PART_OF,C0001924|213,Albumins|ALB,aapp,1,C0020114,Human,humn,0
818,39817,89640,6065080,PART_OF,C0001457|100,ADENOSINE DEAMINASE|ADA,aapp,1,C0013303,Duodenum,bpoc,1
1045,40044,90238,4165881,PART_OF,C0034833|5241,"Receptors, Progesterone|PGR",aapp,1,C0027567,African race,humn,1
1050,40049,90257,13680884,PART_OF,C0017796|2744,Glutaminase|GLS,aapp,1,C0014792,Erythrocytes,cell,1
1122,40121,90444,14770469,TREATS,C0001655|5443,Corticotropin|POMC,aapp,1,C0023418,leukemia,neop,1
1236,40235,90801,14880803,TREATS,C0001655|5443,Corticotropin|POMC,horm,1,C0003873,Rheumatoid Arthritis,dsyn,1
1743,40742,92151,13680882,PART_OF,C0001457|100,ADENOSINE DEAMINASE|ADA,aapp,1,C0014792,Erythrocytes,cell,1


Before we go any further, lets drop any rows that conatin NaN values, as these are corrupted rows that have no good data

In [8]:
# Remove any NaN values
print('Rows before NaN removal {:,}'.format(sem_df.shape[0]))
sem_df = sem_df.dropna()
print('Rows after NaN removal {:,}'.format(sem_df.shape[0]))

Rows before NaN removal 96,363,098
Rows after NaN removal 96,363,098


In [9]:
multi_start = sem_df['SUBJECT_CUI'].str.contains('|', regex=False)
multi_end = sem_df['OBJECT_CUI'].str.contains('|', regex=False)

In [10]:
pipe_lines = sem_df[multi_start | multi_end]
good_lines = sem_df[~multi_start & ~multi_end]
print('Rows with multiple subjects or objects {:,}'.format(len(pipe_lines)))
print('Rows with only 1 subject AND only 1 object {:,}'.format(len(good_lines)))

Rows with multiple subjects or objects 6,476,260
Rows with only 1 subject AND only 1 object 89,886,838


Lines with a pipe in the subject XOR a pipe in the object can be delt with in a rather straightforward mannor. 

Those with a pipe in both the subject AND the object will require a slightly different algorithm, so we'll separate those out.

In [11]:
# get indices for those only with a multi start, multi end, and those with bith a multi start and multi end
multi_start_subset = multi_start[multi_start | multi_end]
multi_end_subset = multi_end[multi_start | multi_end]
multi_both_subset = multi_start_subset & multi_end_subset

In [12]:
start_only_subset = multi_start_subset & ~multi_end_subset
end_only_subset = multi_end_subset & ~multi_start_subset

### Splitting the IDs of the Subjects OR Objects

To split the IDs, the IDs and names will be split into `n+1` rows where `n` is the number of pipes `|`, then the data from the rest of the columns will be duplicated across these new rows.

In [13]:
from itertools import chain

In [14]:
# Split the IDs and Names
start_id_split = pipe_lines.loc[start_only_subset, 'SUBJECT_CUI'].str.split('|')
start_name_split = pipe_lines.loc[start_only_subset, 'SUBJECT_NAME'].str.split('|')

In [15]:
# Get the number of items after splitting
start_lens = start_id_split.apply(len)

In [16]:
# Need the column names for duplicating the data
all_cols = list(pipe_lines.columns)

# Copy the columns and only keep those where the data will be duped
start_cols = all_cols[:]
start_cols.remove('SUBJECT_CUI')
start_cols.remove('SUBJECT_NAME')

In [17]:
# Retaining the same order, duplicate the data times of the new number of rows after the split
new_starts = dict()
for c in start_cols:
    tmp = pipe_lines.loc[start_only_subset, c].apply(lambda x: [x]) * start_lens
    new_starts[c] = [x for x in chain(*tmp.values)]

In [18]:
# Now we have the expanded rows with everthing except the subject CUIs and Names
fixed_starts = pd.DataFrame(new_starts)
fixed_starts.head(10)

Unnamed: 0,OBJECT_CUI,OBJECT_NAME,OBJECT_NOVELTY,OBJECT_SEMTYPE,PMID,PREDICATE,PREDICATION_ID,SENTENCE_ID,SUBJECT_NOVELTY,SUBJECT_SEMTYPE
0,C0040615,Antipsychotic Agents,1,phsu,13384314,INTERACTS_WITH,39212,87854,1,gngm
1,C0040615,Antipsychotic Agents,1,phsu,13384314,INTERACTS_WITH,39212,87854,1,gngm
2,C0021311,Infection,1,dsyn,14782737,ASSOCIATED_WITH,39289,88096,1,horm
3,C0021311,Infection,1,dsyn,14782737,ASSOCIATED_WITH,39289,88096,1,horm
4,C0020114,Human,0,humn,4229045,PART_OF,39353,88296,1,aapp
5,C0020114,Human,0,humn,4229045,PART_OF,39353,88296,1,aapp
6,C0020114,Human,0,humn,18933908,PART_OF,39784,89525,1,aapp
7,C0020114,Human,0,humn,18933908,PART_OF,39784,89525,1,aapp
8,C0013303,Duodenum,1,bpoc,6065080,PART_OF,39817,89640,1,aapp
9,C0013303,Duodenum,1,bpoc,6065080,PART_OF,39817,89640,1,aapp


In [19]:
# Add in the subject CUIs and Names
fixed_starts['SUBJECT_CUI'] = [x for x in chain(*start_id_split.values)]
fixed_starts['SUBJECT_NAME'] = [x for x in chain(*start_name_split.values)]

fixed_starts = fixed_starts[all_cols]

In [20]:
fixed_starts.head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,39212,87854,13384314,INTERACTS_WITH,C1366490,ADRBK1 gene,gngm,1,C0040615,Antipsychotic Agents,phsu,1
1,39212,87854,13384314,INTERACTS_WITH,156,ADRBK1,gngm,1,C0040615,Antipsychotic Agents,phsu,1
2,39289,88096,14782737,ASSOCIATED_WITH,C0001655,Corticotropin,horm,1,C0021311,Infection,dsyn,1
3,39289,88096,14782737,ASSOCIATED_WITH,5443,POMC,horm,1,C0021311,Infection,dsyn,1
4,39353,88296,4229045,PART_OF,C0032140,Plasminogen,aapp,1,C0020114,Human,humn,0
5,39353,88296,4229045,PART_OF,5340,PLG,aapp,1,C0020114,Human,humn,0
6,39784,89525,18933908,PART_OF,C0001924,Albumins,aapp,1,C0020114,Human,humn,0
7,39784,89525,18933908,PART_OF,213,ALB,aapp,1,C0020114,Human,humn,0
8,39817,89640,6065080,PART_OF,C0001457,ADENOSINE DEAMINASE,aapp,1,C0013303,Duodenum,bpoc,1
9,39817,89640,6065080,PART_OF,100,ADA,aapp,1,C0013303,Duodenum,bpoc,1


#### Fixing the lines where the Objects contain pipes

In [21]:
end_id_split = pipe_lines.loc[end_only_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[end_only_subset, 'OBJECT_NAME'].str.split('|')

When examining the data, we can see that some of the lines were not parsed correctly.  This must have happened before the data was downloaded, because mysql shows the same issues when the dump is loaded and queried. 

These line will be dropped since there aren't many and they're pretty much garbage.

In [22]:
end_lens = end_id_split.apply(len)
end_lens1 = end_name_split.apply(len)

print('There are {} lines with data corruped in this manner'.format(sum(end_lens != end_lens1)))

pipe_lines.loc[end_only_subset][(end_lens != end_lens1)]

There are 0 lines with data corruped in this manner


Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY


In [23]:
# get the index for the bad lines
bad_lines = pipe_lines.loc[end_only_subset][(end_lens != end_lens1)].index

# Remove them from the main dataframe
pipe_lines = pipe_lines.drop(bad_lines)

# Remove from the indicies that still need to be used as well...
end_only_subset = end_only_subset.drop(bad_lines)
multi_both_subset = multi_both_subset.drop(bad_lines)

Now the splitting algorithm is identical to that used for the Subject lines

In [24]:
end_id_split = pipe_lines.loc[end_only_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[end_only_subset, 'OBJECT_NAME'].str.split('|')
end_lens = end_id_split.apply(len)

In [25]:
end_cols = all_cols[:]
end_cols.remove('OBJECT_CUI')
end_cols.remove('OBJECT_NAME')

In [26]:
new_ends = dict()
for c in end_cols:
    tmp = pipe_lines.loc[end_only_subset, c].apply(lambda x: [x]) * end_lens
    new_ends[c] = [x for x in chain(*tmp.values)]

In [27]:
fixed_ends = pd.DataFrame(new_ends)
fixed_ends['OBJECT_CUI'] = [x for x in chain(*end_id_split.values)]
fixed_ends['OBJECT_NAME'] = [x for x in chain(*end_name_split.values)]

fixed_ends = fixed_ends[all_cols]

In [28]:
fixed_ends.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,39356,88174,4291298,LOCATION_OF,C0006104,Brain,bpoc,1,C0001655,Corticotropin,aapp,1
1,39356,88174,4291298,LOCATION_OF,C0006104,Brain,bpoc,1,5443,POMC,aapp,1
2,39870,89730,14770466,USES,C0087111,Therapeutic procedure,topp,0,C0001655,Corticotropin,aapp,1
3,39870,89730,14770466,USES,C0087111,Therapeutic procedure,topp,0,5443,POMC,aapp,1
4,40015,90197,13738601,LOCATION_OF,C0022646,Kidney,bpoc,1,708,C1QBP,aapp,1


### Splitting of lines where both the subject and object contain pipes

These differ slightly in the way that they will have to be treated.  First, the number of pipe in the subject and object can be different.  The total number of new lines to be created is `(n+1) * (m+1)` where `n` is the number of pipes in the subject and `m` is the number of pipes in the object.

Secondly, every possible combination of subject and object will need to be made.  Given subjects `A` and `B`, and objects `X` and `Y`, and predicate `p` you will need rows for the following combinatinos `ApX`, `ApY`, `BpX`, `BpY`.

In [29]:
start_id_split = pipe_lines.loc[multi_both_subset, 'SUBJECT_CUI'].str.split('|')
start_name_split = pipe_lines.loc[multi_both_subset, 'SUBJECT_NAME'].str.split('|')
start_lens = start_id_split.apply(len)

end_id_split = pipe_lines.loc[multi_both_subset, 'OBJECT_CUI'].str.split('|')
end_name_split = pipe_lines.loc[multi_both_subset, 'OBJECT_NAME'].str.split('|')
end_lens = end_id_split.apply(len)

In [30]:
# Multiply the start splits by the end length, so you get end_len*start_len total rows
start_id_split = start_id_split * end_lens
start_name_split = start_name_split * end_lens

end_id_split = end_id_split * start_lens
end_name_split = end_name_split * start_lens

In [31]:
# only sort the starts so that you get all possible combinations....
# For example right now we have start = [A, B, C, A, B, C] and end = [X, Y, X, Y, X, Y]
# By sorting the start we will have start = [A, A, B, B, C, C] and end = [X, Y, X, Y, X, Y]
# Therefore when combined element-wise, all possible combinatinos will arise

sorting_df = pd.DataFrame()
sorting_df['ID'] = start_id_split
sorting_df['NAME'] = start_name_split

sorted_start_id_split = sorting_df['ID'].apply(lambda x: sorted(x))
# Need to sort the names based on IDs so that the same name still corresponds to the same ID
sorted_start_name_split = sorting_df.apply(lambda row: [x for y,x in sorted(zip(row['ID'], row['NAME']))], axis = 1)

In [32]:
sorting_df.head()

Unnamed: 0,ID,NAME
45788,"[C0001924, 213, C0001924, 213, C0001924, 213]","[Albumins, ALB, Albumins, ALB, Albumins, ALB]"
47644,"[C0001655, 5443, C0001655, 5443]","[Corticotropin, POMC, Corticotropin, POMC]"
51488,"[C0755813, 325, 4068, C0755813, 325, 4068, C07...","[SKAP55-related protein, APCS, SH2D1A, SKAP55-..."
51517,"[C0755813, 325, 4068, C0755813, 325, 4068, C07...","[SKAP55-related protein, APCS, SH2D1A, SKAP55-..."
51548,"[C0755813, 325, 4068, C0755813, 325, 4068]","[SKAP55-related protein, APCS, SH2D1A, SKAP55-..."


In [33]:
sorted_start_id_split.head()

45788        [213, 213, 213, C0001924, C0001924, C0001924]
47644                     [5443, 5443, C0001655, C0001655]
51488    [325, 325, 325, 4068, 4068, 4068, C0755813, C0...
51517    [325, 325, 325, 4068, 4068, 4068, C0755813, C0...
51548           [325, 325, 4068, 4068, C0755813, C0755813]
Name: ID, dtype: object

In [34]:
sorted_start_name_split.head()

45788        [ALB, ALB, ALB, Albumins, Albumins, Albumins]
47644           [POMC, POMC, Corticotropin, Corticotropin]
51488    [APCS, APCS, APCS, SH2D1A, SH2D1A, SH2D1A, SKA...
51517    [APCS, APCS, APCS, SH2D1A, SH2D1A, SH2D1A, SKA...
51548    [APCS, APCS, SH2D1A, SH2D1A, SKAP55-related pr...
dtype: object

Now the algorithm continues in a simialr mannor to that of the Only subject or Only Object corrections

In [35]:
both_cols = all_cols[:]
both_cols.remove('SUBJECT_CUI')
both_cols.remove('SUBJECT_NAME')
both_cols.remove('OBJECT_CUI')
both_cols.remove('OBJECT_NAME')

In [36]:
new_both = dict()
for c in both_cols:
    tmp = pipe_lines.loc[multi_both_subset, c].apply(lambda x: [x]) * (start_lens * end_lens)
    new_both[c] = [x for x in chain(*tmp.values)]

In [37]:
fixed_both = pd.DataFrame(new_both)

fixed_both['SUBJECT_CUI'] = [x for x in chain(*sorted_start_id_split.values)]
fixed_both['SUBJECT_NAME'] = [x for x in chain(*sorted_start_name_split.values)]

fixed_both['OBJECT_CUI'] = [x for x in chain(*end_id_split.values)]
fixed_both['OBJECT_NAME'] = [x for x in chain(*end_name_split.values)]

fixed_both = fixed_both[all_cols]

In [38]:
fixed_both.head()

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
0,84787,211961,5634523,PART_OF,213,ALB,bacs,1,C1412332,ALB gene,aapp,1
1,84787,211961,5634523,PART_OF,213,ALB,bacs,1,213,ALB,aapp,1
2,84787,211961,5634523,PART_OF,213,ALB,bacs,1,85302,FBF1,aapp,1
3,84787,211961,5634523,PART_OF,C0001924,Albumins,bacs,1,C1412332,ALB gene,aapp,1
4,84787,211961,5634523,PART_OF,C0001924,Albumins,bacs,1,213,ALB,aapp,1


Recombine these lines into the new dataframe.

In [39]:
sem_df = pd.concat([good_lines, fixed_starts, fixed_ends, fixed_both]).reset_index(drop=True)

In [40]:
print('The data now contains {:,} rows'.format(sem_df.shape[0]))

The data now contains 104,929,678 rows


# NORMALIZE IDS FOR GENES to CUIs

Sometimes genes appear with a CUI as an identifier, other times they have an entrez gene id.

In [41]:
# POMC Gene is 5443 and has CUI C1337111
sem_df.query('SUBJECT_CUI == "C1337111"').head(2)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
89886940,48911,115433,13029321,TREATS,C1337111,POMC gene,aapp,1,C0023449,"Leukemia, Lymphocytic, Acute",neop,1
89887794,123811,320122,13584735,NEG_AFFECTS,C1337111,POMC gene,gngm,1,C0232804,Renal function,ortf,1


In [42]:
sem_df.query('SUBJECT_CUI == "5443"').head(2)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
589719,633940,1759696,13841870,INHIBITS,5443,POMC,aapp,1,C0001655,Corticotropin,aapp,1
2707630,2792160,6541453,4366303,INTERACTS_WITH,5443,POMC,aapp,1,C0001655,Corticotropin,aapp,1


## Strategy for normalizing to CUI

mygene.info has umls data now so we will use this as a reliable, up-to-date soruce for mapping.  For those that cannot be acquired by mygene.info, we will use HGNC Mappings as UMLS contains those values

In [43]:
import mygene
mg = mygene.MyGeneInfo()

In [44]:
gene_lines = ~sem_df['SUBJECT_CUI'].str.startswith('C')
genes_entrez = set(sem_df.loc[gene_lines, 'SUBJECT_CUI'])

gene_lines1 = ~sem_df['OBJECT_CUI'].str.startswith('C')
genes_entrez.update(set(sem_df.loc[gene_lines1, 'OBJECT_CUI']))

genes_need_fixing = gene_lines | gene_lines1

In [45]:
print("{} genes appear with Entrez IDs that will need to be mapped".format(len(genes_entrez)))

19503 genes appear with Entrez IDs that will need to be mapped


In [46]:
# Query Mygene.info and make the result a DataFrame
mg_result = mg.getgenes(list(genes_entrez), fields='symbol,namel,umls,HGNC', dotfield=True)
mg_result = pd.DataFrame(mg_result)
mg_result.columns = [c.replace('.', '_') for c in mg_result.columns]

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-19503...done.


In [47]:
e_to_cui = mg_result.dropna(subset=['umls_cui']).set_index('query')['umls_cui'].to_dict()
print("{} out of {} Entrez IDs can be mapped to CUI via mygene.info".format(len(e_to_cui), len(genes_entrez)))

18913 out of 19503 Entrez IDs can be mapped to CUI via mygene.info


### Get some more mappings from umls

Although UMLS does not have direct Entrez Gene IDs mappings to UMLS CUIs, it does have HGNC IDs.  Some Entrez to HGNC values were picked up from mygene, so they will be used to further increase the maps size.

In [48]:
# Fix mygene result so it has HGNC: at start of HGNC ids
mg_result['HGNC'] = 'HGNC:' + mg_result['HGNC']

In [49]:
need_map = genes_entrez - set(e_to_cui.keys())
e_to_hgnc = mg_result.query('query in @need_map').dropna(subset=['HGNC']).set_index('query')['HGNC'].to_dict()

hgnc_ids = list(e_to_hgnc.values())

len(hgnc_ids)

45

In [50]:
# Get the values from the UMLS metathesaurus
conso = load_umls.open_mrconso()
q_res = conso.query('SCUI in @hgnc_ids and TTY == "MTH_ACR"')
len(q_res)

  exec(code_obj, self.user_global_ns, self.user_ns)


33

In [51]:
hgnc_to_cui = q_res.set_index('SCUI')['CUI'].to_dict()
e_to_cui_1 = {k: hgnc_to_cui[v] for k, v in e_to_hgnc.items() if v in hgnc_to_cui.keys()}

e_to_cui = {**e_to_cui_1, **e_to_cui}

In [52]:
print("{} out of {} Entrez IDs now mapped".format(len(e_to_cui), len(genes_entrez)))

18946 out of 19503 Entrez IDs now mapped


### Generate a CUI to name map


Once the Entrez IDs are changed to CUIs, the names will not be the true name that is associated with that CUI.  We will use the UMLS API to get the correct names. This will ensure the correct UMLS name appears with the CUI.


In [53]:
# First get the names from semmed for everything that already has a CUI
d = sem_df[sem_df['SUBJECT_CUI'].str.startswith('C')].set_index('SUBJECT_CUI')['SUBJECT_NAME'].to_dict()
od = sem_df[sem_df['OBJECT_CUI'].str.startswith('C')].set_index('OBJECT_CUI')['OBJECT_NAME'].to_dict()

c_to_name_dict = {**d, **od}

In [54]:
len(c_to_name_dict)

261315

In [55]:
need_name = set(e_to_cui.values()) - set(c_to_name_dict.keys())
len(need_name)

9823

In [56]:
# Get as many names as possible directly from UMLS
# ISPREF == Y gets preferred names for preferred name
c_to_name_1 = conso.query('CUI in @need_name and ISPREF == "Y"').set_index('CUI')['STR'].to_dict()
len(c_to_name_1)

9808

In [57]:
c_to_name_dict = {**c_to_name_1, **c_to_name_dict}
to_query = set(e_to_cui.values()) - set(c_to_name_dict.keys())
len(to_query)

15

In [58]:
# Most names are the Gene symbol + 'gene' so we'll use that for the remainder
name_from_mygene = mg_result.query('umls_cui in @to_query').set_index('umls_cui')['symbol'].to_dict()

# Make sure those mapped from mygene via HGNC have names
to_query_1 = [k for k, v in hgnc_to_cui.items() if v in to_query]
hgnc_to_name = mg_result.query('HGNC in @to_query_1').set_index('HGNC')['symbol'].to_dict()

name_from_mygene.update({hgnc_to_cui[k]: v for k, v in hgnc_to_name.items()})

name_from_mygene = {k: v+' gene' for k, v in name_from_mygene.items()}

In [59]:
# Ensure that all mappable genes now have a mappable name
c_to_name_dict = {**name_from_mygene, **c_to_name_dict}
to_query = set(e_to_cui.values()) - set(c_to_name_dict.keys())
len(to_query)

0

In [60]:
len(c_to_name_dict)

271138

In [61]:
pickle.dump(c_to_name_dict, open( "../data/cui_to_name.pkl", "wb" ) )
pickle.dump(e_to_cui, open( "../data/entrez_to_cui.pkl", "wb" ) )

In [62]:
# Check that the mapper produces the correct name when given the CUI for POMC gene
c_to_name_dict[e_to_cui['5443']]

'POMC gene'

### Apply the Changes

In [63]:
genes_need_fixing = gene_lines | gene_lines1

sem_df[genes_need_fixing].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
2815,41841,95436,4892304,PART_OF,820,CAMP,gngm,1,C0949876,Nazis,humn,1
3344,42376,96958,20342456,LOCATION_OF,C0003062,Animals,anim,0,55578,SUPT20H,aapp,1
5126,44178,102317,13930654,PART_OF,C0036825,Serum Proteins,bacs,1,567,B2M,aapp,1
5205,44257,102517,13738792,LOCATION_OF,C0459385,Brain tissue,bpoc,1,23038,WDTC1,aapp,1
6278,45345,105754,13310143,PART_OF,115825,WDFY2,gngm,1,C0402112,Scientist,humn,1
6486,45556,106326,4872999,AFFECTS,1652,DDT,gngm,1,C0018270,Growth,orgf,1
7581,46657,109086,4221862,PART_OF,23038,WDTC1,gngm,1,C0034652,Rana esculenta,amph,1
7616,46692,109086,4221862,PART_OF,51761,ATP8A2,gngm,1,C0034652,Rana esculenta,amph,1
7660,46736,109086,4221862,PART_OF,51761,ATP8A2,gngm,1,C0260307,Triturus cristatus,amph,1
7700,46776,109086,4221862,PART_OF,23038,WDTC1,gngm,1,C0260307,Triturus cristatus,amph,1


In [64]:
sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'] = sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'].apply(lambda e: e_to_cui.get(e,e))
sem_df.loc[genes_need_fixing, 'OBJECT_CUI'] = sem_df.loc[genes_need_fixing, 'OBJECT_CUI'].apply(lambda e: e_to_cui.get(e,e))

In [65]:
sem_df.loc[genes_need_fixing, 'SUBJECT_NAME'] = sem_df.loc[genes_need_fixing, 'SUBJECT_CUI'].apply(lambda e: c_to_name_dict.get(e,e))
sem_df.loc[genes_need_fixing, 'OBJECT_NAME'] = sem_df.loc[genes_need_fixing, 'OBJECT_CUI'].apply(lambda e: c_to_name_dict.get(e,e))

In [66]:
sem_df[genes_need_fixing].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
2815,41841,95436,4892304,PART_OF,C1413106,CAMP gene,gngm,1,C0949876,Nazis,humn,1
3344,42376,96958,20342456,LOCATION_OF,C0003062,Animals,anim,0,C1539449,C13ORF19,aapp,1
5126,44178,102317,13930654,PART_OF,C0036825,Serum Proteins,bacs,1,C1412709,B2M gene,aapp,1
5205,44257,102517,13738792,LOCATION_OF,C0459385,Brain tissue,bpoc,1,C1428806,DCAF9,aapp,1
6278,45345,105754,13310143,PART_OF,C1426966,WD REPEAT- AND FYVE DOMAIN-CONTAINING PROTEIN 2,gngm,1,C0402112,Scientist,humn,1
6486,45556,106326,4872999,AFFECTS,C1413950,DDT gene,gngm,1,C0018270,Growth,orgf,1
7581,46657,109086,4221862,PART_OF,C1428806,DCAF9,gngm,1,C0034652,Rana esculenta,amph,1
7616,46692,109086,4221862,PART_OF,C1366628,ATP8A2 gene,gngm,1,C0034652,Rana esculenta,amph,1
7660,46736,109086,4221862,PART_OF,C1366628,ATP8A2 gene,gngm,1,C0260307,Triturus cristatus,amph,1
7700,46776,109086,4221862,PART_OF,C1428806,DCAF9,gngm,1,C0260307,Triturus cristatus,amph,1


### The lines that got no Query Result

Some of the ID produced no query result.  They should still have a SUBJECT_CUI with only a number. We'll examine those and see if they produce any insight.

In [67]:
gene_lines2 = ~sem_df['SUBJECT_CUI'].str.startswith('C')
gene_lines3 = ~sem_df['OBJECT_CUI'].str.startswith('C')

genes_need_fixing1 = gene_lines2 | gene_lines3

sem_df[genes_need_fixing1].head(10)

Unnamed: 0,PREDICATION_ID,SENTENCE_ID,PMID,PREDICATE,SUBJECT_CUI,SUBJECT_NAME,SUBJECT_SEMTYPE,SUBJECT_NOVELTY,OBJECT_CUI,OBJECT_NAME,OBJECT_SEMTYPE,OBJECT_NOVELTY
54563,94072,236814,5699417,AFFECTS,8066,8066,gngm,1,C0234222,Baresthesia,ortf,1
78984,118725,305854,13032397,USES,C0033573,Prostatectomy,topp,1,100188776,100188776,aapp,1
144137,184485,486340,13388624,PART_OF,C0870883,Metabolites,bacs,0,619511,619511,aapp,1
146503,186864,493343,4961715,PART_OF,57306,57306,gngm,1,C0162547,Pseudomonas Phages,virs,1
286382,327941,894450,5637040,DISRUPTS,100188784,100188784,gngm,1,C1328948,RNA Synthesis,moft,1
288201,329778,899989,5217476,PART_OF,C0035668,RNA,bacs,0,100271694,100271694,aapp,1
315546,357320,980127,5723528,PART_OF,474256,474256,gngm,1,C0004651,Bacteriophages,virs,1
327550,369417,1015074,14192648,PART_OF,C0035668,RNA,bacs,0,100271694,100271694,aapp,1
406805,449380,1248536,5638139,PART_OF,544326,544326,gngm,1,C0007125,"Carcinoma, Ehrlich Tumor",neop,1
421409,464105,1291919,14114111,ASSOCIATED_WITH,100188864,100188864,gngm,1,C0027708,Nephroblastoma,neop,1


In [68]:
len(sem_df[genes_need_fixing1])

96693

For now we will save these to their own file and remove them from the 'cleaned' data.

In [69]:
sem_df[genes_need_fixing1].to_csv('../data/semmedVER31_R_no_CUI.csv', index=False)

In [70]:
sem_df = sem_df.drop(sem_df[genes_need_fixing1].index)
sem_df.to_csv('../data/semmedVER31_R_clean.csv', index=False)

## Remove Depricated CUIs

Some CUIs in the database are depreicated.  They may have newer versions to which they have not yet been mapped.  However, UMLS has record of these deprecated values that can be used to map 

In [71]:
# Get the map from old CUIs to new CUIs
retired_cui = load_umls.open_mrcui()
retired_cui.head(2)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,CUI1,VER,REL,RELA,MAPREASON,CUI2,MAPIN
0,C0000002,2000AC,SY,,,C0007404,Y
1,C0000003,1999AA,SY,,,C0010504,Y


In [72]:
# Make a mapper from the old to the new
cui_map = retired_cui.set_index('CUI1')['CUI2'].to_dict()

# Ensure we have names for all the new values
no_name = set(cui_map.values()) - set(c_to_name_dict.keys())
if len(no_name) > 0:
    query_result = conso.query('CUI in @no_name and ISPREF == "Y"').set_index('CUI')['STR'].to_dict()
    c_to_name_dict.update(query_result)
    pickle.dump(c_to_name_dict, open( "../data/cui_to_name.pkl", "wb" ) )
print('{} concepts identifiers could not be mapped to a name'.format(len(no_name) - len(query_result)))

94 concepts identifiers could not be mapped to a name


In [73]:
# How many unique s-p-o triples before de-depreication?
'{:,} Unique S-P-O triples before de-deprecation'.format(len(sem_df.drop_duplicates(subset=['SUBJECT_CUI', 'PREDICATE', 'OBJECT_CUI'])))

'21,626,024 Unique S-P-O triples before de-deprecation'

In [74]:
# Map the depricated values to their new CUIs
sem_df['SUBJECT_CUI'] = sem_df['SUBJECT_CUI'].apply(lambda c: cui_map.get(c, c))
sem_df['OBJECT_CUI'] = sem_df['OBJECT_CUI'].apply(lambda c: cui_map.get(c, c))

# Any removed CUIs should be taken out
sem_df = sem_df.dropna(subset=['SUBJECT_CUI', 'OBJECT_CUI'])

# Ensure the names are now corrected
sem_df['SUBJECT_NAME'] = sem_df['SUBJECT_CUI'].apply(lambda c: c_to_name_dict.get(c, c))
sem_df['OBJECT_NAME'] = sem_df['OBJECT_CUI'].apply(lambda c: c_to_name_dict.get(c, c))

# How many unique spo triples after the corrections?
'{:,} Unique S-P-O triples after de-deprecation'.format(len(sem_df.drop_duplicates(subset=['SUBJECT_CUI', 'PREDICATE', 'OBJECT_CUI'])))

'21,416,739 Unique S-P-O triples after de-deprecation'

In [75]:
sem_df.to_csv('../data/semmedVER31_R_clean_de-depricate.csv', index=False)