# Condensing Edge Semmantics

Computational complexity for the Rephetio Algorithm that will be used for Machine Learning is highly dependant on the number of potential Metapaths between the source and target Metanodes in the edge to be predicted. 

For this reason, we will try to reduce the number of different semmantic edge types that connect any two metanodes.  We will attempt to condense to at-most 3 metaedges connecting any given metanode.  One reflecting a postitive association (increases, augments, causes, upregulates, etc.), one a negative association (decreases, disrupts, mitigates, downregulates, etc.), and one a neutural association (associated with, affects, method of, etc.).  Some semmantic types are also not useful in this context, and thus will be removed (any Negating concepts, higher than, compared with, etc.)

[This Map File](https://github.com/mmayers12/semmed/blob/semmed_ver31/data/edge_condense_map.csv) details all of the different mappings that occur in the semmantic condensation.

In [1]:
import os
import pickle
%matplotlib inline
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from collections import defaultdict

import sys
sys.path.append('../../hetnet-ml/src')
import graph_tools as gt

In [2]:
nodes = gt.remove_colons(pd.read_csv('../data/nodes_VER31_R_nodes_consolidated.csv'))
edges = gt.remove_colons(pd.read_csv('../data/edges_VER31_R_nodes_consolidated.csv', converters={'pmids':eval}))

In [3]:
start_edge_num = len(edges)

In [4]:
nodes.head()

Unnamed: 0,id,name,label,id_source
0,C0418940,Change of employment,Activities & Behaviors,UMLS
1,C0871147,Professional Development,Activities & Behaviors,UMLS
2,D013221,State Health Plans,Activities & Behaviors,MeSH
3,C0336931,Waterskiing,Activities & Behaviors,UMLS
4,C0678998,literacy skills,Activities & Behaviors,UMLS


In [5]:
edges.head(10)

Unnamed: 0,start_id,end_id,type,pmids,n_pmids
0,C0556656,D010820,ADMINISTERED_TO_ABatLB,"{12657109, 11588447}",2
1,C0441648,C0680063,ADMINISTERED_TO_ABatLB,{9329121},1
2,C0441648,C0555052,ADMINISTERED_TO_ABatLB,{5579492},1
3,D012648,D017028,ADMINISTERED_TO_ABatLB,"{20584280, 15332425, 18192629, 26294569}",4
4,C0556656,D000072142,ADMINISTERED_TO_ABatLB,{8427562},1
5,C0018581,D009726,ADMINISTERED_TO_ABatLB,{21612613},1
6,D012648,D017741,ADMINISTERED_TO_ABatLB,"{25412401, 2816907, 16263886, 570679}",4
7,C0556656,D000368,ADMINISTERED_TO_ABatLB,"{22835737, 12450902}",2
8,D012648,D018576,ADMINISTERED_TO_ABatLB,{29169933},1
9,C0556656,D002648,ADMINISTERED_TO_ABatLB,"{15626941, 1780381, 25885095}",3


In [6]:
def sanitize(x):
    """Some pmids have the appearance of '2015332 [3]' for some reason. This fixes that"""
    if type(x) == str:
        if ' ' in x:
            x = x.split(' ')[0]
    return x

# Some pmids are appearing as string, e.g. row 6.  They should all be int
edges['pmids'] = edges['pmids'].apply(lambda ids: set([int(sanitize(x)) for x in ids]))

In [7]:
edge_map = pd.read_csv('../data/edge_condense_map.csv')

In [8]:
edge_map.head(2)

Unnamed: 0,original_edge,condensed_to,relationship,reverse,node_semtypes
0,AFFECTS_ABafAB,AFFECTS_ABafAB,neutral,False,Activities & Behaviors --- Activities & Behaviors
1,PREDISPOSES_ABpsAB,AFFECTS_ABafAB,neutral,False,Activities & Behaviors --- Activities & Behaviors


In [9]:
def change_edge_type(from_type, to_type, swap=False):
    idx = edges.query('type == @from_type').index
    edges.loc[idx, 'type'] = to_type
    if swap:
        tmp = edges.loc[idx, 'start_id']
        edges.loc[idx, 'start_id'] = edges.loc[idx, 'end_id']
        edges.loc[idx, 'end_id'] = tmp
                                             
def merge_edge_types(from_list, to_type, swap=False):
    for from_type in from_list:
        change_edge_type(from_type, to_type, swap=swap)
        
def drop_edges_from_list(drop_edges):
    idx = edges.query('type in @drop_edges').index
    edges.drop(idx, inplace=True)

In [10]:
# Order is important here
# Previous iterations of this pipeline had multiple rounds of edge condensation
# so some edges will be changed multiple times, and going through the .csv in row order
# ensurse that these changes are all applied correctly.
for row in tqdm(edge_map.itertuples(), total=len(edge_map)):
    change_edge_type(row.original_edge, row.condensed_to, swap=row.reverse)
edges = edges.dropna(subset=['type']).reset_index(drop=True)

100%|██████████| 292/292 [05:27<00:00,  1.05s/it]


In [11]:
edges['type'].nunique()

2687

## Fix Potential problems of duplicated undirected edges

Similar to the issue at the end of notebook `01-building-the-hetnet`, by switching some of the semmantics, we may now have some instances where Metanode1 and Metanode2 for a given edge of type Metaedge1 are opposite of that for a different edge of the same Metaedge1.

In [12]:
abv, met = gt.get_abbrev_dict_and_edge_tuples(gt.add_colons(nodes), gt.add_colons(edges))

In [13]:
id_to_label = nodes.set_index('id')['label'].to_dict()

In [14]:
edges['start_label'] = edges['start_id'].map(lambda c: id_to_label[c])
edges['end_label'] = edges['end_id'].map(lambda c: id_to_label[c])
edges['sem'] = edges['type'].map(lambda e: '_'.join(e.split('_')[:-1]))

edges['abbrev'] = edges['type'].map(lambda e: e.split('_')[-1])

proper_abbrevs = []
for e in tqdm(edges.itertuples(), total=len(edges)):
    if '>' in e.abbrev:
        abbrev = abv[e.start_label] + abv[e.sem] + '>' + abv[e.end_label]
    else:
        abbrev = abv[e.start_label] + abv[e.sem] + abv[e.end_label]
    proper_abbrevs.append(abbrev)
    
edges['calc_abbrev'] = proper_abbrevs

100%|██████████| 14042303/14042303 [00:34<00:00, 411293.42it/s]


In [15]:
edges.head(2)

Unnamed: 0,start_id,end_id,type,pmids,n_pmids,start_label,end_label,sem,abbrev,calc_abbrev
0,C0556656,D010820,ADMINISTERED_TO_ABatLB,"{12657109, 11588447}",2,Activities & Behaviors,Living Beings,ADMINISTERED_TO,ABatLB,ABatLB
1,C0441648,C0680063,ADMINISTERED_TO_ABatLB,{9329121},1,Activities & Behaviors,Living Beings,ADMINISTERED_TO,ABatLB,ABatLB


In [16]:
idx = edges['calc_abbrev'] != edges['abbrev']
idx.sum()  # This should be Zero! If so then we're good to GO on this potential issue!

0

### Undirected Edges between two nodes of the same type should have only 1 instance

`Compound_1 -- Compound_2` is the same as `Compound_2 -- Compound_1`, so look for these types of duplications and eliminate them.

In [17]:
# Get the edges that are un-directed, between same type
idx = edges['start_label'] == edges['end_label']

self_refferential_types = edges.loc[idx, 'type'].unique()
self_refferential_types = [e for e in self_refferential_types if '>' not in e]

In [18]:
# Get a sorted CUI Map

edge_map = {}

for kind in tqdm(self_refferential_types):
    pmid_map = defaultdict(set)
    subedges = edges.query('type == @kind')
    
    for row in subedges.itertuples():
        edge_id = tuple(sorted([row.start_id, row.end_id]))
        
        pmid_map[edge_id] = pmid_map[edge_id].union(row.pmids)
        edge_map[kind] = pmid_map

100%|██████████| 287/287 [01:56<00:00,  2.76it/s]


In [19]:
# Convert back to a DataFrame
kinds = []
start_ids = []
end_ids = []
pmids = []

for kind, e_dict in edge_map.items():
    for (s_id, e_id), pms in e_dict.items():
        kinds.append(kind)
        start_ids.append(s_id)
        end_ids.append(e_id)
        pmids.append(pms)
        
fixed_edges = pd.DataFrame({'start_id': start_ids, 'end_id': end_ids, 'type': kinds, 'pmids': pmids})

In [20]:
print('Before De-duplication: {:,} Edges between nodes of the same type'.format(len(edges.loc[idx])))
print('After De-duplication: {:,} Edges between nodes of the same type'.format(len(fixed_edges)))

Before De-duplication: 3,280,678 Edges between nodes of the same type
After De-duplication: 1,952,887 Edges between nodes of the same type


In [21]:
# Remove all the potential duplicated edges
print('Total Edges: {:,}'.format(len(edges)))
edges.drop(idx[idx].index, inplace=True)
print('Edges between two different Metanodes: {:,}'.format(len(edges)))

# Then add back in all de-duplicated edgers
edges = pd.concat([edges, fixed_edges], sort=False)
print('Total edges with De-duped edges added back: {:,}'.format(len(edges)))

Total Edges: 14,042,303
Edges between two different Metanodes: 10,761,625
Total edges with De-duped edges added back: 12,714,512


In [22]:
edges = edges[['start_id', 'end_id', 'type', 'pmids']]

In [23]:
edges.head()

Unnamed: 0,start_id,end_id,type,pmids
0,C0556656,D010820,ADMINISTERED_TO_ABatLB,"{12657109, 11588447}"
1,C0441648,C0680063,ADMINISTERED_TO_ABatLB,{9329121}
2,C0441648,C0555052,ADMINISTERED_TO_ABatLB,{5579492}
3,D012648,D017028,ADMINISTERED_TO_ABatLB,"{20584280, 15332425, 18192629, 26294569}"
4,C0556656,D000072142,ADMINISTERED_TO_ABatLB,{8427562}


Finish de-duplication, and merge any pmids between those duplicated edges

In [24]:
%%time

before_dedup = len(edges)

# Some edges now duplicated, de-duplicate and combine pmids
grpd = edges.groupby(['start_id', 'end_id', 'type'])
edges = grpd['pmids'].apply(lambda Series: set.union(*Series.values)).reset_index()

# re-count the pmid numbers
edges['n_pmids'] = edges['pmids'].apply(len)

after_dedup = len(edges)

CPU times: user 19min 56s, sys: 21.2 s, total: 20min 17s
Wall time: 20min 17s


In [25]:
print('Edges before final Deduplication: {:,}'.format(before_dedup))
print('Edges after final Deduplication: {:,}'.format(after_dedup))

Edges before final Deduplication: 12714512
Edges after final Deduplication: 11504961


In [26]:
# Sort values before writing to disk
nodes = nodes.sort_values('label')
edges = edges.sort_values('type')

# Add in colons required by neo4j
nodes = gt.add_colons(nodes)
edges = gt.add_colons(edges)

nodes.to_csv('../data/nodes_VER31_R_consolidated_condensed.csv', index=False)
edges.to_csv('../data/edges_VER31_R_consolidated_condensed.csv', index=False)