Merging Nodes Connected by Edges Labeled “I_CODE”, “N_Name”, "R_exactMatch", "R_equivalentClass", and "PAYLOAD"

The modified merged node list produced by this code can be found here, and the modified edge list can be found here. The node list now contains 1558 nodes, and the edge list contains 107,515 edges. A general overview of the steps taken to modify these lists is below. 
 
By iterating through the edge list, a 2D (Python) list of nodes to be merged is created (tbm_list). For each edge with the label “I_CODE”, “N_Name”, "R_exactMatch", "R_equivalentClass", or "PAYLOAD", the Source and Target nodes are recorded. If the Source node is already in tbm_list, then the Target node is appended to the Source node’s entry (and vice versa). If both the Source and Target nodes are already in tbm_list, then their respective entries are merged (if they are not already in the same entry). If neither the Source nor Target node are already in tbm_list, then a new entry is created containing them both. Thus, each entry in tbm_list contains a list of the Id numbers of nodes that all must be merged with each other. No node appears in more than one entry. For example, if  tbm_list = [[A, B], [C, D, E]], then nodes A and B must be merged into one node and nodes C, D, and E must be merged into another node. 

Next we iterate through tbm_list. For each entry in tbm_list, a new node is created (and appended to the end of the node list). The Id value of this new node is equal to the minimum of the Id values of the original nodes in the entry (the nodes being merged). The rest of the columns for this new node are equal to the concatenated values of the corresponding columns in the original nodes. Here an entry is also added to the dictionary fix_edge_dict for each original node, with the original node's original Id as the key and the new merged node's id as the value. This dictionary will be used to reconnect the remaining valid edges to the newly merged nodes. Finally, all of the original (pre-merge) nodes are deleted from the node list. 

Finally we iterate through the edge list again. If an edge has a Source value and a Target value that share an entry in tbm_list, the edge is deleted. If an edge has only one or the other (either a Source value present in tbm_list or a Target value present in tbm_list), then that value is updated to reflect the new merged node Id using fix_edge_dict. If an edge has a Source value and a Target value that are both present in tbm_list, but in different entries, then both the Source and Target values are updated to reflect the new merged node Id numbers using fix_edge_dict.  



In [None]:
from google.colab import files
import json
import pandas as pd
import os
import numpy as np
import networkx as nx
import csv

MERGING I_CODE AND N_NAME DUPLICATES

In [None]:
#read in edge list
edge_list_df = pd.read_csv('/content/edge_list.csv') 

#read in node list 
node_list_df = pd.read_csv('/content/node_list.csv') 

In [None]:
#create empty to-be-merged list and Source/Target checks
tbm_list = []
Source = False
Target = False

#iterate through edge list
for index, row in edge_list_df.iterrows():
#if an edge has "I_CODE" or "N_Name" in the Neo4j "type" column,  
  if row['type'] == 'I_CODE' or row['type'] == 'N_Name' or row['type'] == 'PAYLOAD' or row['type'] == 'R_equivalentClass' or row['type'] == 'R_exactMatch':
  #see if it's "Source" value A is in the to-be-merged list already
    if any(row['Source'] in subl for subl in tbm_list): 
      Source = True
    else:
      Source = False
    #see if it's "Target" value B is in the to-be-merged list already
    if any(row['Target'] in subl for subl in tbm_list):
      Target = True
    else:
      Target = False
    #if A is in the list but not B, add B to A's entry (and vice versa)
    if Source == True and Target == False:
      for i in range(len(tbm_list)):
        if row['Source'] in tbm_list[i]:
          tbm_list[i].append(row['Target'])
    elif Source == False and Target == True:
      for i in range(len(tbm_list)):
        if row['Target'] in tbm_list[i]:
          tbm_list[i].append(row['Source'])
     #if both are in the list, merge those two entries together (if they aren't already the same)
    elif Source == True and Target == True:
      for i in range(len(tbm_list)):
        if row['Source'] in tbm_list[i]:
          sourcelist = tbm_list[i]
          sourceIndex = i
        if row['Target'] in tbm_list[i]:
          targetlist = tbm_list[i]
          targetIndex = i
      if sourceIndex != targetIndex:
        mergelist = list(set(sourcelist + targetlist))
        tbm_list.append(mergelist)
        tbm_list.pop(sourceIndex)
        if sourceIndex < targetIndex:
          tbm_list.pop(targetIndex-1)
        if sourceIndex > targetIndex:
          tbm_list.pop(targetIndex)
    #if neither are in the list, create a new entry with A and B
    elif Source == False and Target == False:
      tbm_list.append([row['Source'], row['Target']])
    Source = False
    Target = False

In [None]:
#make Id numbers the indices of the node list
node_list_df_id_index = node_list_df.set_index("Id", drop = False)

#create fix edge dict and merge node dfs dataframe
fix_edge_dict = {}
merge_node_dfs = pd.DataFrame()
min_id_tracker = 0 

#iterate through to-be-merged list
for i in range(len(tbm_list)):
  merge_node_data = {}
  #find min id value for new merged node id
  for j in range(len(tbm_list[i])):
    if j == 0:
      min_id_tracker = tbm_list[i][j]
    else:
      min_id_tracker = min(min_id_tracker, tbm_list[i][j])
    #create a new node (append to end of node list) with the minimum identity value as its identity, append the rest of the columns into one list
    for c in list(node_list_df_id_index.columns):
      if c != "Id":
        if c in merge_node_data:
          merge_node_data[c] = merge_node_data[c] + ", " + str(node_list_df_id_index.at[tbm_list[i][j], c])
        else:
          merge_node_data[c] = str(node_list_df_id_index.at[tbm_list[i][j], c])
  merge_node_data["Id"] = min_id_tracker
  merge_node_df = pd.DataFrame.from_dict(merge_node_data, orient='index').T
  merge_node_dfs = merge_node_dfs.append(merge_node_df)

  #add an entry to the fix edge dict for each original node, with the original node's original id as the key and the new merged node's id as the value
  for k in range(len(tbm_list[i])):
    fix_edge_dict[tbm_list[i][k]] = min_id_tracker
    #delete all the original nodes from the node list 
    node_list_df_id_index.drop(labels = [tbm_list[i][k]], axis = 0, inplace = True)

#append all merged nodes to node list
node_list_df_id_index = node_list_df_id_index.append(merge_node_dfs)

In [None]:
#iterate through edge list
#If an edge has a Source value and a Target value that share an entry in tbm_list, the edge is deleted. 
#If an edge has only one or the other (either a Source value present in tbm_list or a Target value present in tbm_list), 
#then that value is updated to reflect the new merged node Id using fix_edge_dict. If an edge has a Source value and a Target 
#value that are both present in tbm_list, but in different entries, then both the Source and Target values are updated to reflect 
#the new merged node Id numbers using fix_edge_dict. 

for index, row in edge_list_df.iterrows():
  #if statements eliminate case 1 (neither in tbm_list)
  #case 2 (only source in tbm_list)
  if any(row['Source'] in subl for subl in tbm_list) and not any(row['Target'] in subl for subl in tbm_list):
    edge_list_df.at[index, "Source"] = fix_edge_dict[row['Source']]
  #case 3 (only target in tbm_list)
  if any(row['Target'] in subl for subl in tbm_list) and not any(row['Source'] in subl for subl in tbm_list):
    edge_list_df.at[index, "Target"] = fix_edge_dict[row['Target']]
  #case 4 and 5 (both in tbm_list)
  if any(row['Source'] in subl for subl in tbm_list) and any(row['Target'] in subl for subl in tbm_list):
    for i in range(len(tbm_list)):
      #case 4 (both in same entry)
      if row['Source'] in tbm_list[i] and row['Target'] in tbm_list[i]:
        edge_list_df.drop([index], inplace = True)
        break
      #case 5 (both in different entries)
      if row['Source'] in tbm_list[i] and row['Target'] not in tbm_list[i]:
        edge_list_df.at[index, "Source"] = fix_edge_dict[row['Source']]
      if row['Target'] in tbm_list[i] and row['Source'] not in tbm_list[i]:
        edge_list_df.at[index, "Target"] = fix_edge_dict[row['Target']]