# Generate Graphs

This notebook (and associated data) can be used to generate Biomarker-Biomarker networks as prescribed in [Identifying and Validating Networks of Oncology Biomarkers Mined From the Scientific Literature](https://journals.sagepub.com/doi/full/10.1177/11769351221086441).

In [None]:
!pip install pandas
!pip install networkx
!pip install igraph
!pip install leidenalg

## Table of Contents

* [Global Imports](#globalimports)
* [Necessary Functions](#functions)
* [Run everything](#run)
* [Wrap up](#wrapup)

## Global Imports <a class='anchor' id='globalimports'></a>

Generally speaking we will just import modules alongside the function they are used in.

However, there are a small number of modules that are used so frequently that we import them here, up front.

In [None]:
import pandas as pd

## Necessary Functions <a class='anchor' id='functions'></a>

#### LoadRawDataFrames

This function loads the relevant data and then cleans each up a little bit.

**NCI_Biomarkers.csv** - Contains basic additional metadata (associated organ, biomarker type) for each biomarker.

**BiomarkerCancer_Cooccurrences.csv** - Contains data for establishing appearance of biomarker within 20 tokens of a phrase indicative of a specific target cancer. Two specific columns bear further explaination:
- Unique Publication Count - Biomarker&Slop: Number of publications in which the focal biomarker appears within within 20 tokens of a phrase indicative of the target cancer.
- Unique Publication Count - Biomarker: Number of publications in which the focal biomarker appears.

**BiomarkerBiomarker_Cooccurrences_Slop20.csv** - Contains publication identifiers for publications in which the specific pairs of biomarkers appear within 20 tokens of each other.

In [None]:
from ast import literal_eval
def LoadRawDataFrames():
    df_NCIBiomarkerInfo = pd.read_csv('NCI_Biomarkers.csv')
    # Change the formatting of the Organ(s) list.
    df_NCIBiomarkerInfo.loc[:,'Organ(s)'] = df_NCIBiomarkerInfo['Organ(s)'].str.split(',').apply(lambda lst:';'.join(sorted(lst)))

    df_BiomarkerCancer = pd.read_csv('BiomarkerCancer_Cooccurrences.csv')
    # Need to do a little bit of work to turn the Publication IDs string into a list.
    df_BiomarkerCancer.loc[:,'Publication IDs'] = df_BiomarkerCancer['Publication IDs'].apply(literal_eval)

    # Merge the NCI_Biomarkers metadata into it.
    df_BiomarkerCancer = df_BiomarkerCancer.merge(df_NCIBiomarkerInfo, left_on='Biomarker', right_on='Biomarker', how='left')

    df_BiomarkerBiomarker = pd.read_csv('BiomarkerBiomarker_Cooccurrences_Slop20.csv')
    # Again, do a little bit of work to turn the Publication IDs string into a list.
    df_BiomarkerBiomarker.loc[:,'Publication IDs'] = df_BiomarkerBiomarker['Publication IDs'].apply(literal_eval)

    return df_NCIBiomarkerInfo, df_BiomarkerCancer, df_BiomarkerBiomarker

#### AddAvgPubAgeAndSlope

Here we have two functions that calculate the average publication age, and publication slope, for Biomarkers. One across all cancers. The other for each cancer separately.

In [None]:
# This little function takes a list of lists, merges the sublists, then returns the unique elements. 
import itertools
def uniques(series):
    lists = series.tolist()
    result = [id for id in itertools.chain(*lists)]
    return list(set(result))

from collections import Counter
import numpy as np
import warnings

# Across all cancers
def AddAvgPubAgeAndSlope_AllCancers(df,PubIDsColumn):

    df_PubYear = pd.read_csv('PublicationYear.csv').set_index('Publication ID')

    df_tmp = df.groupby('Biomarker').agg(idset=(PubIDsColumn, uniques)).reset_index()
    df_tmp = df_tmp.explode('idset')
    df_tmp = df_tmp[df_tmp['idset'].notna()]
    df_tmp = df_tmp.merge(df_PubYear, left_on='idset', right_index=True)
    df_tmp = df_tmp[df_tmp['Publication Year'].notna()]

    df_tmp_MeanPubYear = df_tmp.groupby(['Biomarker'])['Publication Year'].mean().to_frame('Avg Pub Year (all cancers)')

    df = df.merge(df_tmp_MeanPubYear, left_on='Biomarker', right_index=True, how='left')

    series_tmp = df_tmp.groupby(['Biomarker'])['Publication Year'].apply(list)
    
    warnings.filterwarnings('ignore', message='Polyfit may be poorly conditioned') # Polyfit throws a lot of repetitive errors re: not trusting the fit when few data.
    Biomarker_Slope = {}
    for Biomarker,PubYears in series_tmp.items():
        X,Y = list(Counter(sorted(PubYears)).keys()), list(Counter(sorted(PubYears)).values())
        ScaledSlope = 100.0*np.polyfit(X,Y,1)[0]/np.mean(Y) # The 100.0 is just to produce a reasonable scale.
        Biomarker_Slope[Biomarker] = ScaledSlope
    warnings.resetwarnings()

    df_tmp_Biomarker_Slope = pd.DataFrame.from_dict(Biomarker_Slope,orient='index').rename(mapper={0:'Pubs slope (all cancers)'},axis=1)

    df = df.merge(df_tmp_Biomarker_Slope, left_on='Biomarker', right_index=True, how='left')
    
    return df

# For the specific cancer. This is essentially the same as above, just with the groupby being executed on 'Cancer' as well.
def AddAvgPubAgeAndSlope_CancerSpecific(df,PubIDsColumn):

    df_PubYear = pd.read_csv('PublicationYear.csv').set_index('Publication ID')

    df_tmp = df.groupby(['Biomarker','Cancer']).agg(idset=(PubIDsColumn, uniques)).reset_index()
    df_tmp = df_tmp.explode('idset')
    df_tmp = df_tmp[df_tmp['idset'].notna()]
    df_tmp = df_tmp.merge(df_PubYear, left_on='idset', right_index=True)
    df_tmp = df_tmp[df_tmp['Publication Year'].notna()]

    df_tmp_MeanPubYear = df_tmp.groupby(['Biomarker','Cancer'])['Publication Year'].mean().to_frame('Avg Pub Year (this cancer)').reset_index()

    df = df.merge(df_tmp_MeanPubYear, left_on=['Biomarker','Cancer'], right_on=['Biomarker','Cancer'], how='left')

    series_tmp = df_tmp.groupby(['Biomarker','Cancer'])['Publication Year'].apply(list)

    warnings.filterwarnings('ignore', message='Polyfit may be poorly conditioned') # Polyfit throws a lot of repetitive errors re: not trusting the fit when few data.
    Biomarker_Slope = {}
    for (Biomarker,Cancer),PubYears in series_tmp.items():
        X,Y = list(Counter(sorted(PubYears)).keys()), list(Counter(sorted(PubYears)).values())
        ScaledSlope = 100.0*np.polyfit(X,Y,1)[0]/np.mean(Y) # The 100.0 is just to produce a reasonable scale.
        Biomarker_Slope[(Biomarker,Cancer)] = ScaledSlope
    warnings.resetwarnings()

    df_tmp_Biomarker_Slope = pd.DataFrame(data={'Pubs slope (this cancer)': list(Biomarker_Slope.values())}, index=pd.MultiIndex.from_tuples(tuples=Biomarker_Slope.keys(), names=['Biomarker','Cancer'])).reset_index()

    df = df.merge(df_tmp_Biomarker_Slope, left_on=['Biomarker','Cancer'], right_on=['Biomarker','Cancer'], how='left')

    return df

#### Get_CancerBiomarkerPublications

This function takes the biomarker - cancer data and selects only the data that falls into the necessary criteria:
- Slop 20
- Publication count between 5 and 1000.

For the "overall" case (*i.e.* summing across all cancers) this function handles the necessary merging.

In [None]:
# This just formats nicely the Publication IDs into a link to them in the Dimensions web application.
def IDsListToURL(listofIDs,template='https://app.dimensions.ai/discover/publication?search_mode=content&search_text={}',separator='%20OR%20',pattern='{}'):
        if template is not None and isinstance(listofIDs, list):
            if len(listofIDs)>0:
                ids = separator.join([pattern.format(id) for id in listofIDs])
                url = template.format(ids)
                return url

def Get_CancerBiomarkerPublications(df_biomarkercancer,cancer,slop=20,minmax=[5,1000]):
    
    # The overall network needs to be treated differently.
    if cancer=="Overall" or cancer is None:
        df_biomarkercancer_local = df_biomarkercancer.loc[(df_biomarkercancer['Unique Publication Count - Biomarker']>=minmax[0]) \
                                                          & (df_biomarkercancer['Unique Publication Count - Biomarker']<=minmax[1]) \
                                                          & (df_biomarkercancer['Slop']==slop) 
                                                          & (df_biomarkercancer['Unique Publication Count - Biomarker&Slop']>0) \
                                                          , ['Biomarker', 'Publication IDs']
                                                         ]
        # This agg does the merging of the Publication IDs.
        df_biomarkercancer_local = df_biomarkercancer_local.groupby('Biomarker').agg(ids=('Publication IDs', uniques)).reset_index().rename(mapper={'ids':'Publication IDs'},axis=1)

        # Unfortunately the groupby discards the other columns of data, but we merge it back in below.
        df_biomarkercancer_extradata = df_biomarkercancer[['Biomarker', 'Type', 'Organ(s)', 'Avg Pub Year (all cancers)', 'Pubs slope (all cancers)','Unique Publication Count - Biomarker']].drop_duplicates()
        df_biomarkercancer_local = df_biomarkercancer_local.merge(df_biomarkercancer_extradata, right_on='Biomarker', left_on='Biomarker', how='left')
        df_biomarkercancer_local = df_biomarkercancer_local.rename(mapper={'Unique Publication Count - Biomarker':'# Pubs (all cancers)'},axis=1)

    
    else:
        df_biomarkercancer_local = df_biomarkercancer.loc[(df_biomarkercancer['Unique Publication Count - Biomarker']>=minmax[0]) \
                                                          & (df_biomarkercancer['Unique Publication Count - Biomarker']<=minmax[1]) \
                                                          & (df_biomarkercancer['Slop']==slop) \
                                                          & (df_biomarkercancer['Unique Publication Count - Biomarker&Slop']>0) \
                                                          & (df_biomarkercancer['Cancer']==cancer) \
                                                          , ['Biomarker', 'Type', 'Organ(s)', 'Avg Pub Year (all cancers)', 'Pubs slope (all cancers)', 'Avg Pub Year (this cancer)', 'Pubs slope (this cancer)', 'Publication IDs','Unique Publication Count - Biomarker','Unique Publication Count - Biomarker&Slop']
                                                         ]
        df_biomarkercancer_local = df_biomarkercancer_local.rename(mapper={'Unique Publication Count - Biomarker&Slop':'# Pubs (this cancer)', 'Unique Publication Count - Biomarker':'# Pubs (all cancers)'},axis=1)

        # Does the final 5,1000 filtering.
        df_biomarkercancer_local = df_biomarkercancer_local.loc[(df_biomarkercancer_local['# Pubs (this cancer)']>=minmax[0]) \
                                                                & (df_biomarkercancer_local['# Pubs (this cancer)']<=minmax[1]) \
                                                               ]
    
    df_biomarkercancer_local['Dimensions link'] = df_biomarkercancer_local['Publication IDs'].apply(IDsListToURL)
    
    return df_biomarkercancer_local

#### GetCooccurrenceDataFrame

This function takes the biomarker-biomarker co-occurrence dataframe and:
1. Filters the Publication IDs, leaving only those also identified as relevant to the target cancer.
2. Prepare a count of unique publications in which the biomarker-biomarker pair occures *at least twice*.
3. Keep only biomarker-biomarker pairs that satisfy the minimum unique publications, and total publications, thresholds.

In [None]:
# This is a little trick that drops elements that appear only once in the list, thus keeping only entries that appear two or more times.
def remove_single_count_items(List):
    if len(List)>0:
        NewList = List
        for item in list(set(List)):
            NewList.remove(item)
        return NewList
    else:
        return List

def GetCooccurrenceDataFrame(df_biomarkerbiomarker,validpublicationset,mincooc_UniquePapers,mincooc_TotalCounts):
    # This keeps only Publication IDs that are in the set of publications valid for the target cancer.
    df_biomarkerbiomarker.loc[:,'Publication IDs'] = df_biomarkerbiomarker['Publication IDs'].apply(lambda ids: [id for id in ids if id in validpublicationset])

    df_biomarkerbiomarker.loc[:,'Total Cooccurrences'] = df_biomarkerbiomarker['Publication IDs'].apply(len)
    df_biomarkerbiomarker.loc[:,'Unique Publication Cooccurrences'] = df_biomarkerbiomarker['Publication IDs'].apply(set).apply(len)
    
    # Captures the number of publications that have at least two co-occurrences for the given biomarker-biomarker pair.
    df_biomarkerbiomarker.loc[:,'Unique Cooccurrences > 1'] = df_biomarkerbiomarker['Publication IDs'].apply(remove_single_count_items).apply(set).apply(len)


    # Filter on the number of publications w/ at least mincooc_UniquePapers co-occurrences > 1, and at least mincooc_TotalCounts in total.
    df_biomarkerbiomarker = df_biomarkerbiomarker.loc[(df_biomarkerbiomarker['Unique Cooccurrences > 1']>=mincooc_UniquePapers) \
                                                      & (df_biomarkerbiomarker['Total Cooccurrences']>=mincooc_TotalCounts)
                                                     ]

    return df_biomarkerbiomarker

#### GetGraphDataFrame

This function takes the co-occurrence dataframes (biomarker-cancer, biomarker-biomarker) and produces a dataframe in which each row is, essentially, a link in the network that will eventually be constructed.

It generally proceeds as:
1. Joins the biomarker-cancer data for each biomarker onto each biomarker column in the biomarker-biomarker data.
2. Calculates edge weights.
3. Ranks edge weights.

In [None]:
# This little function takes a list of Publication IDs and returns the average age of those publications.
PublicationYearDict = pd.read_csv('PublicationYear.csv').set_index('Publication ID').to_dict()['Publication Year']
def AvgPubAge_fr_PubIDList(PublicationIDsList):
    PubYearList = [PublicationYearDict[PubID] for PubID in PublicationIDsList if PubID in PublicationYearDict]
    if len(PubYearList)>0:
        return np.mean(PubYearList)

def GetGraphDataFrame(df_biomarkerbiomarker,df_biomarkercancer,targetcancer):
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
    
        if targetcancer == 'Overall':
            df_biomarkercancer['Avg Pub Year (this cancer)'] = df_biomarkercancer['Avg Pub Year (all cancers)']
            df_biomarkercancer['Pubs slope (this cancer)'] = df_biomarkercancer['Pubs slope (all cancers)']
            df_biomarkercancer['# Pubs (this cancer)'] = df_biomarkercancer['# Pubs (all cancers)']
    
        df_biomarkercancer = df_biomarkercancer[['Biomarker','# Pubs (this cancer)', 'Avg Pub Year (this cancer)']]
        df_biomarkercancer.loc[:,'Biomarker'] = df_biomarkercancer['Biomarker'].str.lower() # have to lower to make it joinable.

        # Merge the biomarker-cancer data in for Biomarker 1
        df_biomarkerbiomarker_graph = df_biomarkerbiomarker.merge(df_biomarkercancer, left_on='Biomarker 1', right_on='Biomarker').drop(columns=['Biomarker'])
        df_biomarkerbiomarker_graph = df_biomarkerbiomarker_graph.rename(mapper={'# Pubs (this cancer)':'tmp - # Biomarker 1', 'Avg Pub Year (this cancer)':'tmp - age Biomarker 1'},axis=1)
        # Merge the biomarker-cancer data in for Biomarker 2
        df_biomarkerbiomarker_graph = df_biomarkerbiomarker_graph.merge(df_biomarkercancer, left_on='Biomarker 2', right_on='Biomarker').drop(columns=['Biomarker'])
        df_biomarkerbiomarker_graph = df_biomarkerbiomarker_graph.rename(mapper={'# Pubs (this cancer)':'tmp - # Biomarker 2', 'Avg Pub Year (this cancer)':'tmp - age Biomarker 2'},axis=1)

        # Makes a column that is [Cancer-Biomarker Count Biomarker 1, Cancer-Biomarker Count Biomarker 2]
        df_biomarkerbiomarker_graph.loc[:,'Unique Publication Counts - Cancer'] = df_biomarkerbiomarker_graph.apply(lambda row: [row['tmp - # Biomarker 1'], row['tmp - # Biomarker 2']], axis=1)
        df_biomarkerbiomarker_graph = df_biomarkerbiomarker_graph.drop(['tmp - # Biomarker 1','tmp - # Biomarker 2'],axis=1)

        # Makes a column that is [Avg Pub Year Biomarker 1, Avg Pub Year Biomarker 2]
        df_biomarkerbiomarker_graph.loc[:,'Avg Pub Year of endpoints'] = df_biomarkerbiomarker_graph.apply(lambda row: [row['tmp - age Biomarker 1'], row['tmp - age Biomarker 2']], axis=1)
        df_biomarkerbiomarker_graph = df_biomarkerbiomarker_graph.drop(['tmp - age Biomarker 1','tmp - age Biomarker 2'],axis=1)

        # Weights are calculated as the number of unique publication co-occurrences of the biomarker-biomarker pair,
        # divided by the Unique Publication Counts - Cancer (i.e. fraction of publications)
        df_biomarkerbiomarker_graph.loc[:,'Weights'] = df_biomarkerbiomarker_graph.apply(lambda row: [row['Unique Publication Cooccurrences'] / row['Unique Publication Counts - Cancer'][0], row['Unique Publication Cooccurrences'] / row['Unique Publication Counts - Cancer'][1]], axis=1)

        # Get the average age of the publications in the link.
        df_biomarkerbiomarker_graph['Avg Pub Year (this cancer)'] = df_biomarkerbiomarker_graph['Publication IDs'].apply(AvgPubAge_fr_PubIDList)
        
        # This gets the Rank of the edge among all of Biomarker 1's edges.
        Rank = df_biomarkerbiomarker_graph.groupby('Biomarker 1')['Unique Publication Cooccurrences'].rank(method='min', ascending=False)
        Rank.name = 'Ranks - Biomarker 1'
        df_biomarkerbiomarker_graph = pd.concat([df_biomarkerbiomarker_graph, Rank], axis = 1)

        # This gets the Rank of the edge among all of Biomarker 1's edges.
        Rank = df_biomarkerbiomarker_graph.groupby('Biomarker 2')['Unique Publication Cooccurrences'].rank(method='min', ascending=False)
        Rank.name = 'Ranks - Biomarker 2'
        df_biomarkerbiomarker_graph = pd.concat([df_biomarkerbiomarker_graph, Rank], axis = 1)

        df_biomarkerbiomarker_graph.loc[:,'Ranks'] = df_biomarkerbiomarker_graph.apply(lambda row: [int(row['Ranks - Biomarker 1']), int(row['Ranks - Biomarker 2'])], axis=1)
        df_biomarkerbiomarker_graph.loc[:,'Best Rank'] = df_biomarkerbiomarker_graph['Ranks'].apply(min) # Will keep only links that are in the top N of at least one biomarker.
        df_biomarkerbiomarker_graph = df_biomarkerbiomarker_graph.drop(['Ranks - Biomarker 1','Ranks - Biomarker 2'],axis=1)

        return df_biomarkerbiomarker_graph

#### CreateGraph

This is a very large function that takes the graph dataframe created previously and actually creates the network, adds metadata, *etc*.

On top of just generating the network and adding node and edge data, it also executes the leiden clustering algorithm, thus arriving at cluster/community membership of each node.

In [None]:
import networkx as nx
import igraph as ig
import leidenalg as la
import warnings

def CreateGraph(df_graph,df_biomarkercancer,maxrank=None):

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')

        # Filter out edges that are outside the Top N of both nodes.
        if maxrank is not None:
            df_graph = df_graph.loc[df_graph['Best Rank'].apply(lambda r: r<maxrank),:]

        # Set a single valued weight
        df_graph.loc[:,'Weight'] = df_graph['Weights'].apply(lambda w: w[0])

        # Generate the Dimensions links.
        df_graph.loc[:,'Dimensions Link'] = df_graph['Publication IDs'].apply(IDsListToURL)

        # To make it easier to align naming for final figures. 
        EdgeMetadataColumns = {'Weight':'Weight','Best Rank':'Best Rank','Unique Cooccurrences > 1':'# Pubs w/ >1 co-mention',
                               'Total Cooccurrences':'# co-mentions','Dimensions Link':'Dimensions link', 'Publication IDs': 'Publication IDs', 
                               'Avg Pub Year of endpoints':'Avg Pub Year of endpoints', 'Avg Pub Year (this cancer)':'Avg Pub Year (this cancer)'
                              }

        df_graph = df_graph.rename(mapper=EdgeMetadataColumns,axis=1)

        # Generate the networkx Graph object.
        G = nx.from_pandas_edgelist(df_graph,'Biomarker 1','Biomarker 2', edge_attr=list(EdgeMetadataColumns.values()))

        # The next several lines are for adding the node metadata.
        AllBiomarkersPresent = set(df_graph['Biomarker 1'].unique()) | set(df_graph['Biomarker 2'].unique())
        df_AllBiomarkersPresent = pd.DataFrame(list(AllBiomarkersPresent),columns=['Biomarker'])
        df_AllBiomarkersPresent.loc[:,'Biomarker'] = df_AllBiomarkersPresent.loc[:,'Biomarker'].str.lower()
        df_AllBiomarkersPresent = df_AllBiomarkersPresent.set_index('Biomarker')

        df_biomarkercancer.loc[:,'Biomarker'] = df_biomarkercancer.loc[:,'Biomarker'].str.lower()
        df_biomarkercancer = df_biomarkercancer.set_index('Biomarker')

        df_NodeMetadata = df_AllBiomarkersPresent.merge(df_biomarkercancer,left_index=True,right_index=True,how='left')
        df_NodeMetadata = df_NodeMetadata.drop(['Publication IDs'],axis=1).reset_index().drop_duplicates(keep='first').set_index('Biomarker')

        nx.set_node_attributes(G, values=df_NodeMetadata.to_dict(orient='index'))

        # An igraph version of the network is required for leidenalg.
        G_igraph = ig.Graph.TupleList([(e[0], e[1], e[2]['Weight']) for e in G.edges.data()], directed=False, weights=True)
        Partition = la.find_partition(G_igraph, la.ModularityVertexPartition, seed=411966)

        Partition_Optimiser = la.Optimiser()
        Partition_Optimiser.set_rng_seed(17041967)
        Partition_Result = Partition_Optimiser.optimise_partition(Partition, n_iterations=1000)

        # Add the partition result to the network nodes.
        df_NodeClusters = pd.DataFrame([{'Biomarker':node['name'],'ClusterID':index} for index, subgraph in enumerate(Partition.subgraphs()) for node in subgraph.vs()])
        NodeClusters = df_NodeClusters.set_index('Biomarker').astype('str').to_dict()['ClusterID']
        nx.set_node_attributes(G, NodeClusters, 'ClusterID')

        # Add betweenness centrality too.
        BetweennessCentrality = nx.betweenness_centrality(G)
        nx.set_node_attributes(G, BetweennessCentrality, 'Betweenness')

        # If both ends of an edge are in the same cluster, label the edge as internal.
        for edge in G.edges(data=True):
            G[edge[0]][edge[1]].update(internal_cluster_edge=(NodeClusters[edge[0]]==NodeClusters[edge[1]]))
        for edge in G.edges:
            G.edges[edge]['Within Cluster'] = G.edges[edge]['internal_cluster_edge']
            del G.edges[edge]['internal_cluster_edge']

        return G

#### AddAnnotationDataToGraph

This function adds annotations to the nodes and edges of the graph.

Specifically, it adds to each node:
* The LC pathway networks in which the node is present.
* The interaction networks in which the node is present.
* A link potential GeneCard page(s) for the biomarker.
* A link potential Uniprot page(s) for the biomarker.

And for each edge:
* Whether or not the edge appears in at least one LC pathway network.
* The interaction networks in which the edge is appears.

In [None]:
import zipfile

# This just unzips the LCpathways_edgeData.jsonl (that is too large for github in its raw form.)
with zipfile.ZipFile('InteractionNetworks_edgeData.zip', 'r') as archive:
    for file in archive.namelist():
        unzipped = open('./InteractionNetworks_edgeData.jsonl', 'w')
        unzipped.write(archive.read(file).decode('utf-8'))
        unzipped.close()

import json
def LoadAnnotationData():
    with open('LCpathways_nodeData.json','r') as infile:
        lcpathways_nodedata = json.loads(infile.read())

    lcpathways_edgedata = {}
    with open('LCpathways_edgeData.jsonl','r') as infile:
        for jsonl_line in infile.readlines():
            entry = json.loads(jsonl_line)
            key = frozenset(entry['InteractionTuple'])
            data = entry['InteractionData']
            lcpathways_edgedata[key] = data

    with open('InteractionNetworks_nodeData.json','r') as infile:
        interactionnetworks_nodedata = json.loads(infile.read())

    interactionnetworks_edgedata = {}
    with open('InteractionNetworks_edgeData.jsonl','r') as infile:
        for jsonl_line in infile.readlines():
            entry = json.loads(jsonl_line)
            key = frozenset(entry['InteractionTuple'])
            data = entry['InteractionData']
            interactionnetworks_edgedata[key] = data
    
    return lcpathways_nodedata, lcpathways_edgedata, interactionnetworks_nodedata, interactionnetworks_edgedata

from urllib.parse import urlencode
def AddAnnotationDataToGraph(Graph):
    
    LCpathways_nodeData, LCpathways_edgeData, InteractionNetworks_nodeData, InteractionNetworks_edgeData = LoadAnnotationData()
    
    # Note that these are done in an undirected manner.
    for edge in list(Graph.edges):
        if edge[0] in LCpathways_nodeData and edge[1] in LCpathways_nodeData:
            if frozenset(edge) in LCpathways_edgeData:
                Graph.edges[edge]['Present in'] = ['LCpathways']
        if edge[0] in InteractionNetworks_nodeData and edge[1] in InteractionNetworks_nodeData:
            if frozenset(edge) in InteractionNetworks_edgeData:
                try:
                    Graph.edges[edge]['Present in'].extend(InteractionNetworks_edgeData[frozenset(edge)]['PresentIn'])
                except:
                    Graph.edges[edge]['Present in'] = InteractionNetworks_edgeData[frozenset(edge)]['PresentIn']

    for nodeName in list(Graph.nodes):
        if nodeName in LCpathways_nodeData:
            Graph.nodes[nodeName]['Present in'] = ['LCpathways']
        if nodeName in InteractionNetworks_nodeData:
            try:
                Graph.nodes[nodeName]['Present in'].extend(InteractionNetworks_nodeData[nodeName]['PresentIn'])
            except:
                Graph.nodes[nodeName]['Present in'] = InteractionNetworks_nodeData[nodeName]['PresentIn']

        # Note that these are just search results. Because not every Biomarker will appear in either GeneCard and/or Uniprot
        GeneCardQuery = {'queryString':nodeName}
        Graph.nodes[nodeName]['GeneCard Results'] = f'https://www.genecards.org/Search/Keyword?{urlencode(GeneCardQuery)}'
        Graph.nodes[nodeName]['Uniprot Results'] = f'https://www.uniprot.org/uniprot/?query=gene:%22{"+".join(nodeName.split())}%22+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22&sort=score'
    
    return Graph

## Run everything <a class='anchor' id='run'></a>

This runs all of the previous pieces in order, for each cancer type.

Specifically:
1. Loads the data from csvs.
1. Processes the biomarker-cancer co-occurrence data.
    * And subsequently extracting valid publication ids for the target cancer.
1. Process the biomarker-biomarker co-occurrence data.
1. Transform the processed biomarker-biomarker co-occurrence data into an "edge-list" dataframe.
1. Build the network from "edge-list" dataframe.
    * Also calculate network specific metric (*ex.* cluster, centrality)
1. Add the annotation data to the nodes and edges.

In [None]:
Slop = 20 # Value used in publication. Note that this does not get actively used, but rather slop is fixed via the input file containing only data for slop 20.
Cancer_Slop = 20 # Value used in publication.
MinimumUniquePaperCooccurrence = 2 # Value used in publication.
MinimumTotalCooccurrence = 4 # Value used in publication.
EdgeMaxRank=5 # Value used in publication.

CancersList = ['Bladder','Breast','Colorectal','Lung','Prostate','Renal'] # The 6 target cancer types
CancersList.append('Overall') # Plus all 6 together.

Graphs = {} # Dict that will hold all of the graphs.

for TargetCancer in CancersList:
    df_NCIBiomarkerInfo, df_BiomarkerCancer, df_BiomarkerBiomarker = LoadRawDataFrames() # Load the necessary data from csvs

    df_BiomarkerCancer = AddAvgPubAgeAndSlope_AllCancers(df_BiomarkerCancer,'Publication IDs')
    if TargetCancer != 'Overall': # For overall, doing these numbers cancer specific is redundant.
        df_BiomarkerCancer = AddAvgPubAgeAndSlope_CancerSpecific(df_BiomarkerCancer,'Publication IDs')

    df_BiomarkerCancer_TargetCancer = Get_CancerBiomarkerPublications(df_BiomarkerCancer,TargetCancer,Cancer_Slop,[5,1000]) # Process the biomarker-cancer data for the TargetCancer.
    ValidPublicationSet_TargetCancer = df_BiomarkerCancer_TargetCancer['Publication IDs'].aggregate(uniques) # Get the valid publication ids for the the TargetCancer.
    
    # Put together the biomarker-biomarker co-occurrence data.
    df_Cooccurrences_TargetCancer = GetCooccurrenceDataFrame(df_BiomarkerBiomarker,ValidPublicationSet_TargetCancer,MinimumUniquePaperCooccurrence,MinimumTotalCooccurrence)

    # Get the graph dataframe.
    df_Graph_TargetCancer = GetGraphDataFrame(df_Cooccurrences_TargetCancer,df_BiomarkerCancer_TargetCancer,TargetCancer)

    # Create the graph.
    G = CreateGraph(df_Graph_TargetCancer,df_BiomarkerCancer_TargetCancer,maxrank=EdgeMaxRank)

    # Add annotation data to the graph.
    G = AddAnnotationDataToGraph(G)
    
    Graphs[TargetCancer] = G
    
    print('{}\t{}'.format(TargetCancer,nx.info(G))) # Print the graph summary information

## Wrap up <a class='anchor' id='wrapup'></a>

At this point the Graphs dict contains the graph for each cancer, plus overall. These can, for example, be put down to disk using the appropriate [networkx write](https://networkx.org/documentation/stable/reference/readwrite/index.html) command for the format you desire the graphs to be saved in.