# **CS 586: Final Project**
COVID-19: Understanding the range of incubation periods and how long individuals are contagious after recovery.
-----------------------------------------------------------------------
This purpose of this notebook is to take a .csv file of relevant papers and use NetworkX to run the PageRank algorithm. The results are then saved to a separate .tsv file. 

## **Installing Dependencies**
This installs the modules that are not already provided and mounts Google Drive if using Colab. 

In [9]:
# Installs dependencies and 
!pip install rdflib
!apt-get -y install python-dev graphviz libgraphviz-dev pkg-config
!pip install pygraphviz

# Mounting Google drive if using Colab 
from google.colab import drive
drive.mount('/content/drive')

Reading package lists... Done
Building dependency tree       
Reading state information... Done
pkg-config is already the newest version (0.29.1-0ubuntu2).
python-dev is already the newest version (2.7.15~rc1-1).
graphviz is already the newest version (2.40.1-2).
libgraphviz-dev is already the newest version (2.40.1-2).
0 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Preparing the Knowledge Graph**

In order to utilize the open-source knowledge graph, we'll parse the N-Triples file provided using `rdflib`.

In [10]:
import rdflib
from rdflib.extras.external_graph_libs import rdflib_to_networkx_digraph
import networkx as nx
%matplotlib inline
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm

In [11]:
# Load the knowledge-graph (this will take a bit)
# Note that the invalid URIs warnings are unfortunately due to the 
# errors in the open source dataset
kg = rdflib.Graph()
kg.load('/content/drive/Shared drives/CS 586: Data & Web Semantics/Final Project/covid19-literature-knowledge-graph/covid-kg4.nt', format='nt')

http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize 

In [12]:
# Check that we were able to load the knowledge graph properly
# Number of triples
print(len(list(kg.triples((None, None, None)))))

#Predicates
print(len(set(kg.predicates())))

5151961
9


## **Prepare the Dataset**

Below we take the .csv file containing the relevant papers and only keep the papers that have been cited in other papers.

In [13]:
# Load the csv of papers
import pandas as pd 
data = pd.read_csv('/content/drive/Shareddrives/CS 586: Data & Web Semantics/Final Project/final_ib.csv')
title_doi = list(data[['title', 'doi']].values)

# Creating look up table for title and doi
title_lookup = {}
for value in tqdm(title_doi):
  title_lookup[value[1]] = value[0] 

# Extract the doi values from the dataset and convert to the URIRef format
doi_list = list(data['doi'].dropna().apply(lambda x: 'http://dx.doi.org/' + x.strip('doi.org').strip('http://dx.doi.org/')).values)

# Only keep the papers that are cited by other papers in our list
papers = []
for doi in tqdm(doi_list):
  triples_list = list(kg.triples((None, rdflib.URIRef('http://purl.org/spar/cito/cites'), rdflib.URIRef(str(doi)))))
  if len(triples_list) > 0:
    papers += triples_list


HBox(children=(FloatProgress(value=0.0, max=8357.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=7719.0), HTML(value='')))




## **Generate the Subgraph**

Since we don't need the entire knowledge graph, we'll create a subgraph that contains the papers from the previous step.

In [14]:
# Generate subgraph from the papers 
subgraph = rdflib.Graph()
for paper in tqdm(papers):
  subgraph.add(paper)

# Check for successful subgraph creation below
# Number of triples
print(len(list(subgraph.triples((None, None, None)))))

#Predicates
print(len(set(subgraph.predicates())))

# Predicate
for pr in set(subgraph.predicates()):
   print(pr)

HBox(children=(FloatProgress(value=0.0, max=51518.0), HTML(value='')))


15295
1
http://purl.org/spar/cito/cites


## **Convert to NetworkX Graph and run PageRank**

We'll now need to convert our subgraph from the previous step to a directed NetworkX graph with weighted edges. The weighted edges ensure that the PageRank algorithm will converge. Once the PageRank is completed, we can save the rankings to a .tsv file. Note: We saved this as a .tsv file since some titles may have commas.

In [15]:
# Conversion to networkx graph for analysis
networkGraph = rdflib_to_networkx_digraph(subgraph, calc_weights=True)

# Check that this was successful
print("Success! NetworkX Graph has length", len(networkGraph))

Success! NetworkX Graph has length 10314


In [16]:
# Generate the PageRank of the graph
page_rankings = nx.pagerank(networkGraph)

# Convert to .csv
with open('final_ib_page_rankings_title.tsv', 'w') as file:
  for key in page_rankings.keys():
    doi = key.strip('http://dx.doi.org/')
    if doi in title_lookup.keys():      
      title = title_lookup[str(doi)]
      file.write("%s\t%s\t%s\n"%(title,doi,page_rankings[key]))