# **CS 586: Final Project**
COVID-19: Understanding the range of incubation periods and how long individuals are contagious after recovery.
-----------------------------------------------------------------------
This purpose of this notebook is to take a .csv file of relevant papers and use NetworkX to run the PageRank algorithm. The results are then saved to a separate .csv file. 

## **Installing Dependencies**
This installs the modules that are not already provided and mounts Google Drive if using Colab. 

In [3]:
# Installs dependencies and 
!pip install rdflib
!apt-get -y install python-dev graphviz libgraphviz-dev pkg-config
!pip install pygraphviz

# Mounting Google drive if using Colab 
from google.colab import drive
drive.mount('/content/drive')

Collecting rdflib
[?25l  Downloading https://files.pythonhosted.org/packages/d0/6b/6454aa1db753c0f8bc265a5bd5c10b5721a4bb24160fb4faf758cf6be8a1/rdflib-5.0.0-py3-none-any.whl (231kB)
[K     |████████████████████████████████| 235kB 7.9MB/s 
Collecting isodate
[?25l  Downloading https://files.pythonhosted.org/packages/9b/9f/b36f7774ff5ea8e428fdcfc4bb332c39ee5b9362ddd3d40d9516a55221b2/isodate-0.6.0-py2.py3-none-any.whl (45kB)
[K     |████████████████████████████████| 51kB 7.4MB/s 
[?25hInstalling collected packages: isodate, rdflib
Successfully installed isodate-0.6.0 rdflib-5.0.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
pkg-config is already the newest version (0.29.1-0ubuntu2).
python-dev is already the newest version (2.7.15~rc1-1).
graphviz is already the newest version (2.40.1-2).
The following additional packages will be installed:
  libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common
  libgvc6-plugins-gtk li

## **Preparing the Knowledge Graph**

In order to utilize the open-source knowledge graph, we'll parse the N-Triples file provided using `rdflib`.

In [4]:
import rdflib
from rdflib.extras.external_graph_libs import rdflib_to_networkx_digraph
import networkx as nx
%matplotlib inline
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm

In [5]:
# Load the knowledge-graph (this will take a bit)
# Note that the invalid URIs warnings are unfortunately due to the 
# errors in the open source dataset
kg = rdflib.Graph()
kg.load('/content/drive/Shared drives/CS 586: Data & Web Semantics/Final Project/covid19-literature-knowledge-graph/covid-kg4.nt', format='nt')

http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize 

In [6]:
# Check that we were able to load the knowledge graph properly
# Number of triples
print(len(list(kg.triples((None, None, None)))))

#Predicates
print(len(set(kg.predicates())))

# Number of subjects
print(len(set(kg.subjects())))

# Predicates
for pr in set(kg.predicates()):
   print(pr)

5151961
9
1362044
http://dbpedia.org/ontology/city
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://purl.org/spar/cito/cites
http://dbpedia.org/ontology/country
http://xmlns.com/foaf/0.1/surname
http://xmlns.com/foaf/0.1/firstName
http://purl.org/dc/terms/identifier
http://purl.org/spar/pro/creator
http://www.w3.org/ns/org#memberOf


## **Prepare the Dataset**

Below we take the .csv file containing the relevant papers and only keep the papers that have been cited in other papers.

In [7]:
# Load the csv of papers
import pandas as pd 
data = pd.read_csv('/content/drive/Shareddrives/CS 586: Data & Web Semantics/Final Project/final_ib.csv')

# Extract the doi values from the dataset and convert to the URIRef format
doi_list = list(data['doi'].dropna().apply(lambda x: 'http://dx.doi.org/' + x.strip('doi.org').strip('http://dx.doi.org/')).values)

# Only keep the papers that are cited by other papers in our list
papers = []
for doi in tqdm(doi_list):
  triples_list = list(kg.triples((None, rdflib.URIRef('http://purl.org/spar/cito/cites'), rdflib.URIRef(str(doi)))))
  if len(triples_list) > 0:
    papers += triples_list


HBox(children=(FloatProgress(value=0.0, max=7719.0), HTML(value='')))




## **Generate the Subgraph**

Since we don't need the entire knowledge graph, we'll create a subgraph that contains the papers from the previous step.

In [8]:
# Generate subgraph from the papers 
subgraph = rdflib.Graph()
for paper in tqdm(papers):
  subgraph.add(paper)

# Check for successful subgraph creation below
# Number of triples
print(len(list(subgraph.triples((None, None, None)))))

#Predicates
print(len(set(subgraph.predicates())))

# Number of subjects
print(len(set(subgraph.subjects())))

# Predicate
for pr in set(subgraph.predicates()):
   print(pr)

HBox(children=(FloatProgress(value=0.0, max=51518.0), HTML(value='')))


15295
1
9432
http://purl.org/spar/cito/cites


## **Convert to NetworkX Graph and run PageRank**

We'll now need to convert our subgraph from the previous step to a directed NetworkX graph with weighted edges. The weighted edges ensure that the PageRank algorithm will converge. Once the PageRank is completed, we can save the rankings to a .csv file.

In [9]:
# Conversion to networkx graph for analysis
networkGraph = rdflib_to_networkx_digraph(subgraph, calc_weights=True)

# Check that this was successful
print("Success! NetworkX Graph has length", len(networkGraph))
print("Number of Nodes:", networkGraph.number_of_nodes())
print("Number of Edges:", networkGraph.number_of_edges())

Success! NetworkX Graph has length 10314
Number of Nodes: 10314
Number of Edges: 15295


In [10]:
# Generate the PageRank of the graph
page_rankings = nx.pagerank(networkGraph)

# Convert to .csv
with open('page_ranking.csv', 'w') as file:
  for key in page_rankings.keys():
    file.write("%s,%s\n"%(key,page_rankings[key]))