# **CS 586: Final Project**
COVID-19: Understanding the range of incubation periods and how long individuals are contagious after recovery.
-----------------------------------------------------------------------
This purpose of this notebook is to take a .csv file of relevant papers and use NetworkX to run the PageRank algorithm. The results are then saved to a separate .csv file. 

## **Installing Dependencies**
This installs the modules that are not already provided. 

In [1]:
# Installs dependencies
!pip install rdflib
!pip install networkx
!pip install tqdm

Collecting rdflib
  Downloading rdflib-5.0.0-py3-none-any.whl (231 kB)
[K     |████████████████████████████████| 231 kB 4.4 MB/s eta 0:00:01
[?25hCollecting isodate
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 9.5 MB/s  eta 0:00:01
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.6.0 rdflib-5.0.0


## **Preparing the Knowledge Graph**

In order to utilize the open-source knowledge graph, we'll parse the N-Triples file provided using `rdflib`.

In [13]:
import rdflib
from rdflib.extras.external_graph_libs import rdflib_to_networkx_digraph
import networkx as nx
from tqdm import tqdm

In [8]:
# Load the knowledge-graph (this will take a bit)
# Note that the invalid URIs warnings are unfortunately due to the 
# errors in the open source dataset
kg = rdflib.Graph()
kg.load('data/covid-kg.nt', format='nt')

http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize this will break.
http://dx.doi.org/10.1137/040604947\end{doi does not look like a valid URI, trying to serialize 

In [9]:
# Check that we were able to load the knowledge graph properly
# Number of triples
print(len(list(kg.triples((None, None, None)))))

#Predicates
print(len(set(kg.predicates())))

# Number of subjects
print(len(set(kg.subjects())))

# Predicates
for pr in set(kg.predicates()):
   print(pr)

5151961
9
1362044
http://purl.org/spar/cito/cites
http://xmlns.com/foaf/0.1/surname
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://purl.org/spar/pro/creator
http://dbpedia.org/ontology/country
http://www.w3.org/ns/org#memberOf
http://dbpedia.org/ontology/city
http://purl.org/dc/terms/identifier
http://xmlns.com/foaf/0.1/firstName


## **Prepare the Dataset**

Below we take the .csv file containing the relevant papers and only keep the papers that have been cited in other papers.

In [14]:
# Load the csv of papers
import pandas as pd 
data = pd.read_csv('csvs/final_ib.csv')

# Extract the doi values from the dataset and convert to the URIRef format
doi_list = list(data['doi'].dropna().apply(lambda x: 'http://dx.doi.org/' + x.strip('doi.org').strip('http://dx.doi.org/')).values)

# Only keep the papers that are cited by other papers in our list
papers = []
for doi in tqdm(doi_list):
  triples_list = list(kg.triples((None, rdflib.URIRef('http://purl.org/spar/cito/cites'), rdflib.URIRef(str(doi)))))
  if len(triples_list) > 0:
    papers += triples_list


100%|██████████| 7719/7719 [00:00<00:00, 21133.64it/s]


## **Generate the Subgraph**

Since we don't need the entire knowledge graph, we'll create a subgraph that contains the papers from the previous step.

In [15]:
# Generate subgraph from the papers 
subgraph = rdflib.Graph()
for paper in tqdm(papers):
  subgraph.add(paper)

# Check for successful subgraph creation below
# Number of triples
print(len(list(subgraph.triples((None, None, None)))))

#Predicates
print(len(set(subgraph.predicates())))

# Number of subjects
print(len(set(subgraph.subjects())))

# Predicate
for pr in set(subgraph.predicates()):
   print(pr)

100%|██████████| 51518/51518 [00:00<00:00, 115693.07it/s]


15295
1
9432
http://purl.org/spar/cito/cites


## **Convert to NetworkX Graph and run PageRank**

We'll now need to convert our subgraph from the previous step to a directed NetworkX graph with weighted edges. The weighted edges ensure that the PageRank algorithm will converge. Once the PageRank is completed, we can save the rankings to a .csv file.

In [16]:
# Conversion to networkx graph for analysis
networkGraph = rdflib_to_networkx_digraph(subgraph, calc_weights=True)

# Check that this was successful
print("Success! NetworkX Graph has length", len(networkGraph))
print("Number of Nodes:", networkGraph.number_of_nodes())
print("Number of Edges:", networkGraph.number_of_edges())

Success! NetworkX Graph has length 10314
Number of Nodes: 10314
Number of Edges: 15295


In [17]:
# Generate the PageRank of the graph
page_rankings = nx.pagerank(networkGraph)

# Convert to .csv
with open('csvs/page_ranking.csv', 'w') as file:
  for key in page_rankings.keys():
    file.write("%s,%s\n"%(key,page_rankings[key]))