# Final assignment Programming 3
- Programme: Data Science for Life Sciences
- Institute: Hanze University of Applied Sciences
- Lecturer: Martijn Herber
- Author: Jan Rombouts

# Introduction
This project investigates the scientific literature. It uses all published papers available on NCBI Pubmed in XML format. The project combined parallel processing and graph theory to investigate the structure of publishing in the scientific world. Specifically, the following questions were answered:

- How large a group of co-authors does the average publication have?
- Do authors mostly publish using always the same group of authors?
- Do authors mainly reference papers with other authors with whom they've co-authored papers (including themselves)?
- What is the distribution in time for citations of papers in general, and for papers with the highest number of citations? Do they differ?
- Is there a correlation between citations and the number of keywords that papers share? I.e. papers which share the same subject cite each other more often.
- For the most-cited papers (define your own cutoff), is the correlation in shared keywords between them and the papers that cite them different from (5) ?

Note that it probably helps to parse the actual NCBI XML data only once and then save it in another, smaller format suitable to answering the questions (e.g. a NetworkX datastructure).

Your script should take as input the directory of PubMed abstracts in data/datasets/NCBI/Pubmed and produce nothing but a CSV with the questions asked above and the number that answers that question. Please note that the numbers will differ depending on what you choose to use as metrics, and there is no "right" answer! It's about how you motivate the answers that those numbers represent in your report, not the numbers themselves.

In [None]:
# dir = 'data/datasets/NCBI/Pubmed'

In [None]:
from graphframes import *

# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()