# Citation Graph




## Ausgangslage

Zu  Beginn  jeder  Forschungarbeitmuss eine Übersicht über die beste-hende Literatur erarbeitet werden. DieForschenden  suchen  dazu  Papers und Bücher über Suchmaschinen wie Google Scholar oder folgen den Zitaten in der Literatur die sie bereits haben. Dieser Prozess ist nicht immer einfach,  da  damit  nicht  immer  alle wichtigen Publikationen für ein gegebenes Forschungsfeld gefunden werden. 

## Ziel der Arbeit

Das Ziel der Arbeit ist ein Werkzeug zu entwickeln, dass Forschenden hilft die bestehende Literatur um ein Thema systematisch zu erkunden und es auf ihre Bedürnisse anzupassen. Ausgehend von einer Auswahl relevanter Literatur und von geeigneten Stichwörtern wird über Referenzen ein geeigneter Graph erstellt, der interaktiv und mit Hilfe von NLP Verfahren auf die spezifischen Bedürfnissedes Forschenden angepasst werden soll.

## Problemstellung

* *User Interface:* Zur Erstellung, Visualisierung, Navigation und interaktive Anpassung des Graphen. Für die Erstellung des Graphen kann z.B. das Microsoft Academic Graph API verwendet werden.

* *Graphanalyse:* Anwendung verschiedener Methoden zur Graphanalyse um Vorschläge für weitere Literatur zu machen. 

* *User-Interaktion:* Um die Suche im Graphen zu verfeinern und besser auf die spezifische Situation des Forschenden anzupassen.

* *Topic Modelling:* Anwendung von Clustering-Algorithmen und Topic-Modelling der Abstracts um Themen in der Literatur zu identifizieren.

## Technologien/Fachliche Schwerpunkte/Referenzen

* Datenanalyse mit Python, Pandas und iGraph und NLTK
* Nutzung der Microsoft Academic Graph API
* Einfache Web-Applikation zum Beispiel mit Flask (Python), HTML, CSS und JavaScript


Building a database of papers that are connected to a list of "seed papers". Metainformation of all papers that are cited in the *seed papers* is downloaded. Also, the metainformation for all papers that cite the seed papers is downloaded.

The final "database" is in JSON-format and looks like this:

~~~json
{
  "21202130": {
    "AA": [
      {
        "S": 1,
        "AuId": 2146551188,
        "AfN": "ohio state university",
        "AfId": 52357470,
        "AuN": "james moody"
      },
      …
    ],
    "DN": "Dynamic network visualization",
    "V": 110,
    "ECC": 449,
    "D": "2005-01-01",
    "F": [
      {
        "FId": 62886766,
        "FN": "organizational network analysis"
      },
      …
    ],
    "CC": 261,
    "LP": 1241,
    "J": {
      "JN": "amer j sociol",
      "JId": 122471516
    },
    "L": "en",
    "FP": 1206,
    "IA": {
      "IndexLength": 129,
      "InvertedIndex": {
        "“movies.”": [
          23
        ],
        "particularly": [
          89
        ],
        "over": [
          70
        ],
    },
    "S": [
      {
        "U": "http://www.jstor.org/stable/10.1086/421509",
        "Ty": 3
      },
   …
    ],
    "VFN": "American Journal of Sociology",
    "W": [
      "dynamic",
      "network",
      "visualization"
    ],
    "Ti": "dynamic network visualization",
    "Y": 2005,
    "RId": [
      2112090702,
      2148606196,
 …
    ],
    "logprob": -17.963,
    "Id": 21202130,
    "I": 4
  },
  …

~~~

The downloader uses the Microsoft Academic Graph API to find papers by title or by their Id like this:

~~~
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=Ti='iterating between tools to create and edit visualizations'&model=latest&count=10&offset=0&attributes=Id,Ti,L,Y,D,CC,ECC,AA.AuN,AA.AuId,AA.AfN,AA.AfId,AA.S,F.FN,F.FId,J.JN,C.CId,RId,W,E.DN,E.S,E.S.Ty,E.DOI
~~~

In order to query the API, you need to create your own key for [Microsoft Cognitive](https://labs.cognitive.microsoft.com/en-us/project-academic-knowledge) and add it to `config.py`

## Open Academic Graph

Long term, it would be better not to rely on a rate-limited API graciously provided by Microsoft. Fortunately, the are nowadays not evil and provide everything as [Open Data](https://www.openacademic.ai/oag/). Would need get that into a Graph Database et voilà.

## Setup 

In [1]:
import requests
import json
import time
import config
from copy import deepcopy
import re

In [2]:
# Load the paper database
db = {}
with open('db.json', 'r') as db_json:
    db = json.load(db_json)

In [3]:
def normalize(name):
    name = name.lower()
    name = name.replace("-", " ")
    name = name.replace(".", "")
    name = name.replace("?", "")
    name = name.replace(":", "")
    name = name.replace(",", "")
    name = name.replace("-", " ")
    name = re.sub(r"\s\s+", " ", name)
    return name

In [4]:
def id_for_name(name):
        normalized = normalize(name)
        url = 'https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate'
        params = {
            'expr': "Ti='"+normalized+"'",
            'model': 'latest',
            'count': 10,
            'offset': 0,
            'attributes': 'Id,Ti,L,Y,D,CC,ECC,AA.AuN,AA.AuId,AA.AfN,AA.AfId,AA.S,F.FN,F.FId,J.JN,J.JId,C.CN,C.CId,RId,W,E.DN,E.S,E.VFN,E.VSN,E.V,E.I,E.FP,E.LP,E.DOI,E.CC,E.IA'
        }
        headers = {'Ocp-Apim-Subscription-Key': config.api_key}
        r = requests.get(url, params=params, headers=headers)

        result = r.json()

        if 'entities' in result:
            return str(result['entities'][0]['Id'])
        else:
            print(id)
            return None

In [5]:
# Define some useful functions to query the Microsoft Academic Graph API
def download_for_id(id):
    url = 'https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate'
    params = {
        'expr': 'Id='+str(id),
        'model': 'latest',
        'count': 10,
        'offset': 0,
        'attributes': 'Id,Ti,L,Y,D,CC,ECC,AA.AuN,AA.AuId,AA.AfN,AA.AfId,AA.S,F.FN,F.FId,J.JN,J.JId,C.CN,C.CId,RId,W,E.DN,E.S,E.VFN,E.VSN,E.V,E.I,E.FP,E.LP,E.DOI,E.CC,E.IA'
    }
    headers = {'Ocp-Apim-Subscription-Key': config.api_key}

    r = requests.get(url, params=params, headers=headers)
    
    result = r.json()
    if 'entities' in result:
        return r.json()['entities'][0]
    else:
        print(id)
        return None


In [6]:
def find_all_refs(id):
    url = 'https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate'
    params = {
        'expr': 'RId='+str(id),
        'model': 'latest',
        'count': 10000,
        'offset': 0,
        'attributes': 'Id,Ti,L,Y,D,CC,ECC,AA.AuN,AA.AuId,AA.AfN,AA.AfId,AA.S,F.FN,F.FId,J.JN,J.JId,C.CN,C.CId,RId,W,E.DN,E.S,E.VFN,E.VSN,E.V,E.I,E.FP,E.LP,E.DOI,E.CC,E.IA'
    }
    headers = {'Ocp-Apim-Subscription-Key': config.api_key}

    r = requests.get(url, params=params, headers=headers)
    result = r.json()

    if 'entities' in result:
        return r.json()['entities']
    else:
        print(id)
        return None

## 1. Identify seed papers

First we define a list of seed papers. We found the ID's manually through the Academic Graph API like this:

~~~
https://westus.api.cognitive.microsoft.com/academic/v1.0/evaluate?expr=Ti='iterating between tools to create and edit visualizations'
~~~

Some considerations:
* You first need to register for Microsoft Azure Cognitive Services and get a "Ocp-Apim-Subscription-Key". This key needs to be sent as a header with your request. Microsoft limits the quota that is available for free

* The paper title defined in the "Ti"-field need to be all lowercase and is stripped of all special characters like ":" or "–". Otherwise the resulting response will be empty.

In [7]:
seed_papers =[
    id_for_name("Animated transitions in statistical data graphics"),
    id_for_name("A deeper understanding of sequence in narrative visualization"),
    id_for_name("Toward a Deeper Understanding of the Role of Interaction in Information Visualization"),
    id_for_name("Understanding data videos: Looking at narrative visualization through the cinematography lens"),
    id_for_name("Animations 25 Years Later: New Roles and Opportunities"),
    id_for_name("Authoring Narrative Visualizations with Ellipsis"),
    id_for_name("The Not-so-Staggering Effect of Staggered Animated Transitions on Visual Tracking"),
    id_for_name("Animation - can it facilitate?"),
    id_for_name("Display of Key Pictures from Animation: Effects on Learning"),
]

In [8]:
seed_papers

['2125215841',
 '1980209219',
 '2161133721',
 '2143880496',
 '2408736664',
 '1530372876',
 '1989510471',
 '2108075477',
 '53303969']

## 2. Download the information of the seed papers

Download the metadata (including the citations) for the seed papers.

In [30]:
# 2. Download the information of the seed papers
papers = {}
for paper_id in seed_papers:
    if not (paper_id in db): # Only do a request if paper is not already in DB
        db[str(paper_id)] = download_for_id(str(paper_id))
    else:
        print('Already in DB')
    papers[str(paper_id)] = db[str(paper_id)]

Already in DB
Already in DB
Already in DB
Already in DB
Already in DB
Already in DB
Already in DB
Already in DB
Already in DB


## 3. Save a copy of the database to JSON

Save the result which helps to stay in the quota.

In [31]:
with open('db.json', 'w') as outfile:
    json.dump(db, outfile)

## 4. Retrive info of all papers cited in seed papers

This is the longest step which can take quite a few API-requests. We try to not do them twice and also added a rate limit to stay friendly.

In [32]:
papers_clone = deepcopy(papers)
for paper in papers:
    if paper in seed_papers:
        if not('RId' in papers[paper]):
            None     # Otherwise we'll get all the citations from other papers downloaded previously
        else:
            for citation in papers[paper]['RId']:
                if (str(citation) in db):
                    print('In DB')
                else:
                    db[str(citation)] = download_for_id(str(citation))
                    print('downloaded: '+str(citation))
                    time.sleep(0.4)
                papers_clone[str(citation)] = db[str(citation)]
            
papers = papers_clone

In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In DB
In D

In [33]:
# Save to DB again
with open('db.json', 'w') as outfile:
    json.dump(papers, outfile)

In [34]:
len(papers)

319

## 5. Retrieve all papers which cite the seed papers

In [35]:
for paper in seed_papers:
    # We do it in all cases because it only uses as many requests
    # as we have seed papers -> inexpensive
    referring = find_all_refs(paper)
    for ref in referring:
        db[str(ref['Id'])] = ref
        papers[str(ref['Id'])] = ref

In [36]:
with open('db.json', 'w') as outfile:
    json.dump(db, outfile)

In [37]:
print(len(db), len(papers))

1979 1979


## 6. Create graph

In [38]:
import json
import igraph
import pandas as pd

In [39]:
anti_seeds = ['565', '188', '513', '747', '831']

In [40]:
G = igraph.Graph()
G.to_directed()
paperli = []
for paper in papers:
    if not(paper in anti_seeds):
        paperli.append(paper)

G.add_vertices(len(paperli))

for i, paper in enumerate(paperli):
    if 'RId' in papers[paper]:
        for citation in papers[paper]['RId']:
            for j, subpaper in enumerate(paperli):
                if str(subpaper) == str(citation):
                    G.add_edges([(i, j)])

In [41]:
for i, paper in enumerate(paperli):
    G.vs[i]["id"] = papers[paper]['Id']
    G.vs[i]["citations"] = papers[paper]['CC']
    G.vs[i]["author"] = papers[paper]['AA'][0]["AuN"]
    G.vs[i]["year"] = papers[paper]['Y']
    G.vs[i]["name"] = papers[paper]['DN']

## 7. Pagerank

In [42]:
G.es["weight"] = 0
for vertex in G.vs:
    for edge in G.es.select(_source_eq=vertex.index):
        G.es[edge.index]["weight"] = G.es[edge.index]["weight"] + vertex["citations"]
for vertex in G.vs:
    for edge in G.es.select(_target_eq=vertex.index):
        G.es[edge.index]["weight"] = G.es[edge.index]["weight"] / vertex["citations"]

G.es[0]["weight"]

0.043478260869565216

In [43]:
G.vs["pagerank"] = G.pagerank(weights="weight")

In [44]:
pd.set_option('display.max_colwidth', 200)
pr = pd.DataFrame(list(zip(G.vs["id"], G.vs["author"], G.vs["name"], G.vs["pagerank"], G.vs["citations"])), columns=["id", "author", "name", "pr", "citations"])
pr.sort_values("pr", ascending=False).to_csv('pagerank.csv')

## 8. Relative importance

In [45]:
G.ecount()
arr = []
for v in G.vs:
    arr.append([v["author"], v["year"], v["name"], v.indegree(), v['citations'], v["pagerank"]])
inner_importance = pd.DataFrame(arr, columns=("author", "year", "name", "deg", "citations", "pagerank"))
inner_importance["rating"] = (inner_importance["deg"] / inner_importance["citations"]) * (inner_importance["year"]-1900)
inner_importance = inner_importance[inner_importance['citations']>10]
inner_importance = inner_importance[inner_importance['deg']>1]
inner_importance = inner_importance.sort_values("rating", ascending=False)

inner_importance.to_csv('reading.csv')

## 9. Visualizing the graph

In [2]:
G.write_graphml('papers_graph.graphml')

NameError: name 'G' is not defined

[Gephi](https://gephi.org/) can be used for a first visualization of the graph. It's interesting to try and define clusters:

![](gephi.png)

There are also clustering options in igraph. Clusters can also be visualized an compared as a Hiveplot:

![](hiveplot.png)

## 10. Interesting projects

* [Headstart](https://github.com/OpenKnowledgeMaps/Headstart/blob/master/README.md)
* [PivotPaths](https://github.com/ShreenathIyer/pivot-paths)