# Exploratory analysis of S2ORC-Influence-IR corpus

This notebook explores the dataset from network analysis perspective. 

In [1]:
import json
import pathlib
from typing import Iterator, Dict, Any

import networkx as nx

In [2]:
# path
ROOT = pathlib.Path("./").resolve().parent
RAW = ROOT / "data" / "raw"

# typing classes
# doesn't change the behaviour of the code,
# but rather helps to understad what a particular function
# takes as an input or returns as an output
Publication = Dict[str, Any]

In [3]:
def read_jsonl(p: pathlib.PosixPath) -> Iterator[Publication]:
    """Yield .jsonl file's contents line by line."""
    with open(p) as lines:
        for line in lines:
            yield json.loads(line)
            
            
def retrive_texts(data: Publication, field: str = "body_text") -> str:
    """Parse 'body_text' or 'abstract' fields extracting raw texts."""
    return " ".join(section["text"] for section in data[field])

## Based on `texts.jsonl`

In [4]:
fulltexts = {}
citation_context = {}
for publication in read_jsonl(RAW / "texts.jsonl"):
    _id = publication["paper_id"]
    
    fulltexts[_id] = publication
    fulltexts[_id]["body_text"] = retrive_texts(publication)
    
    citation_context[_id] = [
        citation["link"]
        for citation in publication["bib_entries"].values()
        if citation["link"] is not None
    ]

In [5]:
metadata = {}
for publication in read_jsonl(RAW / "metadata.jsonl"):
    _id = publication["paper_id"]
    if _id in citation_context.keys():
        metadata[_id] = publication

In [6]:
len(citation_context)

9411

## Measuring centrality

We assume that the more central a publication is located within our network, the more important it is

In [7]:
G = nx.from_dict_of_lists(citation_context)

In [8]:
len(G.nodes())

109242

In [9]:
len(G.edges())

124163

In [10]:
# calucate 'degree_centrality'
dc = nx.degree_centrality(G)
# sort
centrality = dict(sorted(dc.items(), key=lambda item: item[1], reverse=True))
# select top 10 papers by 'degree_centrality'
top_centrality = list(centrality.keys())[:10]

In [12]:
metadata.get(top_centrality[0]).keys()

dict_keys(['paper_id', 'title', 'authors', 'abstract', 'year', 'arxiv_id', 'acl_id', 'pmc_id', 'pubmed_id', 'doi', 'venue', 'journal', 'has_pdf_body_text', 'mag_id', 'mag_field_of_study', 'outbound_citations', 'inbound_citations', 'has_outbound_citations', 'has_inbound_citations', 'has_pdf_parse', 'has_pdf_parsed_abstract', 'has_pdf_parsed_body_text', 'has_pdf_parsed_bib_entries', 'has_pdf_parsed_ref_entries', 's2_url'])

In [13]:
for paper_id in top_centrality:
    print(metadata[paper_id]["title"], metadata[paper_id]["year"])

The Economics of International Student and Scholar Mobility : Directions for Research 2019
Peace building after Civil War: A Critical Survey of the Literature and Avenues for Future Research 2017
Political Corporate Social Responsibility: Reviewing Theories and Setting New Agendas 2015
Notes on the Determinants of Innovation: A Multi-Perspective Analysis 2004
Reinventing foreign aid for inclusive and sustainable development: a survey 2014
The disclosure of concealable stigmas: Analysis anchored in trust 2016
On the Tension between Sex Equality and Religious Freedom 2007
Theories of War in an Era of Leading-Power Peace Presidential Address, American Political Science Association, 2001 2002
Military Tribunals and Legal Culture: What a Difference Sixty Years Makes 2002
Cognitive Approaches to Foreign Policy Analysis 2017


I don't think it looks very impressive, but still a good start.

---