# papeles package - institutions network analysis example

In this notebook, all institutions from Neurips papers are extracted and then an institutions network is created using the co-location of such institutions in the research papers. Data was obtained using the [neurips_crawler](https://github.com/glhuilli/neurips_crawler).

The `papeles` package is used to extract a clean version of the institutions by processing the front page of the research papers, identifying from there which institutions were involved in the research. More details about which institutions are extracted and how these are cleaned up can be found in the `papeles` package. 



In [1]:
import os
import json
from collections import defaultdict

from tqdm.notebook import tqdm


from papeles.paper.neurips import get_key


## Loading the data

Data is already available in `/var/data/neurips_analysis`. To run this script make sure to create this folder and download the data from github repository `xxx`.

These are files already processed from `pdf` to `txt`, and then only selecting the `header` of such papers (everything from the beginning of the document until the abstract) using the `papeles` package. For more details on this, please check the ipython notebook `xxx`. Also, there are two files with keywords already extracted from the entire corpus, also using the `papeles` package. For more details on this, please check the ipython notebook `xxx`

In [2]:
# These are files with encoding issues that were not parse correctly by the pdf_parser 
SKIP_FILES = [
    '5049-nonparametric-multi-group-membership-model-for-dynamic-networks.pdf_headers.txt',
    '4984-cluster-trees-on-manifolds.pdf_headers.txt',
    '5820-alternating-minimization-for-regression-problems-with-vector-valued-outputs.pdf_headers.txt',
    '9065-visualizing-and-measuring-the-geometry-of-bert.pdf_headers.txt'
    '4130-implicit-encoding-of-prior-probabilities-in-optimal-neural-populations.pdf_headers.txt',
    '7118-local-aggregative-games.pdf_headers.txt'
]

NEURIPS_ANALYSIS_DATA_PATH = '/var/data/neurips_analysis'

file_lines = defaultdict(list)
for filename in tqdm(os.listdir(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'files_headers/')), 'loading files'):
    if filename in SKIP_FILES:
        continue
    with open(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, './files_headers/', filename), 'r') as f:
        for line in f.readlines():
            file_lines[get_key(filename)].append(line.strip())
            
metadata_path = os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'files_metadata/')

metadata = {}
for filename in tqdm(os.listdir(metadata_path), 'loading metadata'):
    with open(os.path.join(metadata_path, filename), 'r') as f: # open in readonly mode
        for line in f.readlines():
            data = json.loads(line)
            metadata[get_key(data['pdf_name'])] = data


HBox(children=(FloatProgress(value=0.0, description='loading files', max=6086.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='loading metadata', max=6083.0, style=ProgressStyle(descri…




In [3]:
from papeles.paper.neurips import institutions

from collections import Counter 
import itertools


In [4]:
inst_counter = Counter()
for file, lines in list(file_lines.items()):
    file_institutions = institutions.get_file_institutions(lines)
    unique_file_institutions = list(set(file_institutions))
    inst_counter.update(unique_file_institutions)

cleaned = len(sorted([x for x in inst_counter.items() if x[1] > 0 and x[0]], key= lambda x: x[1], reverse=True))
total = sum([x[1] for x in inst_counter.items() if x[1] > 0 and x[0]])

print(f'current institutions: {cleaned}')
print(f'total raw institutions: {total}')
print(f'clean-up fraction: {"{:0.2f}".format(1 - cleaned / total)}')


current institutions: 2989
total raw institutions: 9245
clean-up fraction: 0.68


## Institutions Interactions Graph 

To build the graph, we'll use both `networkx` and the `papeles` package. In particular, we'll use a method tailored for Neurips data called `institutions_graph`. I recommend looking into the details of this method, but the overall goal is that for each file of a particular `year` (optional) in the corpus (`file_lines`), it will try to find all institutions (from `inst_counter`) that co-authored that particular file. This will only consider institutions with at least a minimum frequency (`freq`) from `inst_counter`. 

I'm also importing the `dump_to_d3js` method which will be useful to generate the output needed for a second part of this analysis. 



In [5]:
import networkx as nx

from papeles.paper.neurips import institutions_graph

In [6]:
g_all_n2, g_all_n2_files = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=2) 
print(nx.info(g_all_n2))


Name: 
Type: Graph
Number of nodes: 312
Number of edges: 1828
Average degree:  11.7179


This means that the graph with considering all institutions that appear at least twice in all Neurips papers since 2009 is 312, and it's highgly connected (avg degree 11.71). I'll get back to the graph analysis later, but let's see how this graph looks like if we consider institutions with at least 5 papers in Neurips. 


In [7]:
g_all_n5, g_all_n5_files = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=5) 
print(nx.info(g_all_n5))

Name: 
Type: Graph
Number of nodes: 172
Number of edges: 1392
Average degree:  16.1860


In [8]:
g_all_n20, g_all_n20_files = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=20) 
print(nx.info(g_all_n20))

Name: 
Type: Graph
Number of nodes: 63
Number of edges: 724
Average degree:  22.9841


In [9]:
institutions_graph.dump_to_d3js_heb(g_all_n20, os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'n20_graph_all_years.json'))

The graph is now almost ~50% smaller (in terms of nodes), but the number of edges didn't change much (reduced only a ~70%). Something we can conclude from this, is that ~50% of the nodes in the graph didn't contribute many connections (which is expected as they had less papers). However, the degree increased by +45%, which again tells us that it's a much richer graph to analyze. 

If we consider the graphs based on the papers associated to every year since 2009 to 2019, at the same frequency (freq=5), it's possible to see how Neurips took a serious turn around year ~2013 when it became clearly well known and more researchers started publishing at Neurips. 

In [10]:
graphs = {}
graph_files = {}
for year in range(2009, 2020):
    graphs[year], graph_files[year] = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=5, year=year)
    print(f'\nyear: {year}')
    print(nx.info(graphs[year]))



year: 2009
Name: 
Type: Graph
Number of nodes: 81
Number of edges: 127
Average degree:   3.1358

year: 2010
Name: 
Type: Graph
Number of nodes: 84
Number of edges: 146
Average degree:   3.4762

year: 2011
Name: 
Type: Graph
Number of nodes: 79
Number of edges: 150
Average degree:   3.7975

year: 2012
Name: 
Type: Graph
Number of nodes: 95
Number of edges: 174
Average degree:   3.6632

year: 2013
Name: 
Type: Graph
Number of nodes: 99
Number of edges: 164
Average degree:   3.3131

year: 2014
Name: 
Type: Graph
Number of nodes: 104
Number of edges: 192
Average degree:   3.6923

year: 2015
Name: 
Type: Graph
Number of nodes: 113
Number of edges: 238
Average degree:   4.2124

year: 2016
Name: 
Type: Graph
Number of nodes: 120
Number of edges: 271
Average degree:   4.5167

year: 2017
Name: 
Type: Graph
Number of nodes: 135
Number of edges: 349
Average degree:   5.1704

year: 2018
Name: 
Type: Graph
Number of nodes: 141
Number of edges: 405
Average degree:   5.7447

year: 2019
Name: 
Type: 

## Communities of Institutions

Using the `community` package, which is a very simple package with the Louvain algorithm for community detection ([python-louvain](https://python-louvain.readthedocs.io/en/latest/api.html)), it's possible to identify how different communities have evolved over the last 10 years. 



In [11]:
import community
from pprint import pprint


for year in range(2009, 2020):
    print(f'===============\nyear: {year}')
    print(nx.info(graphs[year]))
    partition_y = community.best_partition(graphs[year])
    institution_clusters_y = defaultdict(list)
    for k, p in partition_y.items():
        institution_clusters_y[p].append(k)
    pprint(institution_clusters_y)


year: 2009
Name: 
Type: Graph
Number of nodes: 81
Number of edges: 127
Average degree:   3.1358
defaultdict(<class 'list'>,
            {0: ['universite de montreal'],
             1: ['stanford university',
                 'university of california berkeley',
                 'brown university',
                 'university of texas at austin'],
             2: ['boston university'],
             3: ['university of copenhagen'],
             4: ['university of alberta',
                 'nicta',
                 'indian institute of science',
                 'mcgill university'],
             5: ['duke university',
                 'princeton university',
                 'university of maryland',
                 'facebook'],
             6: ['university of oxford',
                 'carnegie mellon university',
                 'mit',
                 'university of toronto',
                 'max planck institute',
                 'inria',
                 'intel labs',
        

                 'tencent ai lab',
                 'mit',
                 'google brain',
                 'northwestern university',
                 'ecole normale superieure',
                 'google research',
                 'georgia institute of technology',
                 'tel aviv university',
                 'hebrew university',
                 'university of minnesota',
                 'shanghai jiao tong university',
                 'max planck institute for intelligent systems',
                 'tel-aviv university'],
             6: ['peking university'],
             7: ['university of colorado'],
             8: ['rutgers university'],
             9: ['iowa state university'],
             10: ['intel labs', 'dartmouth college'],
             11: ['nec corporation',
                  'university of california san diego',
                  'national institute of informatics',
                  'kth',
                  'riken aip',
                  'cornell un

There's some interesting analysis that can be done about how clusters change over time. Will expand on this as future work, but in the meantime I invite you to explore the data from the above results (e.g. note how some institutions are jumping from cluster to cluster every year, while others are paired together pretty closely).  

In [12]:
partition = community.best_partition(g_all_n5)
institution_clusters = defaultdict(list)
for k, p in partition.items():
    institution_clusters[p].append(k)
institution_clusters
    

defaultdict(list,
            {0: ['university of oxford',
              'deepmind',
              'universite de montreal',
              'university of cambridge',
              'google brain',
              'university of edinburgh',
              'university of british columbia',
              'university of california los angeles',
              'alan turing institute',
              'max planck institute for intelligent systems',
              'city university of hong kong',
              'university of texas at arlington',
              'university college london',
              'university of bristol',
              'uber',
              'technical university of denmark',
              'imperial college london',
              'university of sydney',
              'technische universitat berlin',
              'university of southampton',
              'university of freiburg',
              'ecole polytechnique de montreal',
              'ghent university',
              'univ

In [13]:
[x for x in institution_clusters[0]]

['university of oxford',
 'deepmind',
 'universite de montreal',
 'university of cambridge',
 'google brain',
 'university of edinburgh',
 'university of british columbia',
 'university of california los angeles',
 'alan turing institute',
 'max planck institute for intelligent systems',
 'city university of hong kong',
 'university of texas at arlington',
 'university college london',
 'university of bristol',
 'uber',
 'technical university of denmark',
 'imperial college london',
 'university of sydney',
 'technische universitat berlin',
 'university of southampton',
 'university of freiburg',
 'ecole polytechnique de montreal',
 'ghent university',
 'university of warwick']

Checking for example the first cluster, it's interesting to see that it reflects a mainly european institutions, with a few exceptions (Twitter, Kyoto University, and City University of Hong Kong). More in-depth analysis can be done to understand why these exceptions are clustered together with european institutions and not with other similar institutions (e.g. from US in the case of Twitter, or with other regional institutions in the case of Kyoto University or City University of Hong Kong). 

## Rankings of Institutions 

Using networkx, it's possible to rank institutions using different approaches.

In [14]:
eigen_centrality = nx.eigenvector_centrality(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in eigen_centrality.items()], key=lambda x: x[1], reverse=True)[:10]


[('microsoft research', 0.2357),
 ('university of california berkeley', 0.2319),
 ('carnegie mellon university', 0.228),
 ('mit', 0.2234),
 ('stanford university', 0.1953),
 ('princeton university', 0.1876),
 ('google', 0.1867),
 ('google research', 0.1836),
 ('university of texas at austin', 0.1633),
 ('harvard university', 0.1571)]

In [15]:
katz_centrality = nx.katz_centrality_numpy(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in katz_centrality.items()], key=lambda x: x[1], reverse=True)[:10]


[('university of oxford', 0.2606),
 ('university of cambridge', 0.2546),
 ('deepmind', 0.2467),
 ('university college london', 0.2121),
 ('eth zürich', 0.1967),
 ('alan turing institute', 0.1933),
 ('google brain', 0.1916),
 ('university of toronto', 0.172),
 ('max planck institute for intelligent systems', 0.1682),
 ('inria', 0.165)]

In [16]:
closeness_centrality = nx.closeness_centrality(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in closeness_centrality.items()], key=lambda x: x[1], reverse=True)[:10]


[('microsoft research', 0.6297),
 ('carnegie mellon university', 0.6249),
 ('mit', 0.6225),
 ('university of california berkeley', 0.6178),
 ('stanford university', 0.5846),
 ('princeton university', 0.5723),
 ('google', 0.5683),
 ('google research', 0.5547),
 ('university of texas at austin', 0.5473),
 ('university of oxford', 0.54)]

In [17]:
betweenness_centrality = nx.betweenness_centrality(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in betweenness_centrality.items()], key=lambda x: x[1], reverse=True)[:10]


[('microsoft research', 0.122),
 ('carnegie mellon university', 0.1057),
 ('mit', 0.1031),
 ('university of california berkeley', 0.0876),
 ('stanford university', 0.0551),
 ('google', 0.0448),
 ('university college london', 0.0409),
 ('princeton university', 0.0368),
 ('inria', 0.0341),
 ('deepmind', 0.0313)]

In [18]:
hubs, authorities = nx.hits(g_all_n5)

In [19]:
sorted([(v, float('{:0.4f}'.format(c))) for v, c in hubs.items()], key=lambda x: x[1], reverse=True)[:10]

[('microsoft research', 0.055),
 ('mit', 0.0453),
 ('university of california berkeley', 0.0431),
 ('carnegie mellon university', 0.0426),
 ('princeton university', 0.0383),
 ('stanford university', 0.0332),
 ('university of texas at austin', 0.0309),
 ('columbia university', 0.0266),
 ('google research', 0.0266),
 ('harvard university', 0.0246)]

In [20]:
sorted([(v, float('{:0.4f}'.format(c))) for v, c in authorities.items()], key=lambda x: x[1], reverse=True)[:10]

[('microsoft research', 0.055),
 ('mit', 0.0453),
 ('university of california berkeley', 0.0431),
 ('carnegie mellon university', 0.0426),
 ('princeton university', 0.0383),
 ('stanford university', 0.0332),
 ('university of texas at austin', 0.0309),
 ('columbia university', 0.0266),
 ('google research', 0.0266),
 ('harvard university', 0.0246)]

Note how in most of these rankings, the top institutions are basically the same, except by Katz centrality. In this case, the main instutitions are mainly european. Futher analysis on the structure of the graph could explain this particular behavior, which I'll leave as future work, or to be explored by anyone that is reading this. 


With the following method, you can export the different clusters and the respective centrality measures for each node in the graph. This is designed to work with Treemap version of D3.js described in glhuilli.github.io Neurips analysis post. 

In [21]:
institutions_graph.dump_to_treemap_d3js(g_all_n20, os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'n5_graph_all_years_clusters.json'))