# papeles package - institutions network analysis example

In this notebook, all institutions from Neurips papers are extracted and then an institutions network is created using the co-location of such institutions in the research papers. Data was obtained using the [neurips_crawler](https://github.com/glhuilli/neurips_crawler).

The `papeles` package is used to extract a clean version of the institutions by processing the front page of the research papers, identifying from there which institutions were involved in the research. Results shows that...



In [1]:
import os
import json
from collections import defaultdict

from tqdm.notebook import tqdm


from papeles.paper.neurips import get_key


## Loading the data

Data is already available in `/var/data/neurips_analysis`. To run this script make sure to create this folder and download the data from github repository `xxx`.

These are files already processed from `pdf` to `txt`, and then only selecting the `header` of such papers (everything from the beginning of the document until the abstract) using the `papeles` package. For more details on this, please check the ipython notebook `xxx`. Also, there are two files with keywords already extracted from the entire corpus, also using the `papeles` package. For more details on this, please check the ipython notebook `xxx`

In [2]:
# These are files with encoding issues that were not parse correctly by the pdf_parser 
SKIP_FILES = [
    '5049-nonparametric-multi-group-membership-model-for-dynamic-networks.pdf_headers.txt',
    '4984-cluster-trees-on-manifolds.pdf_headers.txt',
    '5820-alternating-minimization-for-regression-problems-with-vector-valued-outputs.pdf_headers.txt',
    '9065-visualizing-and-measuring-the-geometry-of-bert.pdf_headers.txt'
    '4130-implicit-encoding-of-prior-probabilities-in-optimal-neural-populations.pdf_headers.txt',
    '7118-local-aggregative-games.pdf_headers.txt'
]

NEURIPS_ANALYSIS_DATA_PATH = '/var/data/neurips_analysis'

file_lines = defaultdict(list)
for filename in tqdm(os.listdir(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'files_headers/')), 'loading files'):
    if filename in SKIP_FILES:
        continue
    with open(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, './files_headers/', filename), 'r') as f:
        for line in f.readlines():
            file_lines[get_key(filename)].append(line.strip())
            
metadata_path = os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'files_metadata/')

metadata = {}
for filename in tqdm(os.listdir(metadata_path), 'loading metadata'):
    with open(os.path.join(metadata_path, filename), 'r') as f: # open in readonly mode
        for line in f.readlines():
            data = json.loads(line)
            metadata[get_key(data['pdf_name'])] = data


with open(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'year_keywords_counter_n2.json'), 'r') as f:
    keywords_n2 = json.load(f)
    

with open(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'year_keywords_counter_n3.json'), 'r') as f:
    keywords_n3 = json.load(f)

print(f'keywords loaded -- 2-grams: {sum([len(v) for k, v in keywords_n2.items()])}, 3-grams: {sum([len(v) for k, v in keywords_n3.items()])}')


HBox(children=(FloatProgress(value=0.0, description='loading files', max=6086.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='loading metadata', max=6083.0, style=ProgressStyle(descri…


keywords loaded -- 2-grams: 118478, 3-grams: 120341


In [3]:
from papeles.paper.neurips import institutions

from collections import Counter 
import itertools


In [4]:
inst_counter = Counter()
for file, lines in list(file_lines.items()):
    file_institutions = institutions.get_file_institutions(lines)
    unique_file_institutions = list(set(file_institutions))
    inst_counter.update(unique_file_institutions)

cleaned = len(sorted([x for x in inst_counter.items() if x[1] > 0 and x[0]], key= lambda x: x[1], reverse=True))
total = sum([x[1] for x in inst_counter.items() if x[1] > 0 and x[0]])

print(f'current institutions: {cleaned}')
print(f'total raw institutions: {total}')
print(f'clean-up fraction: {"{:0.2f}".format(1 - cleaned / total)}')


current institutions: 2989
total raw institutions: 9245
clean-up fraction: 0.68


## Institutions Interactions Graph 

To build the graph, we'll use both `networkx` and the `papeles` package. In particular, we'll use a method tailored for Neurips data called `institutions_graph`. I recommend looking into the details of this method, but the overall goal is that for each file of a particular `year` (optional) in the corpus (`file_lines`), it will try to find all institutions (from `inst_counter`) that co-authored that particular file. This will only consider institutions with at least a minimum frequency (`freq`) from `inst_counter`. 

I'm also importing the `dump_to_d3js` method which will be useful to generate the output needed for a second part of this analysis. 



In [5]:
import networkx as nx

from papeles.paper.neurips import institutions_graph

In [6]:
g_all_n2, g_all_n2_files = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=2) 
print(nx.info(g_all_n2))


Name: 
Type: Graph
Number of nodes: 312
Number of edges: 1828
Average degree:  11.7179


This means that the graph with considering all institutions that appear at least twice in all Neurips papers since 2009 is 312, and it's highgly connected (avg degree 11.71). I'll get back to the graph analysis later, but let's see how this graph looks like if we consider institutions with at least 5 papers in Neurips. 


In [7]:
g_all_n5, g_all_n5_files = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=5) 
print(nx.info(g_all_n5))

Name: 
Type: Graph
Number of nodes: 172
Number of edges: 1392
Average degree:  16.1860


In [29]:
g_all_n20, g_all_n20_files = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=20) 
print(nx.info(g_all_n20))

Name: 
Type: Graph
Number of nodes: 63
Number of edges: 724
Average degree:  22.9841


In [30]:
institutions_graph.dump_to_d3js(g_all_n20, os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'n20_graph_all_years.json'))

In [8]:
institutions_graph.dump_to_d3js(g_all_n5, os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'n5_graph_all_years.json'))


The graph is now almost ~50% smaller (in terms of nodes), but the number of edges didn't change much (reduced only a ~70%). Something we can conclude from this, is that ~50% of the nodes in the graph didn't contribute many connections (which is expected as they had less papers). However, the degree increased by +45%, which again tells us that it's a much richer graph to analyze. 

If we consider the graphs based on the papers associated to every year since 2009 to 2019, at the same frequency (freq=5), it's possible to see how Neurips took a serious turn around year ~2013 when it became clearly well known and more researchers started publishing at Neurips. 

In [9]:
graphs = {}
graph_files = {}
for year in range(2009, 2020):
    graphs[year], graph_files[year] = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=5, year=year)
    print(f'\nyear: {year}')
    print(nx.info(graphs[year]))



year: 2009
Name: 
Type: Graph
Number of nodes: 81
Number of edges: 127
Average degree:   3.1358

year: 2010
Name: 
Type: Graph
Number of nodes: 84
Number of edges: 146
Average degree:   3.4762

year: 2011
Name: 
Type: Graph
Number of nodes: 79
Number of edges: 150
Average degree:   3.7975

year: 2012
Name: 
Type: Graph
Number of nodes: 95
Number of edges: 174
Average degree:   3.6632

year: 2013
Name: 
Type: Graph
Number of nodes: 99
Number of edges: 164
Average degree:   3.3131

year: 2014
Name: 
Type: Graph
Number of nodes: 104
Number of edges: 192
Average degree:   3.6923

year: 2015
Name: 
Type: Graph
Number of nodes: 113
Number of edges: 238
Average degree:   4.2124

year: 2016
Name: 
Type: Graph
Number of nodes: 120
Number of edges: 271
Average degree:   4.5167

year: 2017
Name: 
Type: Graph
Number of nodes: 135
Number of edges: 349
Average degree:   5.1704

year: 2018
Name: 
Type: Graph
Number of nodes: 141
Number of edges: 405
Average degree:   5.7447

year: 2019
Name: 
Type: 

## Communities of Institutions

Using the `community` package, which is a very simple package with the Louvain algorithm for community detection (add link xxx), it's possible to identify how different communities have evolved over the last 10 years. 

TODO: add more analysis. 


In [10]:
import community
from pprint import pprint


for year in range(2009, 2020):
    print(f'\nyear: {year}')
    print(nx.info(graphs[year]))
    partition_y = community.best_partition(graphs[year])
    institution_clusters_y = defaultdict(list)
    for k, p in partition_y.items():
        institution_clusters_y[p].append(k)
    pprint(institution_clusters_y)
    


year: 2009
Name: 
Type: Graph
Number of nodes: 81
Number of edges: 127
Average degree:   3.1358
defaultdict(<class 'list'>,
            {0: ['universite de montreal'],
             1: ['stanford university',
                 'university of california berkeley',
                 'brown university',
                 'university of texas at austin'],
             2: ['boston university'],
             3: ['university of copenhagen'],
             4: ['university of alberta',
                 'nicta',
                 'indian institute of science',
                 'mcgill university'],
             5: ['duke university',
                 'princeton university',
                 'facebook',
                 'university of maryland'],
             6: ['university of oxford',
                 'carnegie mellon university',
                 'mit',
                 'university of toronto',
                 'max planck institute',
                 'inria',
                 'intel labs',
       

                 'universita degli studi di milano',
                 'canadian institute for advanced research',
                 'twitter',
                 'eth zürich',
                 'university of sydney',
                 'australian national university',
                 'toyota technological institute at chicago'],
             7: ['peking university'],
             8: ['university of colorado'],
             9: ['rutgers university'],
             10: ['iowa state university'],
             11: ['dartmouth college', 'intel labs'],
             12: ['university of california san diego',
                  'nec corporation',
                  'national institute of informatics',
                  'kth',
                  'riken aip',
                  'cornell university',
                  'ntt communication science laboratories',
                  'jst presto'],
             13: ['stony brook university'],
             14: ['ohio state university'],
             15: ['univer

In [11]:
dendogram = community.generate_dendrogram(g_all_n5)
print(dendogram)
partition_1 = community.partition_at_level(dendogram, 0)
institution_clusters_1 = defaultdict(list)
for k, p in partition_1.items():    
    institution_clusters_1[p].append(k)
institution_clusters_1


[{'university of oxford': 0, 'cornell university': 1, 'university of alberta': 2, 'nicta': 3, 'australian national university': 3, 'iowa state university': 4, 'carnegie mellon university': 5, 'deepmind': 6, 'university of washington': 5, 'northeastern university': 7, 'columbia university': 8, 'stanford university': 5, 'university of pittsburgh': 9, 'university of amsterdam': 10, 'google research': 10, 'university of california irvine': 11, 'university of massachusetts amherst': 5, 'microsoft research': 5, 'hong kong university of science and technology': 9, 'inria': 12, 'university of copenhagen': 12, 'university of toronto': 4, 'universite de montreal': 6, 'university of cambridge': 0, 'princeton university': 8, 'university of california berkeley': 5, 'google brain': 6, 'university of edinburgh': 0, 'university of british columbia': 6, 'mcgill university': 2, 'university of waterloo': 7, 'national taiwan university': 5, 'universite paris-saclay': 12, 'politecnico di milano': 12, 'univ

defaultdict(list,
            {0: ['university of oxford',
              'university of cambridge',
              'university of edinburgh',
              'university of california los angeles',
              'alan turing institute',
              'max planck institute for intelligent systems',
              'city university of hong kong',
              'university of texas at arlington',
              'university college london',
              'technical university of munich',
              'university of bristol',
              'imperial college london',
              'university of sydney',
              'technische universitat berlin',
              'university of southampton',
              'ghent university',
              'university of warwick'],
             1: ['cornell university',
              'rutgers university',
              'university of utah',
              'ntt communication science laboratories'],
             2: ['university of alberta',
              'mcgill uni

In [12]:
partition = community.best_partition(g_all_n5)
institution_clusters = defaultdict(list)
for k, p in partition.items():
    institution_clusters[p].append(k)
institution_clusters
    

defaultdict(list,
            {0: ['university of oxford',
              'deepmind',
              'universite de montreal',
              'university of cambridge',
              'google brain',
              'university of edinburgh',
              'university of british columbia',
              'university of california los angeles',
              'alan turing institute',
              'max planck institute for intelligent systems',
              'city university of hong kong',
              'university of texas at arlington',
              'university college london',
              'technical university of munich',
              'university of bristol',
              'uber',
              'technical university of denmark',
              'imperial college london',
              'university of sydney',
              'technische universitat berlin',
              'university of southampton',
              'university of freiburg',
              'ecole polytechnique de montreal',
     

In [13]:
partition = community.best_partition(g_all_n5)
institution_clusters = defaultdict(list)
for k, p in partition.items():
    institution_clusters[p].append(k)
institution_clusters
    

defaultdict(list,
            {0: ['university of oxford',
              'deepmind',
              'university of copenhagen',
              'universite de montreal',
              'university of cambridge',
              'google brain',
              'university of edinburgh',
              'university of british columbia',
              'university of california los angeles',
              'alan turing institute',
              'max planck institute for intelligent systems',
              'city university of hong kong',
              'university of texas at arlington',
              'university college london',
              'technical university of munich',
              'university of bristol',
              'uber',
              'technical university of denmark',
              'psl research university',
              'imperial college london',
              'university of sydney',
              'technische universitat berlin',
              'university of southampton',
           

In [14]:
nx.triangles(g_all_n5)

{'university of oxford': 243,
 'cornell university': 151,
 'university of alberta': 40,
 'nicta': 4,
 'australian national university': 5,
 'iowa state university': 3,
 'carnegie mellon university': 584,
 'deepmind': 196,
 'university of washington': 240,
 'northeastern university': 64,
 'columbia university': 276,
 'stanford university': 435,
 'university of pittsburgh': 2,
 'university of amsterdam': 6,
 'google research': 398,
 'university of california irvine': 23,
 'university of massachusetts amherst': 19,
 'microsoft research': 633,
 'hong kong university of science and technology': 0,
 'inria': 120,
 'university of copenhagen': 12,
 'university of toronto': 165,
 'universite de montreal': 33,
 'university of cambridge': 167,
 'princeton university': 431,
 'university of california berkeley': 624,
 'google brain': 299,
 'university of edinburgh': 67,
 'university of british columbia': 33,
 'mcgill university': 6,
 'university of waterloo': 36,
 'national taiwan university': 1,
 

In [15]:
all_cliques = nx.enumerate_all_cliques(g_all_n5)
triad_cliques = [x for x in all_cliques if len(x)==3 ]
len(triad_cliques)

4013

## Rankings of Institutions 

Based in the graph properties, it's possible to rank institutions by different approaches... (finish xxx)

In [16]:
centrality = nx.eigenvector_centrality(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in centrality.items()], key=lambda x: x[1], reverse=True)


[('microsoft research', 0.2357),
 ('university of california berkeley', 0.2319),
 ('carnegie mellon university', 0.228),
 ('mit', 0.2234),
 ('stanford university', 0.1953),
 ('princeton university', 0.1876),
 ('google', 0.1867),
 ('google research', 0.1836),
 ('university of texas at austin', 0.1633),
 ('harvard university', 0.1571),
 ('google brain', 0.1568),
 ('university of pennsylvania', 0.1533),
 ('columbia university', 0.1527),
 ('university of washington', 0.1423),
 ('ibm research', 0.1423),
 ('university of oxford', 0.1387),
 ('university of california san diego', 0.1373),
 ('university of southern california', 0.1334),
 ('new york university', 0.1333),
 ('georgia institute of technology', 0.1326),
 ('university college london', 0.1284),
 ('deepmind', 0.1263),
 ('university of michigan', 0.1241),
 ('university of illinois at urbana-champaign', 0.1235),
 ('duke university', 0.1199),
 ('toyota technological institute at chicago', 0.1191),
 ('university of toronto', 0.119),
 ('tsi

In [17]:
centrality = nx.katz_centrality_numpy(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in centrality.items()], key=lambda x: x[1], reverse=True)


[('university of oxford', 0.2606),
 ('university of cambridge', 0.2546),
 ('deepmind', 0.2467),
 ('university college london', 0.2121),
 ('eth zürich', 0.1967),
 ('alan turing institute', 0.1933),
 ('google brain', 0.1916),
 ('university of toronto', 0.172),
 ('max planck institute for intelligent systems', 0.1682),
 ('inria', 0.165),
 ('university of edinburgh', 0.1482),
 ('imperial college london', 0.1422),
 ('max planck institute', 0.1416),
 ('new york university', 0.1306),
 ('facebook ai research', 0.1303),
 ('university of california berkeley', 0.1222),
 ('ecole normale superieure', 0.1161),
 ('harvard university', 0.0953),
 ('uber', 0.0939),
 ('university of warwick', 0.0877),
 ('mit', 0.0856),
 ('northeastern university', 0.0851),
 ('university of copenhagen', 0.0837),
 ('universite de montreal', 0.0714),
 ('epfl', 0.0636),
 ('twitter', 0.0624),
 ('openai', 0.0596),
 ('university of southampton', 0.055),
 ('ecole polytechnique', 0.0543),
 ('university of sydney', 0.0532),
 ('sor

In [18]:
centrality = nx.closeness_centrality(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in centrality.items()], key=lambda x: x[1], reverse=True)


[('microsoft research', 0.6297),
 ('carnegie mellon university', 0.6249),
 ('mit', 0.6225),
 ('university of california berkeley', 0.6178),
 ('stanford university', 0.5846),
 ('princeton university', 0.5723),
 ('google', 0.5683),
 ('google research', 0.5547),
 ('university of texas at austin', 0.5473),
 ('university of oxford', 0.54),
 ('google brain', 0.5383),
 ('university of pennsylvania', 0.5365),
 ('ibm research', 0.5365),
 ('university college london', 0.5365),
 ('deepmind', 0.533),
 ('harvard university', 0.533),
 ('georgia institute of technology', 0.5312),
 ('new york university', 0.5295),
 ('university of illinois at urbana-champaign', 0.5295),
 ('columbia university', 0.5227),
 ('university of california san diego', 0.5227),
 ('university of washington', 0.5211),
 ('inria', 0.5129),
 ('university of toronto', 0.5129),
 ('duke university', 0.5129),
 ('university of michigan', 0.5113),
 ('university of southern california', 0.5097),
 ('cornell university', 0.5081),
 ('universi

In [19]:
# centrality = nx.information_centrality(g_all_n5)
# sorted([(v, float('{:0.4f}'.format(c))) for v, c in centrality.items()], key=lambda x: x[1], reverse=True)

In [20]:
centrality = nx.betweenness_centrality(g_all_n5)
sorted([(v, float('{:0.4f}'.format(c))) for v, c in centrality.items()], key=lambda x: x[1], reverse=True)


[('microsoft research', 0.122),
 ('carnegie mellon university', 0.1057),
 ('mit', 0.1031),
 ('university of california berkeley', 0.0876),
 ('stanford university', 0.0551),
 ('google', 0.0448),
 ('university college london', 0.0409),
 ('princeton university', 0.0368),
 ('inria', 0.0341),
 ('deepmind', 0.0313),
 ('university of illinois at urbana-champaign', 0.0294),
 ('university of texas at austin', 0.0273),
 ('university of oxford', 0.0239),
 ('google research', 0.0234),
 ('university of pennsylvania', 0.0234),
 ('ibm research', 0.0214),
 ('georgia institute of technology', 0.0205),
 ('eth zürich', 0.0193),
 ('epfl', 0.0178),
 ('google brain', 0.0172),
 ('cornell university', 0.0159),
 ('adobe research', 0.0146),
 ('university of sydney', 0.0143),
 ('new york university', 0.013),
 ('university of southern california', 0.0119),
 ('harvard university', 0.0107),
 ('columbia university', 0.0104),
 ('max planck institute', 0.0104),
 ('tsinghua university', 0.0099),
 ('ntt communication sc

In [21]:
# centrality = nx.current_flow_betweenness_centrality(g_all_n5)
# sorted([(v, float('{:0.4f}'.format(c))) for v, c in centrality.items()], key=lambda x: x[1], reverse=True)


In [22]:
# centrality = nx.communicability_betweenness_centrality(g_all_n5)
# sorted([(v, float('{:0.4f}'.format(c))) for v, c in centrality.items()], key=lambda x: x[1], reverse=True)


In [23]:
hubs, authorities = nx.hits(g_all_n5)

In [24]:
sorted([(v, float('{:0.4f}'.format(c))) for v, c in hubs.items()], key=lambda x: x[1], reverse=True)

[('microsoft research', 0.055),
 ('mit', 0.0453),
 ('university of california berkeley', 0.0431),
 ('carnegie mellon university', 0.0426),
 ('princeton university', 0.0383),
 ('stanford university', 0.0332),
 ('university of texas at austin', 0.0309),
 ('columbia university', 0.0266),
 ('google research', 0.0266),
 ('harvard university', 0.0246),
 ('google brain', 0.0233),
 ('google', 0.0219),
 ('university of pennsylvania', 0.0198),
 ('university of washington', 0.0188),
 ('university of toronto', 0.018),
 ('university of california san diego', 0.017),
 ('new york university', 0.0167),
 ('university of cambridge', 0.0159),
 ('university of oxford', 0.0156),
 ('deepmind', 0.0147),
 ('georgia institute of technology', 0.014),
 ('cornell university', 0.0134),
 ('duke university', 0.0125),
 ('ibm research', 0.0114),
 ('university of southern california', 0.0111),
 ('university college london', 0.011),
 ('university of michigan', 0.0106),
 ('toyota technological institute at chicago', 0.01

In [25]:
sorted([(v, float('{:0.4f}'.format(c))) for v, c in authorities.items()], key=lambda x: x[1], reverse=True)

[('microsoft research', 0.055),
 ('mit', 0.0453),
 ('university of california berkeley', 0.0431),
 ('carnegie mellon university', 0.0426),
 ('princeton university', 0.0383),
 ('stanford university', 0.0332),
 ('university of texas at austin', 0.0309),
 ('columbia university', 0.0266),
 ('google research', 0.0266),
 ('harvard university', 0.0246),
 ('google brain', 0.0233),
 ('google', 0.0219),
 ('university of pennsylvania', 0.0198),
 ('university of washington', 0.0188),
 ('university of toronto', 0.018),
 ('university of california san diego', 0.017),
 ('new york university', 0.0167),
 ('university of cambridge', 0.0159),
 ('university of oxford', 0.0156),
 ('deepmind', 0.0147),
 ('georgia institute of technology', 0.014),
 ('cornell university', 0.0134),
 ('duke university', 0.0125),
 ('ibm research', 0.0114),
 ('university of southern california', 0.0111),
 ('university college london', 0.011),
 ('university of michigan', 0.0106),
 ('toyota technological institute at chicago', 0.01