# papeles package - institutions networks per topics

In this notebook, we'll group institutions networks to particular topics extracted in [this script](https://github.com/glhuilli/papeles/blob/master/scripts/papeles%20-%20keywords%20topics%20analysis.ipynb).



In [1]:
import os
import json
from collections import Counter, defaultdict

from collections import Counter 
import itertools
from tqdm.notebook import tqdm
import networkx as nx

from papeles.paper.neurips import get_key, institutions, institutions_graph
from papeles.utils.topics import TopicPredictor


In [2]:
# These are files with encoding issues that were not parse correctly by the pdf_parser 
SKIP_FILES = [
    '5049-nonparametric-multi-group-membership-model-for-dynamic-networks.pdf_headers.txt',
    '4984-cluster-trees-on-manifolds.pdf_headers.txt',
    '5820-alternating-minimization-for-regression-problems-with-vector-valued-outputs.pdf_headers.txt',
    '9065-visualizing-and-measuring-the-geometry-of-bert.pdf_headers.txt'
    '4130-implicit-encoding-of-prior-probabilities-in-optimal-neural-populations.pdf_headers.txt',
    '7118-local-aggregative-games.pdf_headers.txt'
]

NEURIPS_ANALYSIS_DATA_PATH = '/var/data/neurips_analysis'

file_lines = defaultdict(list)
for filename in tqdm(os.listdir(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'files_headers/')), 'loading files'):
    if filename in SKIP_FILES:
        continue
    with open(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, './files_headers/', filename), 'r') as f:
        for line in f.readlines():
            file_lines[get_key(filename)].append(line.strip())
            
metadata_path = os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'files_metadata/')

metadata = {}
for filename in tqdm(os.listdir(metadata_path), 'loading metadata'):
    with open(os.path.join(metadata_path, filename), 'r') as f: # open in readonly mode
        for line in f.readlines():
            data = json.loads(line)
            metadata[get_key(data['pdf_name'])] = data


HBox(children=(FloatProgress(value=0.0, description='loading files', max=6086.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='loading metadata', max=6083.0, style=ProgressStyle(descri…




## Loading and predicting topics

In this analysis, I'll use only topics of 3-grams. Using 2-grams and 1-gram topics needed further manual post-processing (e.g. removing topics that were mostly about writting styles instead of research topics). 

For purposes of this analysis, I'll only compute the top 3 most frequent topics mentioned in abstracts per year, from 2009 to 2019. 

Note that loading the topics generated in a different script, can be used to predict topics in new documents using the `TopicPredictor` object, as presented in the example below. The "top topic" per year is then computed by how frequently these topics were predicted for each paper of that particular year. 

In [3]:
# load topics 
with open(os.path.join(NEURIPS_ANALYSIS_DATA_PATH, '3grams_topics.json'), 'r') as f:
    topics = json.load(f)


In [4]:
topic_predictor = TopicPredictor(topics)

topics_per_year = {}

year_topic_files = defaultdict(lambda: defaultdict(list))

for key, data in metadata.items():
    year = data['year']
    if year not in topics_per_year:
        topics_per_year[year] = Counter()
    
    topic_prediction = topic_predictor.predict_topics(data['abstract'])
    
    year_topic_files[year][key] = [x[0] for x in sorted(topic_prediction.items(), key=lambda x: x[1], reverse=True) if x[1] > 0]
    
    if sum(topic_prediction.values()) > 0:
        top_prediction = [x[0] for x in sorted(topic_prediction.items(), key=lambda x: x[1], reverse=True) if x[1] > 0][:5]
        topics_per_year[year].update(top_prediction)

        
top3_topics_per_year = defaultdict(list)
for year, topic_counter in topics_per_year.items():
    top3_topics_per_year[year] = [x[0] for x in sorted(topic_counter.items(), key=lambda x: x[1], reverse=True)][:3]


In [5]:
top_topics = set()
for year, top3_topics in top3_topics_per_year.items():
    top_topics.update(top3_topics)
sorted_top_topics = sorted(top_topics, key=lambda x: int(x.split('_')[-1]), reverse=False)  # mini hack to sort by topic number

for t in sorted_top_topics:
    print(f'------\n{t}: {topics[t]}')


------
topic_2: ['loss_functions_deep', 'functions_deep_neural', 'optimized_stochastic_gradient', 'modern_deep_networks', 'stochastic_gradient_descent', 'gradient_descent_sgd', 'deep_neural_networks', 'high_dimensional_datasets', 'low_dimensional_structures', 'iterative_algorithm_based']
------
topic_8: ['sequential_monte_carlo', 'partially_observable_markov', 'observable_markov_decision', 'principled_framework_planning', 'provide_principled_framework', 'partially_observable_stochastic', 'markov_decision_processes', 'superposition-structured_dirty_statistical', 'problem_learning_control', 'high_dimensional_datasets']
------
topic_9: ['gaussian_graphical_models', 'paper_address_problem', 'linear_regression_models', 'dirichlet_allocation_lda', 'maximum_posteriori_map', 'address_problem_learning', 'problem_learning_structure', 'posteriori_map_assignment', 'study_problem_finding', 'directed_graphical_models']
------
topic_10: ['probabilistic_graphical_model', 'support_vector_machines', 'in

## Naming topics

Visualy reviewing the 3-grams lists for each topic, most of them are easy to associate to a particular type of research line (e.g. `Topic 2` pairs well with optimization methods for deep learning), though other topics are a little bit harder (e.g. `Topic 10` has graphical models, SVMs, neural networks, and NLP in it). Given that these topic terms are indeed ranked within each topic, I used the top 5 to decide on a name for the hard cases (e.g. `Topic 10` top 5 terms are most associated to `Probabilistic Graphical Models`, so that's the one I used).


In [6]:
topic_mapping = {
    'topic_2': 'deep learning (optimization)',
    'topic_8': 'markov decision processes',
    'topic_9': 'probabilistic graphical models',
    'topic_10': 'probabilistic graphical models (inference)',
    'topic_15': 'reinforcement learning',
    'topic_18': 'matrix decomposition',
    'topic_19': 'deep reinforcement learning',
    'topic_22': 'ML optimization problems (gradients)',
    'topic_23': 'bayesian inference algorithms',
    'topic_35': 'neural networks',
    'topic_45': 'bayesian methods',
    'topic_52': 'deep learning (models)'
}
topic_mapping_snake = {
    'topic_2': 'deep_learning_optimization',
    'topic_8': 'markov_decision processes',
    'topic_9': 'probabilistic_graphical_models',
    'topic_10': 'probabilistic_graphical_models_inference',
    'topic_15': 'reinforcement_learning',
    'topic_18': 'matrix_decomposition',
    'topic_19': 'deep_reinforcement_learning',
    'topic_22': 'ML_optimization_problems_gradients',
    'topic_23': 'bayesian_inference_algorithms',
    'topic_35': 'neural_networks',
    'topic_45': 'bayesian_methods',
    'topic_52': 'deep_learning_models'
}

for year, top_topics in sorted(top3_topics_per_year.items(), key=lambda x: x[0]):
    print(f'{year} --> {[topic_mapping.get(t) for t in top_topics]}')


2009 --> ['probabilistic graphical models', 'bayesian methods', 'probabilistic graphical models (inference)']
2010 --> ['probabilistic graphical models', 'reinforcement learning', 'neural networks']
2011 --> ['probabilistic graphical models (inference)', 'neural networks', 'ML optimization problems (gradients)']
2012 --> ['bayesian inference algorithms', 'probabilistic graphical models', 'markov decision processes']
2013 --> ['markov decision processes', 'matrix decomposition', 'reinforcement learning']
2014 --> ['probabilistic graphical models (inference)', 'neural networks', 'matrix decomposition']
2015 --> ['deep learning (optimization)', 'deep learning (models)', 'neural networks']
2016 --> ['deep learning (optimization)', 'deep learning (models)', 'deep reinforcement learning']
2017 --> ['deep learning (optimization)', 'deep learning (models)', 'deep reinforcement learning']
2018 --> ['deep learning (optimization)', 'deep learning (models)', 'deep reinforcement learning']
2019 -->

## Institutions networks per topic

In this section, an institutions network is built using papers associated to the top 3 topics each year. 

Note that graphs are created as `directed` in this example (unlike the other institutions network example) so that the Hirearchical Edge 

In [7]:
inst_counter = Counter()
for file, lines in list(file_lines.items()):
    file_institutions = institutions.get_file_institutions(lines)
    unique_file_institutions = list(set(file_institutions))
    inst_counter.update(unique_file_institutions)


In [8]:
def subset_data(year, topic, year_topic_files):
    topic_files = year_topic_files[year]
    keys = []
    for key, topics_prediction in topic_files.items():
        if topic in topics_prediction:
            keys.append(key) 
    return keys

            
graphs_per_topic = {}
for year, topics_per_year in tqdm(sorted(top3_topics_per_year.items(), key=lambda x: x[0])):
    print(f'------\nyear: {year}\n------')    
    if year not in graphs_per_topic:
        graphs_per_topic[year] = {}
    for idx, topic in enumerate(topics_per_year):
        keys_filter = subset_data(year, topic, year_topic_files)
        graphs_per_topic[year][topic], _ = institutions_graph.build_institutions_graph(file_lines, metadata, inst_counter, freq=5, year=year, keys_filter=keys_filter, directed=True)
        print(f'\nTopic {idx+1}: {topic_mapping[topic]}')
        print(nx.info(graphs_per_topic[year][topic]))


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

------
year: 2009
------

Topic 1: probabilistic graphical models
Name: 
Type: DiGraph
Number of nodes: 9
Number of edges: 13
Average in degree:   1.4444
Average out degree:   1.4444

Topic 2: bayesian methods
Name: 
Type: DiGraph
Number of nodes: 11
Number of edges: 17
Average in degree:   1.5455
Average out degree:   1.5455

Topic 3: probabilistic graphical models (inference)
Name: 
Type: DiGraph
Number of nodes: 3
Number of edges: 3
Average in degree:   1.0000
Average out degree:   1.0000
------
year: 2010
------

Topic 1: probabilistic graphical models
Name: 
Type: DiGraph
Number of nodes: 7
Number of edges: 7
Average in degree:   1.0000
Average out degree:   1.0000

Topic 2: reinforcement learning
Name: 
Type: DiGraph
Number of nodes: 3
Number of edges: 4
Average in degree:   1.3333
Average out degree:   1.3333

Topic 3: neural networks
Name: 
Type: DiGraph
Number of nodes: 4
Number of edges: 4
Average in degree:   1.0000
Average out degree:   1.0000
------
year: 2011
------

Topi

In [9]:
folder = 'heb_files'

for year, topics_per_year in tqdm(sorted(top3_topics_per_year.items(), key=lambda x: x[0])):
    for idx, topic in enumerate(topics_per_year):
        file_name = f'{year}-topic_{idx+1}-{topic_mapping_snake[topic]}_graph.json'
        institutions_graph.dump_to_d3js_heb(
            graphs_per_topic[year][topic], os.path.join(NEURIPS_ANALYSIS_DATA_PATH, folder, file_name))


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




## Network analysis 

Using this data, there's a wide range of questions that could be answered. 

For example, which institutions have co-authored papers the most over the top 3 topics per year? 

In [10]:
edges_per_year = defaultdict(set)

for year, topics_per_year in top3_topics_per_year.items():
    for topic in topics_per_year:
        for edge in graphs_per_topic[year][topic].edges():
            if len(set(edge)) > 1:
                e = f'{edge[0]}-{edge[1]}'
                e_r = f'{edge[1]}-{edge[0]}'
                if e_r in edges_per_year:
                    continue
                edges_per_year[e].add(year)
[x for x in sorted(edges_per_year.items(), key=lambda x: len(x[1]), reverse=True) if len(x[1]) > 1]

[('university of washington-microsoft research', {2015, 2016, 2019}),
 ('university of texas at austin-microsoft research', {2013, 2019}),
 ('carnegie mellon university-bosch center for artiﬁcial intelligence',
  {2018, 2019}),
 ('university of cambridge-deepmind', {2017, 2019}),
 ('princeton university-mit', {2014, 2017}),
 ('mit-microsoft research', {2009, 2017}),
 ('google research-google brain', {2016, 2018}),
 ('university of california berkeley-university of texas at austin',
  {2009, 2014}),
 ('georgia institute of technology-carnegie mellon university', {2011, 2014}),
 ('purdue university-microsoft research', {2009, 2010}),
 ('microsoft research-kaist', {2009, 2011})]