# Introduction

In this notebook we utilize our results from Part I and Part II where we quantified scientific impact via network analysis. In particular, we built a directed network of nodes for the publications in High Energy Physics (HEP) from the last 55 years or so, made available to us by InspireHEP.


Here we focus on a practical application. In Parts I and II what we did was a broad characterization of the impact of authors. The method picked out very influential scientists as the ones with the highest impact. It's certainly neat that this method proves very effective at picking out Nobel Prize winners for instance.

In reality, citations, and evaluations of impact are probably most important for hirings, and for promotions. Here we attempt to recommend specific candidates based on their PR ranking.

# Preliminary Stuff

In [1]:
import networkx as nx
%matplotlib inline
import matplotlib
matplotlib.style.use('fivethirtyeight')
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'
from collections import defaultdict
from my_utils import *

In [2]:
# let's read the data, clean it a bit, and build network

# key is recid, value is info
# info = {authors: [], citations: int, pub_date: datetime}
recid_info = defaultdict(dict)
json_path = '/Users/ederizaguirre/Research/FunProjects/InspireHEPNetworks/'
DG = nx.DiGraph() # directed graph. Edges are (paper that cites, paper that's cited)
with open(json_path + 'hep_records.json') as f:
    list(map(lambda x: compute_graph(DG,
                                     recid_info,
                                     x
                                    ),
             f.readlines()
            )
        )

In [3]:
len(recid_info) # number of entries

644925

In [4]:
recid_pr = nx.pagerank(DG) # key is recid; value is PR

In [5]:
# let's sort the entries
recid_pr_sorted = sorted(recid_pr.items(), 
                            key=lambda x: x[1], 
                            reverse=True
                           )

In [4]:
# we can build the network from an input file we created in Part I
recid_pr_sorted = []
recid_pr = defaultdict(float)
with open('../data/recid_pr_damping0p85.dat') as f:
    for line in f:
        line = line.split()
        recid = int(line[0])
        pr = float(line[1])
        recid_pr_sorted.append((recid, pr))
        recid_pr[recid] = pr

In [5]:
# update pr key in recid_info dict
for recid in recid_info.iterkeys():
    recid_info[recid]['pr'] = recid_pr[recid]

Ok, so now we have read and processed the meatadata info from publications
We'd like to focus on authors. Let's compute individual author's PR

In [6]:
# build author_pr dict
# default to 1/N penalty
author_pr_sorted = author_metric_sorted(recid_info,
                                       norm = 'len(authors)'
                                      )
author_pr = {au: pr for au, pr in author_pr_sorted} # key is author; value is PR

Ok, so how do we make a recommendation for a faculty hire in a given year?
Here's a simple solution.
Normally, everyone hired as an Assistant Professor in HEP has one or two 3-year postdoctoral fellowships of experience
beyond that person's PhD. So if we're hiring in say 1999, we'd be hiring people who got their PhDs sometime between 1993 and 1996. The InspireHEP metadata does not give us authors' PhD date however. But a good substitute for an author's PhD date is the date of that author's first publication. Most people in HEP publish their first publication in their 2nd or 3rd year of their PhDs, and the majority of people graduate in 5 years, although a small fraction finish in 4 years.

So if we're hiring in 1999, then we are hiring candidates who would have written their first paper back in 1990 or 1991.

In [7]:
# organize authors by the year of their first publication

# key is first pub's year; value is authors whose first pub in that year
first_pub_year_authors = get_first_pub_year_authors(recid_info)

So, if we were to hire in 2017, and we wanted someone fresh out of their *first* postdoctoral fellowship - so someone with a lot of potential - we'd look for a candidate with a first publication from 2011 or 2012. If we focus on someone from 2011, that person would have been only a 2nd year PhD student at the time of their first paper.

In [127]:
top_authors_2017 = []
for year in range(2010,2012):
    top_authors_2017 += first_pub_year_authors[year]
    
top_authors_2017.sort(key = lambda author: author_pr[author], reverse = True)

In [128]:
top_authors_2017[:3] 

[u'Stanford, Douglas', u'Linden, Tim', u'Upadhyay, Sudhaker']

So there we go.
The fellow D. Stanford actually got his PhD from 2010-2014. He is now a postdoc at the Institute of Advanced Studies at Princeton, and is starting a faculty job this Fall at Stanford! (curiously enough, given his last name).
http://inspirehep.net/author/profile/D.Stanford.1

Linden did his PhD from 2008-2013 at Santa Cruz and is now on his 2nd postdoc at Ohio State after a 2-year postdoc at Chicago. So perhaps he's due soon!
http://inspirehep.net/author/profile/T.Linden.1


The other fellow Upadhyay got his PhD in India from 2008-2013. I see he's already written 60+ papers (!).
http://inspirehep.net/author/profile/S.Upadhyay.1

What if we use citations instead?

In [129]:
# we need to build the author_num_cites dict

# key is author; value is num_cites
author_num_cites_sorted = author_metric_sorted(recid_info,
                                               metric = 'num_citations',
                                               norm = None
                                              )
author_num_cites = {au: n for au, n in author_num_cites_sorted}

In [130]:
# sort by citations
top_authors_2017.sort(key = lambda author: author_num_cites[author], reverse = True)

In [133]:
top_authors_2017[:3] # top 3 authors by number of citations

[u'Fisher, Matthew', u'Khakzad, Mohsen', u'Liu, Yanwen']

Fisher and Khakzad are both from CMS, one of the LHC experiments.
CMS has some 3000+ members, and as a rule every member puts their name on every paper CMS puts out.
Liu is in ATLAS it seems, the other general purpose LHC experiment.

Ok, this is hardly a systematic study.
In particular, not every author follows the same path to a faculty job.
However, it's intersting that this very basic application of PageRank as a recommendation system for a faculty hiring committe for the year of 2017 yielded a name which was *actually* hired by one of the top places, and other names which may be on a fastrack to a faculty job.

In principle I could apply this method to previous years, but for now I will leave it at that. For previous years, we'd want to compute the PageRank of papers and authors by taking a snapshot of InspireHEP *at that time* so as to not let future papers and the citations they confer affect the results from that particular year.