### This notebook wrangles the 'track' column for NIPS-2018 data to match it to NIPS-2019 tracks

The tracks from 2019 'track' column have been aggregated into 9 'main_tracks'. 
This notebooks attempts to correlate the tracks from 2018 to 2019 to converge the two.   
- I will read in the track labels from 2019 data
- we will split the track column by " -- " delimiter to extract track information for 2018
- we will then compare the two by grouping data
- we will reconcile tracks with minor labeling differences across 2018 and 2019 
- the labels which can't be reconciled will be grouped under a 'Not Found/NF' main_track label for now

In [1]:
import pandas as pd
import re

In [2]:
#read in NIPS data
nips = pd.read_csv("../data/nips.csv")

#read in 2019 data with tracks information
nips19 = pd.read_csv("../data/nips_with_track_cleaned.csv")
nips19 = nips19[nips19['year'] == 2019]

In [3]:
nips19.head()

Unnamed: 0,title,abstract,pdf_link,year,track,track_original,main_track
0,A Game Theoretic Approach to Class-wise Select...,Selection of input features such as relevant p...,http://papers.nips.cc/paper/by-source-2019-5315,2019,Adversarial Learning,Algorithms -- Adversarial Learning,Algorithms
1,A Little Is Enough: Circumventing Defenses For...,Distributed learning is central for large-scal...,http://papers.nips.cc/paper/by-source-2019-4657,2019,Adversarial Learning,Algorithms -- Adversarial Learning,Algorithms
2,A New Defense Against Adversarial Images: Turn...,Natural images are virtually surrounded by low...,http://papers.nips.cc/paper/by-source-2019-926,2019,Adversarial Learning,Algorithms -- Adversarial Learning,Algorithms
3,Tight Certificates of Adversarial Robustness f...,Strong theoretical guarantees of robustness ca...,http://papers.nips.cc/paper/by-source-2019-2720,2019,Adversarial Learning,Algorithms -- Adversarial Learning,Algorithms
4,Adversarial training for free!,"Adversarial training, in which a network is tr...",http://papers.nips.cc/paper/by-source-2019-1853,2019,Adversarial Learning,Algorithms -- Adversarial Learning,Algorithms


In [4]:
#create a dict to map track to main_track(canonical label)

tracks19 = nips19.track_original.unique().tolist()

mt19 = nips19.main_track.unique().tolist()

doc = {}

for t in tracks19:
    
    t = t.split(" -- ")
    doc[t[1]] = t[0]
    
for t in mt19:
    doc[t] = t

In [5]:
#subset nips2018 data

nips18 = nips[nips['year'] == 2018].copy()

In [6]:
nips18.shape

(2846, 5)

In [7]:
#create a column to record the original track info

nips18['track_original'] = nips18['track']

In [8]:
#reconcile rows with minor track variations

nips18.loc[nips18['track'] == 'Data, Competitions, Implementations, and Software', 'track'] =  "Data, Challenges, Implementations, and Software"
nips18.loc[nips18['track'] == 'Few-Shot Learning Approaches', 'track'] =  'Few-Shot Learning' 

In [9]:
#pass the track column through the dict to create the main_track column

tracks18 =  nips18.track.tolist()

mt18 = []

for t in tracks18:    
    if t in doc:
        mt18.append(doc[t])
    else:
        mt18.append('NF')
        
nips18['main_track'] = mt18

In [10]:
nips18.groupby("main_track").size()

main_track
Algorithms                                         685
Applications                                       473
Data, Challenges, Implementations, and Software     17
Deep Learning                                      572
NF                                                  59
Neuroscience and Cognitive Science                  91
Optimization                                       205
Probabilistic Methods                              247
Reinforcement Learning and Planning                228
Theory                                             269
dtype: int64

In [11]:
#list of tracks that fall under 'Not Found/NF' main_track column with individual notes

nips18[nips18['main_track'] == 'NF'].track.unique().tolist()

['Bayesian Theory',
 'Brain Segmentation',
 'Competitive Analysis',
 'Large Margin Methods',
 'Motor Control',
 'Music Modeling and Analysis',
 'Plasticity and Adaptation',
 'Program Induction',
 'Source Separation',
 'Speech Recognition',
 'Spike Train Generation',
 'Systems Biology',
 'Text Analysis',
 'Video Segmentation',
 'Visual Features']

Notes:
    
1. 'Bayesian Theory': 'Bayesian Nonparametrics' in tracks_2019? (same track in '17)
2. 'Brain Segmentation' --> Brain Imaging', Brain Mapping', Brain--Computer Interfaces and Neural Prostheses' in tracks_2019?
3. 'Competitive Analysis' --> no similar track (same track in '17)


6. 'Large Margin Methods' --> Large Scale Learning' and Large Deviations and Asymptotic Analysis' in tracks_2019?
7. 'Motor Control' --> no similar track (same track in '17)
8. 'Music Modeling and Analysis' --> no similar track (same track in '17)
9. 'Plasticity and Adaptation' --> no similar track
10. 'Program Induction' --> Program Understanding and Generation' in tracks_19? (same track in '17)
11. 'Source Separation' --> no similar track (same track in '17)
12. 'Speech Recognition' --> Audio and Speech Processing' in tracks_2019? (same track in '17)
13. 'Spike Train Generation' -->no similar track (same track in '17)
14. 'Systems Biology' --> no similar track (same track in '17)
15. 'Text Analysis' --> Natural Language Processing' in tracks_2019? (same track in '17)
16. 'Video Segmentation' --> Image Segmentation' in track_19?? 
17. 'Visual Features' --> many options: Tracking and Motion in Video', Visualization or Exposition Techniques for Deep Networks', Visual Scene Analysis and Interpretation', Visual Perception' in tracks_2019 (same track in '17)

In [12]:
nips18.to_csv("../data/nips_with_track_cleaned.csv", mode='a', index= False, header= False)

In [13]:
document = nips18[['track', 'track_original', 'main_track', 'year']]

document = document.drop_duplicates()

document.to_csv("../data/nips_yearwise_trackinfo.csv", mode='a', index= False, header= False)

In [14]:
nips18.shape

(2846, 7)