### This notebook wrangles the 'track' column for NIPS-2017 data to match it to NIPS-2019 tracks

The tracks from 2019 'track' column have been aggregated into 9 'main_tracks'. 
This notebooks attempts to correlate the tracks from 2017 to 2019 to converge the two.   
- I will read in the track labels from 2019 data
- we will split the track column by " -- " delimiter to extract track information for 2017
- we will then compare the two by grouping data
- we will reconcile tracks with minor labeling differences across 2017 and 2019 
- the labels which can't be reconciled will be grouped under a 'Not Found/NF' main_track label for now

In [1]:
import pandas as pd
import re

In [4]:
#read in NIPS data
nips = pd.read_csv("../data/nips.csv")

#read in 2019 data with tracks information
nips19 = pd.read_csv("../data/nips_with_track_cleaned.csv")
nips19 = nips19[nips19['year'] == 2019]

In [6]:
#create a dict to map track to main_track(canonical label)

tracks19 = nips19.track_original.unique().tolist()

mt19 = nips19.main_track.unique().tolist()

doc = {}

for t in tracks19:    
    t = t.split(" -- ")
    doc[t[1]] = t[0]
    
for t in mt19:
    doc[t] = t

In [7]:
#subset 2017 data

nips17 = nips[nips['year'] == 2017].copy()

In [8]:
#create a column to record the original track info

nips17['track_original'] = nips17['track']

In [9]:
#clean specific tracks column to match 2019 tracks

nips17.loc[nips17['track'] == 'Data, Competitions, Implementations, and Software', 'track'] =  "Data, Challenges, Implementations, and Software"
nips17.loc[nips17['track'] == 'Dialog- and/or Communication-Based Learning', 'track'] =  "Dialog- or Communication-Based Learning"
nips17.loc[nips17['track'] == 'Neuroscience and cognitive science', 'track'] =  "Neuroscience and Cognitive Science"
nips17.loc[nips17['track'] == 'Video, Motion and Tracking', 'track'] =  "Tracking and Motion in Video"

In [10]:
#pass the track column through the dict to create the main_track column

tracks17 =  nips17.track.tolist()

mt17 = []

for t in tracks17:   
    if t in doc:
        mt17.append(doc[t])
    else:
        mt17.append('NF')
        
nips17['main_track'] = mt17

In [11]:
nips17.groupby("main_track").size()

main_track
Algorithms                                         512
Applications                                       252
Data, Challenges, Implementations, and Software      6
Deep Learning                                      349
NF                                                  53
Neuroscience and Cognitive Science                  88
Optimization                                       155
Probabilistic Methods                              197
Reinforcement Learning and Planning                116
Theory                                             218
dtype: int64

In [12]:
#list of tracks that fall under 'Not Found/NF' main_track column with individual notes

nips17[nips17['main_track'] == 'NF'].track.unique().tolist()

['Auditory Perception and Modeling',
 'Bayesian Theory',
 'Competitions or Challenges',
 'Competitive Analysis',
 'Hyperparameter Selection',
 'Large Margin Methods',
 'Learning to Learn',
 'Motor Control',
 'Music Modeling and Analysis',
 'Natural Scene Statistics',
 'Neural Abstract Machines',
 'None of the above',
 'One-Shot/Low-Shot Learning Approaches',
 'Program Induction',
 'Source Separation',
 'Speech Recognition',
 'Spike Train Generation',
 'Systems Biology',
 'Text Analysis',
 'Visual Features',
 'Visualization/Expository Techniques for Deep Networks']

Notes:
    
1. 'Bayesian Theory': 'Bayesian Nonparametrics' in tracks_2019?
2. 'Auditory Perception and Modeling': --> Audio and Speech Processing' in tracks_2019?
3. 'Competitions or Challenges' --> no similar track
4. 'Competitive Analysis' --> no similar track

7. 'Hyperparameter Selection' --> maybe 'Model Selection and Structure Learning' in tracks_2019?
8. 'Large Margin Methods' --> Large Scale Learning' and Large Deviations and Asymptotic Analysis' in tracks_2019
9. 'Learning to Learn' --> many options
10. 'Motor Control' --> no similar track
11. 'Music Modeling and Analysis'  --> no similar track
12. 'Natural Scene Statistics' --> no similar track
13. 'Neural Abstract Machines' --> no similar track

15. 'None of the above' --> ??
16. 'One-Shot/Low-Shot Learning Approaches' --> Few-Shot Learning' in tracks_2019?
17. 'Program Induction' --> Program Understanding and Generation' in tracks_2019?
18. 'Source Separation' --> no similar track
19. 'Speech Recognition' --> Audio and Speech Processing' in tracks_2019?
20. 'Spike Train Generation' -->no similar track
21. 'Systems Biology' --> no similar track
22. 'Text Analysis' --> Natural Language Processing' in tracks_2019?

24. 'Visual Features' --> many options: Tracking and Motion in Video', Visualization or Exposition Techniques for Deep Networks', Visual Scene Analysis and Interpretation', Visual Perception' in tracks_2019
25. 'Visualization/Expository Techniques for Deep Networks' --> same as above

In [13]:
nips17.to_csv("../data/nips_with_track_cleaned.csv", mode='a', index= False, header= False)

In [14]:
document = nips17[['track', 'track_original', 'main_track', 'year']]

document = document.drop_duplicates()

document.to_csv("../data/nips_yearwise_trackinfo.csv", mode='a', index= False, header= False)