# Notes/Ideas

### Current task:

Match past grants to faculty profiles by merging dataframes

- Will give us a training set for us to classify grants as expertise fields

Define a list of expertise fields to use for faculty and for grants


### Need data!

Streamline of data (for a live system):
- Grants
    - grants.gov
    - Pivot has more grants but has a no data mining policy (which is a problem if this system is used by Berkeley)
- Faculty Profiles
    - vcresearch.berkeley.edu
    - internal source of data that is not public (past research/grants given)

### Data Representation

Faculty Profile creation (goal: try to come up with a list of categories/hidden traits about grants/research that the faculty LIKE and DISLIKE):
- Field they are in/Department/Specialities
    - Do people in the field generally like certain things?
        -Government as their sponsor/want lots of grant money/cross-research/etc
- Past research
    - Analyze content of research news/name of research/categories of research
    - Do they do cross-field research?
    - Cost of grants of past research
    - Change over time of research concentration (ex. very first was bio, then bio tech, then tech)
    - Who did they work with/their fields of research
- Years of being a faculty/researcher (might make them more set on a certain field like Abbeel)

Grants classification (goal: besides list of categories/find hidden traits):
- Category of funding
- Grant money
- Who is sponsoring (government, non-profit, etc), can find out from sponsor and possible email address (.gov)
- Elgibility check
- Get sentiment/hidden categories from opportunity title/description


### Model

K-NN for clustering

Feedback system is just updating the faculty profile (more resistant to change over time)


### Manual Search

Come up with clear keywords/categories for grants

In [83]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib
from wordcloud import WordCloud
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
import string

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, MultiLabelBinarizer

from sklearn.naive_bayes import BernoulliNB

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import Adagrad


plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('fivethirtyeight')
plt.style.use('ggplot')

# Analysis of Faculty Profiles data

In [2]:
df_faculty = pd.read_csv('scraped_data/faculty_profiles.csv', sep='~')

In [3]:
df_faculty.head()

Unnamed: 0,faculty_name,faculty_profile_url,l_expertise,department,title_name,faculty_site_url,lab_url,faculty_email,description,description_links,...,link_to_news_3,description_teaser_3,article_date_4,title_of_news_4,link_to_news_4,description_teaser_4,article_date_5,title_of_news_5,link_to_news_5,description_teaser_5
0,David A. Aaker,/faculty/david-aaker,"business,marketing,branding",Haas School of Business,Professor of Marketing and Public Policy,http://www.haas.berkeley.edu/faculty/aaker.html,http://groups.haas.berkeley.edu/marketing/,aaker@haas.berkeley.edu,,,...,,,,,,,,,,
1,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,/news/new-deep-learning-technique-enables-robo...,UC Berkeley researchers have developed algor...,"December 17, 2012",Big NSF grant funds research into training ro...,/news/big-nsf-grant-funds-research-training-ro...,"What if robots and humans, working together,...","August 23, 2011",UC Berkeley robotics expert named among world...,/news/uc-berkeley-robotics-expert-named-among-...,"Pieter Abbeel, a UC Berkeley, professor know..."
2,Elizabeth Abel,/faculty/elizabeth-abel,"feminist theory,psychoanalysis,Virginia Woolf,...",Department of English,Professor of English,http://english.berkeley.edu/profiles/5,,eabel@uclink.berkeley.edu,Elizabeth Abel's general research interest is...,,...,,,,,,,,,,
3,Dor Abrahamson,/faculty/dor-abrahamson,"mathematical cognition,design-based research,m...",Graduate School of Education,Associate Professor of Cognition and Development,http://gse.berkeley.edu/people/dor-abrahamson,http://edrl.berkeley.edu/,dor@berkeley.edu,Dor Abrahamson studies the process of mathema...,,...,,,,,,,,,,
4,Norman Abrahamson,/faculty/norman-abrahamson,"civil and environmental engineering,earthquake...",Department of Civil and Environmental Engineering,Adjunct Professor of Civil and Environmental E...,http://www.ce.berkeley.edu/faculty/faculty.php...,,naa3@earthlink.net,,,...,,,,,,,,,,


In [50]:
df_faculty.columns.values

array(['faculty_name', 'faculty_profile_url', 'l_expertise', 'department',
       'title_name', 'faculty_site_url', 'lab_url', 'faculty_email',
       'description', 'description_links', 'article_date_1',
       'title_of_news_1', 'link_to_news_1', 'description_teaser_1',
       'article_date_2', 'title_of_news_2', 'link_to_news_2',
       'description_teaser_2', 'article_date_3', 'title_of_news_3',
       'link_to_news_3', 'description_teaser_3', 'article_date_4',
       'title_of_news_4', 'link_to_news_4', 'description_teaser_4',
       'article_date_5', 'title_of_news_5', 'link_to_news_5',
       'description_teaser_5'], dtype=object)

In [58]:
def split_first_last_name(s):
    """
    Find highest two 'words', those are the first and last name by order
    """
    l = s.lower().split(' ')
    lengths = [len(w) for w in l]
    max_index = max(range(len(l)), key=lambda i:lengths[i])
    max_word = l[max_index]
    lengths[max_index] = -1
    
    max_index2 = max(range(len(l)), key=lambda i:lengths[i])
    max_word2 = l[max_index2]
    
    if max_index < max_index2:
        return max_word, max_word2
    else:
        return max_word2, max_word
    
    

first_last_names = df_faculty['faculty_name'].apply(split_first_last_name)
first_names = [t[0] for t in first_last_names]
last_names = [t[1] for t in first_last_names]
df_faculty['first_name'] = first_names
df_faculty['last_name'] = last_names

## Read in grant history

In [4]:
df_grants = pd.read_csv("scraped_data/research_grant_history_cleaned.csv")

In [5]:
df_grants.columns.values

array(['Activity Type', 'Amount', 'Sponsor Class', 'Sponsor', 'Division',
       'Department', 'Fund', 'UCB Award Number', 'PI Name',
       'Project Begin Date', 'Project End Date', 'Title'], dtype=object)

In [10]:
for r in ["'{}':{}".format(s, s.split('  ')[::-1]) for s in ['Hongbin  Sun',
 'Sarah  Song',
 'Richard H  Bamler',
 'Kris  Gutierrez',
 'Chen  Li',
 'Liana  Lareau',
 'Larry  Conrad']]:
    print(r)

'Hongbin  Sun':['Sun', 'Hongbin']
'Sarah  Song':['Song', 'Sarah']
'Richard H  Bamler':['Bamler', 'Richard H']
'Kris  Gutierrez':['Gutierrez', 'Kris']
'Chen  Li':['Li', 'Chen']
'Liana  Lareau':['Lareau', 'Liana']
'Larry  Conrad':['Conrad', 'Larry']


In [40]:
first_last_name_corrections = {'Brown Cheryl M':['Brown', 'Cheryl'],
 'Kris  Gutierrez':['Gutierrez', 'Kris'],
 'Chen  Li':['Li' 'Chen'],
 'Alan  Hammond':['Hammond', 'Alan'],
 'Hongbin  Sun':['Sun', 'Hongbin'],
'Sarah  Song':['Song', 'Sarah'],
'Richard H  Bamler':['Bamler', 'Richard'],
'Kris  Gutierrez':['Gutierrez', 'Kris'],
'Chen  Li':['Li', 'Chen'],
'Liana  Lareau':['Lareau', 'Liana'],
'Larry  Conrad':['Conrad', 'Larry'],}

def split_first_last_name(s):
    """
    Find highest two 'words', those are the first and last name by order
    """
    l = s.replace(',', '').lower().split(' ')
    lengths = [len(w) for w in l]
    max_index = max(range(len(l)), key=lambda i:lengths[i])
    max_word = l[max_index]
    lengths[max_index] = -1
    
    max_index2 = max(range(len(l)), key=lambda i:lengths[i])
    max_word2 = l[max_index2]
    
    if max_index < max_index2:
        return max_word, max_word2
    else:
        return max_word2, max_word



first_last_names = df_grants['PI Name'].apply(split_first_last_name)
first_names = [t[1] for t in first_last_names]
last_names = [t[0] for t in first_last_names]

In [41]:
df_grants['first_name'] = first_names
df_grants['last_name'] = last_names

In [42]:
df_grants.head()

Unnamed: 0,Activity Type,Amount,Sponsor Class,Sponsor,Division,Department,Fund,UCB Award Number,PI Name,Project Begin Date,Project End Date,Title,first_name,last_name
0,Applied research,"$179,032",State of California,California Department of Health Care Services,School of Public Health,,15952.0,021331-002,"Colford Jr, John M",7/1/2006,9/30/2006,DNS AIDS Training,john,colford
1,Basic research,"$154,578",State of California,California Department of Social Services,School of Social Welfare,,15959.0,021362-002,"Needell, Barbara",7/1/2006,9/30/2006,Performance Indicators/California Children's S...,barbara,needell
2,Instruction,"$225,000",State of California,California Department of Social Services,School of Social Welfare,Social Welfare,15960.0,021363-002,"Midgley, James",7/1/2006,9/30/2006,Title IV-E Social Work Training Program,james,midgley
3,Instruction,"$47,138",State of California,California Department of Social Services,School of Social Welfare,Social Welfare,15960.0,021363-002,"Midgley, James",7/1/2006,9/30/2006,Title IV-E Social Work Training Program,james,midgley
4,Basic research,"$65,000",Federal,NIH National Institutes of Health - Miscellaneous,VC Res Other Research Units,The California Institute for Quantitative Bios...,78561.0,021425-002,"Keasling, Jay",7/1/2006,12/31/2006,Model-Driven Strain Engineering for Isoprenoid...,jay,keasling


## Merge the dataframes together

In [64]:
merged_df = df_faculty.merge(df_grants, on=['last_name', 'first_name'])

In [65]:
len(merged_df)

33837

In [66]:
merged_df

Unnamed: 0,faculty_name,faculty_profile_url,l_expertise,department,title_name,faculty_site_url,lab_url,faculty_email,description,description_links,...,Sponsor Class,Sponsor,Division,Department,Fund,UCB Award Number,PI Name,Project Begin Date,Project End Date,Title
0,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,NSF National Science Foundation,College of Engineering,ERSO Engineering Research Support Organization,29798.0,027673-002,"Abbeel, Pieter",9/1/2009,8/31/2012,CPS: Medium: Learning for Control of Synthetic...
1,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,DOD ONR Office of Naval Research,College of Engineering,ERSO Engineering Research Support Organization,94789.0,028827-002,"Abbeel, Pieter",1/1/2010,8/31/2010,Autonomous Control for a Computer Simulation o...
2,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Not for Profit,"Willow Garage, Inc.",College of Engineering,ERSO Engineering Research Support Organization,,029662-002,"Abbeel, Pieter",5/5/2010,5/4/2011,The UC Berkeley Personal Robotics Project
3,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,DOD ONR Office of Naval Research,College of Engineering,ERSO Engineering Research Support Organization,94789.0,028827-002,"Abbeel, Pieter",1/1/2010,6/30/2011,Autonomous Control for a Computer Simulation o...
4,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,NSF National Science Foundation,College of Engineering,ERSO Engineering Research Support Organization,29798.0,027673-002,"Abbeel, Pieter",9/1/2009,8/31/2012,CPS: Medium: Learning for Control of Synthetic...
5,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,NSF National Science Foundation,College of Engineering,ERSO Engineering Research Support Organization,30903.0,031849-002,"Abbeel, Pieter",9/1/2011,8/31/2014,RI: Small: Large Scale Machine Learning for Co...
6,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Not for Profit,Alfred P. Sloan Foundation,College of Engineering,ERSO Engineering Research Support Organization,95387.0,031219-002,"Abbeel, Pieter",9/15/2011,9/15/2013,Alfred P. Sloan Research Fellowship for Dr. Pi...
7,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,NSF National Science Foundation,College of Engineering,ERSO Engineering Research Support Organization,29798.0,027673-002,"Abbeel, Pieter",9/1/2009,8/31/2013,CPS: Medium: Learning for Control of Synthetic...
8,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,DOD ONR Office of Naval Research,College of Engineering,ERSO Engineering Research Support Organization,31362.0,033142-002,"Abbeel, Pieter",6/1/2012,5/31/2015,Large-Scale Machine Learning for High-Accuracy...
9,Pieter Abbeel,/faculty/pieter-abbeel,"robotics,machine learning",Division of Computer Science/EECS,Professor,http://www.cs.berkeley.edu/~pabbeel,,pabbeel@cs.berkeley.edu,Robotics and Machine Learning.,,...,Federal,DAF AFOSR Air Force Office of Scientific Research,College of Engineering,ERSO Engineering Research Support Organization,31422.0,033376-002,"Abbeel, Pieter",7/1/2012,6/30/2015,Apprenticeship Learning for Robotic Control


In [67]:
df = merged_df

## Problem:

Multi-label classification problem

http://stats.stackexchange.com/questions/21851/predicting-multiple-targets-or-classes

http://scikit-learn.org/stable/modules/multiclass.html

## Was wondering how predictive the title is of the funding category

So I threw together a model that:

- Takes in the title of the grant

- Outputs a prediction of the category/label of the grant.

For instance as below, it'll take in this title and try to predict the category:

- 'Cooperative Ecosystem Studies Unit, Piedmont South Atlantic Coast CESU'

- 'ST' which corresponds to 'Science and Technology and other Research and Development'

Tried out Naive Bayes then Neural Nets for a better result. Overall the model is weak ($78\%$ on testing) but can definitely be improved by:

- Tuning the neural network

- Making some real features (besides just looking at whether a word is contained in the title or not)
    - Spatial positioning of words/types of words used (noun/verb)/combination of words used together
    - Using word2vec to map the words to a sentiment vector space (not too familiar with this and I need to read more on this)
    
- Feature selection because the training set has a $97\%$ accuracy.
    
- Incorporating other variables besides just the title such as description, grant floor and ceiling


This isn't directly related to the main problem of recommendation systems but might be used to create labels for grants that are unlabeled which would help for filling in missing data. Or maybe some grants are labeled as one thing but are on the border of being on another label which could point this grant towards a different range of faculty members.

In [70]:
df['Title'][1]

'Autonomous Control for a Computer Simulation of the Mono Tiltrotor'

In [72]:
df['l_expertise'][1]

'robotics,machine learning'

In [102]:
bool_filter = ~df['l_expertise'].isnull()
df = df[bool_filter]
df['l_expertise'] = df['l_expertise'].apply(lambda s:s.lower())

In [103]:
word_bag = [w.lower() for l in df['l_expertise'].apply(lambda s:s.split(',')) for w in l]

In [104]:
freq_expertise = nltk.FreqDist(word_bag)

In [105]:
freq_expertise.most_common(2000)[-10:]

[('birdsong', 27),
 ('lipid mediators', 27),
 ('biomimicry', 27),
 ('speech perception', 27),
 ('formation of structure in the universe', 27),
 ('leukocytes', 27),
 ('axonal pathfinding', 27),
 ('theoretical neuroscience', 27),
 ('soft x-rays', 27),
 ('fish oils', 27)]

In [74]:
len(df)

33837

In [75]:
train_set_size = 25000

In [76]:
remove_words = set(stopwords.words('english') + list(string.punctuation))

word_mass = [w for w in nltk.word_tokenize(' '.join(df['Title'].values[:train_set_size]).lower()) 
             if w not in remove_words]
title_word_freq = nltk.FreqDist(word_mass)

word_features = [t[0] for t in title_word_freq.most_common(5000)]

In [77]:
word_features[:10]

['research',
 'services',
 'agreement',
 'program',
 'center',
 'supplies',
 'transfer',
 'material',
 'california',
 'collaborative']

In [79]:
def generate_features(title):
    title_tokens = set([w for w in nltk.word_tokenize(title.lower()) if w not in remove_words])
    features = [(word in title_tokens) * 1 for word in word_features]
    return features
    
#only take those with 1 label
#bool_filter = df['CategoryOfFundingActivity'].apply(lambda s:',' not in s)

rows = df['Title'].apply(generate_features)

In [80]:
model_df = pd.DataFrame(list(rows), columns=word_features)
target_df = df['l_expertise'].apply(lambda l:l.split(','))

train_X = model_df[:train_set_size]
train_Y = target_df[:train_set_size]
test_X = model_df[train_set_size:]
test_Y = target_df[train_set_size:]

In [81]:
len(train_X), len(test_X)

(25000, 8686)

## Neural Nets

In [85]:
# label_encoder = LabelEncoder()
# label_encoder.fit(train_Y)
# labeled_train_Y = label_encoder.transform(train_Y)
# labeled_test_Y = label_encoder.transform(test_Y)
# # one_hot = OneHotEncoder()
# # one_hot.fit(labeled_train_Y)
# # one_hot_train_Y = one_hot.transform(labeled_train_Y)
# # one_hot_test_Y = one_hot.transform(labeled_test_Y)
# one_hot_train_Y = pd.get_dummies(labeled_train_Y)
# one_hot_test_Y = pd.get_dummies(labeled_test_Y)

lb = MultiLabelBinarizer()
lb.fit(np.concatenate([train_Y, test_Y]))

one_hot_train_Y = lb.transform(train_Y)
one_hot_test_Y = lb.transform(test_Y)

In [87]:
one_hot_train_Y.shape

(25000, 4713)

In [90]:
np.random.seed(27)

nn = Sequential([
    Dense(5000, input_dim=len(train_X.columns.values), 
          init='normal', activation='relu'),
    Dense(5000, init='normal', activation='relu'),
    Dense(one_hot_train_Y.shape[1], init='normal')
])

nn.compile(optimizer='rmsprop',
              loss='mse')


nn.fit(train_X.values, one_hot_train_Y, 
       nb_epoch=4, batch_size=100, verbose=True)

4 epochs

In [89]:
train_preds = nn.predict(train_X.values)
print('Training Accuracy: {}'.format(sum(lb.inverse_transform(train_preds) == train_Y)/len(train_Y)))

test_preds = nn.predict(test_X.values)
print('Training Accuracy: {}'.format(sum(lb.inverse_transform(test_preds) == test_Y)/len(test_Y)))

Training Accuracy: 0.96348
Training Accuracy: 0.7802842318971351


5 epochs

In [82]:
train_preds = nn.predict(train_X.values)
print('Training Accuracy: {}'.format(sum(lb.inverse_transform(train_preds) == train_Y)/len(train_Y)))

test_preds = nn.predict(test_X.values)
print('Training Accuracy: {}'.format(sum(lb.inverse_transform(test_preds) == test_Y)/len(test_Y)))

Training Accuracy: 0.97116
Training Accuracy: 0.78287841191067


In [None]:
def split_first_last_name(s):
    if s in first_last_name_corrections:
        return first_last_name_corrections[s]
    
    l, f = s.split(', ')
    if ' ' in l:
        l = l.split(' ')[0]
    if ' ' in f:
        l_c = f.split(' ')
        if len(l_c[0]) < len(l_c[1]):
            return l, l_c[1]
        else:
            return l, l_c[0]
    return l, f