Berkeley Institute for Data Science: Data Science for Social Good Spring 2017 Project
Our team has developed a webapp for the Berkeley Research Development Office to use to automatically and effectively match research grants with Berkeley researchers and faculty.
Each research grant is a description of the grant which varies in length and specificity across grants.
Here's the type of research grant data we're dealing with:
The first two sentences of the description of Innovations in Biological Imaging and Visualization
The IBIV activity supports the development of novel approaches to the analysis of biological research images through the innovative "Ideas Lab" project development and review process. The analysis and visual representation of complex biological images present daunting challenges across all scales of investigation, from multispectral analysis of foliage or algal bloom patterns in satellite images, to automated specimen classification, and tomographic reconstructions in structural biology.
Each faculty member has a bag of words we've scraped from VCResearch or from their personal webpages. Sometimes faculty members don't put a lot for their description or don't update their webpages.
Here's John Denero's bag of words from his personal webpage
['teaching', 'artificial', 'intelligence', 'education', 'educ', 'centers', 'artificial', 'intelligence', 'bair', 'teaching', 'schedule', 'foundations', 'data', 'science', 'mowefr', 'pimentel', 'completion', 'computer', 'science', 'decal', 'anova', 'teaching', 'computer', 'science', 'youth', 'soda', 'foundations', 'data', 'science', 'mowefr', 'completion', 'computer', 'science', 'eecs', 'natural', 'language', 'processing', 'tasks', 'related', 'statistical', 'machine', 'translation', 'cross-lingual', 'alignment', 'translation', 'model', 'estimation', 'translation', 'inference', 'lexicon', 'acquisition', 'unsupervised', 'grammar', 'induction', 'prior', 'spent', 'four', 'scientist', 'google', 'primarily', 'google', 'translate', 'serves', 'billion', 'translation', 'requests', 'refereed', 'naacl', 'acl', 'emnlp', 'conferences', 'author', 'composing', 'textbook', 'programming', 'computer', 'science', 'masters', 'philosophy', 'eecs']
Each faculty member also has a list of research grant titles that they were awarded (if any).
We used two methods to match research grants to faculty and faculty to research grants.
The first is using TF-IDF (Term Frequency Inverse Doc Frequency) and past research grant awards. We created two vector spaces, one containing faculty website words and the other consisting of grants, both vectorized with their corresponding TF-IDF over their respective corpus. Our rationale for two separate vector spaces is that we noticed grants tend to have a different set of vocabulary than faculty data (such as from their websites). The vocabulary and IDF vectors are learned from our entire grants database, including grants from grants.gov and NSF. We also keep track of a professor's past grant history, which we use to find grant matches. To match a new grant to faculty, vectorize the grant in the grant space and find the k-nearest neighbors to the grant of those that faculty have taken on in the past, and recommend the grant to the corresponding faculty. The advantages of this method include recommending grants that are similar to those a particular faculty has taken on in the past, which may be more representative of the faculty member's grant interests, and TF-IDF vectors that better model the similarities. Things to keep in mind and possible improvements include how to incorporate department data and the cold start problem (also including faculty with less grant history).
The second is using GloVe (Global Vectors for Word Representations) without using past research grant awards. To match a faculty member to his or her k best research grants, we take their bag of words, convert each word to its GloVe representation (100 dimensional), then average the vectors. We do the same for each grant for their grant descriptions after cleaning and filtering them (removing stop words, punctuation, etc). Then we run K-nearest neighbors to find the k best research grants for the faculty member. This also works the same way for matching a research grant to the k-best faculty members. One benefit of this is the meaning or connotation of the faculty's interests and the grant's subject areas are captured. Similar words in GloVe have similar geometrical structure, that is, words such as "HIV" and "AIDS" have very similar vectors since they show up in very similar contexts. This means if the word "HIV" shows up a lot in a grant description but "AIDS" never does and if the word "AIDS" shows up in a faculty profile but "HIV" never does then the grant would still be most likely paired with that faculty member.
Using this, John Denero's first research grant match is with Computational and Data-Enabled Science and Engineering in Mathematical and Statistical Sciences. As an EECS professor that doesn't do research and that has been heavily involved in developing and growing the Data Science program here at Berkeley (developing and teaching DATA8), this is not an unreasonable grant matching. However, some more educational based research grants seem to be more appropriate from our perspective since he seems really really enthusiastic about CS education and helping his students succeed.
We've used the
requests python library and
beautifulsoup to help gather the data.
For faculty members:
- Berkeley VCResearch
- Berkeley department personal faculty webpages
- Google Scholars
- Past research grant awards to faculty members
The following requirements are needed (Python3):
Flask==0.12.1 Jinja2==2.9.6 MarkupSafe==1.0 PyMySQL==0.7.11 Werkzeug==0.12.1 beautifulsoup4==4.5.3 bs4==0.0.1 click==6.7 configparser==3.5.0 flup6==1.1.1 html5lib==0.999999999 itsdangerous==0.24 lxml==3.7.3 mysqlclient==1.3.10 nltk==3.2.2 numpy==1.12.1 pandas==0.19.2 python-dateutil==2.6.0 pytz==2017.2 requests==2.13.0 scikit-learn==0.18.1 scipy==0.19.0 six==1.10.0 sklearn==0.0 webencodings==0.5.1
while in the first flaskr directory run to install the packages
pip install -r requirements.txt
To initialize the database (be in /flaskr/flaskr/data_management):
- Download the neccessary files from from https://drive.google.com/drive/u/1/folders/0B7Wc4Mfxs-1GM2Jrd2dhelBjNVU (since they're too big to fit on github) and place it into (/flaskr/flaskr/data_management/temp_data)
- The following files are needed to initialize database:
- Either change your local mysql access to user: root, password: pw or change the code in init_db.py and database.py
To run the flask app (be in the inner flaskr directory):
export FLASK_APP=flaskr.py flask run