# Mentor-Mentee Recommendation System
### Nidhi Singh
### 8th September 2017

## Problem Statement

* Design a system for a user who is looking for an expert(s) based on his certain preferences. The experts are refered to as mentors and the expert seekers are mentees. 
* In our scenario, mentors are authors in DBLP(Database and lnaguage programming) bibliographic reference data and their expertise is inferred from title of their publications.
* Mentees are users of the system who are recommended Mentors based on their captured preferences.


## System Process Flow


![alt text](images/RecSysFlow.png "Title")

## Structure of DBLP data set

![alt text](images/pub_description.png "dblp")

## Parse and store DBLP 

* DBLP, a data set for bibliographic information on computer science publications 
* Our system uses the latest copy which indexes 3.7 million publication written in English by ~2 million authors.
* We utilize Python's SAX xml parser to read the required fields for our scenario and store it in appropriate tables.
* PostGreSQL is used for storing the parsed data.

Key challenge
- Author disambiguation - Authors with same name in the collection and also few Publications with same name authors
Since we did not capture extra information of the authors which could help in disambiguation, our system assumes author name to be unique.
- Publication fields duplication - Publications with multiple years
We assume Publication attributes to be unique


## Topic Extraction
DBLP data set does not contain keywords, paper abstract or full text, so the topics are extracted from titles.
We use Python's NLTK package to perfrom text transformation and corpus preparation.
![alt text](images/topics.png "dblp")

### Corpus Preparation
There is a main design decisions that entail corpus preparation in Information Retrieval:
* Document based - this approach first ranks documents in the corpus given a query topic and then find associate candidates. So corpus comprises of every document, in our case, publication titles are documents.
* Candidate based - in this approach we directly model the profile of the candidate based on all documents associated with the candidate and estimate ranking score according to profile in response to a user query.

Since our system is query independent, we choose candidate based model. Also for simplification, as the topics extracted could be represented as expertise for the candidate. Furthermore, LDA algorithm works better for longer documents.

### Corpus Preparation and Topic modeling
* We identify and collect all publication titles of the author
* Create a single document for each author
* Combine all documents to create corpus

* We use Python's Gensim's package LDA (Latent Dirichlet Allocation Model) for topic modeling on the corpus.
* We choose number of topics as 20 and iteration passes as 5.
    * number of topics were judged by running few iterations with 10, 15 and 20. 20 gave better distribution of terms over topics.
    * 5 for no. of passes is chosen randomly as covergence point was computationally intensive to calculate.
* LDA result of topics is further used to identify probable topics for each document, this is then saved as author-topic probability.
* This choice is influenced by :
    * LDA produces interpretable,semantically coherent topics, which can be examined by listing the most probable words for each topic.
    * Gensim LDA has a multi-core variant, since we have 1.9 million documents in our corpus and topic modeling is computatinally intensive, this choice is vital. 


## Author expert profile
Expertise can be defined as combination of several topics, hence a candidate can be represented in terms of mixing proportions of multiple topics.
* We take the results of document-topic distribution as the input to the expert model for the author.
* The topics are considered as the expertise topics of associated probabilities.
* Expertise level can be further calculated by :
    * number of total publications, publication per topic is more useful but not available in our scenario.
    * co-authorship - each publication is authored by one or more authors, less the number of authors more the expert level of the author.
    So if paper has 4 author, each author gets a co-author ship score of 1/4 (0.25)

* We construct author profile feature vector, with topics as features.

## Mentor recommendations
* For a mentee to be recommended a mentor, we need to model their preferences. In our case, we have identified a list of topics from the corpus. These can be used as topics to be get user preference.
* From this preference, we construct a mentee preference feature vector.
* To identify mentors similar to preference , there are 2 key decisions:
    * similarity measure - we use cosine similarity
    * similarity threshold - this is randomly chosen to be 0.5 but needs to be adjusted with evaluations of the results.
* We compute the cosine similarity between the mentor profiles and mentee profile, we output the ones with similarity value greater than threshold

Recommendations evaluation
* As first part of evaluations we need to create ground truth:
    * Pick sample users, say 100
    * Recruit evaluators, say 3 (atleast more than 2 to cover for evaluator's subjective bias)
    * Run one query (user preference) per evaluator.
    * For this query mark each sampled user as relevant or not relevant. This can be done by looking up author's name on scholar websites to retrieve their interests and other profile details.
* These scores can be used to fit a model on probability of relevant value, for P(relevant)>0.5 we can use the corresponding cosine similarity threshold values.
* With ground truth, we can use IR evaluation methods like Precision, recall and RMSE.

# Thank you