# Talk Recommender - Pycon 2018

With 32 tuotorials, 12 sponsor workshops, 16 talks at the education summit, and 95 talks at the main conference - Pycon has a lot to offer. Reading through all the talk descriptions and filtering out the ones that you should go to is a tedious process. But not anymore.

## Introducing TalkRecommender
Talk recommender is a recommendation system that recommends talks from this year's Pycon based on the ones that you went to last year.  This way you don't waste any time preparing a schedule and get to see the talks that matter the most to you! 

As shown in the demo, the users are asked to label previous year's talks into two categories - the one that they went to in person, and the ones they watched later online. Talk Recommender uses those labels to predict talks from this year that will be interesing to them. 

We will be using [`pandas`](https://pandas.pydata.org/) abd [`scikit-learn`](http://scikit-learn.org/) to build and the model.

*Remember to click on Save and Checkpoint from the File menu to save changes you made to the notebook* 

### Exercise A: Load the data
The data directory contains the snapshot of one such user's labeling - lets load that up and start with our analysis. 

In [1]:
!ls -lrt data

total 184
-rw-r--r-- 1 1000 1000 186903 Jun  3 15:44 talks.csv


In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/talks.csv', index_col='id')
df.head()

Unnamed: 0_level_0,title,description,presenters,date_created,date_modified,location,talk_dt,year,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,5 ways to deploy your Python web app in 2017,You’ve built a fine Python web application and...,Andrew T. Baker,2018-04-19 00:59:20.151875,2018-04-19 00:59:20.151875,Portland Ballroom 252–253,2017-05-08 15:15:00.000000,2017,0.0
2,A gentle introduction to deep learning with Te...,Deep learning's explosion of spectacular resul...,Michelle Fullwood,2018-04-19 00:59:20.158338,2018-04-19 00:59:20.158338,Oregon Ballroom 203–204,2017-05-08 16:15:00.000000,2017,0.0
3,aiosmtpd - A better asyncio based SMTP server,smtpd.py has been in the standard library for ...,Barry Warsaw,2018-04-19 00:59:20.161866,2018-04-19 00:59:20.161866,Oregon Ballroom 203–204,2017-05-08 14:30:00.000000,2017,1.0
4,Algorithmic Music Generation,Music is mainly an artistic act of inspired cr...,Padmaja V Bhagwat,2018-04-19 00:59:20.165526,2018-04-19 00:59:20.165526,Portland Ballroom 251 & 258,2017-05-08 17:10:00.000000,2017,0.0
5,An Introduction to Reinforcement Learning,Reinforcement learning (RL) is a subfield of m...,Jessica Forde,2018-04-19 00:59:20.169075,2018-04-19 00:59:20.169075,Portland Ballroom 252–253,2017-05-08 13:40:00.000000,2017,0.0


Here is a brief description of the interesting fields.

variable | description  
------|------|
`title`|Title of the talk
`description`|Description of the talk
`year`|Is it a `2017` talk or `2018`  
`label`|`1` indicates the user preferred seeing the talk in person,<br> `0` indicates they would schedule it for later.

Note all 2018 talks are set to 1. However they are only placeholders, and are not used in training the model. We will  use only 2017 data for training.

Lets start by selecting the 2017 talk descriptions that were labeled by the user for watching in person.

```python
df[(df.year==2017) & (df.label==1)]['description']
```

Print the description of the talks that the user preferred watching in person. How many such talks are there?

In [3]:
fav_2017 = df[(df.year == 2017) & (df.label == 1)]

In [4]:
fav_2017[['title', 'description']]

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,aiosmtpd - A better asyncio based SMTP server,smtpd.py has been in the standard library for ...
8,Automate AWS With Python,AWS is one of the best-known cloud vendors. Us...
19,"Decorators, unwrapped: How do they work?",Decorators are a syntactically-pleasing way of...
9,Awesome Command Line Tools,Designing a good command line tool is challeng...
13,Building Stream Processing Applications,Do you have a stream of data that you would li...
16,Cython as a Game Changer for Efficiency,Are you running a Web application? Do you suff...
18,"Debugging in Python 3.6: Better, Faster, Stronger",Python 3.6 was released in December of 2016 an...
20,Designing secure APIs with state machines,Did you ever need to create an application who...
21,Dial M For Mentor,One of the nicest things about Python communit...
23,Ending Py2/Py3 compatibility in a user friendl...,"""Four shalt thou not count, neither count t..."


In [5]:
fav_2017.shape[0]

38

### Exercise B: Feature Extraction
In this step we build the feature set by tokenization, counting and normalization of the bi-grams from the text descriptions of the talk. You can find more information on text feature extraction [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) and TfidfVectorizer [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")

The 2017 talks will be used for training and the 2018 talks will we used for predicting. Set the values of `year_labeled` and `year_predict` to appropriate values.

In [7]:
year_labeled = 2017
year_predict = 2018
vectorized_text_labeled = vectorizer.fit_transform(df[df.year == year_labeled]['description'])
vectorized_text_predict = vectorizer.transform(df[df.year == year_predict]['description'])

In [8]:
vectorized_text_labeled

<95x8244 sparse matrix of type '<class 'numpy.float64'>'
	with 11147 stored elements in Compressed Sparse Row format>

### Exercise C: Split into Training and Testing Set

Next we split our data into training set and testing set. This allows us to do cross validation and avoid overfitting. Use the `train_test_split` method from `sklearn.model_selection` to split the `vectorized_text_labeled` into training and testing set with the test size as one third of the size (0.3) of the labeled.

[Here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) is the documentation for the function.

In [9]:
from sklearn.model_selection import train_test_split
labels = df[df.year == 2017]['label']
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(vectorized_text_labeled, labels, test_size=test_size, random_state=1)

### Exercise D: Train the model
Finally we get to the stage for training the model. We are going to use a linear [support vector classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) and check its accuracy by using the `classification_report` function. Note that we have not done any parameter tuning yet, so your model might not give you the best results. 


[Here](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html) is some information for using [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) for doing exhaustive search over specified parameter values of an estimator. _However, this is purely for reference and not needed for this exercise._

Print out the `report` to see how well your model has been trained!

In [10]:
import sklearn
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
classifier = LinearSVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
report = sklearn.metrics.classification_report(y_test, y_pred)

In [11]:
print(report)

             precision    recall  f1-score   support

        0.0       0.71      1.00      0.83        20
        1.0       1.00      0.11      0.20         9

avg / total       0.80      0.72      0.64        29



### Exercise E: Make Predictions
Use the model to predict which 2018 talks the user should go to. 

Using the `predicted_talk_indexes` print out the talk id, description, presenters, title and location and talk date.
How many talks should the user go to according to your model?

In [12]:
predicted_talks_vector = classifier.predict(vectorized_text_predict)
df_2018 = df[df.year == 2018]

# Offset the rows by 2017 talks
predicted_talk_indexes = predicted_talks_vector.nonzero()[0] + len(df[df.year==2017])
# your solution goes here
talks = df_2018[predicted_talks_vector == 1]

In [13]:
# sanity check
(talks == df.iloc[predicted_talk_indexes]).all().all()

True

In [14]:
talks.shape[0]

20

In [15]:
talks[['title', 'description', 'presenters', 'talk_dt', 'location']]

Unnamed: 0_level_0,title,description,presenters,talk_dt,location
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
103,Bayesian Non-parametric Models for Data Scienc...,"Nowadays, there are many ways of building data...",Christopher Fonnesbeck,2018-03-29 13:40:00.000000,Global Center Ballroom AB
104,Behavior-Driven Python,Behavior-Driven Development (BDD) is gaining p...,Andrew Knight,2018-03-29 12:10:00.000000,Grand Ballroom A
106,Beyond Unit Tests: Taking Your Testing to the ...,"You've used pytest and you've used mypy, but b...",Hillel Wayne,2018-03-29 12:10:00.000000,Room 26A/B/C
113,By the Numbers: Python Community Trends in 201...,Want to know about the latest trends in the Py...,"Dmitry Filippov, Ewa Jodlowska",2018-03-29 13:55:00.000000,Room 26A/B/C
114,Clearer Code at Scale: Static Types at Zulip a...,Python now offers static types! Companies like...,Greg Price,2018-03-29 13:50:00.000000,Grand Ballroom B
126,Demystifying the Patch Function,One of the most challenging and important thin...,Lisa Roach,2018-03-29 12:10:00.000000,Grand Ballroom B
132,Elegant Solutions For Everyday Python Problems,Are you an intermediate python developer looki...,Nina Zakharenko,2018-03-29 17:10:00.000000,Room 26A/B/C
135,Fighting the Good Fight: Python 3 in your orga...,"Today, services built on Python 3.6.3 are wide...",Jason Fried,2018-03-29 16:30:00.000000,Grand Ballroom B
139,How Netflix does failovers in 7 minutes flat,"During peak hours, Netflix video streams make ...",Amjith Ramanujam,2018-03-29 11:30:00.000000,Global Center Ballroom AB
140,HOWTO Write a Function,A function is a small chunk of code that does ...,Jack Diederich,2018-03-29 12:10:00.000000,Room 26A/B/C


### Exercise F: Expose it as a service

Now that you have pieces of the code ready, copy them together into the `model.py` file located in this folder, and rebuild your docker image. Copy the code from the above cells into the body of the `prediction` function.

Lets rebuild the docker image and start an new container following the comments.

In the following steps you will leave the jupyter notebook, and stop the container serving it. So save any changes you have done till this point.

```
docker stop <container_name>
docker build -t recommender .
docker run -p 8888:8888 -p 9000:9000 -v $(pwd):/app recommender
```
where `<container_name>` is the name of the container serving this jupyter notebook.

The `api.py` file in this directory is a flask app that makes call to the `model.py` module and exposes the model built in the previous steps as a service. In order to start the flask server, open a new terminal and run the following command.

```
docker exec $(docker ps -ql) python api.py
```
Where `docker ps -ql` gets numeric id of the latest container id.

Finally go to http://0.0.0.0:9000/predict to see the talks that were recommended for this user.

### Exercise G: Pickle the model

Finally we do not have to retrain our model anytime we have to make predictions. In most real life data science applications, the training phase is a time consuming proecss. We would seaprately train and serialize the model which is then exposed through the api to make the predictions. The `predict_api` directory of the TalkVoter app shows an approach where we wrap the model and seaprate out only calls to the prediction api to use the trained model instead of reprocessing any time there is a call to the api.

In [16]:
from sklearn.externals import joblib
with open('talk_recommender.pkl', 'wb') as f:
    joblib.dump(classifier, f)

This will create the pickle file in your directory.

In [17]:
!ls -l

total 148
-rw-r--r-- 1 1000 1000   277 Jun  3 15:44 Dockerfile
-rw-r--r-- 1 1000 1000   400 Jun  3 15:44 README.md
drwxr-xr-x 2 root root  4096 Jul  3 22:54 __pycache__
-rw-r--r-- 1 1000 1000   298 Jun  3 15:44 api.py
drwxr-xr-x 2 1000 1000  4096 Jun  3 15:44 data
-rw-r--r-- 1 1000 1000  1838 Jul  3 22:54 model.py
-rw-r--r-- 1 1000 1000   167 Jun  3 15:44 requirements.txt
-rw-r--r-- 1 1000 1000 49461 Jul  3 23:07 talk_recommender.ipynb
-rw-r--r-- 1 root root 66755 Jul  3 23:10 talk_recommender.pkl


Use the `joblib.load` function to read the `classifier` back from the `talk_recommender.pkl` file.

In [19]:
with open('talk_recommender.pkl', 'rb') as f:
    classifier = joblib.load(f)

In [20]:
classifier

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)