### Introduction

In this notebook we will create main models & vectors that we will use during chatbot conversations.

### Data description

- `tagged_posts.tsv` — StackOverflow posts, tagged with one programming language (*positive samples*).
- `dialogues.tsv` — dialogue phrases from movie subtitles (*negative samples*).
- `word_embeddings.tsv` — word embeddings trained earlier with [StarSpace](https://github.com/facebookresearch/StarSpace) on StackOverflow query data earlier.


### Models Description

As a result of this notebook, we will obtain the following new objects that we will then use in the running bot:

- `intent_recognizer.pkl` — intent recognition model;
- `tag_classifier.pkl` — programming language classification model;
- `tfidf_vectorizer.pkl` — vectorizer used during training;
- `thread_embeddings_by_tags` — folder with thread embeddings, arranged by tags.
    

Some functions will be reused by this notebook and the scripts, so we put them into *utils.py* file. Don't forget to open it and fill in the gaps.

In [2]:
from utils import *

## Part I. Intent and language recognition

This bot will not only **answer programming-related questions**, but also will be able to **maintain a dialogue**. Bot will detect the *intent* of the user from the question which will distinguish programming-related questions from general ones.

Bot will also be able to predict which programming language a particular question refer to, this will speed up question search by a factor of the number of languages(10 here). This part will involve creating a **text classification model**.

In [3]:
import numpy as np
import pandas as pd
import pickle
import re

from sklearn.feature_extraction.text import TfidfVectorizer

### Data preparation

In this part we will preprocess texts by removing stopwords, bad symbols, tokenization and then do TF-IDF tranformations. We will also pickle TF-IDF vectorizer to use it later in the running bot.

In [4]:
def tfidf_features(X_train, X_test):
    """Performs TF-IDF transformation and dumps the model."""
    
    #Training a vectorizer on X_train data
    tfidf_vectorizer = TfidfVectorizer(token_pattern='(\S+)', min_df=5, max_df=0.9, ngram_range=(1,2)) 

    #Transforming X_train and X_test data
    tfidf_vectorizer.fit(X_train)
    X_train = tfidf_vectorizer.transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    
    #Pickling the trained vectorizer to 'vectorizer_path'
    with open('output/tfidf_vectorizer.pkl', 'wb') as fin:
        pickle.dump(tfidf_vectorizer, fin)

    return X_train, X_test

Loading examples of two classes & using a subsample of stackoverflow data to balance the classes.

In [5]:
sample_size = 200000

dialogue_df = pd.read_csv('data/dialogues.tsv', sep='\t').sample(sample_size, random_state=0)
stackoverflow_df = pd.read_csv('data/tagged_posts.tsv', sep='\t').sample(sample_size, random_state=0)

Lets check how the data look like

In [6]:
dialogue_df.head()

Unnamed: 0,text,tag
82925,"Donna, you are a muffin.",dialogue
48774,He was here last night till about two o'clock....,dialogue
55394,"All right, then make an appointment with her s...",dialogue
90806,"Hey, what is this-an interview? We're supposed...",dialogue
107758,Yeah. He's just a friend of mine I was trying ...,dialogue


In [7]:
stackoverflow_df.head()

Unnamed: 0,post_id,title,tag
2168983,43837842,Efficient Algorithm to compose valid expressio...,python
1084095,15747223,Why does this basic thread program fail with C...,c_cpp
1049020,15189594,Link to scroll to top not working,javascript
200466,3273927,Is it possible to implement ping on windows ph...,c#
1200249,17684551,GLSL normal mapping issue,c_cpp


Applying *text_prepare* function from utils.py to preprocess this data

In [8]:
from utils import text_prepare

In [9]:
dialogue_df['text'] = dialogue_df['text'].map(lambda text:  text_prepare(text))
stackoverflow_df['title'] = stackoverflow_df['title'].map(lambda text:  text_prepare(text))

### Intent recognition

We will do a binary classification on TF-IDF representations of texts. Labels will be either `dialogue` for general questions or `stackoverflow` for programming-related questions. First, lets prepare the data for this task using below steps:
- concatenate `dialogue` and `stackoverflow` examples into one sample
- split it into train and test in proportion 9:1, using *random_state=0* for reproducibility
- transform it into TF-IDF features

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X = np.concatenate([dialogue_df['text'].values, stackoverflow_df['title'].values])
y = ['dialogue'] * dialogue_df.shape[0] + ['stackoverflow'] * stackoverflow_df.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) 
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test)

Train size = 360000, test size = 40000


Train the **intent recognizer** using LogisticRegression on the train set

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [13]:
intent_recognizer = LogisticRegression(penalty='l2', C=10, random_state=0)
intent_recognizer.fit(X_train_tfidf, y_train)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [14]:
#Checking test accuracy.
y_test_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.992


Dumping the classifier to use it in the running bot.

In [15]:
pickle.dump(intent_recognizer, open(RESOURCE_PATH['INTENT_RECOGNIZER'], 'wb'))

### Programming language classification 

We will train one more classifier for the programming-related questions. It will predict exactly one tag (programming language) and will be also based on Logistic Regression with TF-IDF features. 

First, let us prepare the data for this task.

In [16]:
X = stackoverflow_df['title'].values
y = stackoverflow_df['tag'].values

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

Train size = 160000, test size = 40000


Let us reuse the TF-IDF vectorizer that we have already created above. It should not make a huge difference which data was used to train it.

In [18]:
vectorizer = pickle.load(open(RESOURCE_PATH['TFIDF_VECTORIZER'], 'rb'))

X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)

Training the **tag classifier** using OneVsRestClassifier wrapper over LogisticRegression.

In [19]:
from sklearn.multiclass import OneVsRestClassifier

In [20]:
tag_classifier=OneVsRestClassifier(LogisticRegression(penalty='l2', C=5, random_state=0))
tag_classifier.fit(X_train_tfidf, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)

In [21]:
#Checking test accuracy.
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.8007


Dumping the classifier to use it in the running bot.

In [22]:
pickle.dump(tag_classifier, open(RESOURCE_PATH['TAG_CLASSIFIER'], 'wb'))

## Part II. Ranking  questions with embeddings

To find a relevant answer (a thread from StackOverflow) on a question we will use vector representations to calculate similarity between the question and existing threads. We will use `question_to_vec` function from _utils.py_ which can create such a representation based on word vectors. 

However, it would be costly to compute such a representation for all possible answers in *online mode* of the bot (e.g. when bot is running and answering questions from many users). This is the reason why we will create a *database* with pre-computed representations. These representations will be arranged by non-overlaping tags (programming languages), so that the search of the answer can be performed only within one tag each time. This will make our bot even more efficient and allow not to store all the database in RAM. 

Loading starspace embeddings

In [23]:
from utils import load_embeddings

In [41]:
starspace_embeddings , embeddings_dim = load_embeddings('data/starspace_embedding.tsv')

Since we want to precompute representations for all possible answers, we need to load the whole posts dataset, unlike we did for the intent classifier

In [25]:
posts_df = pd.read_csv('data/tagged_posts.tsv', sep='\t')

Lets Look at the distribution of posts for programming languages (tags) and find the most common ones.

In [26]:
counts_by_tag = posts_df.groupby(['tag']).count() 

Now for each `tag` we will create two data structures, which will serve as online search index:
* `tag_post_ids` — a list of post_ids with shape `(counts_by_tag[tag],)`. It will be needed to show the title and link to the thread;
* `tag_vectors` — a matrix with shape `(counts_by_tag[tag], embeddings_dim)` where embeddings for each answer are stored.

In [38]:
counts_by_tag

Unnamed: 0_level_0,post_id,title
tag,Unnamed: 1_level_1,Unnamed: 2_level_1
c#,394451,394451
c_cpp,281300,281300
java,383456,383456
javascript,375867,375867
php,321752,321752
python,208607,208607
r,36359,36359
ruby,99930,99930
swift,34809,34809
vb,35044,35044


In [39]:
counts_by_tag = posts_df['tag'].value_counts().to_dict()

In [42]:
import os
os.makedirs(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], exist_ok=True)

for tag, count in counts_by_tag.items():
    tag_posts = posts_df[posts_df['tag'] == tag]
    
    tag_post_ids = tag_posts['post_id'].values
    
    tag_vectors = np.zeros((count, embeddings_dim), dtype=np.float32)
    for i, title in enumerate(tag_posts['title']):
        tag_vectors[i, :] = question_to_vec(title, starspace_embeddings, embeddings_dim) 

    # Dumping post ids and vectors to a file.
    filename = os.path.join(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], os.path.normpath('%s.pkl' % tag))
    pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))