In [55]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


To construct a *dialogue chat bot* to perform the following:
* answer programming-related questions (using StackOverflow dataset);
* chit-chat and simulate dialogue on all non programming-related questions.

For a chit-chat mode, a pre-trained neural network engine available from [ChatterBot](https://github.com/gunthercox/ChatterBot).

### Data description

To detect *intent* of users questions we will use two text collections:
- `tagged_posts.tsv` — StackOverflow posts, tagged with one programming language (*positive samples*).
- `dialogues.tsv` — dialogue phrases from movie subtitles (*negative samples*).

For those questions, that have programming-related intent, we will proceed as follow to predict the programming language (only one tag per question allowed here) and rank candidates within the tag using embeddings.
For the ranking part, we will use:
- `word_embeddings.tsv` — word embeddings, that was trained with StarSpace.

As a result of this notebook, the followings will be prepared for the use of running chat bot:

- `intent_recognizer.pkl` — intent recognition model;
- `tag_classifier.pkl` — programming language classification model;
- `tfidf_vectorizer.pkl` — vectorizer used during training;
- `thread_embeddings_by_tags` — folder with thread embeddings, arranged by tags.

In [56]:
import numpy as np
import pandas as pd
import pickle
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from utils import text_prepare, load_embeddings, question_to_vec, unpickle_file

## Part I. Intent and language recognition

### Intent recognition

In [4]:
def tfidf_features(X_train, X_test, vectorizer_path):
    """Performs TF-IDF transformation and dumps the model."""

    vectorizer = TfidfVectorizer()
    vectorizer.fit(X_train)
    X_train = vectorizer.transform(X_train)
    X_test = vectorizer.transform(X_test)

    with open(vectorizer_path, 'wb') as vp:
        pickle.dump(vectorizer, vp)
    
    return X_train, X_test

In [5]:
# use a subsample of stackoverflow data to balance the classes
sample_size = 200000

dialogue_df = pd.read_csv('./data/dialogues.tsv', sep='\t').sample(sample_size, random_state=0)
stackoverflow_df = pd.read_csv('./data/tagged_posts.tsv', sep='\t').sample(sample_size, random_state=0)

In [6]:
dialogue_df.head()

Unnamed: 0,text,tag
82925,"Donna, you are a muffin.",dialogue
48774,He was here last night till about two o'clock....,dialogue
55394,"All right, then make an appointment with her s...",dialogue
90806,"Hey, what is this-an interview? We're supposed...",dialogue
107758,Yeah. He's just a friend of mine I was trying ...,dialogue


In [7]:
stackoverflow_df.head()

Unnamed: 0,post_id,title,tag
2168983,43837842,Efficient Algorithm to compose valid expressio...,python
1084095,15747223,Why does this basic thread program fail with C...,c_cpp
1049020,15189594,Link to scroll to top not working,javascript
200466,3273927,Is it possible to implement ping on windows ph...,c#
1200249,17684551,GLSL normal mapping issue,c_cpp


In [8]:
dialogue_df['text'] = [text_prepare(x) for x in dialogue_df['text']]
stackoverflow_df['title'] = [text_prepare(x) for x in stackoverflow_df['title']]

In [9]:
# prepare train, test set by concatenating dialogue and stackoverflow text
X = np.concatenate([dialogue_df['text'].values, stackoverflow_df['title'].values])
y = ['dialogue'] * dialogue_df.shape[0] + ['stackoverflow'] * stackoverflow_df.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

# transform to tfidf vectors
X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test, './model_artifacts/tfidf_vectorizer.pkl')

Train size = 360000, test size = 40000


In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# use 0.1 for l2 regularization and increased default max_iter to 500 for convergence to global optimum
intent_recognizer = LogisticRegression(penalty='l2', C=10, random_state=0, max_iter=500)
intent_recognizer.fit(X_train_tfidf, y_train)

y_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_pred)
print('test acc: {}'.format(test_accuracy))

pickle.dump(intent_recognizer, open('./model_artifacts/intent_recognizer.pkl', 'wb'))



test acc: 0.99055


### Programming language classification 

In [7]:
X = stackoverflow_df['title'].values
y = stackoverflow_df['tag'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

# transform to tfidf vectors
vectorizer = pickle.load(open('./model_artifacts/tfidf_vectorizer.pkl', 'rb'))

X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)

Train size = 160000, test size = 40000




In [22]:
# for simplicity, use multiclass logistic classification
from sklearn.multiclass import OneVsRestClassifier

# wrapper is to force logistic classifier to use OneVsRest or even OneVsOne scheme for multiclass
# by default without wrapper, it uses OneVsRest using argument multiclass='auto' => 'multinominal' when non-binary
tag_classifier = OneVsRestClassifier(LogisticRegression(penalty='l2', C=10, random_state=0, max_iter=500)).fit(X_train_tfidf, y_train)

# Check train accuracy.
y_train_pred = tag_classifier.predict(X_train_tfidf)
train_accuracy = accuracy_score(y_train, y_train_pred)
print('Train accuracy = {}'.format(train_accuracy))

# Check test accuracy.
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

pickle.dump(tag_classifier, open('./model_artifacts/tag_classifier.pkl', 'wb'))



Train accuracy = 0.87179375
Test accuracy = 0.7778


## Part II. Ranking questions with embeddings

To find a relevant answer (a thread from StackOverflow) on a question, vector representations will be used to calculate similarity between the question and existing threads. 

However, it would be costly to compute such a representation for all possible answers in *online mode* of the bot (e.g. when bot is running and answering questions for user). This can be solved by creating a *database* with pre-computed representations. These representations will be arranged by non-overlaping tags (programming languages), so that the search of the answer can be performed only within one tag each time. This will make the bot more efficient and allow not to store all the database in RAM. 

In [2]:
starspace_embeddings, embeddings_dim = load_embeddings('./starspace_embeddings/data/stackoverflow_duplicate.tsv')

In [3]:
posts_df = pd.read_csv('data/tagged_posts.tsv', sep='\t')

In [62]:
# create dict(tag: counts)
counts_by_tag = posts_df.groupby(['tag']).count()['title'].to_dict()

In [74]:
import os
os.makedirs('./data/thread_embeddings_by_tag', exist_ok=True)

for tag, count in counts_by_tag.items():
    tag_posts = posts_df[posts_df['tag'] == tag]
    
    tag_post_ids = tag_posts['post_id'].to_numpy()
    
    tag_vectors = np.zeros((count, embeddings_dim), dtype=np.float16)
    for i, title in enumerate(tag_posts['title']):
        tag_vectors[i, :] = question_to_vec(title, starspace_embeddings, 100)

    # Dump post ids and vectors to tuple
    filename = os.path.join('./data/thread_embeddings_by_tag', os.path.normpath('%s.pkl' % tag))
    pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))