# Evolving vector-space model

This lab will be devoted to the use of `doc2vec` model for the needs of information retrieval and text classification.  

## 1. Searching in the curious facts database
The facts dataset is given [here](https://github.com/hsu-ai-course/hsu.ai/blob/master/code/datasets/nlp/facts.txt), take a look.  We want you to retrieve facts relevant to the query, for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using `doc2vec` model.

### 1.1 Loading trained `doc2vec` model

First, let's load the pre-trained `doc2vec` model from https://github.com/jhlau/doc2vec (Associated Press News DBOW (0.6GB))

In [None]:
!pip install gensim

In [7]:
from gensim.models.doc2vec import Doc2Vec

# unpack a model into 3 files and target the main one
# doc2vec.bin  <---------- this
# doc2vec.bin.syn0.npy
# doc2vec.bin.sin1neg.npy
model = Doc2Vec.load('doc2vec.bin', mmap=None)
print(type(model))
print(type(model.infer_vector(["to", "be", "or", "not"])))

<class 'gensim.models.doc2vec.Doc2Vec'>
<class 'numpy.ndarray'>




### 1.2 Reading data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [8]:
#TODO read facts into list
facts = []
with open("facts.txt", "rb") as file:
    facts= file.read().decode(errors="ignore").split("\n")

### 1.3 Tests

In [9]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.
2. McDonalds calls frequent buyers of their food heavy users.
3. The average person spends 6 months of their lifetime waiting on a red light to turn green.
4. The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.
5. You burn more calories sleeping than you do watching television.


### 1.4  Transforming sentences to vectors

Transform the list of facts to numpy array of vectors corresponding to each document (`sent_vecs`), inferring them from the model we just loaded.

In [10]:
#TODO infer vectors
import nltk
import string
import numpy as np
sent_vecs =[]
pun_trans = str.maketrans('', '', string.punctuation)
remove_digits = str.maketrans('', '', string.digits)
for para in facts:
    para = para.translate(pun_trans) 
    para = para.translate(remove_digits) 
    words = nltk.word_tokenize(para)
    sent_vecs.append(model.infer_vector(words))
sent_vecs = np.array(sent_vecs)

In [11]:
sent_vecs.shape

(159, 300)

### 1.5 Tests 

In [12]:
print(sent_vecs.shape)
assert sent_vecs.shape == (159, 300)

(159, 300)


### 1.6 Find closest

Now, reusing the code from the last lab, find facts which are closest to the query using cosine similarity measure.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
def find_k_closest(query, dataset, k=5):    
    #TODO write here the code that will find 5 closest rows in dataset in terms of cosine similarity
    #HINT: as vectors in dataset are already normed, cosine similarity is just dot product.
    temp_df = pd.DataFrame(dataset)
    temp_df["similarity"] =  cosine_similarity([query], dataset)[0]
    matches = []
    for top_match in temp_df.sort_values(by=['similarity'], ascending=False).index[:5]:
        matches.append([top_match, dataset[top_match, :], temp_df["similarity"][top_match]])
    return matches

In [14]:
#TODO output closest facts to the query
query = "good mood"
# Convert the query words into vector 
query = query.translate(pun_trans) 
query = query.translate(remove_digits)
query_vector = model.infer_vector(nltk.word_tokenize(query))
r = find_k_closest(query_vector, sent_vecs)
print("Results for query:", query)
for k, v, p in r:
    print("\t", facts[k], "sim=", p)

Results for query: good mood
	 144. Dolphins sleep with one eye open! sim= 0.6341592
	 68. Cherophobia is the fear of fun. sim= 0.59027797
	 118. An ostrichs eye is bigger than its brain sim= 0.5770811
	 57. Gorillas burp when they are happy sim= 0.5745027
	 110. Cats have 32 muscles in each of their ears. sim= 0.57116723


## 2. Training doc2vec model and documents classifier

Now we would like you to train doc2vec model yourself based on [this topic-modeling dataset](https://code.google.com/archive/p/topic-modeling-tool/downloads).

### 2.1 Read dataset

First, read the dataset - it consists of 4 parts, you need to merge them into single list. 

In [15]:
#TODO read the dataset into list
all_data = []
def read_txt_dataset(file_name):
    with open(file_name, "rb") as file:
        file_data = file.read().decode(errors="ignore").split("\n")
        if file_data[-1].strip()=="":
            del file_data[-1]
        return file_data
all_data += read_txt_dataset("testdata_braininjury_10000docs.txt")
all_data += read_txt_dataset("testdata_news_economy_2073docs.txt")
all_data += read_txt_dataset("testdata_news_fuel_845docs.txt")
all_data += read_txt_dataset("testdata_news_music_2084docs.txt")

### 2.2 Tests 

In [16]:
print(len(all_data))
assert len(all_data) == 15002

15002


### 2.3 Training `doc2vec` model

Train a `doc2vec` model based on the dataset you've loaded. The example of training is provided.

In [17]:
#TODO change this according to the task
# small set of tokenized sentences
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# just a test set of tokenized sentences
print(common_texts, "\n")
corpus = []

for doc in all_data:
    corpus.append(nltk.word_tokenize(doc))
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
print(documents, "\n")
# train a model
model = Doc2Vec(
    documents,     # collection of texts
    vector_size=5, # output vector size
    window=2,      # maximum distance between the target word and its neighboring word
    min_count=1,   # minimal number of 
    workers=4      # in parallel
)

# clean training data
model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

# save and load
model.save("d2v.model")
model = Doc2Vec.load("d2v.model")

vec = model.infer_vector(["system", "response"])
print(vec)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']] 



IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[ 0.01329018  0.04400817  0.08257095  0.04851411 -0.07365471]


### 2.4 Form train and test datasets

Transform documents to vectors and split data to train and test sets. Make sure that the split is stratified as the classes are imbalanced.

In [18]:
# generate labels
# braininjury = 0
# news_economy = 1
# news_fuel = 2
# news_music = 3
"""all_data += read_txt_dataset("testdata_braininjury_10000docs.txt")
all_data += read_txt_dataset("testdata_news_economy_2073docs.txt")
all_data += read_txt_dataset("testdata_news_fuel_845docs.txt")
all_data += read_txt_dataset("testdata_news_music_2084docs.txt")"""
y = []
for i in range(10000):
    y.append(0)
for i in range(10000, 10000+2073):
    y.append(1)

for i in range(10000+2073, 10000+2073+845):
    y.append(2)
for i in range(10000+2073+845, 10000+2073+845+2084):
    y.append(3)

In [19]:
#TODO transform and make a train-test split
from sklearn.model_selection import train_test_split
corpus_vectors = []
for para in all_data:
    para = para.translate(pun_trans) 
    para = para.translate(remove_digits) 
    words = nltk.word_tokenize(para)
    corpus_vectors.append(model.infer_vector(words))
print("len : ", len(corpus_vectors))
X_train, X_test, y_train, y_test = train_test_split(corpus_vectors, y, test_size=0.2)

len :  15002


### 2.5 Train topics classifier

Train a classifier that would classify any document to one of four categories: fuel, brain injury, music, and economy.
Print a classification report for test data.

In [21]:
#TODO train a classifier and measure its performance
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Accuracy score : ", neigh.score(X_test, y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Accuracy score :  0.8770409863378874
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2020
           1       0.59      0.73      0.65       399
           2       0.37      0.19      0.25       197
           3       0.74      0.75      0.74       385

   micro avg       0.88      0.88      0.88      3001
   macro avg       0.67      0.67      0.66      3001
weighted avg       0.87      0.88      0.87      3001



Which class is the hardest one to recognize?

### 2.6 Bonus task

What if we trained our `doc2vec` model using window size = 5 or 10? Would it improve the classification acccuracy? What about vector dimensionality? Does it mean that increasing it we will achieve better performance in terms of classification?

Explore the influence of these parameters on classification performance, visualizing it as a graph (e.g. window size vs f1-score, vector dim vs f1-score).

In [22]:
def train_doc2vec(documents, vector_size, window, model_name="d2v.model"):
    model = Doc2Vec(
        documents,     # collection of texts
        vector_size=vector_size, # output vector size
        window=window,      # maximum distance between the target word and its neighboring word
        min_count=1,   # minimal number of 
        workers=4      # in parallel
    )

    # clean training data
    model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

    # save and load
    model.save(model_name)

### vector size =5, window size 5

In [23]:
model_name = "d2v_window5.model"
train_doc2vec(documents, vector_size=5, window=5, model_name=model_name)

In [24]:
model = Doc2Vec.load(model_name)

In [25]:
corpus_vectors = []
for para in all_data:
    para = para.translate(pun_trans) 
    para = para.translate(remove_digits) 
    words = nltk.word_tokenize(para)
    corpus_vectors.append(model.infer_vector(words))
print("len : ", len(corpus_vectors))
X_train, X_test, y_train, y_test = train_test_split(corpus_vectors, y, test_size=0.2)



neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Accuracy score : ", neigh.score(X_test, y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

len :  15002
Accuracy score :  0.8830389870043319
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2014
           1       0.63      0.69      0.66       409
           2       0.37      0.24      0.29       174
           3       0.74      0.77      0.76       404

   micro avg       0.88      0.88      0.88      3001
   macro avg       0.68      0.68      0.68      3001
weighted avg       0.88      0.88      0.88      3001



### vector size =5, window size 10

In [27]:
model_name = "d2v_window10.model"
train_doc2vec(documents, vector_size=5, window=10, model_name=model_name)
model = Doc2Vec.load(model_name)

In [28]:
corpus_vectors = []
for para in all_data:
    para = para.translate(pun_trans) 
    para = para.translate(remove_digits) 
    words = nltk.word_tokenize(para)
    corpus_vectors.append(model.infer_vector(words))
print("len : ", len(corpus_vectors))
X_train, X_test, y_train, y_test = train_test_split(corpus_vectors, y, test_size=0.2)



neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Accuracy score : ", neigh.score(X_test, y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

len :  15002
Accuracy score :  0.8810396534488504
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2018
           1       0.59      0.70      0.64       398
           2       0.38      0.23      0.29       163
           3       0.76      0.74      0.75       422

   micro avg       0.88      0.88      0.88      3001
   macro avg       0.68      0.67      0.67      3001
weighted avg       0.88      0.88      0.88      3001



### vector size =50, window size 2

In [30]:
model_name = "d2v_vector50.model"
train_doc2vec(documents, vector_size=50, window=2, model_name=model_name)
model = Doc2Vec.load(model_name)

In [31]:
corpus_vectors = []
for para in all_data:
    para = para.translate(pun_trans) 
    para = para.translate(remove_digits) 
    words = nltk.word_tokenize(para)
    corpus_vectors.append(model.infer_vector(words))
print("len : ", len(corpus_vectors))
X_train, X_test, y_train, y_test = train_test_split(corpus_vectors, y, test_size=0.2)



neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Accuracy score : ", neigh.score(X_test, y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

len :  15002
Accuracy score :  0.895034988337221
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      1970
           1       0.68      0.74      0.71       438
           2       0.47      0.31      0.38       153
           3       0.79      0.78      0.78       440

   micro avg       0.90      0.90      0.90      3001
   macro avg       0.73      0.71      0.72      3001
weighted avg       0.89      0.90      0.89      3001



### vector size =100, window size 2

In [32]:
model_name = "d2v_vector100.model"
train_doc2vec(documents, vector_size=100, window=2, model_name=model_name)
model = Doc2Vec.load(model_name)

In [33]:
corpus_vectors = []
for para in all_data:
    para = para.translate(pun_trans) 
    para = para.translate(remove_digits) 
    words = nltk.word_tokenize(para)
    corpus_vectors.append(model.infer_vector(words))
print("len : ", len(corpus_vectors))
X_train, X_test, y_train, y_test = train_test_split(corpus_vectors, y, test_size=0.2)



neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Accuracy score : ", neigh.score(X_test, y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

len :  15002
Accuracy score :  0.9046984338553815
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2014
           1       0.69      0.77      0.73       398
           2       0.57      0.37      0.45       171
           3       0.80      0.79      0.80       418

   micro avg       0.90      0.90      0.90      3001
   macro avg       0.76      0.73      0.74      3001
weighted avg       0.90      0.90      0.90      3001



### vector size =100, window size 5

In [34]:
model_name = "d2v_vector100_win5.model"
train_doc2vec(documents, vector_size=100, window=5, model_name=model_name)
model = Doc2Vec.load(model_name)

In [35]:
corpus_vectors = []
for para in all_data:
    para = para.translate(pun_trans) 
    para = para.translate(remove_digits) 
    words = nltk.word_tokenize(para)
    corpus_vectors.append(model.infer_vector(words))
print("len : ", len(corpus_vectors))
X_train, X_test, y_train, y_test = train_test_split(corpus_vectors, y, test_size=0.2)



neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Accuracy score : ", neigh.score(X_test, y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

len :  15002
Accuracy score :  0.9003665444851716
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1999
           1       0.66      0.76      0.71       406
           2       0.55      0.43      0.48       160
           3       0.83      0.75      0.79       436

   micro avg       0.90      0.90      0.90      3001
   macro avg       0.76      0.73      0.74      3001
weighted avg       0.90      0.90      0.90      3001

