# Advanced Topics in Data Mining and Knowledge Discovery 
## Word2Vec
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. [Wikipedia](https://en.wikipedia.org/wiki/Word2vec)

## Questions

1. What are word embeddings?

Word embedding is a vector representing a particular word. The vecotr is capiable to capture context of a word in a document, semnatic and synthatic similarity and relation with other words.

2. Propose a way to create sentence (or full review) embeddings out of word embeddings. What will happen if you use your proposed embeddings on very long texts?

Make an average of all the word embeddings in the sentence will give us the sentence embedding. If i will use this approach on very long text than the sentence embedding will be less accurate.

3. What are the advantages of using embedding as opposed to other vector representations (counter vectorization, tf-idf)?

The advantages of using embedding over other vector representations are -  
1. Word embedding vector is multi dimentional vector that capture a word relationship to other words in contrast of others methods that can't capture the relation between words.
2. Word embedding can be trainned on large external data.
3. Can be applied to each word and not the whole document.
4. Ideal for problems involving single word such as a word translation.

# CODE

In [1]:
# imports - add additional imports here
import pandas as pd


In [42]:
df = pd.read_csv('https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv')

4. a. Explain the following Word2Vec parameters:
        * min_count
        * size
        * window
    b. Prepare you data for the word2vec model.
    
    c. Create a word2vec model using Gensim, what vector size will you choose? 

4.a.  
min count - The minimum number of word frequency that will get into the calculation. Ignores all words with total frequency lower than this.  
size - Number of dimensions of the word vector.  
window - Maximum distance between the current and predicted word within a sentence.  

In [43]:
# Question number 4.b.
import nltk
import string
import re
from copy import deepcopy
from nltk.tokenize import  word_tokenize, sent_tokenize
from sklearn.model_selection import train_test_split
import numpy as np


# nltk.download() - You must download nltk for using 
stopwords = set(nltk.corpus.stopwords.words('english') + ['reuter', '\x03'])
text_column = 'review_text'

def pre_process(row, column_name):
    text_column = deepcopy(row[column_name])

    # Replace numbers wih 'num'
    text_column = re.sub(r'\d+', 'num', text_column)

    # Tokenize
    tokenized_row = word_tokenize(text_column.lower())

    # remove stop words + lower + remove punctuation
    for word in tokenized_row:
        if word in stopwords or word in string.punctuation:
            tokenized_row.remove(word)

    return tokenized_row


df[f"final_{text_column}"] = df.apply(lambda row: pre_process(row, text_column), axis=1)

# Split to train and validation

X = df[f"final_{text_column}"]
train, validation = train_test_split(df, test_size=0.3)
msk = np.random.rand(len(df)) < 0.7

X_train = X[msk]
y_train = df['business_category'][msk]

X_validation = X[~msk]
y_validation = df['business_category'][~msk]


X_train

0       [first, place, tried, toronto, spa-hopping, pr...
3       [value, world, one, messed, places, earth, ima...
4       [live, num, num, miles, family, friends, ea, g...
5       [awful, experience, dealing, sales, n't, even,...
6       [lavo, become, go-to, place, bouncing, back, s...
8       [came, num\/num, visiting, tattoo, places, str...
9       [one, irritating, inevitabilities, living, val...
10      [horrible, here, 's, letter, num, characters, ...
12      [tall, drink, brunette, bugs, bunny, special, ...
15      [num.num, stars, 've, tried, zak, 's, chocolat...
16      [planned, vegas, buffets, num, tips, buddies, ...
18      [start, dreadful, seriously, thee, worst, gues...
20      [woke, one, morning, went, take, photo, my, do...
21      [say, minus, num, could, go, pages, absolute, ...
22      [review, 'ritche, bridal, super, sale, happens...
23      [hospitality, amazing, really, memorable, expe...
24      [pittsburgh, area, it, 's, important, take, ti...
25      [sin, 

In [44]:
import gensim

# Question 4.c.
model = gensim.models.Word2Vec(X_train,min_count=5,size=200)

I choosed size=200 because i checked the accuracy of the model for different sizes (number of dimensions) and i discovered that 200 is big enough to get good accuracy and not overrfit the trainning. Also calculate time was not too long.

5. What is the models vocabulary size? 

In [45]:
print(f"Model vocabulary size is - {len(model.wv.vocab)}")

Model vocabulary size is - 10561


6. a. Display the 10 most similar words to 'good' according to the Word2Vec model.

    b. Explain why 'bad'/'terrible' are similar to 'good' according to the model.

In [46]:
# Question 6.a - 
model.wv.most_similar('good')

[('decent', 0.8184503316879272),
 ('great', 0.7816325426101685),
 ('impressed', 0.7194135785102844),
 ('bad', 0.7181156873703003),
 ('much', 0.7090676426887512),
 ('disappointing', 0.705264687538147),
 ('forgettable', 0.6867215633392334),
 ('tasty', 0.6860816478729248),
 ('big', 0.6811453104019165),
 ('amazing', 0.6783066987991333)]

Answer to 6.b -  
The words 'bad'/'terrible' are similiar to 'good' because all of them are adjectives that usally exists in the same location/context in the sentence (between similar other words). The Word2Vec algorithm depend on the context of a world in the sentence and that why all of those adjactives usally exists in the same context.

# Classifier

7. Create review embeddings using the method you suggested in question 2.

In [47]:
import numpy

def calculate_sentence_embedding(tokenized_review_column, word_embed_model):
    tokenized_review_column = deepcopy(tokenized_review_column)
    review_word_vectors = []
    for word in tokenized_review_column:
        if word in word_embed_model.wv.vocab:
            review_word_vectors.append(model.wv[word])

    review_embedding = numpy.average(review_word_vectors, axis=0)
    return review_embedding


X_train_sentence_embedding = X_train.apply(
    lambda row: calculate_sentence_embedding(
        row, word_embed_model=model)
)

X_validation_sentence_embedding = X_validation.apply(
    lambda row: calculate_sentence_embedding(
        row, word_embed_model=model)
)

X_train_sentence_embedding

0       [-0.32588437, 0.1845494, 0.20753913, 0.0462646...
3       [-0.369014, 0.23160389, 0.19598591, -0.0051019...
4       [-0.35817385, 0.24639024, 0.24574226, 0.112439...
5       [-0.31849933, 0.24133524, 0.22826841, 0.119337...
6       [-0.25779817, 0.1596853, 0.14181164, 0.0307057...
8       [-0.32832667, 0.19891585, 0.19202474, 0.125309...
9       [-0.34537798, 0.20179005, 0.1696644, -0.020736...
10      [-0.39839554, 0.27014935, 0.24044673, 0.077784...
12      [-0.28762263, 0.13525726, 0.13131414, 0.072323...
15      [-0.3047567, 0.15302491, 0.16422091, 0.0056613...
16      [-0.32582533, 0.16393247, 0.15367512, 0.056921...
18      [-0.4025262, 0.2358636, 0.23800135, 0.11219228...
20      [-0.3940325, 0.20872073, 0.22691229, 0.0368339...
21      [-0.28505367, 0.16458203, 0.18012623, -0.00957...
22      [-0.3037446, 0.18124746, 0.19464967, 0.0445338...
23      [-0.26883644, 0.17279707, 0.14958753, -0.02848...
24      [-0.14443347, 0.1389643, 0.09262905, 0.0020757...
25      [-0.29

8. Create a classifier and use it to classify the reviews.

In [48]:
from sklearn.ensemble import ExtraTreesClassifier

extra_tree = ExtraTreesClassifier(n_estimators=200)
classifier = extra_tree.fit(X_train_sentence_embedding.tolist(), y_train)


pred = classifier.predict(X_validation_sentence_embedding.tolist())

9. Calculate the accuracy, precision, recall and F1 score on the validation set.

In [49]:
from sklearn import metrics
from sklearn.metrics import f1_score,precision_score, recall_score

confusion_matrix = metrics.confusion_matrix(y_validation, pred)
f1 = f1_score(y_validation, pred, average="macro")
precision = precision_score(y_validation, pred, average="macro")
recall = recall_score(y_validation, pred, average="macro")


print(f"Accuracy is - {np.mean(pred == y_validation) * 100}%")
print(f"F1 score is - {f1}")
print(f"Precision score is - {precision}")
print(f"Recall score is - {recall}")
print("Confusion Matrix is - ")
pd.DataFrame(confusion_matrix)

Accuracy is - 85.34161490683229%
F1 score is - 0.833013770824716
Precision score is - 0.8358895591301758
Recall score is - 0.8319238417459321
Confusion Matrix is - 


Unnamed: 0,0,1,2
0,193,2,33
1,11,344,9
2,36,27,150
