# Text Embeddings

Before you start working on this notebook, have a look to this [blogpost](https://towardsdatascience.com/bow-to-bert-2695cdb19787).

The general goal in NLP is to learn from text data by transforming it into a vector-like format while keeping the semantic meaning of each word and its context. This vector-like format is what we will call embeddings.

Now, depending on the goal of our project, we may aim to have embeddings for each word in our text or for representing the whole text (i.e. all the text from each of the restaurant reviews).

Coming back to the machine learning protocol, embeddings are vectors that will be used to feed a model (i.e. classifier, linear model, clustering, neural network, etc) to train it and make predictions. Depending on the data science problem that we face we will use one or another model, we will deep dive into different use cases in the next days.

In this notebook we are going to focus on producing embeddings that describe the whole text, each of the restaurant reviews. We are going to start from the traditional and more intuitive methods:
* Bag of Words (BOW)
* TF-IDF

In [None]:
import re
import time
import nltk
import gensim
import pandas as pd
import numpy as np
import seaborn as sns
from nltk.corpus import wordnet
from datetime import datetime
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
def visualize_wordcloud_dict_frequencies(dict_freqs, title, relative_scaling=0.5, max_words=100,
                                background_color='black'):
    plt.figure(figsize=(10, 10))
    wordcloud = WordCloud(width=900, height=500, max_words=max_words, relative_scaling=relative_scaling,
                          normalize_plurals=False, background_color=background_color).generate_from_frequencies(
        dict_freqs)
    plt.title(title)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

### Load data

In [None]:
path_to_file = '../../datasets/Class_Exercises_for_Students/Ex 6.2. ratings.csv'
data = pd.read_csv(path_to_file)

In [None]:
data.shape

In [None]:
samples = data['review'].dropna()

## Bag of Words

In [None]:
start = time.time()

matrix = CountVectorizer(max_features=100)
X = matrix.fit_transform(samples).toarray()

end = time.time()
print("It took {} sec to fit and transform all documents.".format(end - start))

In [None]:
# Build the column names dictionary -> ordered dataframe
bow_dict = matrix.vocabulary_
df_baw_voc = pd.DataFrame({'column_name': list(bow_dict.keys()), 'column_index': list(bow_dict.values())})
df_baw_voc = df_baw_voc.sort_values(by='column_index')

In [None]:
# Build the matrix dataframe with the right columns
df_X = pd.DataFrame(X)
df_X.columns = df_baw_voc['column_name'].tolist()

In [None]:
title="Bag of Words all reviews"
d_freq_bow = df_X.sum().to_dict()
visualize_wordcloud_dict_frequencies(d_freq_bow, title, relative_scaling=0.5, max_words=1000,
                                background_color='white')

In [None]:
# Exercise 1. You are transforming 1319968 reviews into vectors of 100 dimensions in 42s.
# Do you think that Sklearn is doing this operation column or row based? Compare with
# the time that it took the tokenization of reviews through a row based iterative process
# in the previous notebook.

In [None]:
# Exercise 2. Do you see in the wordcloud words that you would consider as stopwords? Go to
# the sklearn CountVectorizer class documentation and find out how to fix this problem.
# Rerun the cells and have a look to the wordcloud, which words have highest frequency now?

In [None]:
# Exercise 3. Check out the RAM memory bar while you transform the reviews into
# vectors of bags of words. The parameter "max_features" will determine the number of
# words that will define each of the reviews in your dataset (the dimensions). Play around with it
# visualizing the wordcloud from each setup and argument which number is the optimal
# from your perspective and for which goal do you think it is optimal.

## TF-IDF

In [None]:
start = time.time()

vectorizer = TfidfVectorizer(max_df=0.5, min_df=0.1)
X_tfidf = vectorizer.fit_transform(samples)

end = time.time()
print("It took {} sec to fit and transform all documents.".format(end - start))

In [None]:
X_tfidf.shape

In [None]:
# Build the column names dictionary -> ordered dataframe
tfidf_dict = vectorizer.vocabulary_
df_tf_idf_voc = pd.DataFrame({'column_name': list(tfidf_dict.keys()), 'column_index': list(tfidf_dict.values())})
df_tf_idf_voc = df_tf_idf_voc.sort_values(by='column_index')

In [None]:
# Build the matrix dataframe with the right columns
df_X_tfidf = pd.DataFrame(X_tfidf.toarray())
df_X_tfidf.columns = df_tf_idf_voc['column_name'].tolist()

In [None]:
title="TF-IDF vector representation of all reviews"
d_freq_tfidf = df_X_tfidf.sum().to_dict()
visualize_wordcloud_dict_frequencies(d_freq_tfidf, title, relative_scaling=0.5, max_words=500,
                                background_color='black')

In [None]:
# Exercise 4. As in the previous section, find out how to exclude stopwords and play around
# to find the best model hyperparameters "max_df" and "min_df". Based on which criteria did you
# choose those ones?

## Cosine similarity between reviews

Most likely you realized during the exercises that the optimization of the embeddings may be perform with a specific goal (i.e. capture words related to sentiment to predict reviews rating).

In order to evaluate the type of semantic information captured in the embeddings generated, we can use pairwise distance metric (similarity metric) in order to know which reviews are more closed based on the encoded knowledge by each setup.

For high dimensional vectors we will use cosine distance or cosine similarity metric.

#### Get vectors of products with reviews containing food words

In [None]:
# We want to know whether reviews that are similar to one with high frequency on
# the word "great" are also positive reviews and the other way around. Let's use 
# review number "1162740" as our reference.
df_X_tfidf['great'].sort_values().tail()

In [None]:
#We'll take one review that contains "great"
samples.iloc[1162740]

In [None]:
# Make a copy of the original df to avoid problems with different dimensionality
df_X_tfidf_ = df_X_tfidf.copy()

# Have a look to the Sklearn method "cosine_similarity" to know how to calculate cosine
# similarity. It requires two matrices. Here we're computing the cosine similarity of 
# the review 1162740 against all the other ones
reference_review_matrix = np.expand_dims(np.array(df_X_tfidf_.iloc[1162740].values), axis=0)
distances_to_reference_review = cosine_similarity(reference_review_matrix, X_tfidf)

In [None]:
#Let's create a dataframe with the results
sim_df = pd.DataFrame(distances_to_reference_review).transpose()
sim_df.head()

In [None]:
# Get the most similar
similar = sim_df.nlargest(10,[0])
for i in similar.index:
    print (samples.iloc[i],"\n")

In [None]:
# What's going on? Check for duplicates and repeat the process to get
# most similar reviews.



In [None]:
# Get the most dis-similar
dissimilar = sim_df.nsmallest(10,[0])
for i in dissimilar.index:
    print (samples.iloc[i],"\n")

In [None]:
# Exercise 5. Is this the separation of reviews that you expected to split by customer
# satisfaction? If not, play around with the two methods for text embeddings, its hyper parameters
# and the code until you feel familiar with the whole process. Now try to find the best embeddings.