[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juanhuguet/intro_to_nlp/blob/main/notebooks/04-basic-sentiment-classification-supervised-classification-embeddings.ipynb)

# The NLP revolution: embeddings

The main takeaway of the exercise befor is we need to transform text into a mathematical representation, or vectors,
to be able to feed the documents to machine learning algorithms so they can learn to map from the features to the labels.

Word embeddings are a type of vector representation for words in natural language processing (NLP) that have revolutionized the field by enabling more accurate and efficient text analysis.

As we have seen before, early embedding techniques like CountVectorizer or TF-ID have limitations in capturing the complex relationships between words and their meanings. They are just a sparse word representation of the universe of words in the documents.

To address this issue, more sophisticated techniques like Global Vectors for Word Representation (**GloVe**) and **Word2vec** were developed.

These techniques use deep learning networks to transform sparse Bag of Word representation of text to dense vector representations by taking into account the context the word is appearing

<img src="https://file.notion.so/f/f/003df94c-172d-46b4-9c84-4a2f90ef0ed1/2ac17577-2a5d-438c-89e3-3ed0a60a74e6/Untitled.png?id=922ae51c-a398-42c2-9b57-43ed4e0f99b9&table=block&spaceId=003df94c-172d-46b4-9c84-4a2f90ef0ed1&expirationTimestamp=1705622400000&signature=JZp11wX0MDrgzBd9lRbRAkDCj5nReB4-plB1R7Gkni4&downloadName=Untitled.png" width="400" height="200">

# Import the basic libraries

In [1]:
import numpy as np

In [2]:
import pandas as pd

**We will use gensim to load the pre-trained vector embeddings**

In [3]:
from gensim.models import KeyedVectors

In [4]:
import gensim.downloader as gensim_api

In [5]:
import warnings

In [6]:
warnings.filterwarnings("ignore")

### Download the pre-trained word2vec embeddings

In [7]:
from pathlib import Path

In [13]:
#model_path = '/root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'
model_path = '/Users/jhuguet/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'

In [14]:
file_exists = Path(model_path).is_file()

In [15]:
file_exists

True

In [16]:
if not file_exists:
    model_name = 'word2vec-google-news-300'
    model_path = gensim_api.load(model_name, return_path=True)

#### Load the pre-trained word2vec embeddings

In [17]:
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

This is a look-up table that have words as keys and embeddings as values.

>Now, we can retrieve any embedding for the words in the available corpus

In [18]:
w2v_model["potato"].shape

(300,)

In [19]:
w2v_model["potato"]

array([-2.92968750e-01,  2.94921875e-01,  7.71484375e-02,  4.58984375e-01,
        7.66601562e-02,  6.44531250e-02,  1.80664062e-01, -1.57226562e-01,
       -5.46875000e-02, -3.75976562e-02,  2.16796875e-01, -8.85009766e-03,
       -9.71679688e-02,  2.81982422e-02, -2.91015625e-01,  1.33789062e-01,
       -1.25000000e-01,  1.67968750e-01, -2.15820312e-01, -1.36718750e-01,
       -1.69921875e-01,  2.53906250e-01,  1.23535156e-01,  1.62109375e-01,
        4.41894531e-02, -1.06445312e-01, -1.08886719e-01,  1.78710938e-01,
        1.30859375e-01, -9.33837891e-03, -2.69531250e-01, -1.71875000e-01,
        1.26953125e-01,  1.58203125e-01, -5.41992188e-02,  1.13281250e-01,
        4.24804688e-02,  8.10546875e-02,  3.94531250e-01,  3.36914062e-02,
        1.45874023e-02, -2.16796875e-01, -9.13085938e-02,  9.94873047e-03,
       -4.46777344e-02, -3.30078125e-01,  4.07714844e-02,  9.37500000e-02,
       -1.95312500e-02,  1.07421875e-01, -1.00585938e-01,  3.68652344e-02,
        1.39648438e-01,  

Let's see if how we can operate with vectors

<img src="https://file.notion.so/f/f/003df94c-172d-46b4-9c84-4a2f90ef0ed1/b5618b19-dbe1-435c-a7f2-0ae858a092f4/Untitled.png?id=f17dd007-e5aa-4ff4-a3a9-8200d4dd1d18&table=block&spaceId=003df94c-172d-46b4-9c84-4a2f90ef0ed1&expirationTimestamp=1705622400000&signature=MUqZi9enFd5_-yK0-DzoZRGp2ADJoFfutYAMpm4mPxw&downloadName=Untitled.png" width="400" height="200">

In [20]:
diff = w2v_model["king"] - w2v_model["man"] + w2v_model["woman"]

In [21]:
w2v_model.similar_by_vector(diff)

[('king', 0.8449392318725586),
 ('queen', 0.7300516366958618),
 ('monarch', 0.6454660296440125),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676948547363),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376776456832886),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

## Let's use this embeddings as features for a classifier....

In [22]:
from sklearn.linear_model import LogisticRegression

#### Let's simulate some reviews...

In [23]:
reviews_docs = ["The food was good",
                "The service was bad",
                "The service and the food were good",
               ]

labels = ["positive",
          "negative",
          "positive"]

### Let's use scikit learn to calculate the counts vectors...

In [24]:
def sentence_embedding(sentence):
    vecs = [w2v_model[x] for x in sentence.split(" ") if x in w2v_model]
    return np.mean(vecs, axis=0)

In [25]:
embeddings = [sentence_embedding(r) for r in reviews_docs]

In [26]:
reviews = pd.DataFrame(embeddings, index=reviews_docs)

In [27]:
reviews["sentiment"] = labels

In [28]:
reviews

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,291,292,293,294,295,296,297,298,299,sentiment
The food was good,-0.071991,0.126236,0.027252,0.056519,-0.034729,0.034439,-0.003113,-0.056004,0.040527,0.075195,...,-0.01062,-0.116821,0.165497,0.005157,-0.001099,-0.028778,0.001099,0.120356,-0.068047,positive
The service was bad,-0.003998,0.090164,0.126465,-0.01947,0.003479,0.011475,0.03891,-0.001316,0.131226,0.143066,...,0.014526,-0.098389,0.125153,0.010895,0.049561,-0.077332,0.01062,0.095215,-0.08197,negative
The service and the food were good,-0.044434,0.090983,0.027608,0.056885,-0.032104,-0.014638,0.011332,-0.049624,0.036987,0.087952,...,-0.03953,-0.084605,0.112651,0.017578,0.040965,-0.048299,0.014771,0.080278,-0.078211,positive


### Let's use these feature as an input for the classifier

Note: the exercise here is only to demonstrate how we can learn from features extracted from text, of course, we need more examples and a proper modelling process that involves train/test split and validations...

In [29]:
clf_lr = LogisticRegression()

In [30]:
X = reviews.drop(columns=["sentiment"])

In [31]:
y = reviews["sentiment"]

In [32]:
clf_lr.fit(X, y)

### Let' s make a prediction...

In [33]:
new_review = "the weather was very good"

In [34]:
new_review_vect = sentence_embedding(new_review)

In [35]:
clf_lr.predict([new_review_vect])

array(['positive'], dtype=object)

## MAIN TAKEAWAY

> Thanks to deep neural networks, word embeddings have evolved from sparse simple representations to dense representations that take into account the context the word normally appears. Also, these techniques show consistency in mathematical operations over the representations of the words.