<a href="https://colab.research.google.com/github/james-monahan/Code-school-notebooks/blob/main/Week-14-nlp-regex/Word_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word embedding manipulation

## Load pre-trained embeddings

You could train your own word embedding (using library like [gensim](https://radimrehurek.com/gensim/models/word2vec.html))  if you want, however you would need a lot of text and you would have to determine a ton of parameters (What is the size of your context, how big do you want your embedding, which algorithm to use, etc.).

Why go through all that hassle when you can just use embeddings that specialist in the field already trained on huge corpus?

[SpaCy](https://spacy.io/usage/models) is a library for NLP that provide such embeddings.

### Run the code bellow :

In [None]:
# Download the embeddings

!python3 -m spacy download en_core_web_md

# Load them

import en_core_web_md
nlp = en_core_web_md.load()

### Some optionnal information on this model 

The word embeddings of this model are of size 300 (a pretty standard size) and are trained using [GloVe](https://mlexplained.com/2018/04/29/paper-dissected-glove-global-vectors-for-word-representation-explained/) algorithm. The model you loaded also come with other types of embeddings that may be useful for other NLP tasks (like Part Of speech vectors). 

There also exist a larger model with more words and models for other languages (see the SpaCy link).

## Tokens embeddings and similarity

Now that the model is loaded, we can give it a sentence and it will tokenise it and return a list of tokens with a number of attributes.

Run the two following cells and try to understand them : 

In [None]:
tokens = nlp("Hello, I'm a data analyst. aabbbb")

for t in tokens:
    print(t.text, t.has_vector, t.vector_norm)

# The attribute has_vector for "aabbbb" is False, it mean that no vector exist for this word in the model.

Hello True 5.586428
, True 5.094723
I True 6.4231944
'm True 5.9417286
a True 5.306696
data True 7.1505103
analyst True 7.489983
. True 4.9316354
aabbbb False 0.0


In [None]:
print('Vector of "' + tokens[0].text + '" : \n', tokens[0].vector)

Vector of "Hello" : 
 [ 0.25233    0.10176   -0.67485    0.21117    0.43492    0.16542
  0.48261   -0.81222    0.041321   0.78502   -0.077857  -0.66324
  0.1464    -0.29289   -0.25488    0.019293  -0.20265    0.98232
  0.028312  -0.081276  -0.1214     0.13126   -0.17648    0.13556
 -0.16361   -0.22574    0.055006  -0.20308    0.20718    0.095785
  0.22481    0.21537   -0.32982   -0.12241   -0.40031   -0.079381
 -0.19958   -0.015083  -0.079139  -0.18132    0.20681   -0.36196
 -0.30744   -0.24422   -0.23113    0.09798    0.1463    -0.062738
  0.42934   -0.078038  -0.19627    0.65093   -0.22807   -0.30308
 -0.12483   -0.17568   -0.14651    0.15361   -0.29518    0.15099
 -0.51726   -0.033564  -0.23109   -0.7833     0.018029  -0.15719
  0.02293    0.49639    0.029225   0.05669    0.14616   -0.19195
  0.16244    0.23898    0.36431    0.45263    0.2456     0.23803
  0.31399    0.3487    -0.035791   0.56108   -0.25345    0.051964
 -0.10618   -0.30962    1.0585    -0.42025    0.18216   -0.11256

You can also get the similarity between two tokens.

In [None]:
tokens = nlp("dog cat banana")

for i in range(len(tokens)):
    for j in range(i+1, len(tokens)):
        print(tokens[i].text, tokens[j].text, tokens[i].similarity(tokens[j]))

dog cat 0.80168545
dog banana 0.24327643
cat banana 0.28154364


**Warning** : You may find other pre-trained embeddings that you want to use or even train your owns with another library. All library has different methods, attributes and ways of handling embeddings, read the documentation and examples before using them.

# Sentence embeddings

Now you know how to manipulate word embeddings, congratulation. 
So you have the sentence that you want to classify, and you have the embedding of each word of this sentence... Now what?

Maybe you can concatenate all of these vectors and just give it to the classifier? 

Problems: 

- It would give a very very big vector. 

- It would be EXTREMELY sensible of the orders of the words 

- You would have to handle sentence having difference size with padding.

In practice, state of the art model either train special sentence embeddings for their task or use special sequential neural network (RNN/LSTM). 

But we won't do that here (phew!). Actually just doing the average of the vectors works surprisingly well. And good news spacy comes with this functionality!

In [None]:
tokens = nlp("Hello, I am a sentence.")
tokens.vector.shape

(300,)

You can also get sentences similarity.

In [None]:
tokens1 = nlp("Hello, I am a sentence.")
tokens2 = nlp("Hi, also some sort of phrase!")
tokens3 = nlp("This cat is cute.")

print(tokens1.similarity(tokens2))
print(tokens1.similarity(tokens3))
print(tokens2.similarity(tokens3))

0.832282210939598
0.7502564755692778
0.7618915522647609


In [None]:
tokens1

Hello, I am a sentence.

Just doing a mere average on untreated sentence actually have one problem: it gives to much weight to stop word or other very frequent and not important words. 

That is why you should delete the stop words like you did previously.

Try to do it now and compute the embeddings for each treated sentences : 

In [None]:
import nltk
nltk.download('popular')

In [None]:
stopwordsenglish  = nltk.corpus.stopwords.words("english")
'Hello,' in stopwordsenglish

False

In [None]:
tokens1 = "Hello, I am a sentence."
tokens2 = "Hi, also some sort of phrase!"
tokens3 = "This cat is cute."

#interesting if manually remove punc
# tokens1 = "Hello I am a sentence"
# tokens2 = "Hi also some sort of phrase"
# tokens3 = "This cat is cute"

tokens1 = [word for word in tokens1.split() if word.lower() not in stopwordsenglish]
tokens2 = [word for word in tokens2.split() if word.lower() not in stopwordsenglish]
tokens3 =  [word for word in tokens3.split() if word.lower() not in stopwordsenglish]

tokens1 = nlp(" ".join(tokens1))
tokens2 = nlp(" ".join(tokens2))
tokens3 = nlp(" ".join(tokens3))

print(tokens1.similarity(tokens2))
print(tokens1.similarity(tokens3))
print(tokens2.similarity(tokens3))

0.8174121063676709
0.6202515567511576
0.6381025221733543


In [None]:
tokens1, tokens2, tokens3 

(Hello, sentence., Hi, also sort phrase!, cat cute.)

# Sentiment analysis

## The dataset

### Run the code bellow :

We won't use the twitter dataset that you already know because as strong as embeddings are they aren't great with unknown words/abreviation/emoji and the twitter dataset is full of them.

We will instead use a dataset with review from Amazon, Yelp and IMDB. 

In [None]:
import pandas as pd
df_source = pd.read_csv("https://raw.githubusercontent.com/CindyAloui/datasets_wcs/master/sentiment_dataset.csv", usecols=("sentence", "sentiment", "source"))
df_source

Unnamed: 0,sentence,sentiment,source
0,So there is no way for me to plug it in here i...,0,amazon_cells_labelled
1,"Good case, Excellent value.",1,amazon_cells_labelled
2,Great for the jawbone.,1,amazon_cells_labelled
3,Tied to charger for conversations lasting more...,0,amazon_cells_labelled
4,The mic is great.,1,amazon_cells_labelled
...,...,...,...
2995,I think food should have flavor and texture an...,0,yelp_labelled
2996,Appetite instantly gone.,0,yelp_labelled
2997,Overall I was not impressed and would not go b...,0,yelp_labelled
2998,"The whole experience was underwhelming, and I ...",0,yelp_labelled


In [None]:
df_source.groupby(['source', 'sentiment']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,sentence
source,sentiment,Unnamed: 2_level_1
amazon_cells_labelled,0,500
amazon_cells_labelled,1,500
imdb_labelled,0,500
imdb_labelled,1,500
yelp_labelled,0,500
yelp_labelled,1,500


## Challenge

Now you have all the elements to train a classifier for sentiment analysis using embeddings! A little reminder of the steps: 

- First take out the stop words so you won't have to do a weighted average. You can also lemmatize the text is you want but in this case it shouldn't have a big influence.

- Then compute the sentence embeddings of the reviews. This is going to be our features.

- Do a train test split.

- Choose a type of classifier you want to use (for example a Logistic Regression).

- Train and evaluate your classifier. 

You should be able to reach easily an accuracy of 80%.

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import nltk
nltk.download('popular')

In [None]:
df_source.tail()

Unnamed: 0,sentence,sentiment,source,no_stops,embeddings
2995,I think food should have flavor and texture an...,0,yelp_labelled,think food flavor texture lacking.,"[[-0.16743083, 0.19883184, 0.049234163, -0.275..."
2996,Appetite instantly gone.,0,yelp_labelled,Appetite instantly gone.,"[[-0.07648475, 0.191955, -0.057646506, -0.2006..."
2997,Overall I was not impressed and would not go b...,0,yelp_labelled,Overall impressed would go back.,"[[0.0021266676, 0.29672068, -0.16125, -0.01485..."
2998,"The whole experience was underwhelming, and I ...",0,yelp_labelled,"whole experience underwhelming, think we'll go...","[[0.01728477, 0.088313386, -0.028295077, -0.12..."
2999,"Then, as if I hadn't wasted enough of my life ...",0,yelp_labelled,"Then, wasted enough life there, poured salt wo...","[[0.020121753, 0.21565683, -0.023613028, -0.05..."


In [None]:
def remove_stop_words(text):
  words = []
  for t in text.split():
    if t.lower() not in stopwordsenglish:
      words.append(t)
  return " ".join(words)
remove_stop_words("You are better when I am well.")

'better well.'

In [None]:
df_source['no_stops'] = df_source['sentence'].apply(remove_stop_words)

In [None]:
# df_source['embeddings'] = df_source['no_stops'].apply(lambda x: nlp(x).vector)
df_source['embeddings'] = df_source['no_stops'].apply(lambda x: nlp(x).vector.reshape(1,300))

In [None]:
type(df_source['embeddings'][0]), len(df_source['embeddings'][0]),df_source['embeddings'][0].shape

(numpy.ndarray, 1, (1, 300))

In [None]:
matrix = np.zeros(shape=(3000,300))

In [None]:
for i in range(len(df_source['embeddings'])):
  matrix[i] = df_source['embeddings'][i]

In [None]:
X = matrix
y = df_source["sentiment"].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=32)

In [None]:
lr = LogisticRegression()
lr_model = lr.fit(X_train, y_train)

In [None]:
lr_model.score(X_train, y_train), lr_model.score(X_test, y_test)

(0.8706666666666667, 0.828)