<a href="https://colab.research.google.com/github/mlukan/GDA3B2021/blob/main/Martin_Lukan_Word_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word embedding manipulation

## Load pre-trained embeddings

You could train your own word embedding (using library like [gensim](https://radimrehurek.com/gensim/models/word2vec.html))  if you want, however you would need a lot of text and you would have to determine a ton of parameters (What is the size of your context, how big do you want your embedding, which algorithm to use, etc.).

Why go through all that hassle when you can just use embeddings that specialist in the field already trained on huge corpus?

[SpaCy](https://spacy.io/usage/models) is a library for NLP that provide such embeddings.

### Run the code bellow :

In [1]:
# Download the embeddings

!python3 -m spacy download en_core_web_md

# Load them

import en_core_web_md
nlp = en_core_web_md.load()

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.3MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp37-none-any.whl size=98051305 sha256=f7db416e14429c0b63eff4a95781170bc574c333625c9ae58ba9afeeed81f63a
  Stored in directory: /tmp/pip-ephem-wheel-cache-_vl_9asa/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


### Some optionnal information on this model 

The word embeddings of this model are of size 300 (a pretty standard size) and are trained using [GloVe](https://mlexplained.com/2018/04/29/paper-dissected-glove-global-vectors-for-word-representation-explained/) algorithm. The model you loaded also come with other types of embeddings that may be useful for other NLP tasks (like Part Of speech vectors). 

There also exist a larger model with more words and models for other languages (see the SpaCy link).

## Tokens embeddings and similarity

Now that the model is loaded, we can give it a sentence and it will tokenise it and return a list of tokens with a number of attributes.

Run the two following cells and try to understand them : 

In [2]:
tokens = nlp("Hello, I'm a data analyst. aabbbb")

for t in tokens:
    print(t.text, t.has_vector, t.vector_norm)

# The attribute has_vector for "aabbbb" is False, it mean that no vector exist for this word in the model.

Hello True 5.586428
, True 5.094723
I True 6.4231944
'm True 5.9417286
a True 5.306696
data True 7.1505103
analyst True 7.489983
. True 4.9316354
aabbbb False 0.0


In [4]:
print('Vector of "' + tokens[0].text + '" : \n', tokens[0].vector)
len(tokens[0].vector)

Vector of "Hello" : 
 [ 0.25233    0.10176   -0.67485    0.21117    0.43492    0.16542
  0.48261   -0.81222    0.041321   0.78502   -0.077857  -0.66324
  0.1464    -0.29289   -0.25488    0.019293  -0.20265    0.98232
  0.028312  -0.081276  -0.1214     0.13126   -0.17648    0.13556
 -0.16361   -0.22574    0.055006  -0.20308    0.20718    0.095785
  0.22481    0.21537   -0.32982   -0.12241   -0.40031   -0.079381
 -0.19958   -0.015083  -0.079139  -0.18132    0.20681   -0.36196
 -0.30744   -0.24422   -0.23113    0.09798    0.1463    -0.062738
  0.42934   -0.078038  -0.19627    0.65093   -0.22807   -0.30308
 -0.12483   -0.17568   -0.14651    0.15361   -0.29518    0.15099
 -0.51726   -0.033564  -0.23109   -0.7833     0.018029  -0.15719
  0.02293    0.49639    0.029225   0.05669    0.14616   -0.19195
  0.16244    0.23898    0.36431    0.45263    0.2456     0.23803
  0.31399    0.3487    -0.035791   0.56108   -0.25345    0.051964
 -0.10618   -0.30962    1.0585    -0.42025    0.18216   -0.11256

300

You can also get the similarity between two tokens.

In [None]:
tokens = nlp("dog cat banana")

for i in range(len(tokens)):
    for j in range(i+1, len(tokens)):
        print(tokens[i].text, tokens[j].text, tokens[i].similarity(tokens[j]))

dog cat 0.80168545
dog banana 0.24327643
cat banana 0.28154364


**Warning** : You may find other pre-trained embeddings that you want to use or even train your owns with another library. All library has different methods, attributes and ways of handling embeddings, read the documentation and examples before using them.

# Sentence embeddings

Now you know how to manipulate word embeddings, congratulation. 
So you have the sentence that you want to classify, and you have the embedding of each word of this sentence... Now what?

Maybe you can concatenate all of these vectors and just give it to the classifier? 

Problems: 

- It would give a very very big vector. 

- It would be EXTREMELY sensible of the orders of the words 

- You would have to handle sentence having difference size with padding.

In practice, state of the art model either train special sentence embeddings for their task or use special sequential neural network (RNN/LSTM). 

But we won't do that here (phew!). Actually just doing the average of the vectors works surprisingly well. And good news spacy comes with this functionality!

In [6]:
tokens = nlp("Hello, I am a sentence.")
len(tokens.vector)

300

You can also get sentences similarity.

In [16]:
tokens1 = nlp("Hello, I am a sentence.")
tokens2 = nlp("Hi, also some sort of phrase!")
tokens3 = nlp("This cat is cute.")

print(tokens1.similarity(tokens2))
print(tokens1.similarity(tokens3))
print(tokens2.similarity(tokens3))


0.832282210939598
0.7502564114243815
0.7618915522647609


Hello, I am a sentence.

Just doing a mere average on untreated sentence actually have one problem: it gives to much weight to stop word or other very frequent and not important words. 

That is why you should delete the stop words like you did previously.

Try to do it now and compute the embeddings for each treated sentences : 

In [19]:
import string 
import nltk
nltk.download('punkt')
nltk.download('stopwords')
sent1 = nltk.word_tokenize("Hello, I am a sentence.")
sent2 = nltk.word_tokenize("Hi, also some sort of phrase!")
sent3 = nltk.word_tokenize("This cat is cute.")
tokens1=nlp(' '.join([w.lower() for w in sent1 if w.lower() not in nltk.corpus.stopwords.words("english") and w not in string.punctuation]))
tokens2=nlp(' '.join([w.lower() for w in sent2 if w.lower() not in nltk.corpus.stopwords.words("english") and w not in string.punctuation]))
tokens3=nlp(' '.join([w.lower() for w in sent3 if w.lower() not in nltk.corpus.stopwords.words("english") and w not in string.punctuation]))

print(tokens1.similarity(tokens2))
print(tokens1.similarity(tokens3))
print(tokens2.similarity(tokens3))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
0.6751346465775762
0.4319715571396021
0.48920482878630084


# Sentiment analysis

## The dataset

### Run the code bellow :

We won't use the twitter dataset that you already know because as strong as embeddings are they aren't great with unknown words/abreviation/emoji and the twitter dataset is full of them.

We will instead use a dataset with review from Amazon, Yelp and IMDB. 

In [20]:
import pandas as pd
df_source = pd.read_csv("https://raw.githubusercontent.com/CindyAloui/datasets_wcs/master/sentiment_dataset.csv", usecols=("sentence", "sentiment", "source"))
df_source

Unnamed: 0,sentence,sentiment,source
0,So there is no way for me to plug it in here i...,0,amazon_cells_labelled
1,"Good case, Excellent value.",1,amazon_cells_labelled
2,Great for the jawbone.,1,amazon_cells_labelled
3,Tied to charger for conversations lasting more...,0,amazon_cells_labelled
4,The mic is great.,1,amazon_cells_labelled
...,...,...,...
2995,I think food should have flavor and texture an...,0,yelp_labelled
2996,Appetite instantly gone.,0,yelp_labelled
2997,Overall I was not impressed and would not go b...,0,yelp_labelled
2998,"The whole experience was underwhelming, and I ...",0,yelp_labelled


In [21]:
df_source.groupby(['source', 'sentiment']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,sentence
source,sentiment,Unnamed: 2_level_1
amazon_cells_labelled,0,500
amazon_cells_labelled,1,500
imdb_labelled,0,500
imdb_labelled,1,500
yelp_labelled,0,500
yelp_labelled,1,500


## Challenge

Now you have all the elements to train a classifier for sentiment analysis using embeddings! A little reminder of the steps: 

- First take out the stop words so you won't have to do a weighted average. You can also lemmatize the text is you want but in this case it shouldn't have a big influence.

- Then compute the sentence embeddings of the reviews. This is going to be our features.

- Do a train test split.

- Choose a type of classifier you want to use (for example a Logistic Regression).

- Train and evaluate your classifier. 

You should be able to reach easily an accuracy of 80%.

### Removing stopwords and punctuation

In [140]:
mystr="So there is no way for me to plug it in here i..."
### Stemming using snowball stemmer returned a worse performing model
def swremover(sent):
  wordlist=nltk.word_tokenize(sent)
  return ' '.join([w.lower() for w in wordlist if w.lower() not in nltk.corpus.stopwords.words("english") and w not in string.punctuation and w not in ["...","``","''"]])
swremover(mystr)

df_work=df_source.copy()
for i,row in df_work.iterrows():
  newstring=swremover(df_work.loc[i,'sentence'])
  #print(newstring)
  df_work.at[i,'sentence']=newstring


###  Using pretrained BERT model for sentence embedding

In [132]:
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Tokenization

In [141]:
tokenized = df_work['sentence'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
# Padding and masking the tokenized sentences
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(3000, 62)

### Retrieving the vectors

In [142]:
#tokenized
import numpy as np
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)




In [143]:
# Features
features = last_hidden_states[0][:,0,:].numpy()


## Logistic regression

In [144]:
X_train, X_test,y_train,y_test = train_test_split(features,df_work['sentiment'],
                                     test_size = 0.2, 
                                     random_state = 42, 
                                     stratify = df_work['sentiment']) #split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset




In [149]:
from sklearn.linear_model import LogisticRegression
lreg = LogisticRegression(max_iter=1000)
lreg.fit(X_train, y_train)
lreg.score(X_test, y_test)


0.815

In [148]:
from sklearn.tree import DecisionTreeClassifier
modelDTC = DecisionTreeClassifier()
modelDTC.fit(X_train, y_train)

modelDTC.predict(X_test)
modelDTC.score(X_test,y_test)

0.6466666666666666