# Word embedding manipulation

## Load pre-trained embeddings

You could train your own word embedding (using library like [gensim](https://radimrehurek.com/gensim/models/word2vec.html))  if you want, however you would need a lot of text and you would have to determine a ton of parameters (What is the size of your context, how big do you want your embedding, which algorithm to use, etc.).

Why go through all that hassle when you can just use embeddings that specialist in the field already trained on huge corpus?

[SpaCy](https://spacy.io/usage/models) is a library for NLP that provide such embeddings.

### Run the code bellow :

In [1]:
# Download the embeddings

!python3 -m spacy download en_core_web_md

# Load them
import nltk
nltk.download('stopwords')

import en_core_web_md
nlp = en_core_web_md.load()

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Some optionnal information on this model 

The word embeddings of this model are of size 300 (a pretty standard size) and are trained using [GloVe](https://mlexplained.com/2018/04/29/paper-dissected-glove-global-vectors-for-word-representation-explained/) algorithm. The model you loaded also come with other types of embeddings that may be useful for other NLP tasks (like Part Of speech vectors). 

There also exist a larger model with more words and models for other languages (see the SpaCy link).

## Tokens embeddings and similarity

Now that the model is loaded, we can give it a sentence and it will tokenise it and return a list of tokens with a number of attributes.

Run the two following cells and try to understand them : 

In [2]:
tokens = nlp("Hello, I'm a data analyst. aabbbb")

for t in tokens:
    print(t.text, t.has_vector, t.vector_norm)

# The attribute has_vector for "aabbbb" is False, it mean that no vector exist for this word in the model.

Hello True 5.586428
, True 5.094723
I True 6.4231944
'm True 5.9417286
a True 5.306696
data True 7.1505103
analyst True 7.489983
. True 4.9316354
aabbbb False 0.0


In [3]:
print('Vector of "' + tokens[0].text + '" : \n', tokens[0].vector)

Vector of "Hello" : 
 [ 0.25233    0.10176   -0.67485    0.21117    0.43492    0.16542
  0.48261   -0.81222    0.041321   0.78502   -0.077857  -0.66324
  0.1464    -0.29289   -0.25488    0.019293  -0.20265    0.98232
  0.028312  -0.081276  -0.1214     0.13126   -0.17648    0.13556
 -0.16361   -0.22574    0.055006  -0.20308    0.20718    0.095785
  0.22481    0.21537   -0.32982   -0.12241   -0.40031   -0.079381
 -0.19958   -0.015083  -0.079139  -0.18132    0.20681   -0.36196
 -0.30744   -0.24422   -0.23113    0.09798    0.1463    -0.062738
  0.42934   -0.078038  -0.19627    0.65093   -0.22807   -0.30308
 -0.12483   -0.17568   -0.14651    0.15361   -0.29518    0.15099
 -0.51726   -0.033564  -0.23109   -0.7833     0.018029  -0.15719
  0.02293    0.49639    0.029225   0.05669    0.14616   -0.19195
  0.16244    0.23898    0.36431    0.45263    0.2456     0.23803
  0.31399    0.3487    -0.035791   0.56108   -0.25345    0.051964
 -0.10618   -0.30962    1.0585    -0.42025    0.18216   -0.11256

You can also get the similarity between two tokens.

In [4]:
tokens = nlp("dog cat banana")

for i in range(len(tokens)):
    for j in range(i+1, len(tokens)):
        print(tokens[i].text, tokens[j].text, tokens[i].similarity(tokens[j]))

dog cat 0.80168545
dog banana 0.24327643
cat banana 0.28154364


**Warning** : You may find other pre-trained embeddings that you want to use or even train your owns with another library. All library has different methods, attributes and ways of handling embeddings, read the documentation and examples before using them.

# Sentence embeddings

Now you know how to manipulate word embeddings, congratulation. 
So you have the sentence that you want to classify, and you have the embedding of each word of this sentence... Now what?

Maybe you can concatenate all of these vectors and just give it to the classifier? 

Problems: 

- It would give a very very big vector. 

- It would be EXTREMELY sensible of the orders of the words 

- You would have to handle sentence having difference size with padding.

In practice, state of the art model either train special sentence embeddings for their task or use special sequential neural network (RNN/LSTM). 

But we won't do that here (phew!). Actually just doing the average of the vectors works surprisingly well. And good news spacy comes with this functionality!

In [5]:
tokens = nlp("Hello, I am a sentence.")
tokens.vector

array([ 1.44052897e-02,  3.73151302e-01, -3.24717134e-01, -1.14375697e-02,
        1.16910569e-01,  1.52780011e-01,  1.90909535e-01, -4.37690973e-01,
       -3.23535725e-02,  1.87437439e+00, -2.64555395e-01,  1.18573561e-01,
        3.17365713e-02, -8.37386176e-02, -1.97842836e-01, -5.29319979e-02,
       -1.00312009e-01,  1.20892560e+00, -8.57821181e-02,  5.27822860e-02,
       -7.06457123e-02, -1.33178145e-01,  4.75231409e-02,  1.12724580e-01,
       -1.40458560e-02, -2.75365766e-02, -6.30009994e-02, -1.70811281e-01,
        4.11111683e-01, -6.48657158e-02, -1.10077433e-01,  1.48285711e-02,
       -7.37885758e-02,  8.64421576e-02, -5.08786440e-02, -8.35811496e-02,
        9.39526111e-02, -5.84518574e-02, -2.87427843e-01, -1.23048410e-01,
        5.54607138e-02, -2.00897142e-01,  6.62029907e-02, -8.45435113e-02,
        5.04015684e-02,  8.26912746e-02, -1.58224300e-01, -1.18329972e-02,
        1.42642856e-01, -1.73901469e-02, -1.13422573e-01,  3.16255152e-01,
       -3.69520001e-02,  

You can also get sentences similarity.

In [6]:
tokens1 = nlp("Hello, I am a sentence.")
tokens2 = nlp("Hi, also some sort of phrase!")
tokens3 = nlp("This cat is cute.")

print(tokens1.similarity(tokens2))
print(tokens1.similarity(tokens3))
print(tokens2.similarity(tokens3))

0.832282210939598
0.7502564755692778
0.7618915522647609


Just doing a mere average on untreated sentence actually have one problem: it gives to much weight to stop word or other very frequent and not important words. 

That is why you should delete the stop words like you did previously.

Try to do it now and compute the embeddings for each treated sentences : 

In [0]:
def PunkStopWordsExterm(sentence, output='sentence'):
  '''
  Exterminate punctuation and stop words of a sentence and return as a sentence
  or list of words. 

  :output: str : 'sentence' or 'words'
  'sentence'  would return the result as a sentence
  'words'     would return the result as a list of words
  '''
  # import nltk
  # nltk.download('stopwords')

  from nltk.corpus import stopwords
  from nltk.tokenize import RegexpTokenizer

  tokenizer = RegexpTokenizer(r'\w+')
  words = tokenizer.tokenize(sentence)

  stop_words = set(stopwords.words("english"))

  if output == 'sentence':
    return " ".join([w for w in words if not w in stop_words])
  elif output == 'words':
    return [w for w in words if not w in stop_words]
  else:
    Warning(
    print("Error : Wrong output format selected.", 
                   "Please chose between 'sentence' or 'words'!"))


In [8]:
sentence = "Hello, I am a sentence."

PunkStopWordsExterm(sentence, 'sentence')

'Hello I sentence'

In [9]:
PunkStopWordsExterm(sentence, 'words')

['Hello', 'I', 'sentence']

In [10]:
PunkStopWordsExterm(sentence, 'word')

Error : Wrong output format selected. Please chose between 'sentence' or 'words'!


In [11]:
# On mets les phrases dans un corpus pour les traiter plus vite
corpus = ["Hello, I am a sentence.", "Hi, also some sort of phrase!", 
          "This cat is cute."]

# On créé un nouveau corpus en appliquant les fonction de filtre
fltd_tok_corp = [nlp(PunkStopWordsExterm(sent, 'sentence')) for sent in corpus]

# On compare les token du corpus filtré
print(fltd_tok_corp[0].similarity(fltd_tok_corp[1]))
print(fltd_tok_corp[0].similarity(fltd_tok_corp[2]))
print(fltd_tok_corp[1].similarity(fltd_tok_corp[2]))

0.7297824993581993
0.5886927521791829
0.604620161888518


# Sentiment analysis

## The dataset

### Run the code bellow :

We won't use the twitter dataset that you already know because as strong as embeddings are they aren't great with unknown words/abreviation/emoji and the twitter dataset is full of them.

We will instead use a dataset with review from Amazon, Yelp and IMDB. 

In [12]:
import pandas as pd
df_source = pd.read_csv("https://raw.githubusercontent.com/CindyAloui/datasets_wcs/master/sentiment_dataset.csv", usecols=("sentence", "sentiment", "source"))
df_source

Unnamed: 0,sentence,sentiment,source
0,So there is no way for me to plug it in here i...,0,amazon_cells_labelled
1,"Good case, Excellent value.",1,amazon_cells_labelled
2,Great for the jawbone.,1,amazon_cells_labelled
3,Tied to charger for conversations lasting more...,0,amazon_cells_labelled
4,The mic is great.,1,amazon_cells_labelled
...,...,...,...
2995,I think food should have flavor and texture an...,0,yelp_labelled
2996,Appetite instantly gone.,0,yelp_labelled
2997,Overall I was not impressed and would not go b...,0,yelp_labelled
2998,"The whole experience was underwhelming, and I ...",0,yelp_labelled


In [13]:
df_source.groupby(['source', 'sentiment']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,sentence
source,sentiment,Unnamed: 2_level_1
amazon_cells_labelled,0,500
amazon_cells_labelled,1,500
imdb_labelled,0,500
imdb_labelled,1,500
yelp_labelled,0,500
yelp_labelled,1,500


## Challenge

Now you have all the elements to train a classifier for sentiment analysis using embeddings! A little reminder of the steps: 

- First take out the stop words so you won't have to do a weighted average. You can also lemmatize the text is you want but in this case it shouldn't have a big influence.

- Then compute the sentence embeddings of the reviews. This is going to be our features.

- Do a train test split.

- Choose a type of classifier you want to use (for example a Logistic Regression).

- Train and evaluate your classifier. 

You should be able to reach easily an accuracy of 80%.

### Words preprocessing

In [14]:
df_source['filtred_sentence'] = [PunkStopWordsExterm(sent, 'sentence') for sent in df_source['sentence']]
df_source

Unnamed: 0,sentence,sentiment,source,filtred_sentence
0,So there is no way for me to plug it in here i...,0,amazon_cells_labelled,So way plug US unless I go converter
1,"Good case, Excellent value.",1,amazon_cells_labelled,Good case Excellent value
2,Great for the jawbone.,1,amazon_cells_labelled,Great jawbone
3,Tied to charger for conversations lasting more...,0,amazon_cells_labelled,Tied charger conversations lasting 45 minutes ...
4,The mic is great.,1,amazon_cells_labelled,The mic great
...,...,...,...,...
2995,I think food should have flavor and texture an...,0,yelp_labelled,I think food flavor texture lacking
2996,Appetite instantly gone.,0,yelp_labelled,Appetite instantly gone
2997,Overall I was not impressed and would not go b...,0,yelp_labelled,Overall I impressed would go back
2998,"The whole experience was underwhelming, and I ...",0,yelp_labelled,The whole experience underwhelming I think go ...


### Embedding sentences


In [15]:
df_source['embedding_sentence'] = [nlp(sent).vector for sent in df_source['filtred_sentence']]
df_source

Unnamed: 0,sentence,sentiment,source,filtred_sentence,embedding_sentence
0,So there is no way for me to plug it in here i...,0,amazon_cells_labelled,So way plug US unless I go converter,"[0.1096631, 0.16896625, -0.14765373, 0.0498342..."
1,"Good case, Excellent value.",1,amazon_cells_labelled,Good case Excellent value,"[-0.27085474, 0.560745, -0.16895999, -0.057802..."
2,Great for the jawbone.,1,amazon_cells_labelled,Great jawbone,"[0.013746999, 0.37386, -0.0674105, -0.074564, ..."
3,Tied to charger for conversations lasting more...,0,amazon_cells_labelled,Tied charger conversations lasting 45 minutes ...,"[0.0052317046, 0.263143, 0.038863745, 0.032554..."
4,The mic is great.,1,amazon_cells_labelled,The mic great,"[-0.015982, 0.35231665, 0.025606329, 0.1886543..."
...,...,...,...,...,...
2995,I think food should have flavor and texture an...,0,yelp_labelled,I think food flavor texture lacking,"[-0.13820933, 0.23190516, -0.01509251, -0.2692..."
2996,Appetite instantly gone.,0,yelp_labelled,Appetite instantly gone,"[-0.10598, 0.18677, -0.03493534, -0.06977666, ..."
2997,Overall I was not impressed and would not go b...,0,yelp_labelled,Overall I impressed would go back,"[0.03134817, 0.32979402, -0.22557668, -0.00845..."
2998,"The whole experience was underwhelming, and I ...",0,yelp_labelled,The whole experience underwhelming I think go ...,"[0.062392544, 0.003429463, -0.03594419, -0.126..."


### Modelisation

In [0]:
def EmbeddingMatrix(data):
  '''
  Cette fonction permet de transformer une liste matrice de liste DataFrame avec 
  autant de colonne que d'éléments dans la liste et autant de lignes que de 
  listes.
  e.g., 
      [1, 2, 3, 4, 5],
      [6, 7, 8, 9, 10]    (2, 1)

       | 0| 1| 2| 3| 4|
      0| 1| 2| 3| 4| 5|
      1| 6| 7| 8| 9|10|   (2, 5)
  '''
  
  import numpy as np

  df = pd.DataFrame(np.zeros((3000,300)))

  for col in range(len(data.iloc[0, 0])):
    for row in range(data.shape[0]):
      df.iloc[row, col] = data.iloc[row, 0][col]

  return df

In [19]:
X = EmbeddingMatrix(df_source[['embedding_sentence']])
X.shape

(3000, 300)

In [20]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299
0,0.109663,0.168966,-0.147654,0.049834,0.102829,-0.094142,0.109589,-0.206444,-0.027967,1.78435,-0.124273,0.063895,0.116049,-0.08463,-0.187175,-0.153012,-0.262444,1.480843,-0.136094,0.134874,0.091645,-0.016464,-0.099847,0.014625,-0.079578,0.051355,-0.166281,-0.316079,0.169389,-0.164995,-0.185009,0.054753,0.040482,0.249062,-0.014022,-0.045909,0.140762,0.061508,-0.145026,-0.297461,...,0.02768,0.090858,0.105971,-0.173263,0.236921,-0.16725,-0.002762,0.04335,0.450269,0.249652,0.217144,0.051456,-0.009744,0.111146,-0.022863,0.169901,-0.00817,-0.035847,-0.151876,0.112259,0.02679,-0.004259,0.028025,0.170896,-0.062507,-0.056613,0.102537,-0.131265,0.158625,-0.062705,-0.187955,-0.053995,-0.156746,-0.0333,0.13758,-0.061015,-0.023025,-0.073138,-0.014318,0.085721
1,-0.270855,0.560745,-0.16896,-0.057803,0.036918,0.014717,0.108985,-0.328349,-0.006825,1.99395,-0.230334,0.130527,0.210568,0.095102,-0.238488,-0.16415,-0.185345,1.642375,-0.046432,-0.165923,-0.325705,-0.017494,0.082217,-0.191642,0.168568,0.072748,0.05903,-0.097233,-0.045005,-0.187548,0.196536,-0.076026,0.186652,0.090627,0.001399,-0.008536,0.276623,-0.215912,-0.160085,-0.381673,...,0.086577,0.115545,0.132033,0.32274,0.206147,-0.033628,0.181212,0.23286,0.536573,-0.112325,-0.040972,-0.049456,-0.040866,-0.28287,-0.06228,0.153102,0.062588,-0.054287,-0.089045,0.262835,-0.239125,0.083166,0.003665,-0.139467,-0.026252,-0.048946,-0.018232,-0.02886,0.032971,0.134638,-0.31233,0.200558,-0.281409,-0.127516,0.184591,-0.139078,0.031646,-0.38943,-0.034399,0.036702
2,0.013747,0.37386,-0.06741,-0.074564,0.070689,-0.15354,0.143205,-0.13925,-0.083925,1.70345,0.01152,-0.040388,-0.070225,0.371185,0.08471,0.05288,-0.33013,1.55515,-0.064726,0.23663,-0.032779,-0.153369,0.02404,-0.350315,0.071991,-0.09515,0.17764,-0.037675,-0.15586,0.054762,0.145335,-0.132815,0.10159,-0.222905,0.378795,0.05105,0.152127,-0.215993,0.029535,0.079135,...,-0.008265,-0.073052,0.01375,0.088435,0.397781,0.240161,0.069705,0.028425,0.03762,0.214086,0.15599,0.323997,-0.487735,0.060985,0.008515,-0.029136,-0.3272,0.092521,-0.21974,0.171205,-0.111631,0.208718,-0.125098,-0.213375,0.07864,-0.136414,0.2016,-0.0889,-0.02484,-0.14473,-0.50721,0.132917,-0.30248,0.071517,0.25498,0.315783,0.293955,0.137308,0.143229,0.105892
3,0.005232,0.263143,0.038864,0.032554,0.052446,-0.300596,-0.138098,0.033495,0.276415,1.736114,0.069309,0.104837,0.194245,-0.129325,-0.135233,-0.01432,-0.044309,1.289629,-0.196687,0.200776,-0.14231,0.194908,-0.068316,-0.107279,0.242936,0.004401,0.103688,0.015258,-0.016261,0.116775,0.05318,0.145179,-0.297973,-0.241276,0.032697,0.187086,0.081994,-0.084893,0.067034,0.016117,...,0.140689,0.044818,0.082057,0.025478,-0.018318,-0.155836,0.011366,0.168235,0.142625,-0.144056,0.074373,-0.003572,0.038383,-0.064905,-0.139747,0.105948,0.097436,0.102608,0.02027,0.063612,-0.114077,0.020457,0.066293,0.088737,-0.007222,0.119746,0.079966,-0.172343,0.044939,0.192429,-0.198691,-0.003271,-0.066565,0.127852,0.111966,0.010233,0.007898,-0.068726,-0.098732,0.016123
4,-0.015982,0.352317,0.025606,0.188654,0.182467,0.060571,0.23385,-0.234687,0.133407,1.682603,-0.091687,-0.039279,0.321583,-0.01106,-0.005474,-0.034167,-0.26267,1.128467,-0.277277,0.081237,-0.068542,-0.02728,0.082719,-0.16942,0.046568,-0.104087,-0.250583,0.249136,-0.066547,0.076403,-0.100633,0.30987,0.06,0.265996,-0.03195,0.006489,0.114733,-0.127352,-0.208155,-0.248147,...,-0.097483,-0.060752,0.156287,-0.131477,-0.005229,-0.036176,0.261663,0.155129,0.282853,0.050193,0.129882,0.120452,-0.36638,-0.10133,0.086427,-0.087886,0.047033,0.131491,-0.069938,-0.12676,0.078902,0.126602,-0.106804,0.116491,0.047214,-0.068193,0.33348,0.012526,0.00504,-0.006707,-0.49502,0.177245,-0.094147,-0.074577,0.040951,0.139583,0.196097,0.010625,-0.051216,0.25962


In [21]:
y = df_source['sentiment']
y.shape

(3000,)

#### Model training


In [0]:
def LogRegModeling(X, y, rand_state=42, train_size=.8):

  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import StandardScaler
  from sklearn.linear_model import LogisticRegression
    
  X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                      random_state = rand_state, 
                                                      train_size = train_size)
  LogisticModel = LogisticRegression().fit(X_train, y_train)

  print("Model accuracy score on train set =", 
        LogisticModel.score(X_train, y_train))
  print("Model accuracy score on test set =", 
        LogisticModel.score(X_test, y_test))
  
  return LogisticModel


In [24]:
LogRModel = LogRegModeling(X, y, rand_state = 42)

Model accuracy score on train set = 0.8745833333333334
Model accuracy score on test set = 0.8233333333333334
