# Word Embeddings
A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning

Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.


Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

Gensim is an open source Python library for natural language processing, with a focus on topic modeling

In [1]:
!pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/33/33/df6cb7acdcec5677ed130f4800f67509d24dbec74a03c329fcbf6b0864f0/gensim-3.4.0-cp36-cp36m-manylinux1_x86_64.whl (22.6MB)
[K    100% |████████████████████████████████| 22.6MB 36kB/s eta 0:00:011  5% |█▋                              | 1.1MB 5.4MB/s eta 0:00:04    31% |██████████                      | 7.0MB 5.4MB/s eta 0:00:03    33% |██████████▋                     | 7.5MB 8.7MB/s eta 0:00:02
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/69/c92661a333f733510628f28b8282698b62cdead37291c8491f3271677c02/smart_open-1.5.7.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/bd/b7/a88a67002b1185ed9a8e8a6ef15266728c2361fcb4f1d02ea331e4c7741d/boto-2.48.0-py2.py3-none-any.whl (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 513kB/s ta 0:00:011
[?25hCollecting bz2file (from smart-open>=1.2.1->gen

* size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* window: (default 5) The maximum distance between a target word and words around the target word.
* min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* workers: (default 3) The number of threads to use while training.
* sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1

In [2]:
import gensim

In [3]:
from gensim.models import Word2Vec

In [4]:
from gensim.models import KeyedVectors

In [6]:
import pandas as pd

In [9]:
#load data
df=pd.read_csv('reviews.csv', encoding='latin-1')

In [10]:
df.shape

(7, 2)

In [11]:
df.head(3)

Unnamed: 0,review,label
0,this is a great movie,1
1,I liked the length of the movie,1
2,I did not like the movie,0


In [29]:
def tokenizer(text):
    return text.split()

In [33]:
df['tokenzied']=df["review"].str.split(" ")

In Python, Word2Vec expects to be given a list of sentences, each of which is a list of words. To make this data setup, we define a function to split our sentences into lists of words and then apply this within another function that splits our texts into lists of sentences

In [42]:
model = Word2Vec(df['tokenzied'], min_count=1)

In [43]:
print(model)

Word2Vec(vocab=27, size=100, alpha=0.025)


In [44]:
# summarize vocabulary
words = list(model.wv.vocab)

In [45]:
words

['this',
 'is',
 'a',
 'great',
 'movie',
 'i',
 'liked',
 'the',
 'length',
 'of',
 '',
 'did',
 'not',
 'like',
 'should',
 'have',
 'gone',
 'to',
 'will',
 'try',
 'another',
 'please',
 'donõt',
 'watch',
 'best',
 'use',
 'time']

In [46]:
# access vector for one word
print(model['watch'])

[  3.09633644e-04   3.70263751e-03   2.38276972e-03  -3.87261924e-03
   3.09008430e-03  -3.69046163e-03  -5.54425584e-04  -2.06267717e-03
  -3.19953506e-05   4.66789724e-03   3.14125558e-03   2.37497129e-03
  -2.04226188e-03   3.44053190e-03  -6.20116480e-04   7.44130870e-04
   1.18909136e-03  -2.66512646e-03  -1.15553506e-04   5.24986594e-04
  -2.51844712e-03  -1.74200255e-03   2.53457529e-03   4.45861602e-03
  -8.28336400e-04  -3.58780287e-03  -4.10189899e-03   1.75538671e-03
  -4.87980293e-03  -1.64285919e-03   3.67766595e-03  -2.67773808e-04
   6.63481886e-04   3.90672730e-03  -2.53078528e-03   2.92938971e-03
   4.41400008e-03   4.71163075e-03  -2.33827275e-03   3.92537657e-03
   2.13778228e-03  -2.83788703e-03   9.19706144e-05  -1.64399657e-03
   1.05461280e-03  -2.20665173e-03  -1.96755980e-03  -4.37681517e-03
  -3.96122830e-03   2.14403844e-03  -3.84839205e-03   4.28637536e-03
   4.63919714e-03   4.99490241e-04   7.11607339e-04  -4.36716713e-03
   1.49905623e-03  -3.77917616e-03

  


# Sentence Vector 

In [48]:
import numpy as np

In [49]:
df.head(3)

Unnamed: 0,review,label,tokenzied
0,this is a great movie,1,"[this, is, a, great, movie]"
1,i liked the length of the movie,1,"[i, liked, the, length, of, the, movie, ]"
2,i did not like the movie,0,"[i, did, not, like, the, movie]"


In [64]:
def vectorizer(text):
    return np.array([model[x] for x in text])
    

In [65]:
# apply the preprocess function to all reviews
df['vec_text'] = df['tokenzied'].apply(vectorizer)

  


In [67]:
df['sent_vec']=list(map(lambda x:x.sum(axis=0),df.vec_text))

In [83]:
X = pd.DataFrame(df['sent_vec'].tolist())

In [84]:
X.shape

(7, 100)

In [85]:
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.001679,0.002369,-0.01219,0.013261,-0.00113,0.004322,-0.00082,-0.003699,0.002799,-0.000391,...,0.000428,0.000418,-0.00099,-0.005109,0.010388,0.010266,-0.01565,0.003815,0.005719,0.01231
1,-0.00832,-0.003231,-0.010294,-0.00525,-0.01167,0.002519,-0.000537,0.015894,-0.005816,-0.013788,...,-0.00341,-0.000831,0.007358,-0.003313,0.019346,0.006974,-0.011077,0.01074,0.009,0.007877
2,-0.006566,0.004054,-0.001793,-0.000389,-0.000106,0.006649,-0.000122,-0.000466,-0.001075,-0.001082,...,0.0118,0.001475,0.006896,0.000904,-0.000249,0.003613,-0.007766,0.003834,0.009996,0.011406
3,-0.005039,0.014803,0.004784,0.003938,0.00856,-0.004391,0.003821,-0.012251,-0.011578,0.006688,...,-0.001032,-0.01448,0.015763,-0.006751,0.005362,0.011469,-0.007362,0.003082,0.004728,0.01528
4,0.01093,-0.006461,-0.001861,0.003614,-0.002374,0.003197,0.001534,0.008267,0.0078,-0.004337,...,0.003912,-0.003774,-0.003448,-0.004118,0.007381,0.006081,-0.006517,0.00199,-0.003789,0.011328
5,0.004003,0.005289,-0.005221,0.007832,0.011042,-0.011512,0.005874,-0.008491,-0.006839,0.005848,...,-0.005712,-0.010749,-0.006398,-0.000234,-0.000387,0.006732,-0.012781,-0.00541,-0.002541,0.007355
6,0.00142,0.002101,-0.004637,0.003078,-0.005869,0.000756,0.005624,0.00798,0.00656,-0.003168,...,0.002237,0.009013,0.001045,-0.009482,0.005547,-0.004015,-0.006422,-0.006949,0.001613,0.004426


In [87]:
final_df=df.merge(X, how='outer', left_index=True, right_index=True)

In [88]:
final_df.head(3)

Unnamed: 0,review,label,tokenzied,vec_text,sent_vec,0,1,2,3,4,...,90,91,92,93,94,95,96,97,98,99
0,this is a great movie,1,"[this, is, a, great, movie]","[[-0.00286979, 0.00450099, -0.00367972, 0.0025...","[-0.00167917, 0.00236917, -0.0121904, 0.013261...",-0.001679,0.002369,-0.01219,0.013261,-0.00113,...,0.000428,0.000418,-0.00099,-0.005109,0.010388,0.010266,-0.01565,0.003815,0.005719,0.01231
1,i liked the length of the movie,1,"[i, liked, the, length, of, the, movie, ]","[[-0.00258756, -0.000346607, -0.000324643, -0....","[-0.00831955, -0.00323095, -0.010294, -0.00524...",-0.00832,-0.003231,-0.010294,-0.00525,-0.01167,...,-0.00341,-0.000831,0.007358,-0.003313,0.019346,0.006974,-0.011077,0.01074,0.009,0.007877
2,i did not like the movie,0,"[i, did, not, like, the, movie]","[[-0.00258756, -0.000346607, -0.000324643, -0....","[-0.00656643, 0.00405386, -0.00179299, -0.0003...",-0.006566,0.004054,-0.001793,-0.000389,-0.000106,...,0.0118,0.001475,0.006896,0.000904,-0.000249,0.003613,-0.007766,0.003834,0.009996,0.011406


In [90]:
final_df=final_df.iloc[:,final_df.columns !='review']

In [91]:
final_df=final_df.iloc[:,final_df.columns !='tokenzied']

In [92]:
final_df=final_df.iloc[:,final_df.columns !='vec_text']

In [93]:
final_df=final_df.iloc[:,final_df.columns !='sent_vec']

In [94]:
final_df.shape

(7, 101)

In [95]:
final_df.head(3)

Unnamed: 0,label,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,1,-0.001679,0.002369,-0.01219,0.013261,-0.00113,0.004322,-0.00082,-0.003699,0.002799,...,0.000428,0.000418,-0.00099,-0.005109,0.010388,0.010266,-0.01565,0.003815,0.005719,0.01231
1,1,-0.00832,-0.003231,-0.010294,-0.00525,-0.01167,0.002519,-0.000537,0.015894,-0.005816,...,-0.00341,-0.000831,0.007358,-0.003313,0.019346,0.006974,-0.011077,0.01074,0.009,0.007877
2,0,-0.006566,0.004054,-0.001793,-0.000389,-0.000106,0.006649,-0.000122,-0.000466,-0.001075,...,0.0118,0.001475,0.006896,0.000904,-0.000249,0.003613,-0.007766,0.003834,0.009996,0.011406
