# Word Embedding Techniques

https://www.youtube.com/watch?v=Do8cVbx-HOs&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=19

- Word2vec
- GloVe
- fastText

Dense arrays, similar words will have similar vectors, shorter arrays (popular is 300 elements)

Use continuous bag of words (CBOW) and skip grams to implement

- BERT (parameter tuning: ALBERT, RoBERTa)
- GPT

Transformer-based embedding techniques (based on transformer architecture)

- ElMo

Based on LSTM (Long Short-Term Memory RNN)

# Word2Vec

https://www.youtube.com/watch?v=hQwFeIupNP0

Feature vectors are learned as side effects by training neural networks to calculate probability of word given contextual word(s).

Input layer of neural network is one-hot encoded vectors of previous and next context word(s) (however many previous and next words you include is called your window size; this is the continuous bag of words (CBOW) approach). Output should be the target word (also a one-hot encoded vector). The hidden layer has however many neurons. By back propagation, after however many epochs, the network will learn the weights on the neurons in the hidden layer. These weights, for each output word, are that word's dense word vector. The individual weights in the vector don't correspond to any real-world measurement (like semantic components). 

Another approach is skip grams, which is similar except that the network is trained to predict the context word(s) given the target word.

# Gensim Code Tutorial

https://www.youtube.com/watch?v=Q2NtCcqmIww

Data: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

Note: downloads as zipped file (json.gz); to unzip, run command in gitbash: gunzip reviews_Cell_Phones_and_Accessories_5.json.gz

In [7]:
# !pip install gensim
# !pip install python-levenshtein

In [8]:
import gensim
import pandas as pd

In [9]:
df = pd.read_json("reviews_Cell_Phones_and_Accessories_5.json", 
                  lines = True)

df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [10]:
df.shape

(194439, 9)

## Simple Preprocessing & Tokenization

The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [11]:
# we only want the 'reviewText' column
# gensim.utils.simple_preprocessing: tokenize, remove punctuation,
# lowercase, trim spaces
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

# now each review is a list of tokens
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [12]:
review_text.loc[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

## Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

In [13]:
# initialize the model

model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

In [14]:
# build vocabulary

model.build_vocab(review_text, progress_per = 1000)

In [15]:
model.corpus_count

194439

In [16]:
model.epochs

5

In [17]:
# train the word2vec model

model.train(review_text, 
            total_examples = model.corpus_count,
            epochs = model.epochs)

(61505889, 83868975)

In [18]:
# save the model so it can be reused elsewhere

model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

## Finding similar words

In [19]:
model.wv.most_similar("bad")

[('shabby', 0.6867200136184692),
 ('terrible', 0.6813408732414246),
 ('horrible', 0.6117178797721863),
 ('good', 0.588121235370636),
 ('funny', 0.5419744849205017),
 ('awful', 0.5320730805397034),
 ('okay', 0.5306132435798645),
 ('poor', 0.5153629779815674),
 ('crappy', 0.5038278698921204),
 ('cheap', 0.5017231106758118)]

In [20]:
model.wv.similarity(w1 = 'cheap', w2 = 'inexpensive')

0.52422905

In [21]:
model.wv.similarity(w1 = 'great', w2 = 'good')

0.7756392

## Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/

# spaCy Word Vectors

https://www.youtube.com/watch?v=vyohzuTkty8&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=20

In [23]:
# word vectors are only included in spacy's medium or large
# models
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
     -------------------------------------- 587.7/587.7 MB 5.8 MB/s eta 0:00:00
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [24]:
import spacy

nlp = spacy.load("en_core_web_lg")

In [25]:
# check out if this model has vectors for some words

doc = nlp("dog cat banana afskfsd")

for token in doc:
    print(token.text, 
          "Vector:", token.has_vector, 
          "OOV:", token.is_oov)

dog Vector: True OOV: False
cat Vector: True OOV: False
banana Vector: True OOV: False
afskfsd Vector: False OOV: True


In [26]:
# check out a vector

# dog
doc[0].vector

array([ 1.2330e+00,  4.2963e+00, -7.9738e+00, -1.0121e+01,  1.8207e+00,
        1.4098e+00, -4.5180e+00, -5.2261e+00, -2.9157e-01,  9.5234e-01,
        6.9880e+00,  5.0637e+00, -5.5726e-03,  3.3395e+00,  6.4596e+00,
       -6.3742e+00,  3.9045e-02, -3.9855e+00,  1.2085e+00, -1.3186e+00,
       -4.8886e+00,  3.7066e+00, -2.8281e+00, -3.5447e+00,  7.6888e-01,
        1.5016e+00, -4.3632e+00,  8.6480e+00, -5.9286e+00, -1.3055e+00,
        8.3870e-01,  9.0137e-01, -1.7843e+00, -1.0148e+00,  2.7300e+00,
       -6.9039e+00,  8.0413e-01,  7.4880e+00,  6.1078e+00, -4.2130e+00,
       -1.5384e-01, -5.4995e+00,  1.0896e+01,  3.9278e+00, -1.3601e-01,
        7.7732e-02,  3.2218e+00, -5.8777e+00,  6.1359e-01, -2.4287e+00,
        6.2820e+00,  1.3461e+01,  4.3236e+00,  2.4266e+00, -2.6512e+00,
        1.1577e+00,  5.0848e+00, -1.7058e+00,  3.3824e+00,  3.2850e+00,
        1.0969e+00, -8.3711e+00, -1.5554e+00,  2.0296e+00, -2.6796e+00,
       -6.9195e+00, -2.3386e+00, -1.9916e+00, -3.0450e+00,  2.48

In [28]:
# check out the shape of the spacy vector
doc[0].vector.shape

(300,)

In [29]:
# the vector for a whole doc is the average of the vectors
# of the words in that doc
doc.vector

array([ 1.285995  ,  1.51985   , -3.1519876 , -4.857275  ,  0.40372053,
       -0.702725  , -1.97505   , -1.9329001 , -0.79143   ,  0.99263746,
        3.560485  ,  1.390425  ,  0.26564184,  2.01145   ,  3.3977425 ,
       -3.612475  , -0.15815374, -2.1185076 ,  1.435475  , -1.710825  ,
       -2.4027236 ,  2.909375  , -2.1509075 , -2.2286    , -0.668355  ,
       -0.9713    , -2.6473498 ,  3.782715  , -2.5905025 , -0.33405   ,
       -0.61644995, -0.599235  , -1.24345   , -0.14730498,  0.490825  ,
       -4.184225  ,  1.0886575 ,  1.9182426 ,  2.1102002 , -2.239075  ,
       -0.19210999, -2.6021075 ,  5.2194247 ,  2.7733    ,  1.3173975 ,
        0.5136955 ,  1.3593975 , -1.86975   , -0.20521674, -1.4796726 ,
        2.3111901 ,  5.665     ,  2.3114748 ,  0.7079749 , -0.90067494,
        1.17948   ,  2.5487623 ,  0.68675   ,  1.7658175 ,  1.3378    ,
        0.59345746, -3.6535451 ,  0.527775  ,  1.3896024 , -2.6922002 ,
       -3.325725  , -1.3890749 , -0.874045  ,  0.09935001,  0.87

## Compare similarity scores

In [31]:
base_token = nlp("bread")

doc = nlp("bread sandwich burger car tiger human wheat")

for token in doc:
    print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

# higher scores (closer to 1) indicate more similarity

bread <-> bread: 1.0
sandwich <-> bread: 0.6341067010130894
burger <-> bread: 0.47520687769584247
car <-> bread: 0.06451532596945217
tiger <-> bread: 0.04764611272488976
human <-> bread: 0.2151154210812192
wheat <-> bread: 0.615036141030184


In [32]:
# define helper function to print similarity scores

def print_similarity(base_word, words_to_compare):
    base_token = nlp(base_word)
    doc = nlp(words_to_compare)
    for token in doc:
        print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

In [33]:
print_similarity("iphone", "apple samsung iphone dog kitten")

apple <-> iphone: 0.4387907748060368
samsung <-> iphone: 0.6708590303423401
iphone <-> iphone: 1.0
dog <-> iphone: 0.08211864228011527
kitten <-> iphone: 0.10222317834969896


In [34]:
# perform cosine similarity
king = nlp.vocab["king"].vector
man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector
queen = nlp.vocab["queen"].vector

result = king - man + woman
result

array([ 1.9392200e+00, -2.3115001e+00, -1.3863000e+00, -1.9133999e+00,
        4.1749401e+00, -1.5401300e+00, -3.8272700e+00,  5.0291996e+00,
       -2.4454002e+00,  2.0851002e+00,  1.6605499e+01, -1.3788500e+00,
       -5.7085404e+00,  2.7210798e+00,  6.6530025e-01,  3.4804001e+00,
        1.0497000e+00, -1.1281996e+00, -6.6435003e-01, -3.5216696e+00,
       -8.0680294e+00, -3.8434997e+00, -4.4948001e+00,  8.7943001e+00,
       -6.3383985e-01, -4.8098001e+00, -1.2955203e+00, -6.1078286e-01,
        4.1610003e-01, -4.1724200e+00,  3.7961500e+00, -5.5350199e+00,
       -1.4319000e+00, -4.7633996e+00,  3.7440000e+00, -1.2749730e+00,
        3.1816001e+00,  1.0476298e+00,  1.0784001e+00, -3.0779200e+00,
       -1.2711000e+00, -3.6251001e+00, -2.7258501e+00,  4.7676001e+00,
        1.5000498e+00,  2.5363998e+00,  9.6959996e-01,  2.8748999e+00,
        2.6771998e+00,  1.8741999e+00, -5.3535199e+00,  3.7624002e+00,
       -5.4443008e-01, -2.8594000e+00, -2.3983500e+00,  7.5615001e-01,
      

In [35]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([result], [queen])

array([[0.6178014]], dtype=float32)

# Example: Text Classification Using spaCy Word Vectors

https://www.youtube.com/watch?v=ibi5hvw6f3g&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=21

In [1]:
import pandas as pd

df = pd.read_csv("Fake_Real_Data.csv")

df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [2]:
df.shape

(9900, 2)

In [3]:
# check for class imbalance
df.label.value_counts()

# no class imbalance

Fake    5000
Real    4900
Name: label, dtype: int64

In [4]:
# convert label into numbers

df['label_num'] = df['label'].map({'Fake': 0, 'Real': 1})

df.head()

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


In [5]:
# convert text column into word vectors
# (make new column with each doc's vector)

import spacy
nlp = spacy.load("en_core_web_lg")

In [6]:
# takes a long time! :)
df['vector'] = df['Text'].apply(lambda x: nlp(x).vector)

In [8]:
# tts

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values, # df.vector.values 
    df.label_num,
    test_size = 0.2,
    random_state = 2022,
    stratify = df.label_num)

In [13]:
# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [14]:
# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

In [15]:
# create classifier
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)

In [17]:
# evaluate model performance on test data
from sklearn.metrics import classification_report

y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      1000
           1       0.94      0.95      0.95       980

    accuracy                           0.95      1980
   macro avg       0.95      0.95      0.95      1980
weighted avg       0.95      0.95      0.95      1980



In [18]:
# knn

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')

clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1000
           1       0.99      1.00      0.99       980

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980

