# Word Embedding Techniques

https://www.youtube.com/watch?v=Do8cVbx-HOs&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=19

- Word2vec
- GloVe
- fastText

Dense arrays, similar words will have similar vectors, shorter arrays (popular is 300 elements)

Use continuous bag of words (CBOW) and skip grams to implement

- BERT (parameter tuning: ALBERT, RoBERTa)
- GPT

Transformer-based embedding techniques (based on transformer architecture)

- ElMo

Based on LSTM (Long Short-Term Memory RNN)

# Word2Vec

https://www.youtube.com/watch?v=hQwFeIupNP0

Feature vectors are learned as side effects by training neural networks to calculate probability of word given contextual word(s).

Input layer of neural network is one-hot encoded vectors of previous and next context word(s) (however many previous and next words you include is called your window size; this is the continuous bag of words (CBOW) approach). Output should be the target word (also a one-hot encoded vector). The hidden layer has however many neurons. By back propagation, after however many epochs, the network will learn the weights on the neurons in the hidden layer. These weights, for each output word, are that word's dense word vector. The individual weights in the vector don't correspond to any real-world measurement (like semantic components). 

Another approach is skip grams, which is similar except that the network is trained to predict the context word(s) given the target word.

# Gensim Code Tutorial

https://www.youtube.com/watch?v=Q2NtCcqmIww

Data: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

Note: downloads as zipped file (json.gz); to unzip, run command in gitbash: gunzip reviews_Cell_Phones_and_Accessories_5.json.gz

In [1]:
# !pip install gensim
# !pip install python-levenshtein

In [2]:
import gensim
import pandas as pd

In [3]:
df = pd.read_json("reviews_Cell_Phones_and_Accessories_5.json", 
                  lines = True)

df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [4]:
df.shape

(194439, 9)

## Simple Preprocessing & Tokenization

The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [5]:
# we only want the 'reviewText' column
# gensim.utils.simple_preprocessing: tokenize, remove punctuation,
# lowercase, trim spaces
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

# now each review is a list of tokens
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [6]:
review_text.loc[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

## Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

In [7]:
# initialize the model

model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

In [8]:
# build vocabulary

model.build_vocab(review_text, progress_per = 1000)

In [9]:
model.corpus_count

194439

In [10]:
model.epochs

5

In [11]:
# train the word2vec model

model.train(review_text, 
            total_examples = model.corpus_count,
            epochs = model.epochs)

(61500920, 83868975)

In [12]:
# save the model so it can be reused elsewhere

model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

## Finding similar words

In [13]:
model.wv.most_similar("bad")

[('terrible', 0.6464673280715942),
 ('shabby', 0.6327041983604431),
 ('horrible', 0.5904310941696167),
 ('good', 0.585347592830658),
 ('awful', 0.545529842376709),
 ('poor', 0.5279961228370667),
 ('sad', 0.5254566073417664),
 ('disappointing', 0.5213345885276794),
 ('okay', 0.5137054324150085),
 ('crappy', 0.509948194026947)]

In [14]:
model.wv.similarity(w1 = 'cheap', w2 = 'inexpensive')

0.5457079

In [15]:
model.wv.similarity(w1 = 'great', w2 = 'good')

0.7778567

## Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/

# spaCy Word Vectors

https://www.youtube.com/watch?v=vyohzuTkty8&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=20

In [16]:
# word vectors are only included in spacy's medium or large
# models
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
     -------------------------------------- 587.7/587.7 MB 7.1 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [17]:
import spacy

nlp = spacy.load("en_core_web_lg")

In [18]:
# check out if this model has vectors for some words

doc = nlp("dog cat banana afskfsd")

for token in doc:
    print(token.text, 
          "Vector:", token.has_vector, 
          "OOV:", token.is_oov)

dog Vector: True OOV: False
cat Vector: True OOV: False
banana Vector: True OOV: False
afskfsd Vector: False OOV: True


In [19]:
# check out a vector

# dog
doc[0].vector

array([ 1.2330e+00,  4.2963e+00, -7.9738e+00, -1.0121e+01,  1.8207e+00,
        1.4098e+00, -4.5180e+00, -5.2261e+00, -2.9157e-01,  9.5234e-01,
        6.9880e+00,  5.0637e+00, -5.5726e-03,  3.3395e+00,  6.4596e+00,
       -6.3742e+00,  3.9045e-02, -3.9855e+00,  1.2085e+00, -1.3186e+00,
       -4.8886e+00,  3.7066e+00, -2.8281e+00, -3.5447e+00,  7.6888e-01,
        1.5016e+00, -4.3632e+00,  8.6480e+00, -5.9286e+00, -1.3055e+00,
        8.3870e-01,  9.0137e-01, -1.7843e+00, -1.0148e+00,  2.7300e+00,
       -6.9039e+00,  8.0413e-01,  7.4880e+00,  6.1078e+00, -4.2130e+00,
       -1.5384e-01, -5.4995e+00,  1.0896e+01,  3.9278e+00, -1.3601e-01,
        7.7732e-02,  3.2218e+00, -5.8777e+00,  6.1359e-01, -2.4287e+00,
        6.2820e+00,  1.3461e+01,  4.3236e+00,  2.4266e+00, -2.6512e+00,
        1.1577e+00,  5.0848e+00, -1.7058e+00,  3.3824e+00,  3.2850e+00,
        1.0969e+00, -8.3711e+00, -1.5554e+00,  2.0296e+00, -2.6796e+00,
       -6.9195e+00, -2.3386e+00, -1.9916e+00, -3.0450e+00,  2.48

In [20]:
# check out the shape of the spacy vector
doc[0].vector.shape

(300,)

In [21]:
# the vector for a whole doc is the average of the vectors
# of the words in that doc
doc.vector

array([ 1.285995  ,  1.51985   , -3.1519876 , -4.857275  ,  0.40372053,
       -0.702725  , -1.97505   , -1.9329001 , -0.79143   ,  0.99263746,
        3.560485  ,  1.390425  ,  0.26564184,  2.01145   ,  3.3977425 ,
       -3.612475  , -0.15815374, -2.1185076 ,  1.435475  , -1.710825  ,
       -2.4027236 ,  2.909375  , -2.1509075 , -2.2286    , -0.668355  ,
       -0.9713    , -2.6473498 ,  3.782715  , -2.5905025 , -0.33405   ,
       -0.61644995, -0.599235  , -1.24345   , -0.14730498,  0.490825  ,
       -4.184225  ,  1.0886575 ,  1.9182426 ,  2.1102002 , -2.239075  ,
       -0.19210999, -2.6021075 ,  5.2194247 ,  2.7733    ,  1.3173975 ,
        0.5136955 ,  1.3593975 , -1.86975   , -0.20521674, -1.4796726 ,
        2.3111901 ,  5.665     ,  2.3114748 ,  0.7079749 , -0.90067494,
        1.17948   ,  2.5487623 ,  0.68675   ,  1.7658175 ,  1.3378    ,
        0.59345746, -3.6535451 ,  0.527775  ,  1.3896024 , -2.6922002 ,
       -3.325725  , -1.3890749 , -0.874045  ,  0.09935001,  0.87

## Compare similarity scores

In [22]:
base_token = nlp("bread")

doc = nlp("bread sandwich burger car tiger human wheat")

for token in doc:
    print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

# higher scores (closer to 1) indicate more similarity

bread <-> bread: 1.0
sandwich <-> bread: 0.6341067010130894
burger <-> bread: 0.47520687769584247
car <-> bread: 0.06451532596945217
tiger <-> bread: 0.04764611272488976
human <-> bread: 0.2151154210812192
wheat <-> bread: 0.615036141030184


In [23]:
# define helper function to print similarity scores

def print_similarity(base_word, words_to_compare):
    base_token = nlp(base_word)
    doc = nlp(words_to_compare)
    for token in doc:
        print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

In [24]:
print_similarity("iphone", "apple samsung iphone dog kitten")

apple <-> iphone: 0.4387907748060368
samsung <-> iphone: 0.6708590303423401
iphone <-> iphone: 1.0
dog <-> iphone: 0.08211864228011527
kitten <-> iphone: 0.10222317834969896


In [25]:
# perform cosine similarity
king = nlp.vocab["king"].vector
man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector
queen = nlp.vocab["queen"].vector

result = king - man + woman
result

array([ 1.9392200e+00, -2.3115001e+00, -1.3863000e+00, -1.9133999e+00,
        4.1749401e+00, -1.5401300e+00, -3.8272700e+00,  5.0291996e+00,
       -2.4454002e+00,  2.0851002e+00,  1.6605499e+01, -1.3788500e+00,
       -5.7085404e+00,  2.7210798e+00,  6.6530025e-01,  3.4804001e+00,
        1.0497000e+00, -1.1281996e+00, -6.6435003e-01, -3.5216696e+00,
       -8.0680294e+00, -3.8434997e+00, -4.4948001e+00,  8.7943001e+00,
       -6.3383985e-01, -4.8098001e+00, -1.2955203e+00, -6.1078286e-01,
        4.1610003e-01, -4.1724200e+00,  3.7961500e+00, -5.5350199e+00,
       -1.4319000e+00, -4.7633996e+00,  3.7440000e+00, -1.2749730e+00,
        3.1816001e+00,  1.0476298e+00,  1.0784001e+00, -3.0779200e+00,
       -1.2711000e+00, -3.6251001e+00, -2.7258501e+00,  4.7676001e+00,
        1.5000498e+00,  2.5363998e+00,  9.6959996e-01,  2.8748999e+00,
        2.6771998e+00,  1.8741999e+00, -5.3535199e+00,  3.7624002e+00,
       -5.4443008e-01, -2.8594000e+00, -2.3983500e+00,  7.5615001e-01,
      

In [26]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([result], [queen])

array([[0.6178014]], dtype=float32)

# Example: Text Classification Using spaCy Word Vectors

https://www.youtube.com/watch?v=ibi5hvw6f3g&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=21

In [27]:
import pandas as pd

df = pd.read_csv("Fake_Real_Data.csv")

df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [28]:
df.shape

(9900, 2)

In [29]:
# check for class imbalance
df.label.value_counts()

# no class imbalance

Fake    5000
Real    4900
Name: label, dtype: int64

In [30]:
# convert label into numbers

df['label_num'] = df['label'].map({'Fake': 0, 'Real': 1})

df.head()

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


In [31]:
# convert text column into word vectors
# (make new column with each doc's vector)

import spacy
nlp = spacy.load("en_core_web_lg")

In [32]:
# takes a long time! :)
df['vector'] = df['Text'].apply(lambda x: nlp(x).vector)

KeyboardInterrupt: 

In [None]:
# tts

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values, # df.vector.values 
    df.label_num,
    test_size = 0.2,
    random_state = 2022,
    stratify = df.label_num)

In [None]:
# sets are of format numpy array of numpy arrays
# need to flatten the arrays because clf is expecting
# just a 2d numpy array

import numpy as np

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [None]:
# scale values so there are no negative values
# MultinomialNB doesn't accept negative values
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)

In [None]:
# create classifier
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)

In [None]:
# evaluate model performance on test data
from sklearn.metrics import classification_report

y_pred = clf.predict(scaled_test_embed)

print(classification_report(y_test, y_pred))

In [None]:
# knn

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')

clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

# Gensim Word Vectors

https://github.com/codebasics/nlp-tutorials/blob/main/15_word_vectors_gensim_overview/nlp_word_vectors_gensim_overview.ipynb

In [None]:
import gensim.downloader as api

# download model that is trained on google news
wv = api.load("word2vec-google-news-300")

# other models available:
# twitter, wiki
# glove, fasttext

In [None]:
wv.similarity(w1 = 'great', w2 = 'good')

In [None]:
wv.similarity(w1 = 'great', w2 = 'great')

In [None]:
wv.most_similar('good')

In [None]:
# do vector math
wv.most_similar(positive = ['france', 'berlin'], negative = ['paris'])

In [None]:
wv.most_similar(positive = ['king', 'woman'], negative = ['man'])

In [None]:
wv.most_similar(positive = ['puppy', 'adult'], negative = ['young'])

In [None]:
wv.doesnt_match(['facebook', 'cat', 'google', 'microsoft'])

In [None]:
wv.doesnt_match(['dog', 'cat', 'lion', 'google'])

In [None]:
# new model
glv = api.load('glove-twitter-25')

In [None]:
# will give different answers to same math questions
glv.most_similar('good')

# Example: News Classification With Gensim Word Vectors

https://www.youtube.com/watch?v=ZrgVlfNduj8&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=23

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

In [None]:
import pandas as pd
df = pd.read_csv('fake_and_real_news.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
# class imbalance?
df.label.value_counts()

In [None]:
# create numeric column for label
df['label_num'] = df.label.map({'Fake': 0, 'Real': 1})

df.head()

In [None]:
# preprocess and get gensim doc vector

nlp = spacy.load("en_core_web_lg")

def preprocess_and_vectorize(text):
    doc = nlp(text)
    
    filtered_tokens = []
    
    for token in doc:
        if token.is_punct or token.is_stop:
            continue
        filtered_tokens.append(token.lemma_)
    
    return wv.get_mean_vector(filtered_tokens)

In [None]:
# check
preprocess_and_vectorize("No worries if you don't understand!")

In [None]:
# convert text into gensim word embeddings

df['gensim_vector'] = df['Text'].apply(lambda text: preprocess_and_vectorize(text))

In [None]:
df.head()

In [None]:
# tts
X_train, X_test, y_train, y_test = train_test_split(
    df.gensim_vector.values,
    df.label_num,
    test_size = 0.2,
    random_state = 2022,
    stratify = df.label_num)

In [None]:
# create 2d np arrays for X train and test sets

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [None]:
# gradient boosting classifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

clf = GradientBoostingClassifier()

clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

# fastText

https://www.youtube.com/watch?v=Br-Ozg9D4mc&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=24

**Word2vec:** neural network is trained on individual words (in context)

**fastText:** neural network is trained on CHARACTER n GRAM (where n is a hyperparameter; 3-6 are popular)
- ex: word 'capable', n = 3
- CHARACTER n GRAMS: 'cap', 'apa', 'pab', 'abl', 'ble'
- will also include the whole word

**Advantage of fastText vs. Word2vec:** Word2vec has out-of-vocabulary problem, where if a word is in the test that wasn't in the train, the model doesn't have a vector for that word (even if it is close to another word it does have, like 'capability' vs. 'capable'). With fastText, the OOV problem is largely dealt with, since, for example, the model will have vectors for n-grams 'cap', 'apa', and 'pab' for both 'capability' and 'capable'. 

fastText is a technique as well as a library.

Often a first choice when you want to train custom word embeddings for your domain. 

**Potential problem:** getting vectors for 'cap' when it appears in 'capability'/'capable' vs in a different word (like 'cap' as in 'baseball cap') would, correctly, result in different vectors with Word2vec but would be collapsed into the same vector in fastText.

In [1]:
# pretrained model

!pip install fasttext



In [3]:
import fasttext
fasttext

<module 'fasttext' from 'C:\\Users\\yang0108\\AppData\\Local\\anaconda3\\envs\\dojo-env\\lib\\site-packages\\fasttext\\__init__.py'>

In [4]:
model_en = fasttext.load_model('cc.en.300.bin')



In [5]:
dir(model_en)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_labels',
 '_words',
 'f',
 'get_analogies',
 'get_dimension',
 'get_input_matrix',
 'get_input_vector',
 'get_label_id',
 'get_labels',
 'get_line',
 'get_meter',
 'get_nearest_neighbors',
 'get_output_matrix',
 'get_sentence_vector',
 'get_subword_id',
 'get_subwords',
 'get_word_id',
 'get_word_vector',
 'get_words',
 'is_quantized',
 'labels',
 'predict',
 'quantize',
 'save_model',
 'set_args',
 'set_matrices',
 'test',
 'test_label',
 'words']

In [6]:
model_en.get_nearest_neighbors("good")

[(0.7517593502998352, 'bad'),
 (0.7426098585128784, 'great'),
 (0.7299689054489136, 'decent'),
 (0.7123614549636841, 'nice'),
 (0.6796907186508179, 'Good'),
 (0.6737031936645508, 'excellent'),
 (0.669592022895813, 'goood'),
 (0.6602178812026978, 'ggod'),
 (0.6479219794273376, 'semi-good'),
 (0.6417751908302307, 'good.Good')]

In [7]:
model_en.get_word_vector("good")

array([-0.09213716, -0.0634383 ,  0.00173813,  0.13524324, -0.06561062,
        0.00619071,  0.12609869, -0.01646539,  0.0174491 , -0.00126792,
       -0.09709831,  0.02329333,  0.00996784,  0.00463419,  0.01587938,
        0.00689824,  0.08575399, -0.01988525, -0.0601579 , -0.02327966,
        0.01183712,  0.08217917,  0.01488847,  0.00902181,  0.00696296,
       -0.06426616,  0.03345198, -0.02101481,  0.06767873,  0.03022419,
        0.07203474, -0.05689922, -0.04370377,  0.00642597,  0.0439174 ,
        0.0604848 , -0.00611545, -0.12256738, -0.03530414, -0.02696739,
       -0.02058216,  0.00752347, -0.00686451,  0.0362783 , -0.03308735,
        0.05801626,  0.00832448, -0.06336953, -0.05775082,  0.01089846,
       -0.0925179 ,  0.01559984, -0.04079024,  0.0066871 , -0.06374165,
        0.05881973,  0.07209535, -0.05387195, -0.14658651, -0.04046486,
       -0.02507038, -0.04954465, -0.05224417, -0.06846938,  0.0467079 ,
        0.00459271, -0.07522177,  0.03627685, -0.0698283 ,  0.01

In [8]:
model_en.get_word_vector("good").shape

(300,)

In [9]:
model_en.get_analogies("berlin", "germany", "india")

[(0.7148876190185547, 'delhi'),
 (0.6974374055862427, 'mumbai'),
 (0.648612916469574, 'jaipur'),
 (0.6349966526031494, 'kolkata'),
 (0.6279922723770142, 'pune'),
 (0.6277596354484558, 'bangalore'),
 (0.6044078469276428, 'hyderabad'),
 (0.6021745800971985, 'noida'),
 (0.6018899083137512, 'bhubaneswar'),
 (0.599077582359314, 'nashik')]

In [10]:
model_en.get_analogies("driving", "car", "phone")

[(0.610385537147522, 'texting'),
 (0.5203558802604675, 'phone-calling'),
 (0.5153835415840149, 'cellphone'),
 (0.5135326981544495, 'cell-phone'),
 (0.5117910504341125, 'dialing'),
 (0.5087355971336365, 'texing'),
 (0.5079342722892761, 'text-messaging'),
 (0.500900387763977, 'txting'),
 (0.4960441589355469, 'texting.'),
 (0.4951859414577484, 'Texting')]

In [13]:
model_en.get_analogies("driving", "car", "book")

[(0.5302355885505676, 'reading'),
 (0.517051637172699, 'book.I'),
 (0.5137901306152344, 'book--and'),
 (0.5090512633323669, 'book.That'),
 (0.5005884766578674, 'book--it'),
 (0.49395182728767395, 'book--I'),
 (0.49293914437294006, 're-reading'),
 (0.49156999588012695, 'book.This'),
 (0.49107635021209717, 'reading--and'),
 (0.48960915207862854, 'book--the')]

In [14]:
# train a model on your own dataset

# indian food dataset

import pandas as pd

df = pd.read_csv("Cleaned_Indian_Food_Dataset.csv")

In [15]:
df.head()

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,TotalTimeInMins,Cuisine,TranslatedInstructions,URL,Cleaned-Ingredients,image-url,Ingredient-count
0,Masala Karela Recipe,"1 tablespoon Red Chilli powder,3 tablespoon Gr...",45,Indian,"To begin making the Masala Karela Recipe,de-se...",https://www.archanaskitchen.com/masala-karela-...,"salt,amchur (dry mango powder),karela (bitter ...",https://www.archanaskitchen.com/images/archana...,10
1,Spicy Tomato Rice (Recipe),"2 teaspoon cashew - or peanuts, 1/2 Teaspoon ...",15,South Indian Recipes,"To make tomato puliogere, first cut the tomato...",https://www.archanaskitchen.com/spicy-tomato-r...,"tomato,salt,chickpea lentils,green chilli,rice...",https://www.archanaskitchen.com/images/archana...,12
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1 Onion - sliced,1 teaspoon White Urad Dal (Sp...",50,South Indian Recipes,"To begin making the Ragi Vermicelli Recipe, fi...",https://www.archanaskitchen.com/ragi-vermicell...,"salt,rice vermicelli noodles (thin),asafoetida...",https://www.archanaskitchen.com/images/archana...,12
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"1/2 teaspoon Turmeric powder (Haldi),1 tablesp...",45,Andhra,To begin making Gongura Chicken Curry Recipe f...,https://www.archanaskitchen.com/gongura-chicke...,"tomato,salt,ginger,sorrel leaves (gongura),fen...",https://www.archanaskitchen.com/images/archana...,15
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"oil - as per use, 1 tablespoon coriander seed...",30,Andhra,"To make Andhra Style Alam Pachadi, first heat ...",https://www.archanaskitchen.com/andhra-style-a...,"tomato,salt,ginger,red chillies,curry,asafoeti...",https://www.archanaskitchen.com/images/archana...,12


In [25]:
# use 'TranslatedInstructions' column to train model
# clean dataset: remove special characters, etc. with regex
# regex101.com

import re

In [26]:
# use one row's text to practice in regex101.com
text = df['TranslatedInstructions'][0]
text

'To begin making the Masala Karela Recipe,de-seed the karela and slice.\nDo not remove the skin as the skin has all the nutrients.\nAdd the karela to the pressure cooker with 3 tablespoon of water, salt and turmeric powder and pressure cook for three whistles.\nRelease the pressure immediately and open the lids.\nKeep aside.Heat oil in a heavy bottomed pan or a kadhai.\nAdd cumin seeds and let it sizzle.Once the cumin seeds have sizzled, add onions and saute them till it turns golden brown in color.Add the karela, red chilli powder, amchur powder, coriander powder and besan.\nStir to combine the masalas into the karela.Drizzle a little extra oil on the top and mix again.\nCover the pan and simmer Masala Karela stirring occasionally until everything comes together well.\nTurn off the heat.Transfer Masala Karela into a serving bowl and serve.Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family.\n'

In [27]:
# substitute any non-alphanumeric chracter with a space

text = re.sub(r'[^\w\s]', ' ', text, flags = re.MULTILINE)
text

'To begin making the Masala Karela Recipe de seed the karela and slice \nDo not remove the skin as the skin has all the nutrients \nAdd the karela to the pressure cooker with 3 tablespoon of water  salt and turmeric powder and pressure cook for three whistles \nRelease the pressure immediately and open the lids \nKeep aside Heat oil in a heavy bottomed pan or a kadhai \nAdd cumin seeds and let it sizzle Once the cumin seeds have sizzled  add onions and saute them till it turns golden brown in color Add the karela  red chilli powder  amchur powder  coriander powder and besan \nStir to combine the masalas into the karela Drizzle a little extra oil on the top and mix again \nCover the pan and simmer Masala Karela stirring occasionally until everything comes together well \nTurn off the heat Transfer Masala Karela into a serving bowl and serve Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family \n'

In [29]:
# remove newline characters
text = re.sub(r"\n", " ", text, flags = re.MULTILINE)
text

'To begin making the Masala Karela Recipe de seed the karela and slice  Do not remove the skin as the skin has all the nutrients  Add the karela to the pressure cooker with 3 tablespoon of water  salt and turmeric powder and pressure cook for three whistles  Release the pressure immediately and open the lids  Keep aside Heat oil in a heavy bottomed pan or a kadhai  Add cumin seeds and let it sizzle Once the cumin seeds have sizzled  add onions and saute them till it turns golden brown in color Add the karela  red chilli powder  amchur powder  coriander powder and besan  Stir to combine the masalas into the karela Drizzle a little extra oil on the top and mix again  Cover the pan and simmer Masala Karela stirring occasionally until everything comes together well  Turn off the heat Transfer Masala Karela into a serving bowl and serve Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family  '

In [31]:
# remove extra spaces
text = re.sub(r" {2,}", " ", text, flags = re.MULTILINE)
text

'To begin making the Masala Karela Recipe de seed the karela and slice Do not remove the skin as the skin has all the nutrients Add the karela to the pressure cooker with 3 tablespoon of water salt and turmeric powder and pressure cook for three whistles Release the pressure immediately and open the lids Keep aside Heat oil in a heavy bottomed pan or a kadhai Add cumin seeds and let it sizzle Once the cumin seeds have sizzled add onions and saute them till it turns golden brown in color Add the karela red chilli powder amchur powder coriander powder and besan Stir to combine the masalas into the karela Drizzle a little extra oil on the top and mix again Cover the pan and simmer Masala Karela stirring occasionally until everything comes together well Turn off the heat Transfer Masala Karela into a serving bowl and serve Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family '

In [32]:
# put all these steps into a preprocessing function
def preprocess(text):
    text = re.sub(r"[^\w\w\"]", " ", text)
    text = re.sub(r"[ \n]+", " ", text)
    return text.strip().lower()

In [34]:
text = df['TranslatedInstructions'][1]
text

'To make tomato puliogere, first cut the tomatoes.\nNow put in a mixer grinder and puree it.\nNow heat oil in a pan.\nAfter the oil is hot, add chana dal, urad dal, cashew and let it cook for 10 to 20 seconds.\nAfter 10 to 20 seconds, add cumin seeds, mustard seeds, green chillies, dry red chillies and curry leaves.\nAfter 30 seconds, add tomato puree to it and mix.\nAdd BC Belle Bhat powder, salt and mix it.\nAllow to cook for 7 to 8 minutes and then turn off the gas.\nTake it out in a bowl, add cooked rice and mix it.\nServe hot.\nServe tomato puliogre with tomato cucumber raita and papad for dinner.'

In [35]:
preprocess(text)

'to make tomato puliogere first cut the tomatoes now put in a mixer grinder and puree it now heat oil in a pan after the oil is hot add chana dal urad dal cashew and let it cook for 10 to 20 seconds after 10 to 20 seconds add cumin seeds mustard seeds green chillies dry red chillies and curry leaves after 30 seconds add tomato puree to it and mix add bc belle bhat powder salt and mix it allow to cook for 7 to 8 minutes and then turn off the gas take it out in a bowl add cooked rice and mix it serve hot serve tomato puliogre with tomato cucumber raita and papad for dinner'

In [36]:
# apply to entire column
df['TranslatedInstructions'] = df['TranslatedInstructions'].map(preprocess)

In [37]:
df.TranslatedInstructions[2]

'to begin making the ragi vermicelli recipe first steam the ragi vermicelli in a rice cooker or a steamer for about 5 6 minutes or till it is cooked but firm keep aside this aside till later use you can add a few drops of oil and mix it so that they don t stick to each other place a kadai on the heat add the ghee or oil to it and when warm add hing and allow it to sizzle for 30 seconds then follow it up with mustard seeds urad dal and curry leaves and allow them to crackle saute for 1 minute or so till the urad dal is slightly browned then add onions and fry till translucent and soft next add the green chillies along with par boiled carrots and peas sprinkle some salt and cook for 2 3 minutes or until the vegetables are semi cooked then add the steamed ragi vermicelli toss it together so the vegetables are all well combined switch off the heat take the vermicelli out into a serving dish and to with lemon juice mix well and serve along with coconut chutney and a hot cup of coffee or tea

In [41]:
# export data to csv to train model
df.to_csv("food_recipes.txt", 
          columns = ['TranslatedInstructions'], 
          header = None, 
          index = False)

# every line is a recipe

In [43]:
# train fasttext model on text
model = fasttext.train_unsupervised("food_recipes.txt")

In [44]:
# now model has different word vectors than pre-trained models
model.get_nearest_neighbors("good")

[(0.5873678922653198, 'give'),
 (0.5548170804977417, 'retained'),
 (0.5144150853157043, 'goodness'),
 (0.5059404969215393, '70'),
 (0.5023975968360901, 'got'),
 (0.4985058903694153, 'attained'),
 (0.48711153864860535, 'amchoor'),
 (0.4862218201160431, 'condiments'),
 (0.48579737544059753, 'well'),
 (0.48556724190711975, 'bright')]

In [None]:
# this default method uses skipgrams, but you can use cbow instead
# model = fasttext.train_unsupervised("text", "cbow")
# hyperparameter tuning: change dimensions of word vectors (default is 100),
# number of epochs (default is 5)
# more info: https://fasttext.cc/docs/en/unsupervised-tutorial.html


# Example: Text Classification Using fastText

https://www.youtube.com/watch?v=Cq_pbQYO3M8&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=25

In [3]:
import pandas as pd

df = pd.read_csv("ecommerce_dataset.csv", 
                 names = ['category', 'description'], 
                 header = None)

df.head()

Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [4]:
df.shape

(50425, 2)

In [5]:
df.category.value_counts()

Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: category, dtype: int64

In [6]:
df.isna().sum()

category       0
description    1
dtype: int64

In [7]:
df.dropna(inplace = True)

In [8]:
df.isna().sum()

category       0
description    0
dtype: int64

In [9]:
# replace category names with spaces to not have spaces
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace = True)

In [10]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

In [11]:
# fasttext expects the format of text as follows:
# one text per line
# label (one string, no spaces) at beginning of line, followed by a space,
# followed by the text
# if there is more than one label for the text, it can go after the first label
# and a space
# the label must have the string "__label__" prefixed

# prefix "__label__" to all categories
df['category'] = "__label__" + df['category']

In [12]:
# check

df.category.unique()

array(['__label__Household', '__label__Books',
       '__label__Clothing_Accessories', '__label__Electronics'],
      dtype=object)

In [33]:
# merge the two columns so the text is on the same line as the label
df['category_description'] = df['category'] + " " + df['description']

# check
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__Household SAF Flower Print Framed Pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__Household Incredible Gifts India Wood...


In [34]:
# recommended to preprocess by converting to lower, removing punctuation,
# removing extra white spaces, etc.

# preprocess with re (regex101.com)

import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"

# sub all non-word characters except apostrophe with space
text = re.sub(r"[^\w\s\']", " ", text)
text

"  VIKI's   Bookcase Bookshelf  3 Shelf Shelve  White        hi"

In [35]:
# sub multiple spaces with one space
text = re.sub(r" +", " ", text)
text

" VIKI's Bookcase Bookshelf 3 Shelf Shelve White hi"

In [36]:
text = text.strip().lower()
text

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [37]:
# put it all into a function

def preprocess(text):
    text = re.sub(r"[^\w\s\']", " ", text)
    text = re.sub(r" +", " ", text)
    return text.strip().lower()

In [38]:
# test
text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"

preprocess(text)

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [39]:
# preprocess on entire category_description column
df['category_description'] = df['category_description'].map(preprocess)

df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [41]:
# tts

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, 
                               test_size = 0.2, 
                               stratify = df.category)

In [42]:
train.shape

(40339, 3)

In [43]:
test.shape

(10085, 3)

In [44]:
train.head()

Unnamed: 0,category,description,category_description
27643,__label__Books,How to Think Like a Writer: A Short Book for C...,__label__books how to think like a writer a sh...
30244,__label__Books,"Archaeology: Theories, Methods and Practice Re...",__label__books archaeology theories methods an...
32499,__label__Clothing_Accessories,Littly Khadi Style Ethnic Wear Kids Cotton Kur...,__label__clothing_accessories littly khadi sty...
10430,__label__Household,R Crafts Handmade Wooden Non-Stick Serving And...,__label__household r crafts handmade wooden no...
13183,__label__Household,JAPP ABS Multifunction EU Plug Electric Steame...,__label__household japp abs multifunction eu p...


In [49]:
train.category.value_counts(normalize = True)

__label__Household               0.383004
__label__Books                   0.234413
__label__Electronics             0.210640
__label__Clothing_Accessories    0.171943
Name: category, dtype: float64

In [46]:
test.head()

Unnamed: 0,category,description,category_description
18536,__label__Household,What the Dog Saw: And Other Adventures About t...,__label__household what the dog saw and other ...
21702,__label__Books,Chapterwise Solved Papers Chemistry GATE 2019 ...,__label__books chapterwise solved papers chemi...
6241,__label__Household,Gala Spin Mop Rod (Compatible with only Gala m...,__label__household gala spin mop rod compatibl...
17839,__label__Household,"Cera Cloister 1046 Wash Basin (White , One Pie...",__label__household cera cloister 1046 wash bas...
9985,__label__Household,EZ Life Kitchen Roll Dispenser - Towel Holder ...,__label__household ez life kitchen roll dispen...


In [48]:
test.category.value_counts(normalize = True)

__label__Household               0.383044
__label__Books                   0.234408
__label__Electronics             0.210610
__label__Clothing_Accessories    0.171939
Name: category, dtype: float64

In [50]:
# save category_description column in file for fasttext to read
train.to_csv("ecommerce_train", 
             columns = ["category_description"], 
             index = False, 
             header = False)

test.to_csv("ecommerce_test", 
             columns = ["category_description"], 
             index = False, 
             header = False)

In [53]:
import fasttext

# supervised method is used to do text classification
# unsupervised technique from previous tutorial is used to generate word embeddings

# the supervised model will first get trained on your data to generate word
# embeddings, then will use those word embeddings for classification

model = fasttext.train_supervised(input = 'ecommerce_train')
model.test("ecommerce_test")

(10083, 0.9686601209957354, 0.9686601209957354)

In [54]:
# first number is size of test samples
# second is precision
# third is recall

# can predict new texts with:
# model.predict("TEXT HERE")

# the model has been trained, so it has word embeddings
model.get_nearest_neighbors("sony")

[(0.9979151487350464, 'colour'),
 (0.9975535273551941, 'tripod'),
 (0.9973239898681641, 'laptops'),
 (0.997262179851532, 'whey'),
 (0.9971546530723572, 'gen'),
 (0.9969862103462219, 'link'),
 (0.996778130531311, 'mac'),
 (0.9967317581176758, 'glossy'),
 (0.9966189861297607, 'connectivity'),
 (0.9963929057121277, 'gps')]