# Sentiment analysis with an MLP and vector representation


# Case Study: Sentiment Analysis

In this lab we use part of the 'Amazon_Unlocked_Mobile.csv' dataset published by Kaggle. The dataset contain the following information:

- Product Name
- Brand Name
- Price
- Rating
- Reviews
- Review Votes

We are mainly interested by the 'Reviews' (X) and by the 'Rating' (y)

The goal is to try to predict the 'Rating' after reading the 'Reviews'. I've prepared for you TRAIN and TEST set.
The work to be done is as follows:

1. Feature extraction and baseline
   - read the dataset and understand it
   - put it in a format so that you can use `CountVectorizer` or`Tf-IDF` to extract the desired features
   - perform on the desired dates and preprocessing
   - use one of the classifiers you know to predict the polarity of different sentences
1. My first neural network
   - reuse the features already extracted
   - proposed a neural network built with Keras
1. Hyper-parameter fitting
   - for the base line: adjust min_df, max_df, ngram, max_features + model's hyper-parameter
   - for the neural network: adjust batch size, number of layers and number of neuron by layers, use earlystop
1. <span style="color:red">Word embedding
   - stage 1 build a network that uses Keras' embedding which is not language sensitive.
   - stage 2 build a network that simultaneously uses Keras' embedding and the features extracted in the first weeks.
   - stage 3 try to use an existing embedding (https://github.com/facebookresearch/MUSE)
     </span>

**WARNING:** the dataset is voluminous, I can only encourage you to work first on a small part of it and only at the end, when the code is well debugged and that it is necessary to build the "final model", to use the whole dataset.


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-the-dataset" data-toc-modified-id="Read-the-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read the dataset</a></span></li><li><span><a href="#Text-normalisation" data-toc-modified-id="Text-normalisation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Text normalisation</a></span></li><li><span><a href="#Approach1---BOW-and-MLP-classifier" data-toc-modified-id="Approach1---BOW-and-MLP-classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Approach1 - BOW and MLP classifier</a></span></li><li><span><a href="#Approach2---Keras-word-embedding-and-MLP-classifier" data-toc-modified-id="Approach2---Keras-word-embedding-and-MLP-classifier-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Approach2 - Keras word embedding and MLP classifier</a></span></li></ul></div>


## Read the dataset

Could you find below a proposal. You can complete them.


In [1]:
import os

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_addons as tfa
from keras.layers import Dense, Embedding, Flatten, Input, TextVectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer, one_hot
from tensorflow_addons.metrics import F1Score


In [2]:
TRAIN = pd.read_csv(
    "http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/train.csv.gz"
)
TEST = pd.read_csv(
    "http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/test.csv.gz"
)

TRAIN.head()


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,Samsung Galaxy Note 4 N910C Unlocked Cellphone...,Samsung,449.99,4,I love it!!! I absolutely love it!! 👌👍,0.0
1,BLU Energy X Plus Smartphone - With 4000 mAh S...,BLU,139.0,5,I love the BLU phones! This is my second one t...,4.0
2,Apple iPhone 6 128GB Silver AT&T,Apple,599.95,5,Great phone,1.0
3,BLU Advance 4.0L Unlocked Smartphone -US GSM -...,BLU,51.99,4,Very happy with the performance. The apps work...,2.0
4,Huawei P8 Lite US Version- 5 Unlocked Android ...,Huawei,198.99,5,Easy to use great price,0.0


In [3]:
# Construct X_train and y_train
X_train = TRAIN["Reviews"]
y_train = np.array(TRAIN["Rating"]).reshape(-1, 1)

X_test = TEST["Reviews"]
y_test = np.array(TEST["Rating"]).reshape(-1, 1)

nb_classes = len(np.unique(y_train))

ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
y_train_ohe = ohe.fit_transform(y_train)
y_test_ohe = ohe.fit_transform(y_test)

X_train.shape, y_train_ohe.shape, np.unique(y_train)


((5000,), (5000, 5), array([1, 2, 3, 4, 5]))

## Approach1 - BOW and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using a BOW representation of the dataset and evaluate the model.

The dataset being unbalanced the metric will be the f1 score.


$$TO DO STUDENT$$

> - Build BOW representation of the train and test set
> - Fix a value for vocab_size = the maximum number of words to keep, based on word frequency. Only the most common vocab_size-1 words will be kept.


In [4]:
# Your code
vocab_size = 20000
tokenize = Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(X_train)
X_train_ohe = tokenize.texts_to_matrix(X_train, mode="tfidf")
X_test_ohe = tokenize.texts_to_matrix(X_test, mode="tfidf")


$$TO DO STUDENT$$

> - Build an MLP and print the model (model.summary())


In [5]:
# build sequential model
model = Sequential()
model.add(Input(shape=(vocab_size,), name="input", dtype=tf.float32))
model.add(Dense(64, activation="relu", name="hidden"))
model.add(Dense(5, activation="softmax"))
model.build()
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 hidden (Dense)              (None, 64)                1280064   
                                                                 
 dense (Dense)               (None, 5)                 325       
                                                                 
Total params: 1,280,389
Trainable params: 1,280,389
Non-trainable params: 0
_________________________________________________________________


2022-02-02 22:29:59.079944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-02 22:29:59.119856: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-02-02 22:29:59.119880: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-02-02 22:29:59.120533: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN

$$ TO DO STUDENT $$

> - Compile the network
> - Fit the network using EarlyStopping
> - Babysit your model
> - Evaluate the network with f1 score


In [6]:
X_train_ohe.shape


(5000, 20000)

In [7]:
y_train_ohe.shape, y_test_ohe.shape


((5000, 5), (1000, 5))

In [8]:
## compile the model with f1 metrics
# define F1Score instance

f1_score_name = "f1_score"
f1 = F1Score(
    num_classes=len(np.unique(y_test)),
    name=f1_score_name,
    average="weighted",
)
# compile model
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=[f1, "accuracy"],
)

# define early stopping
early_stop = EarlyStopping(
    monitor="val_f1_score",
    patience=10,
    verbose=1,
    restore_best_weights=True,
    mode="max",
)

# fit model using early stopping
history = model.fit(
    x=X_train_ohe,
    y=y_train_ohe,
    validation_data=(X_test_ohe, y_test_ohe),
    # validation_split=.3,
    # batch_size=1,
    epochs=2000,
    verbose=1,
    callbacks=[early_stop],
    workers=6,
    use_multiprocessing=True,
)


2022-02-02 22:29:59.500494: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 400000000 exceeds 10% of free system memory.


Epoch 1/2000
Epoch 2/2000
Epoch 3/2000
Epoch 4/2000
Epoch 5/2000
Epoch 6/2000
Epoch 7/2000
Epoch 8/2000
Epoch 9/2000
Epoch 10/2000
Epoch 11/2000
Epoch 12/2000
Epoch 13/2000
Epoch 14/2000
Epoch 15/2000
Epoch 00015: early stopping


In [9]:
import plotly.express as px

# Babysit the model - use you favourite plot
px.line(
    pd.DataFrame(
        {
            "val_loss": history.history["val_loss"],
            "loss": history.history["loss"],
            "val_f1_score": history.history["val_f1_score"],
            "f1_score": history.history["f1_score"],
            "val_accuracy": history.history["val_accuracy"],
            "accuracy": history.history["accuracy"],
        }
    )
)


In [10]:
# Evaluate the model with f1 metrics (Tensorflow f1 metrics or sklearn)
model.evaluate(X_test_ohe, y_test_ohe)




[1.3393656015396118, 0.6449266672134399, 0.6690000295639038]

## Approach2 - Keras word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an Embedding Keras layer and the same classifier as in approach 1. Evaluate the model.


$$ TO DO STUDENTS $$

> - fix the max_lengh of a review (max number of token in a review)
> - use the same vocab_size as previously
> - fix the embedding dimension (embed_dim variable)


In [11]:
import nltk

X_train_tok = [nltk.word_tokenize(review) for review in X_train]
max_len = int(
    np.amax([len(review_tok) for review_tok in X_train_tok])
)  # Sequence length to pad the outputs to
# In order to fix it, you have to know the distribution on lengh... see first lab
embed_dim = 300  # embedding dimension


$$ TO DO STUDENTS $$

> - Create a vectorizer_layer with TextVectorization function
> - Fit the vectorizer_layer (adapt function


In [12]:
vectorize_layer = TextVectorization(
    max_tokens=vocab_size,
    output_sequence_length=max_len,
)
vectorize_layer.adapt(X_train)
vectorize_layer(X_test)  # display vectorized test set


<tf.Tensor: shape=(1000, 1137), dtype=int64, numpy=
array([[    2,   438,    12, ...,     0,     0,     0],
       [    3,    29,   108, ...,     0,     0,     0],
       [    3,    45,    15, ...,     0,     0,     0],
       ...,
       [   16,    17,   113, ...,     0,     0,     0],
       [  198,  1559, 10239, ...,     0,     0,     0],
       [   82,     4,     9, ...,     0,     0,     0]])>

In [13]:
vectorize_layer.get_vocabulary()


['',
 '[UNK]',
 'the',
 'i',
 'it',
 'and',
 'phone',
 'a',
 'to',
 'is',
 'this',
 'for',
 'of',
 'with',
 'my',
 'not',
 'was',
 'in',
 'that',
 'but',
 'on',
 'have',
 'you',
 'great',
 'good',
 'as',
 'very',
 'so',
 'its',
 'had',
 'one',
 'be',
 'like',
 'no',
 'all',
 'or',
 'me',
 'if',
 'just',
 'battery',
 'use',
 'screen',
 'has',
 'are',
 'an',
 'would',
 'from',
 'only',
 'at',
 'when',
 'works',
 'can',
 'love',
 'will',
 'get',
 'new',
 'they',
 'work',
 'up',
 'time',
 'really',
 'than',
 'phones',
 'dont',
 'price',
 'product',
 'out',
 'camera',
 'im',
 'am',
 'about',
 'do',
 'because',
 'well',
 'buy',
 'after',
 'sim',
 'bought',
 'card',
 'got',
 'even',
 'more',
 'what',
 'also',
 'other',
 'there',
 'which',
 'back',
 'now',
 'your',
 'iphone',
 'does',
 'any',
 'some',
 'used',
 'nice',
 'excellent',
 'fast',
 'did',
 'better',
 'quality',
 'apps',
 'unlocked',
 'doesnt',
 'then',
 'much',
 'could',
 'case',
 'been',
 'problem',
 'came',
 'by',
 'best',
 'perfe

In [14]:
vectorize_layer.get_config()


{'name': 'text_vectorization',
 'trainable': True,
 'batch_input_shape': (None, None),
 'dtype': 'string',
 'max_tokens': 20000,
 'standardize': 'lower_and_strip_punctuation',
 'split': 'whitespace',
 'ngrams': None,
 'output_mode': 'int',
 'output_sequence_length': 1137,
 'pad_to_max_tokens': False,
 'sparse': False,
 'ragged': False,
 'vocabulary': None,
 'idf_weights': None}

$$TO DO STUDENT$$

> - Build an MLP and print the model (model.summary())


In [18]:
# Flatten after Embedding in order to reduce the dimension of tensors
model = Sequential()
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim=vocab_size, output_dim=64))
model.add(Flatten())
model.add(Dense(units=128, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=5, activation="sigmoid"))
model.build()

# get summary of the model
model.summary()


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 1137)             0         
 torization)                                                     
                                                                 
 embedding_1 (Embedding)     (None, 1137, 64)          1280000   
                                                                 
 flatten_1 (Flatten)         (None, 72768)             0         
                                                                 
 dense_3 (Dense)             (None, 128)               9314432   
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_4 (Dense)             (None, 5)                 645       
                                                      

$$ TO DO STUDENT $$

> - Compile the network
> - Fit the network using EarlyStopping
> - Babysit your model
> - Evaluate the network with f1 score


In [19]:
vectorize_layer(X_train)


<tf.Tensor: shape=(5000, 1137), dtype=int64, numpy=
array([[ 3, 52,  4, ...,  0,  0,  0],
       [ 3, 52,  2, ...,  0,  0,  0],
       [23,  6,  0, ...,  0,  0,  0],
       ...,
       [ 6, 50, 73, ...,  0,  0,  0],
       [23,  7,  0, ...,  0,  0,  0],
       [10,  9,  7, ...,  0,  0,  0]])>

In [20]:
# compile the model with metrics f1 score
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=[f1, "accuracy"],
)

# define early stopping
early_stop = EarlyStopping(
    monitor=f"val_{f1_score_name}",
    patience=10,
    verbose=3,
    restore_best_weights=True,
    mode="max",
)

# fit model using ealy stopping
history = model.fit(
    x=X_train,
    y=y_train_ohe,
    epochs=2000,
    validation_data=(X_test, y_test_ohe),
    callbacks=early_stop,
    use_multiprocessing=True,
    workers=6,
)


Epoch 1/2000
Epoch 2/2000
Epoch 3/2000
Epoch 4/2000
Epoch 5/2000
Epoch 6/2000
Epoch 7/2000
Epoch 8/2000
Epoch 9/2000
Epoch 10/2000
Epoch 11/2000
Epoch 12/2000
Epoch 13/2000
Epoch 14/2000
Epoch 15/2000
Epoch 00015: early stopping


In [21]:
# Babysit the model
px.line(
    pd.DataFrame(
        {
            "val_loss": history.history["val_loss"],
            "loss": history.history["loss"],
            "val_f1_score": history.history["val_f1_score"],
            "f1_score": history.history["f1_score"],
            "val_accuracy": history.history["val_accuracy"],
            "accuracy": history.history["accuracy"],
        }
    )
)


In [22]:
# Evaluate the model
model.evaluate(X_test, y_test_ohe)




[1.025903344154358, 0.6308760643005371, 0.6629999876022339]

**The model seems to overfit: its results improve on the train set, but (at best) remain stable on the validation set.**


## Approach3 - Word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an existing embedding matrix (Word2Vec / Glove or FastText), or on an embedding matrix that you will have built using Gensim.

Use the same constant as a previous steps.

Evaluate the model.


In [23]:
import gensim
from gensim import models, utils

gensim_path = f"{gensim.__path__[0]}/test/test_data/"
corpus = "lee_background.cor"
corpus_path = gensim_path + corpus


class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)


In [24]:
sentences = MyCorpus()
model = models.Word2Vec(sentences=sentences, vector_size=150)


In [25]:
# Same steps as Keras Embedding
max_len = 10  # Sequence length to pad the outputs to.
vectorizer = TextVectorization(max_tokens=vocab_size, output_sequence_length=max_len)
vectorizer.adapt(X_train)
X_train_vec = vectorizer(X_train)
X_train_vec


<tf.Tensor: shape=(5000, 10), dtype=int64, numpy=
array([[   3,   52,    4, ..., 4907,    0,    0],
       [   3,   52,    2, ...,   14,  348,   30],
       [  23,    6,    0, ...,    0,    0,    0],
       ...,
       [   6,   50,   73, ...,  182,   13,  100],
       [  23,    7,    0, ...,    0,    0,    0],
       [  10,    9,    7, ...,    4,    8, 2459]])>

In [26]:
# Build word dict
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))


In [27]:
# Make a dict mapping words (strings) to their NumPy vector representation:

path_to_glove_file = "glove.6B/glove.6B.50d.txt"
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")


Found 400000 word vectors.


In [28]:
# Prepare embedding matrix

num_tokens = len(voc) + 2
embedding_dim = 50
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    print(word, i)
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print(f"Converted {hits} words ({misses} misses)")


 0
[UNK] 1
the 2
i 3
it 4
and 5
phone 6
a 7
to 8
is 9
this 10
for 11
of 12
with 13
my 14
not 15
was 16
in 17
that 18
but 19
on 20
have 21
you 22
great 23
good 24
as 25
very 26
so 27
its 28
had 29
one 30
be 31
like 32
no 33
all 34
or 35
me 36
if 37
just 38
battery 39
use 40
screen 41
has 42
are 43
an 44
would 45
from 46
only 47
at 48
when 49
works 50
can 51
love 52
will 53
get 54
new 55
they 56
work 57
up 58
time 59
really 60
than 61
phones 62
dont 63
price 64
product 65
out 66
camera 67
im 68
am 69
about 70
do 71
because 72
well 73
buy 74
after 75
sim 76
bought 77
card 78
got 79
even 80
more 81
what 82
also 83
other 84
there 85
which 86
back 87
now 88
your 89
iphone 90
does 91
any 92
some 93
used 94
nice 95
excellent 96
fast 97
did 98
better 99
quality 100
apps 101
unlocked 102
doesnt 103
then 104
much 105
could 106
case 107
been 108
problem 109
came 110
by 111
best 112
perfect 113
using 114
first 115
android 116
too 117
working 118
off 119
2 120
still 121
want 122
life 123
need 124
ev

In [29]:
# Define embedding layers

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
)


In [30]:
# define the model
input_ = Input(shape=(max_len,), dtype=tf.int32)
x = embedding_layer(input_)
x = Flatten()(x)
output_ = Dense(5, activation="sigmoid")(x)
model = Model(input_, output_)
# summarize the model
model.summary()


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding_2 (Embedding)     (None, 10, 50)            550200    
                                                                 
 flatten_2 (Flatten)         (None, 500)               0         
                                                                 
 dense_5 (Dense)             (None, 5)                 2505      
                                                                 
Total params: 552,705
Trainable params: 2,505
Non-trainable params: 550,200
_________________________________________________________________


In [31]:
# compile the model
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=[f1, "accuracy"],
)

# define early stopping
early_stop = EarlyStopping(
    monitor=f"val_{f1_score_name}",
    patience=100,
    verbose=2,
    mode="max",
    restore_best_weights=True,
)


# fit model using ealy stopping
history = model.fit(
    x=vectorizer(X_train),
    y=y_train_ohe,
    validation_data=(vectorizer(X_test), y_test_ohe),
    epochs=2000,
    callbacks=early_stop,
)


Epoch 1/2000
Epoch 2/2000
Epoch 3/2000
Epoch 4/2000
Epoch 5/2000
Epoch 6/2000
Epoch 7/2000
Epoch 8/2000
Epoch 9/2000
Epoch 10/2000
Epoch 11/2000
Epoch 12/2000
Epoch 13/2000
Epoch 14/2000
Epoch 15/2000
Epoch 16/2000
Epoch 17/2000
Epoch 18/2000
Epoch 19/2000
Epoch 20/2000
Epoch 21/2000
Epoch 22/2000
Epoch 23/2000
Epoch 24/2000
Epoch 25/2000
Epoch 26/2000
Epoch 27/2000
Epoch 28/2000
Epoch 29/2000
Epoch 30/2000
Epoch 31/2000
Epoch 32/2000
Epoch 33/2000
Epoch 34/2000
Epoch 35/2000
Epoch 36/2000
Epoch 37/2000
Epoch 38/2000
Epoch 39/2000
Epoch 40/2000
Epoch 41/2000
Epoch 42/2000
Epoch 43/2000
Epoch 44/2000
Epoch 45/2000
Epoch 46/2000
Epoch 47/2000
Epoch 48/2000
Epoch 49/2000
Epoch 50/2000
Epoch 51/2000
Epoch 52/2000
Epoch 53/2000
Epoch 54/2000
Epoch 55/2000
Epoch 56/2000
Epoch 57/2000
Epoch 58/2000
Epoch 59/2000
Epoch 60/2000
Epoch 61/2000
Epoch 62/2000
Epoch 63/2000
Epoch 64/2000
Epoch 65/2000
Epoch 66/2000
Epoch 67/2000
Epoch 68/2000
Epoch 69/2000
Epoch 70/2000
Epoch 71/2000
Epoch 72/2000
E

In [32]:
# Babysit the model
px.line(
    pd.DataFrame(
        {
            "loss": history.history["loss"],
            "val_loss": history.history["val_loss"],
            "f1_score": history.history["f1_score"],
            "val_f1_score": history.history["val_f1_score"],
            "accuracy": history.history["accuracy"],
            "val_accuracy": history.history["val_accuracy"],
        }
    )
)


In [33]:
# Evaluate the model
model.evaluate(vectorizer(X_test), y_test_ohe)




[1.2321563959121704, 0.5482052564620972, 0.5820000171661377]

## Approach3 (bis) - Word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an existing embedding matrix (Word2Vec / Glove or FastText), or on an embedding matrix that you will have built using Gensim.

Use the same constant as a previous steps.

Evaluate the model.


In [34]:
# Build gensim model
import gensim
from gensim.test.utils import datapath
from gensim import utils
import gensim.models

gensim_path = f"{gensim.__path__[0]}/test/test_data/"
corpus = "lee_background.cor"
corpus_path = gensim_path + corpus


class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)


sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences, vector_size=150)


In [35]:
# Export gensim model
import tempfile

with tempfile.NamedTemporaryFile(prefix="gensim-model-", delete=False) as tmp:
    temporary_filepath = tmp.name
    print(temporary_filepath)
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)


/tmp/gensim-model-ylf38uv3


In [36]:
# Load gensim model
new_model = gensim.models.Word2Vec.load(temporary_filepath)


In [37]:
# Prepare embedding matrix
num_tokens = len(voc) + 2
embedding_dim = 150
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    try:
        model.wv[word]
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = model.wv[word]
        hits += 1
    except :
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Converted 1083 words (9919 misses)


In [38]:
# Define embedding layers
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
)


In [39]:
# define the model
input_ = Input(shape=(max_len,), dtype=tf.int32)
x = embedding_layer(input_)
x = Flatten()(x)
output_ = Dense(5, activation='sigmoid')(x)
model = Model(input_, output_)
# summarize the model
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding_3 (Embedding)     (None, 10, 150)           1650600   
                                                                 
 flatten_3 (Flatten)         (None, 1500)              0         
                                                                 
 dense_6 (Dense)             (None, 5)                 7505      
                                                                 
Total params: 1,658,105
Trainable params: 7,505
Non-trainable params: 1,650,600
_________________________________________________________________


In [44]:
# compile the model
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy", f1],
)

# define early stopping
early_stop = EarlyStopping(
    monitor="val_f1_score",
    patience=100,
    verbose=1,
    restore_best_weights=True,
    mode="max",
)


# fit model using ealy stopping
history = model.fit(
    x=vectorizer(X_train),
    y=y_train_ohe,
    epochs=2000,
    verbose=1,
    validation_data=(vectorizer(X_test), y_test_ohe),
    callbacks=[early_stop],
    workers=6,
    use_multiprocessing=True,
)


Epoch 1/2000
Epoch 2/2000
Epoch 3/2000
Epoch 4/2000
Epoch 5/2000
Epoch 6/2000
Epoch 7/2000
Epoch 8/2000
Epoch 9/2000
Epoch 10/2000
Epoch 11/2000
Epoch 12/2000
Epoch 13/2000
Epoch 14/2000
Epoch 15/2000
Epoch 16/2000
Epoch 17/2000
Epoch 18/2000
Epoch 19/2000
Epoch 20/2000
Epoch 21/2000
Epoch 22/2000
Epoch 23/2000
Epoch 24/2000
Epoch 25/2000
Epoch 26/2000
Epoch 27/2000
Epoch 28/2000
Epoch 29/2000
Epoch 30/2000
Epoch 31/2000
Epoch 32/2000
Epoch 33/2000
Epoch 34/2000
Epoch 35/2000
Epoch 36/2000
Epoch 37/2000
Epoch 38/2000
Epoch 39/2000
Epoch 40/2000
Epoch 41/2000
Epoch 42/2000
Epoch 43/2000
Epoch 44/2000
Epoch 45/2000
Epoch 46/2000
Epoch 47/2000
Epoch 48/2000
Epoch 49/2000
Epoch 50/2000
Epoch 51/2000
Epoch 52/2000
Epoch 53/2000
Epoch 54/2000
Epoch 55/2000
Epoch 56/2000
Epoch 57/2000
Epoch 58/2000
Epoch 59/2000
Epoch 60/2000
Epoch 61/2000
Epoch 62/2000
Epoch 63/2000
Epoch 64/2000
Epoch 65/2000
Epoch 66/2000
Epoch 67/2000
Epoch 68/2000
Epoch 69/2000
Epoch 70/2000
Epoch 71/2000
Epoch 72/2000
E

In [45]:
# Babysit the model
px.line(
    pd.DataFrame(
        {
            "val_loss": history.history["val_loss"],
            "loss": history.history["loss"],
            "val_f1_score": history.history["val_f1_score"],
            "f1_score": history.history["f1_score"],
            "val_accuracy": history.history["val_accuracy"],
            "accuracy": history.history["accuracy"],
        }
    )
)


In [46]:
# evaluate the model
model.evaluate(vectorizer(X_test), y_test_ohe, verbose=1)




[1.209080457687378, 0.546999990940094, 0.47998419404029846]