# Sentiment analysis with an MLP and vector representation


# Case Study: Sentiment Analysis

In this lab we use part of the 'Amazon_Unlocked_Mobile.csv' dataset published by Kaggle. The dataset contain the following information:

- Product Name
- Brand Name
- Price
- Rating
- Reviews
- Review Votes

We are mainly interested by the 'Reviews' (X) and by the 'Rating' (y)

The goal is to try to predict the 'Rating' after reading the 'Reviews'. I've prepared for you TRAIN and TEST set.
The work to be done is as follows:

1. Feature extraction and baseline
   - read the dataset and understand it
   - put it in a format so that you can use `CountVectorizer` or`Tf-IDF` to extract the desired features
   - perform on the desired dates and preprocessing
   - use one of the classifiers you know to predict the polarity of different sentences
1. My first neural network
   - reuse the features already extracted
   - proposed a neural network built with Keras
1. Hyper-parameter fitting
   - for the base line: adjust min_df, max_df, ngram, max_features + model's hyper-parameter
   - for the neural network: adjust batch size, number of layers and number of neuron by layers, use earlystop
1. <span style="color:red">Word embedding
   - stage 1 build a network that uses Keras' embedding which is not language sensitive.
   - stage 2 build a network that simultaneously uses Keras' embedding and the features extracted in the first weeks.
   - stage 3 try to use an existing embedding (https://github.com/facebookresearch/MUSE)
     </span>

**WARNING:** the dataset is voluminous, I can only encourage you to work first on a small part of it and only at the end, when the code is well debugged and that it is necessary to build the "final model", to use the whole dataset.


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-the-dataset" data-toc-modified-id="Read-the-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read the dataset</a></span></li><li><span><a href="#Text-normalisation" data-toc-modified-id="Text-normalisation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Text normalisation</a></span></li><li><span><a href="#Approach1---BOW-and-MLP-classifier" data-toc-modified-id="Approach1---BOW-and-MLP-classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Approach1 - BOW and MLP classifier</a></span></li><li><span><a href="#Approach2---Keras-word-embedding-and-MLP-classifier" data-toc-modified-id="Approach2---Keras-word-embedding-and-MLP-classifier-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Approach2 - Keras word embedding and MLP classifier</a></span></li></ul></div>


## Read the dataset

Could you find below a proposal. You can complete them.


In [1]:
import os

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_addons as tfa
from keras.layers import Dense, Embedding, Flatten, Input, TextVectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer, one_hot
from tensorflow_addons.metrics import F1Score


In [2]:
TRAIN = pd.read_csv(
    "http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/train.csv.gz"
)
TEST = pd.read_csv(
    "http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/test.csv.gz"
)

TRAIN.head()


URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

In [None]:
""" Construct X_train and y_train """
X_train = TRAIN["Reviews"]
y_train = np.array(TRAIN["Rating"]).reshape(-1, 1)

X_test = TEST["Reviews"]
y_test = np.array(TEST["Rating"]).reshape(-1, 1)

nb_classes = len(np.unique(y_train))

ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
y_train_ohe = ohe.fit_transform(y_train)
y_test_ohe = ohe.fit_transform(y_test)

X_train.shape, y_train_ohe.shape, np.unique(y_train)


((5000,), (5000, 5), array([1, 2, 3, 4, 5]))

## Approach1 - BOW and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using a BOW representation of the dataset and evaluate the model.

The dataset being unbalanced the metric will be the f1 score.


$$TO DO STUDENT$$

> - Build BOW representation of the train and test set
> - Fix a value for vocab_size = the maximum number of words to keep, based on word frequency. Only the most common vocab_size-1 words will be kept.


In [None]:
# Your code
vocab_size = 10
tokenize = Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(X_train)
X_train_ohe = tokenize.texts_to_matrix(X_train, mode="tfidf")
X_test_ohe = tokenize.texts_to_matrix(X_test, mode="tfidf")


$$TO DO STUDENT$$

> - Build an MLP and print the model (model.summary())


In [None]:
# build sequential model
model = Sequential()
model.add(Input(shape=(vocab_size,), name="input", dtype=tf.float32))
model.add(Dense(128, activation="relu", name="hidden"))
model.add(Dense(5, activation="softmax", name="output"))
model.build()
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 hidden (Dense)              (None, 128)               1408      
                                                                 
 output (Dense)              (None, 5)                 645       
                                                                 
Total params: 2,053
Trainable params: 2,053
Non-trainable params: 0
_________________________________________________________________


2022-01-30 12:26:24.886902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-30 12:26:24.923335: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-01-30 12:26:24.923380: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-01-30 12:26:24.923674: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN

$$ TO DO STUDENT $$

> - Compile the network
> - Fit the network using EarlyStopping
> - Babysit your model
> - Evaluate the network with f1 score


In [None]:
X_train


0                  I love it!!! I absolutely love it!! 👌👍
1       I love the BLU phones! This is my second one t...
2                                             Great phone
3       Very happy with the performance. The apps work...
4                                 Easy to use great price
                              ...                        
4995    Easy to use. Easy to use. Easy to use. I can s...
4996    I got this phone because I did not want to go ...
4997    Phone works well, just as expected. No issues ...
4998                                            Great A++
4999    This is a really good phone, bought it to deve...
Name: Reviews, Length: 5000, dtype: object

In [None]:
## compile the model with f1 metrics
# define F1Score instance

f1_score_name = "f1_score"
f1 = F1Score(
    num_classes=len(np.unique(y_test)),
    name=f1_score_name,
    average="weighted",
)
# compile model
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=[f1],
)

# define early stopping
early_stop = EarlyStopping(monitor=f1_score_name, patience=40, verbose=2)

# fit model using early stopping
history = model.fit(
    X_train_ohe,
    y_train_ohe,
    # batch_size=5,
    epochs=50,
    verbose=1,
    callbacks=[early_stop],
    workers=6,
    use_multiprocessing=True,
)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 00041: early stopping


In [None]:
import plotly.express as px

# Babysit the model - use you favourite plot
px.line(
    pd.DataFrame(
        {  #'val_loss':history.history['val_loss'],
            "loss": history.history["loss"],
            #'val_f1_score':history.history['val_f1_score'],
            "f1_score": history.history["f1_score"],
            #'val_accuracy':history.history['val_accuracy'],
            #'accuracy':history.history['accuracy']
        }
    )
)


In [None]:
# Evaluate the model with f1 metrics (Tensorflow f1 metrics or sklearn)
model.evaluate(X_test_ohe, y_test_ohe)




[1.2164429426193237, 0.4412788450717926]

## Approach2 - Keras word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an Embedding Keras layer and the same classifier as in approach 1. Evaluate the model.


$$ TO DO STUDENTS $$

> - fix the max_lengh of a review (max number of token in a review)
> - use the same vocab_size as previously
> - fix the embedding dimension (embed_dim variable)


In [None]:
import nltk

X_train_tok = [nltk.word_tokenize(review) for review in X_train]
max_len = int(
    np.amax([len(review_tok) for review_tok in X_train_tok])
)  # Sequence length to pad the outputs to
# In order to fix it, you have to know the distribution on lengh... see first lab
embed_dim = 300  # embedding dimension


$$ TO DO STUDENTS $$

> - Create a vectorizer_layer with TextVectorization function
> - Fit the vectorizer_layer (adapt function


In [None]:
vectorize_layer = TextVectorization(
    max_tokens=vocab_size,
    output_sequence_length=max_len,
)
vectorize_layer.adapt(X_train)
vectorize_layer(X_test)  # display vectorized test set


<tf.Tensor: shape=(1000, 1137), dtype=int64, numpy=
array([[2, 1, 1, ..., 0, 0, 0],
       [3, 1, 1, ..., 0, 0, 0],
       [3, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 4, 9, ..., 0, 0, 0]])>

In [None]:
vectorize_layer.get_vocabulary()


['', '[UNK]', 'the', 'i', 'it', 'and', 'phone', 'a', 'to', 'is']

In [None]:
vectorize_layer.get_config()


{'name': 'text_vectorization',
 'trainable': True,
 'batch_input_shape': (None, None),
 'dtype': 'string',
 'max_tokens': 10,
 'standardize': 'lower_and_strip_punctuation',
 'split': 'whitespace',
 'ngrams': None,
 'output_mode': 'int',
 'output_sequence_length': 1137,
 'pad_to_max_tokens': False,
 'sparse': False,
 'ragged': False,
 'vocabulary': None,
 'idf_weights': None}

$$TO DO STUDENT$$

> - Build an MLP and print the model (model.summary())


In [None]:
# Flatten after Embedding in order to reduce the dimension of tensors
input_ = Input(
    shape=(1,),
    dtype=tf.string,
)
x = vectorize_layer(input_)
x = Embedding(
    input_dim=vocab_size,
    output_dim=256,
    name="Embedding",
)(x)
x = Flatten()(x)
output_ = Dense(units=5, activation="sigmoid")(x)
model = Model(input_, output_)

# get summary of the model
model.summary()


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 1137)             0         
 torization)                                                     
                                                                 
 Embedding (Embedding)       (None, 1137, 256)         2560      
                                                                 
 flatten (Flatten)           (None, 291072)            0         
                                                                 
 dense (Dense)               (None, 5)                 1455365   
                                                                 
Total params: 1,457,925
Trainable params: 1,457,925
Non-trainable params: 0
___________________________________________________

$$ TO DO STUDENT $$

> - Compile the network
> - Fit the network using EarlyStopping
> - Babysit your model
> - Evaluate the network with f1 score


In [None]:
y_train_ohe.shape


(5000, 5)

In [None]:
# compile the model with metrics f1 score
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=f1,
)

# define early stopping
early_stop = EarlyStopping(
    monitor=f1_score_name,
    patience=15,
    verbose=2,
)

# fit model using ealy stopping
history = model.fit(
    x=X_train,
    y=y_train_ohe,
    epochs=50,
    batch_size=512,
    callbacks=early_stop,
)


Epoch 1/50


2022-01-30 12:26:36.632993: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 596115456 exceeds 10% of free system memory.
2022-01-30 12:26:36.735613: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 596115456 exceeds 10% of free system memory.


 1/10 [==>...........................] - ETA: 8s - loss: 1.6181 - f1_score: 0.3854

2022-01-30 12:26:37.160856: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 596115456 exceeds 10% of free system memory.
2022-01-30 12:26:37.300761: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 596115456 exceeds 10% of free system memory.


 2/10 [=====>........................] - ETA: 4s - loss: 4.0471 - f1_score: 0.3859

2022-01-30 12:26:37.741040: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 596115456 exceeds 10% of free system memory.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 00021: early stopping


In [None]:
# Babysit the model
px.line(
    pd.DataFrame(
        {
            # "val_loss": history.history["val_loss"],
            "loss": history.history["loss"],
            # "val_f1_score": history.history["val_f1_score"],
            "f1_score": history.history["f1_score"],
        }
    )
)


In [None]:
# Evaluate the model
model.evaluate(X_test, y_test_ohe)




[1.2580500841140747, 0.46112141013145447]

## Approach3 - Word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an existing embedding matrix (Word2Vec / Glove or FastText), or on an embedding matrix that you will have built using Gensim.

Use the same constant as a previous steps.

Evaluate the model.


In [None]:
import gensim
from gensim import models, utils

gensim_path = f"{gensim.__path__[0]}/test/test_data/"
corpus = "lee_background.cor"
corpus_path = gensim_path + corpus


class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)


In [None]:
sentences = MyCorpus()
model = models.Word2Vec(sentences=sentences, vector_size=150)


In [None]:
# Same steps as Keras Embedding
max_len = 10  # Sequence length to pad the outputs to.
vectorizer = TextVectorization(max_tokens=vocab_size, output_sequence_length=max_len)
vectorizer.adapt(X_train)
X_train_vec = vectorizer(X_train)
X_train_vec


<tf.Tensor: shape=(5000, 10), dtype=int64, numpy=
array([[3, 1, 4, ..., 1, 0, 0],
       [3, 1, 2, ..., 1, 1, 1],
       [1, 6, 0, ..., 0, 0, 0],
       ...,
       [6, 1, 1, ..., 1, 1, 1],
       [1, 7, 0, ..., 0, 0, 0],
       [1, 9, 7, ..., 4, 8, 1]])>

In [None]:
# Build word dict
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))


In [None]:
# Make a dict mapping words (strings) to their NumPy vector representation:

path_to_glove_file = "glove.6B/glove.6B.50d.txt"
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")


Found 400000 word vectors.


In [None]:
# Prepare embedding matrix

num_tokens = len(voc) + 2
embedding_dim = 50
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    print(word, i)
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print(f"Converted {hits} words ({misses} misses)")


 0
[UNK] 1
the 2
i 3
it 4
and 5
phone 6
a 7
to 8
is 9
Converted 8 words (2 misses)


In [None]:
# Define embedding layers

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

In [None]:
# define the model
input_ = Input(shape=(max_len,), dtype=tf.int32)
x = embedding_layer(input_)
x = Flatten()(x)
output_ = Dense(5, activation='sigmoid')(x)
model = Model(input_, output_)
# summarize the model
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding_2 (Embedding)     (None, 10, 50)            600       
                                                                 
 flatten_4 (Flatten)         (None, 500)               0         
                                                                 
 dense_4 (Dense)             (None, 5)                 2505      
                                                                 
Total params: 3,105
Trainable params: 2,505
Non-trainable params: 600
_________________________________________________________________


In [None]:
# compile the model
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=f1,
)


In [None]:
# define early stopping
early_stop = EarlyStopping(monitor=f1_score_name, patience=40, verbose=2)

# fit model using ealy stopping
history = model.fit(
    vectorizer(X_train),
    y_train_ohe,
    epochs=200,
    callbacks=early_stop,
    batch_size=64
)
history


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.callbacks.History at 0x7fa31d77abb0>

In [None]:
# evaluate the model
loss_result, f1_result = model.evaluate(vectorizer(X_train), y_train_ohe, verbose=0)
print(
    f"""loss: {loss_result}\nf1: {f1_result}"""
)


loss: 1.1984031200408936
f1: 0.44999581575393677


In [None]:
# Babysit the model
px.line(
    pd.DataFrame(
        {
            # "val_loss": history.history["val_loss"],
            "loss": history.history["loss"],
            # "val_f1_score": history.history["val_f1_score"],
            "f1_score": history.history["f1_score"],
        }
    )
)


In [None]:
# Evaluate the model
# Your code


## Approach3 - Word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an existing embedding matrix (Word2Vec / Glove or FastText), or on an embedding matrix that you will have built using Gensim.

Use the same constant as a previous steps.

Evaluate the model.


In [None]:
# Avalaible in your gensim installation...
# You can also use the train reviews.
corpus_path = (
    "/Users/riveill/opt/miniconda3/lib/python3.9/site-packages/gensim/test/test_data/"
)
corpus = "lee_background.cor"


In [None]:
# Build gensim model
from gensim.test.utils import datapath
from gensim import utils
import gensim.models


class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in open(corpus_path + corpus):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)


sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences, vector_size=150)


In [None]:
# Export gensim model
import tempfile

with tempfile.NamedTemporaryFile(prefix="gensim-model-", delete=False) as tmp:
    temporary_filepath = tmp.name
    print(temporary_filepath)
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)


In [None]:
# Load gensim model
new_model = gensim.models.Word2Vec.load(temporary_filepath)


In [None]:
# Prepare embedding matrix
# Your code


In [None]:
# Define embedding layers
# Your code


In [None]:
# define the model
# Your code


In [None]:
# compile the model
# Your code


In [None]:
# fit model using ealy stopping
# Your code


In [None]:
# Babysit the model
pd.DataFrame(
    {
        "val_loss": history.history["val_loss"],
        "loss": history.history["loss"],
        "val_f1_score": history.history["val_f1_score"],
        "f1_score": history.history["f1_score"],
    }
).plot(figsize=(8, 5))


In [None]:
# Evaluate the model
# Your code
