# Sentiment analysis with an MLP and vector representation

# Case Study: Sentiment Analysis

In this lab we use part of the 'Amazon_Unlocked_Mobile.csv' dataset published by Kaggle. The dataset contain the following information:
* Product Name
* Brand Name
* Price
* Rating
* Reviews
* Review Votes

We are mainly interested by the 'Reviews' (X) and by the 'Rating' (y)


The goal is to try to predict the 'Rating' after reading the 'Reviews'. I've prepared for you TRAIN and TEST set.
The work to be done is as follows:

1. Feature extraction and baseline
    * read the dataset and understand it
    * put it in a format so that you can use `CountVectorizer` or`Tf-IDF` to extract the desired features
    * perform on the desired dates and preprocessing
    * use one of the classifiers you know to predict the polarity of different sentences
1. My first neural network
    * reuse the features already extracted 
    * proposed a neural network built with Keras
1. Hyper-parameter fitting
    * for the base line: adjust min_df, max_df, ngram, max_features + model's hyper-parameter
    * for the neural network: adjust batch size, number of layers and number of neuron by layers, use earlystop
1. <span style="color:red">Word embedding
    * stage 1 build a network that uses Keras' embedding which is not language sensitive.
    * stage 2 build a network that simultaneously uses Keras' embedding and the features extracted in the first weeks.
    * stage 3 try to use an existing embedding (https://github.com/facebookresearch/MUSE)
    </span>

**WARNING:** the dataset is voluminous, I can only encourage you to work first on a small part of it and only at the end, when the code is well debugged and that it is necessary to build the "final model", to use the whole dataset.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-the-dataset" data-toc-modified-id="Read-the-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read the dataset</a></span></li><li><span><a href="#Text-normalisation" data-toc-modified-id="Text-normalisation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Text normalisation</a></span></li><li><span><a href="#Approach1---BOW-and-MLP-classifier" data-toc-modified-id="Approach1---BOW-and-MLP-classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Approach1 - BOW and MLP classifier</a></span></li><li><span><a href="#Approach2---Keras-word-embedding-and-MLP-classifier" data-toc-modified-id="Approach2---Keras-word-embedding-and-MLP-classifier-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Approach2 - Keras word embedding and MLP classifier</a></span></li></ul></div>

## Read the dataset

Could you find below a proposal. You can complete them.

In [1]:
#!pip install tensorflow-addons

In [4]:
import os
import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model
from keras.layers import Input, TextVectorization, Dense, Flatten, Embedding

from sklearn.metrics import f1_score
from sklearn.preprocessing import OneHotEncoder

import tensorflow_addons as tfa
from tensorflow_addons.metrics import F1Score

In [5]:
TRAIN = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/train.csv.gz")
TEST = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/test.csv.gz")

TRAIN.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,Samsung Galaxy Note 4 N910C Unlocked Cellphone...,Samsung,449.99,4,I love it!!! I absolutely love it!! 👌👍,0.0
1,BLU Energy X Plus Smartphone - With 4000 mAh S...,BLU,139.0,5,I love the BLU phones! This is my second one t...,4.0
2,Apple iPhone 6 128GB Silver AT&T,Apple,599.95,5,Great phone,1.0
3,BLU Advance 4.0L Unlocked Smartphone -US GSM -...,BLU,51.99,4,Very happy with the performance. The apps work...,2.0
4,Huawei P8 Lite US Version- 5 Unlocked Android ...,Huawei,198.99,5,Easy to use great price,0.0


In [6]:
''' Construct X_train and y_train '''
X_train = TRAIN['Reviews']
y_train = np.array(TRAIN['Rating']).reshape(-1,1)

X_test = TEST['Reviews']
y_test = np.array(TEST['Rating']).reshape(-1,1)

nb_classes = len(np.unique(y_train))

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
y_train_enc = ohe.fit_transform(y_train)
y_test_enc = ohe.fit_transform(y_test)

X_train.shape, y_train_enc.shape, np.unique(y_train)

((5000,), (5000, 5), array([1, 2, 3, 4, 5]))

## Approach1 - BOW and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using a BOW representation of the dataset and evaluate the model.

The dataset being unbalanced the metric will be the f1 score.

$$TO DO STUDENT$$
> * Build BOW representation of the train and test set
> * Fix a value for vocab_size = the maximum number of words to keep, based on word frequency. Only the most common vocab_size-1 words will be kept.

In [7]:
# Your code

$$TO DO STUDENT$$

> * Build an MLP and print the model (model.summary())

In [8]:
# Your code

$$ TO DO STUDENT $$
> * Compile the network
> * Fit the network using EarlyStopping
> * Babysit your model
> * Evaluate the network with f1 score

In [9]:
# compile the model with f1 metrics
# Your code

In [10]:
# fit model using ealy stopping
# Your code

In [11]:
# Babysit the model - use you favourite plot
pd.DataFrame({#'val_loss':history.history['val_loss'],
              #'loss':history.history['loss'],
              #'val_f1_score':history.history['val_f1_score'],
              # 'f1_score':history.history['f1_score'],
              #'val_accuracy':history.history['val_accuracy'],
              #'accuracy':history.history['accuracy']
             }).plot(figsize=(8,5))

TypeError: no numeric data to plot

In [12]:
# Evaluate the model with f1 metrics (Tensorflow f1 metrics or sklearn)
# Your code

## Approach2 - Keras word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an Embedding Keras layer and the same classifier as in approach 1. Evaluate the model.

$$ TO DO STUDENTS $$
> * fix the max_lengh of a review (max number of token in a review)
> * use the same vocab_size as previously
> * fix the embedding dimension (embed_dim variable)

In [None]:
max_len = ...   # Sequence length to pad the outputs to.
                # In order to fix it, you have to know the distribution on lengh... see first lab
embed_dim = ... # embedding dimension

$$ TO DO STUDENTS $$

>* Create a vectorizer_layer with TextVectorization function
>* Fit the vectorizer_layer (adapt function

In [13]:
# Your code

$$TO DO STUDENT$$

> * Build an MLP and print the model (model.summary())

In [14]:
# Your code... perhaps you need to use Flatten after Embedding in order to reduce the dimension of tensors
# Your code

$$ TO DO STUDENT $$
> * Compile the network
> * Fit the network using EarlyStopping
> * Babysit your model
> * Evaluate the network with f1 score

In [15]:
# compile the model with metrics f1 score
# Your code

In [16]:
# fit model using ealy stopping
# Your code

In [17]:
# Babysit the model
pd.DataFrame({'val_loss':history.history['val_loss'],
              'loss':history.history['loss'],
             'val_f1_score':history.history['val_f1_score'],
              'f1_score':history.history['f1_score']}).plot(figsize=(8,5))

NameError: name 'history' is not defined

In [18]:
# Evaluate the model
# Your code

## Approach3 - Word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an existing embedding matrix (Word2Vec / Glove or FastText), or on an embedding matrix that you will have built using Gensim.

Use the same constant as a previous steps.

Evaluate the model.

In [None]:
# Same steps as Keras Embedding
# Your code

In [None]:
# Build word dict
# Your code

In [None]:
# Make a dict mapping words (strings) to their NumPy vector representation:
# Your code

In [None]:
# Prepare embedding matrix
# Your code

In [None]:
# Define embedding layers
# Your code

In [None]:
# define the model
# Your code

In [None]:
# compile the model
# Your code

In [None]:
# fit model using ealy stopping
# Your code

In [None]:
# Babysit the model
pd.DataFrame({'val_loss':history.history['val_loss'],
              'loss':history.history['loss'],
             'val_f1_score':history.history['val_f1_score'],
              'f1_score':history.history['f1_score']}).plot(figsize=(8,5))

In [None]:
# Evaluate the model
# Your code

## Approach3 - Word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an existing embedding matrix (Word2Vec / Glove or FastText), or on an embedding matrix that you will have built using Gensim.

Use the same constant as a previous steps.

Evaluate the model.

In [None]:
# Avalaible in your gensim installation... 
# You can also use the train reviews.
corpus_path="/Users/riveill/opt/miniconda3/lib/python3.9/site-packages/gensim/test/test_data/"
corpus="lee_background.cor"

In [None]:
# Build gensim model
from gensim.test.utils import datapath
from gensim import utils
import gensim.models

class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in open(corpus_path+corpus):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)
            

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences, vector_size=150)

In [None]:
# Export gensim model
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    print(temporary_filepath)
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

In [None]:
# Load gensim model
new_model = gensim.models.Word2Vec.load(temporary_filepath)

In [None]:
# Prepare embedding matrix
# Your code

In [None]:
# Define embedding layers
# Your code

In [None]:
# define the model
# Your code

In [None]:
# compile the model
# Your code

In [None]:
# fit model using ealy stopping
# Your code

In [None]:
# Babysit the model
pd.DataFrame({'val_loss':history.history['val_loss'],
              'loss':history.history['loss'],
             'val_f1_score':history.history['val_f1_score'],
              'f1_score':history.history['f1_score']}).plot(figsize=(8,5))

In [None]:
# Evaluate the model
# Your code