# Sentiment analysis with MLP and vector word representation (Keras embedding)

In [3]:
"""
(Practical tip) Table of contents can be compiled directly in jupyter notebooks using the following code:
I set an exception: if the package is in your installation you can import it otherwise you download it 
then import it.
"""
try:
    from jyquickhelper import add_notebook_menu 
except:
    !pip install jyquickhelper
    from jyquickhelper import add_notebook_menu
    
"""
Output Table of contents to navigate easily in the notebook. 
For interested readers, the package also includes Ipython magic commands to go back to this cell
wherever you are in the notebook to look for cells faster
"""
add_notebook_menu()

## Imports

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import time

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, TextVectorization, Flatten, Embedding
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
#!pip install tensorflow-addons

In [None]:
import tensorflow_addons as tfa
from tensorflow_addons.metrics import F1Score

## The dataset

In this lab we use part of the 'Amazon_Unlocked_Mobile.csv' dataset published by Kaggle. The dataset contain the following information:
* Product Name
* Brand Name
* Price
* Rating
* Reviews
* Review Votes

We are mainly interested by the 'Reviews' (X) and by the 'Rating' (y)


The goal is to try to predict the 'Rating' after reading the 'Reviews'. I've prepared for you TRAIN and TEST set.
The work to be done is as follows:

1. Feature extraction and baseline
    * read the dataset and understand it
    * put it in a format so that you can use `CountVectorizer` or`Tf-IDF` to extract the desired features
    * perform on the desired dates and preprocessing
    * use one of the classifiers you know to predict the polarity of different sentences
1. My first neural network
    * reuse the features already extracted 
    * proposed a neural network built with Keras
1. Hyper-parameter fitting
    * for the base line: adjust min_df, max_df, ngram, max_features + model's hyper-parameter
    * for the neural network: adjust batch size, number of layers and number of neuron by layers, use earlystop
1. <span style="color:red">Word embedding
    * stage 1 build a network that uses Keras' embedding which is not language sensitive.
    * stage 2 build a network that simultaneously uses Keras' embedding and the features extracted in the first weeks.
    * stage 3 try to use an existing embedding (https://github.com/facebookresearch/MUSE)
    </span>

**WARNING:** the dataset is voluminous, I can only encourage you to work first on a small part of it and only at the end, when the code is well debugged and that it is necessary to build the "final model", to use the whole dataset.

In [None]:
TRAIN = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/train.csv.gz")
VAL = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/val.csv.gz")
TEST = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/test.csv.gz")

TRAIN.head()

In [None]:
''' Construct X_train and y_train '''
X_train = TRAIN['Reviews']
y_train = np.array(TRAIN['Rating']).reshape(-1,1)
X_val = VAL['Reviews']
y_val = np.array(VAL['Rating'])
X_test = TEST['Reviews']
y_test = np.array(TEST['Rating']).reshape(-1,1)

nb_classes = len(np.unique(y_train))

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
y_train_enc = ohe.fit_transform(y_train)
y_val_enc = ohe.fit_transform(y_val)
y_test_enc = ohe.fit_transform(y_test)

X_train.shape, y_train_enc.shape, np.unique(y_train)

## Baseline model (last week)

$$ TO DO STUDENTS $$
> * As the classes are unbalanced, it is preferable to use the `f1 score`.
> * So modify your model, its compilation, its fit and its evaluation accordingly.

In [None]:
# Put here your previous preprocessing
# Evaluate your baseline model with f1 score (sklearn)

## Approach1 - BOW and MLP classifier (last week)

The work was done last week but...
$$ TO DO STUDENTS $$
> * Create a vectorizer_layer with TextVectorization function
> * Fit the vectorizer_layer (adapt function)
> * Build an MLP and print the model (model.summary())
> * Babysit your model
> * Evaluate your model
>
> * As the classes are unbalanced, it is preferable to use the `f1 score`.
> * So modify your model, its compilation, its fit and its evaluation accordingly.

In [None]:
# Evaluate the model with f1 metrics (Tensorflow f1 metrics or sklearn)
# Your code

## Approach2 - Keras word embedding and MLP classifier

Using the course companion notebook, build a multi-layer perceptron using an Embedding Keras layer and the same classifier as in approach 1. Evaluate the model.

> * Create a vectorizer_layer with TextVectorization function
> * Fit the vectorizer_layer (adapt function)
> * Build an MLP and print the model (model.summary())
> * Babysit your model
> * Evaluate your model

In [None]:
vocab_size = ... # fix your vocabulary size
max_len = ...   # Sequence length to pad the outputs to.
                # In order to fix it, you have to know the distribution on lengh... see first lab
embed_dim = ... # embedding dimension

$$ TO DO STUDENTS $$

>* Create a vectorizer_layer with TextVectorization function
>* Fit the vectorizer_layer (adapt function)

In [None]:
# Your code

$$TO DO STUDENT$$

> * Build an MLP and print the model (model.summary())

In [None]:
# Your code... perhaps you need to use Flatten after Embedding in order to reduce the dimension of tensors
# Your code

$$ TO DO STUDENT $$
> * Compile the network
> * Fit the network using EarlyStopping
> * Babysit your model
> * Evaluate the network with f1 score

In [None]:
# compile the model with metrics f1 score
# Your code

In [None]:
# fit model using ealy stopping
# Your code

In [None]:
# Babysit the model
pd.DataFrame({'val_loss':history.history['val_loss'],
              'loss':history.history['loss'],
             'val_f1_score':history.history['val_f1_score'],
              'f1_score':history.history['f1_score']}).plot(figsize=(8,5))

In [None]:
# Evaluate the model
# Your code