# P0: Simple models with Keras

**Goal**: implement **three models** for multiclass text classification on the [Women's E-commerce clothing reviews](https://github.com/ya-stack/Women-s-Ecommerce-Clothing-Reviews) dataset, two of them simple feed-forward models using a `Tokenizer` and `TextVectorizer`, respectively, and the third a Convolutional Neural Network (CNN) using a `TextVectorizer` layer and embeddings.

**Teams**: one person or two.

**Due date**: Check virtual campus.


### 1. Data preparation

The first step is to downlad the dataset (a `csv` file) from *GitHub*. Suggestions:
- You can use the utility function [`tensorflow.keras.utils.get_file()`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file) to download the file. You should set an absolute path to save the file, taking into account that, in *Google Colaboratory*, you have direct acces to the folder `/content/`.
- There are many ways to load a `csv` in memory. One simple way is to use `csv.reader()`.

~~~
with open(path, newline='') as f:
    reader = csv.reader(f)
    data = list(reader)
~~~

The resulting data estructure (`data`) is a python list of lists (the reviews).

In [9]:

import os
import csv
from tensorflow.keras.utils import get_file

DATA_URL = "https://raw.githubusercontent.com/ya-stack/Women-s-Ecommerce-Clothing-Reviews/master/Womens%20Clothing%20E-Commerce%20Reviews.csv"
FILE_NAME = "Womens_Clothing_E-Commerce_Reviews.csv"

DOWNLOAD_DIR = "/content/datasets"
os.makedirs(DOWNLOAD_DIR, exist_ok=True)

csv_path = get_file(
    fname=FILE_NAME,
    origin=DATA_URL,
    cache_dir=DOWNLOAD_DIR,
    cache_subdir="",
    extract=False
)

with open(csv_path, newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    data = list(reader)

print(f"Ruta del archivo que hemos descargado: {csv_path}")
print(f"Tenemos las siguientes filas: {len(data)}")
print("Los encabezados de columna son:", data[0])


Ruta del archivo que hemos descargado: /content/datasets/Womens_Clothing_E-Commerce_Reviews.csv
Tenemos las siguientes filas: 23487
Los encabezados de columna son: ['', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name']


In [3]:
#Vemos qué pinta tienen los datos
print(data[0:5])

[['', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'], ['0', '767', '33', '', 'Absolutely wonderful - silky and sexy and comfortable', '4', '1', '0', 'Initmates', 'Intimate', 'Intimates'], ['1', '1080', '34', '', 'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.', '5', '1', '4', 'General', 'Dresses', 'Dresses'], ['2', '1077', '60', 'Some major design flaws', 'I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall,

Once you have the rows of the `csv` file in a data structure (remember that the first one is the names of the attributes of the data set, and must be discarded) you have to preprocess the data for its use as an input to the neural networks:
1. Extract the textual data from the rows, included in the fields `Title` and `Review Text`, and join both fields if title is not empty.
2. Convert the field `Rating`, whose content are integers in the interval [1,5] into three classes: negative (ratings 1,2), neutral (rating 3) and positive (ratings 4,5).
3. The dataset contains about 23,000 reviwes. Reserve the first 18,000 for training, and the rest for validation.

In [24]:
from typing import List

# Separamos la primera fila con los encabezados de columna
header, *rows = data

def juntar_titulo_y_reseña(title: str, review: str) -> str:

    title = title.strip()
    review = review.strip()
    if title and review:
        return f"{title}. {review}"
    if title:
        return title
    return review

def rating_agrupa(rating_str: str) -> int:

    rating = int(rating_str)
    if rating <= 2:
        return 0  # clase 0: reseñas negativas
    if rating == 3:
        return 1  # clase 1: reseñas neutrales
    return 2      # clase 2: reseñas positivas

texts: List[str] = []
labels: List[int] = []

for row in rows:
    title = row[3]
    review_text = row[4]
    rating_str = row[5]

    combined_text = juntar_titulo_y_reseña(title, review_text)
    texts.append(combined_text)
    labels.append(rating_agrupa(rating_str))


train_texts, val_texts = texts[:18000], texts[18000:]
train_labels, val_labels = labels[:18000], labels[18000:]

print(f"Reseñas: {len(texts)}")
print(f"Entrenamos con {len(train_texts)} reseñas")
print(f"Validamos con {len(val_texts)} reseñas")
print(train_texts[0:5])

Reseñas: 23486
Entrenamos con 18000 reseñas
Validamos con 5486 reseñas
['Absolutely wonderful - silky and sexy and comfortable', 'Love this dress!  it\'s sooo pretty.  i happened to find it in a store, and i\'m glad i did bc i never would have ordered it online bc it\'s petite.  i bought a petite and am 5\'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.', 'Some major design flaws. I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c', "My favorite buy!. I love, love

### 2. Perceptron with Tokenizer.

In the first model, you are going to use a `Tokenizer()` object to process the training and validation texts, transforming each review into binary vectors (of length *n*, where *n* is the size of the vocabulary) in which the positions of the words appearing in the review will be coded as `1` (clue: you can use the method `texts_to_matrix()` for this). You can set a maximum size for the vocabulary (parameter `num_words`), but it is not necessary.

Remember that you have to use the `fit_on_texts()` method in order to build the vocabulary of the tokenizer from the training data.

In addition, you have to convert vectors with the labels (negative=0, neutral=1, positive=2) from the training and validation sets to a data type which make possible to use them with the loss function `categorical_crossentropy` (clue: you may want to use the utility function `tensorflow.keras.utils.to_categorical()`).

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

# Definimos un tamaño máximo de vocabulario (aunque no sea necesario)
VOCAB_SIZE = 20000

tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)  # Aprendemos el vocabulario del set de entrenamiento

# Generamos las matrices binarias como en el ejemplo del tokenizer del otro notebook
x_train = tokenizer.texts_to_matrix(train_texts, mode="binary")
x_val = tokenizer.texts_to_matrix(val_texts, mode="binary")

# Convertimos las etiquetas en vectores one-hot
y_train = to_categorical(train_labels, num_classes=3)
y_val = to_categorical(val_labels, num_classes=3)

print("Shape x_train:", x_train.shape)
print("Shape x_val:", x_val.shape)
print("Primeras filas y_train:\n", y_train[:3])
print(x_train[0:3])

Shape x_train: (18000, 20000)
Shape x_val: (5486, 20000)
Primeras filas y_train:
 [[0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]]


Now it is time to create the `Sequential` architecture of out first model. In this case, a simple perceptron with three layers (input, hidden with relu, output with Softmax) will suffice. A few pointers:
- You will need to set the `input_shape` of the first layer of the network to the size of the vocabulary in the `Tokenizer`.
- The number of units and the activation function in the output layer must be appropiate for a three-class classification problem.

In [15]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Input, Dense

model = Sequential(
    [
        Input(shape=(VOCAB_SIZE,)),           # dimensión igual al vocabulario controlado por el Tokenizer
        Dense(64, activation="relu"),
        Dense(3, activation="softmax"),
    ]
)
model.summary()

Now compile and train the model. You can use any optimizer you want, but the loss function must be `categorical_crossentropy`, the metric used will be `accuracy`, and you will provide the validation sets for the computation of the validation loss and validation accuracy at the end of each epoch of training, with the argument `validation_data`.

The model will train for 10 epochs.

Expect a validation accuracy of 0.80-0.83, approximately.

In [16]:
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)

history = model.fit(
    x_train,
    y_train,
    epochs=10,
    batch_size=512,
    validation_data=(x_val, y_val),
)

Epoch 1/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 222ms/step - accuracy: 0.7336 - loss: 0.8025 - val_accuracy: 0.7763 - val_loss: 0.5611
Epoch 2/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 245ms/step - accuracy: 0.7784 - loss: 0.5003 - val_accuracy: 0.7858 - val_loss: 0.4785
Epoch 3/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 118ms/step - accuracy: 0.8031 - loss: 0.4145 - val_accuracy: 0.7906 - val_loss: 0.4570
Epoch 4/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 153ms/step - accuracy: 0.8229 - loss: 0.3804 - val_accuracy: 0.8246 - val_loss: 0.4349
Epoch 5/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 115ms/step - accuracy: 0.8926 - loss: 0.3101 - val_accuracy: 0.8356 - val_loss: 0.4215
Epoch 6/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 116ms/step - accuracy: 0.9146 - loss: 0.2649 - val_accuracy: 0.8347 - val_loss: 0.4243
Epoch 7/10
[1m36/36[0m [3

¿Does the validation accuracy grow with each epoch?

For the 10 epochs the model is trained, the validation accuracy grow with each. That shows that the model is not overfitted through the training with these 10 epochs.


### 3. Perceptron with a TextVectorizer layer.

Now you are going to implement a new neural network, with two differences with respect to the previous one:
- We will use a `TextVectorizer` Layer instead of a `Tokenizer`.
- The loss function will be `sparse_categorical_crossentropy`.

Your first task is to set the `TextVectorization` layer. Remember you have to create the layer and call the method `adapt()` on the training data before adding the layer to the new model. You can use the default values when creating the layer if you wish, except for `output_mode` that has to be set to `'multi_hot'`, so a binary vector the size of the vocabulary is generated for each example, as `Tokenizer` did in the first model.

In [17]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

text_vectorizer = TextVectorization(output_mode="multi_hot")
#tal y como se nos comenta en el enunciado, el output debe ser este, para poder alimentar al perceptrón
text_vectorizer.adapt(train_texts)


x_train_tv = text_vectorizer(tf.constant(train_texts))
x_val_tv = text_vectorizer(tf.constant(val_texts))

vocab_size = text_vectorizer.vocabulary_size()
print("Vocabulario aprendido:", vocab_size)
print("Shape x_train_tv:", x_train_tv.shape)
print("Shape x_val_tv:", x_val_tv.shape)
print(x_train_tv[0])
#Ahora tenemos el conjunto de entrenamiento y de validación como vectores binarios de cada reseña según la aparición de palabras
#del vocabulario que ha creado el text_vectorizer

Vocabulario aprendido: 17424
Shape x_train_tv: (18000, 17424)
Shape x_val_tv: (5486, 17424)
tf.Tensor([0 0 0 ... 0 0 0], shape=(17424,), dtype=int64)


Now you can create your second `Sequential` model, adding its layers one by one. Obviously, the previously created `TextVectorizer` goes first. There is not need to define an input layer. You can add the rest of the layers after the text vectorizer.



In [18]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

second_model = Sequential()
second_model.add(text_vectorizer)
second_model.add(Dense(64, activation="relu"))
second_model.add(Dense(3, activation="softmax"))


_ = second_model(tf.constant(train_texts[:1]))

second_model.summary()

Once the topology of the new model is set, you will set the datasets, compile and train it. Important:

- Remember that you are supposed to use `sparse_categorical_crossentropy`, so the label vectors for both training and validation will have to be of the appropiate type and dimensions.
- `TextVectorizer` will not accept its training (or validation) input as a list of strings. If you are using lists to store your input strings, convert those lists to numpy arrays with np.array().

You can use whichever optimizer you prefer, but you will use accuracy to measure the performance of the model, provide the validation data through the argument `validation_data`, and train for 10 epochs.

In [19]:
import numpy as np

# Convertimos las listas de Python en arreglos de NumPy
# para que Keras genere tensores tf.string.
train_texts_np = np.array(train_texts, dtype=object)
val_texts_np = np.array(val_texts, dtype=object)

# Convertimos las etiquetas a enteros
train_labels_np = np.array(train_labels, dtype="int64")
val_labels_np = np.array(val_labels, dtype="int64")

second_model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

In [20]:
history_tv = second_model.fit(
    train_texts_np,
    train_labels_np,
    epochs=10,
    batch_size=512,
    validation_data=(val_texts_np, val_labels_np),
)

Epoch 1/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 191ms/step - accuracy: 0.6794 - loss: 0.8624 - val_accuracy: 0.7763 - val_loss: 0.6308
Epoch 2/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 159ms/step - accuracy: 0.7740 - loss: 0.5717 - val_accuracy: 0.7814 - val_loss: 0.5139
Epoch 3/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 182ms/step - accuracy: 0.8057 - loss: 0.4428 - val_accuracy: 0.8243 - val_loss: 0.4592
Epoch 4/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 159ms/step - accuracy: 0.8677 - loss: 0.3669 - val_accuracy: 0.8321 - val_loss: 0.4282
Epoch 5/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 190ms/step - accuracy: 0.8919 - loss: 0.3135 - val_accuracy: 0.8356 - val_loss: 0.4203
Epoch 6/10
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 159ms/step - accuracy: 0.9096 - loss: 0.2742 - val_accuracy: 0.8363 - val_loss: 0.4203
Epoch 7/10
[1m36/36[0m [3

¿Is the new model any better than the previous one?

(you can write your answer here)

### 4. CNN with TextVectorizer layer and word embeddings

Finally, you are going to train a third model with the following components:
- A `TextVectorizer` layer.
- An `Embedding` layer.
- One or more `Conv1D` layers.
- A `GlobalMaxPooling1D` layer.
- One or more `Dense` layers for the computation of results.
- A output layer with the appropiate activation function for a multiclass classifier.

You will use the functional API.

Our goal is to process the input texts token by token using a Convolutional Neural Network (CNN) and embeddings. The first step is to define the `TextVectorizer` layer. This time the output of this layer will be a vector of integer numbers (the input for the `Embedding` layer), with one integer for each token in the input text, so `output_mode` must be set to `int` or omitted (since `int` is the default value for this parameter). In addition, all sequences of integers (words) given to the embedding layer must have the same length. To ensure that, you will use the parameter `output_sequence_length` in the definition of the `TextVectorizer` (i.e. `output_sequence_length=100`). That will cut sequences longer than the value of `output_sequence_length` and pad shorter ones with zeros.

Once the layer is defined, it will be trained with the method `adapt()`.

In [21]:
from tensorflow import keras
from tensorflow.keras import layers


cnn_text_vectorizer = layers.TextVectorization(
    output_sequence_length=100
)
cnn_text_vectorizer.adapt(train_texts)

vocab_size_cnn = cnn_text_vectorizer.vocabulary_size()


You have to start the defintion of the model with an `Input` layer, e.g.:

~~~
inputs = keras.Input(shape=(1,),dtype=tf.string)
~~~

then you can add the `TextVectorizer`, `Conv1D`, ... layers.

The `Embedding` layer has at least two parameters: the size of the vocabulary and the size of the embeddings. For the vocabulary you have two choices: set it in avance when creating the layer, via de `max_tokens` parameter, or to let all tokens of the training set be part of the vocabulary. In the latter case, you can get the vocabulary size from the layer, using the method `vocabulary_size()`.

You must the set embedding dimension to a integer value, e.g. `30`.

In the `Conv1D` layer you have to set two parameters, `filters` and `kernel_size`. Both are integers. The first can have any integer value (e.g., `64`of `128`) but the higher is set, the bigger the number of computations will be, while the second should be small compared to the length of the sequences of words (e.g. `3` or `5`).

We finish with the output layer:

~~~
outputs = tf.keras.layers.Dense(...)(x)
~~~

where `x` is the output of the previous layer. At this point we can define the model:

~~~
model_functional = keras.Model(inputs=inputs, outputs=outputs)
~~~

In [22]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

embedding_dim = 30  # Establecemos la dimensión del vector embedding de cada palabra a 30

inputs = keras.Input(shape=(1,), dtype=tf.string)
x = cnn_text_vectorizer(inputs)
x = layers.Embedding(input_dim=vocab_size_cnn, output_dim=embedding_dim)(x)
x = layers.Conv1D(filters=128, kernel_size=5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(64, activation="relu")(x)
outputs = tf.keras.layers.Dense(3, activation="softmax")(x)

cnn_model = keras.Model(inputs=inputs, outputs=outputs, name="cnn_text_vectorizer")
cnn_model.summary()

You can use exactly the same datasets than in the previous model for training and validation, and the optimizer of you preference, but you will use accuracy as performance metric and sparse categorical crossentropy as loss function, provide the validation data through the argument `validation_data`, and train the model for 10 epochs.

In [23]:
# Reutilizamos los datasets de entrenamiento y validación del modelo anterior
#La comparación entre los dos últimos modelos depende de la arquitectura
cnn_model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

history_cnn = cnn_model.fit(
    train_texts_np,
    train_labels_np,
    epochs=10,
    batch_size=256,
    validation_data=(val_texts_np, val_labels_np),
)

Epoch 1/10
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 105ms/step - accuracy: 0.7180 - loss: 0.8083 - val_accuracy: 0.7763 - val_loss: 0.6352
Epoch 2/10
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 118ms/step - accuracy: 0.7678 - loss: 0.5983 - val_accuracy: 0.8024 - val_loss: 0.4761
Epoch 3/10
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 115ms/step - accuracy: 0.8189 - loss: 0.4226 - val_accuracy: 0.8263 - val_loss: 0.4224
Epoch 4/10
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 99ms/step - accuracy: 0.8529 - loss: 0.3521 - val_accuracy: 0.8292 - val_loss: 0.4130
Epoch 5/10
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 120ms/step - accuracy: 0.8829 - loss: 0.2939 - val_accuracy: 0.8277 - val_loss: 0.4155
Epoch 6/10
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 113ms/step - accuracy: 0.9082 - loss: 0.2488 - val_accuracy: 0.8192 - val_loss: 0.4443
Epoch 7/10
[1m71/71[0m 

¿Does the new model perform any better than the previous two?

In terms of the validation accuracy, we don't obtain a better peak than in the two previus models. Furthermore, in this model we have the fastest rythm of decreasing of the validation accuracy from the epoch 5 included, so we wouldn't be able to conclude that this model performs better.