## Test goal 🎯

The goal of the test is to create a model able to classify documents into two sentiment categories **positive** or **negative**.

**Datasets**

Train dataset (train.txt file) contains 25000 documents, the first 12500 are positive, the last 12500 are negative.

Test dataset (test.txt file) contains 25000 documents to assess, the first 12500 are positive, the last 12500 are negative.

You will be guided through the different stages of the model developement.

**Question 1 :** Load train and test data into dataframes

In [4]:
import pandas as pd

# read train data
with open("train.txt", "r", encoding="utf-8") as f:
    train_lines = f.readlines()

# creation of a dataframe from train data
train_df = pd.DataFrame(train_lines, columns=["data"])

# read test data
with open("test.txt", "r", encoding="utf-8") as f:
    test_lines = f.readlines()

# creation of a dataframe from test date
test_df = pd.DataFrame(test_lines, columns=["data"])

print("train_df shape :", train_df.shape)
display(train_df.head(5))
print("\ntest_df shape :", test_df.shape)
display(test_df.head(5))


train_df shape : (25000, 1)


Unnamed: 0,data
0,Bromwell High is a cartoon comedy. It ran at t...
1,Homelessness (or Houselessness as George Carli...
2,Brilliant over-acting by Lesley Ann Warren. Be...
3,This is easily the most underrated film inn th...
4,This is not the typical Mel Brooks film. It wa...



test_df shape : (25000, 1)


Unnamed: 0,data
0,I went and saw this movie last night after bei...
1,Actor turned director Bill Paxton follows up h...
2,As a recreational golfer with some knowledge o...
3,"I saw this film in a sneak preview, and it is ..."
4,Bill Paxton has taken the true story of the 19...



**Question 2:** Once the train and test data set are loaded, add to each dataframe a "label" column that contains the sentiment of the documents. In each file, the first half of the documents have positive sentiment and the second half have negative sentiment.

In [5]:
# add label column to train data frame
train_df['label'] = [1] * 12500 + [0] * 12500

# add label column to test data frame
test_df['label'] = [1] * 12500 + [0] * 12500

print("\ntrain_df\n")
display(train_df.iloc[12498:12502])
print("\ntest_df\n")
display(test_df.iloc[12498:12502])


train_df



Unnamed: 0,data,label
12498,A Christmas Together actually came before my t...,1
12499,Working-class romantic drama from director Mar...,1
12500,Story of a man who has unnatural feelings for ...,0
12501,Airport '77 starts as a brand new luxury 747 p...,0



test_df



Unnamed: 0,data,label
12498,"This movie, with all its complexity and subtle...",1
12499,I've seen this story before but my kids haven'...,1
12500,Once again Mr. Costner has dragged out a movie...,0
12501,This is an example of why the majority of acti...,0


**Question 3:** Before transforming the documents to format understandable by machine learning algorithms, we first need to clean their content. The following regex expressions might be helpful.

```
<br /><br /> # matchs html tags
[^\x00-\x7f] # matchs hexadecimal caracters
[^\w\s] # matchs all words in a documents
[0-9]+[a-z]* # matchs all alphabetical characters
```

In [6]:
import re

def clean_text(text):
    # remove html tags
    text = re.sub('<br /><br />', ' ', text)
    
    # remove hexadecimal characters
    text = re.sub('[^\x00-\x7f]+', ' ', text)
    
    # remove punctuation and digits
    text = re.sub('[^\w\s]+', '', text)
    text = re.sub('[0-9]+[a-z]*', '', text)
    
    # remove excessive whitespaces
    text = re.sub('\s+', ' ', text).strip()
    
    return text

# Apply the cleanup function to each document in train_df
train_df['data'] = train_df['data'].apply(clean_text)

# Apply the cleanup function to each test_df document
test_df['data'] = test_df['data'].apply(clean_text)


**Question 4:** Now documents content is clean, we have to tokenize the documents and transform them into vectors using text tokenizer from Keras (don't forget to pad the resulting vectors from the tokenization to make sure all vectors have the same size)

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# initialize tokenizer
tokenizer = Tokenizer(num_words=10000)

# fit on clean text
tokenizer.fit_on_texts(train_df['data'])

# convert text to sequences
train_sequences = tokenizer.texts_to_sequences(train_df['data'])
test_sequences = tokenizer.texts_to_sequences(test_df['data'])

# pad sequences
train_padded_sequences = pad_sequences(train_sequences, maxlen=500)
test_padded_sequences = pad_sequences(test_sequences, maxlen=500)


**Question 5:** Create and train a deep learning model using Keras to classify the documents.

Hint1: use an Emebedding layer, you can use CNN, RNN and Feed Forward layers, no need for a transformer model in this case

Hint2: In case you do not have access to a GPU, you may use a portion of the train dataset to train the model  (also colab notebook give access to gpus for free)

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense

# define model
model = Sequential()

#Embedding layer : turns words into numerical form called vectors 
# (Each words are turned into a 128 dim vector)
model.add(Embedding(10000, 128, input_length=500))

#Convolution layer : extract important features from the text by scanning the embedding vectors of each word 
# (We have 64 filters wich runs through 5 words at a time)
model.add(Conv1D(64, 5, activation='relu'))

#Pooling layer : reduce the dimention output of the convolution layer 
# (We keep only the maximum value of each window traversed by the convolution filter)
model.add(MaxPooling1D(pool_size=4))

#Flatten layer : transforms the input data into a one-dimensional vector
model.add(Flatten())

#Activation sigmoïd function : generates an outpout beetween 0 or 1
model.add(Dense(1, activation='sigmoid'))

# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model
model.fit(train_padded_sequences, train_df['label'], epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x242ce1a9a10>

**Question 6:** Evaluate the model on the test dataset using the metrics of your choice

In [18]:
# evaluate model
loss, accuracy = model.evaluate(test_padded_sequences, test_df['label'])

print('\nTest Loss:', loss)
print('Test Accuracy:', accuracy, "\n")



Test Loss: 0.6448918581008911
Test Accuracy: 0.8676400184631348 



**Conclusion**: Overall, a model with an accuracy of 86% is quite efficient, but it can be further improved by fine-tuning the hyperparameters (ex : eta, number of layers, number of neurons, batch size, etc...) or using other types of deep learning models. In addition, a more detailed analysis can be performed with other indicators such as a confusion matrix, recall, MSE, etc...