# M2 Data Science - NLP 

#### Hugo Rialan

## Research paper analysis: 

### <u>A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks</u>

Link of the paper: https://aclanthology.org/2020.acl-main.514/

The purpose of this notebook is to show that preprocessing tools on text are an important step before using these texts in more complicated AI models. 

To show this we will build a simple recurrent neural network. We will classify movie reviews from IMDB by good or bad feelings using this network. We will first train the network without preprocessing techniques and then using simple preprocessing techniques like stemming.

We will then compare the results.

We will use:
- __nltk__ and __keras__ for text preprocessing
- __keras__ to build a neural network

In [1]:
import numpy as np
import nltk
import pandas as pd
from sklearn.metrics import f1_score

import re
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, WordNetLemmatizer

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from tensorflow.keras import backend as K
from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import Dense, Activation, Embedding, Dropout, Input, LSTM, Reshape, Lambda, RepeatVector

In [2]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hugorialan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hugorialan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

I have defined here some global variables according to the power of my computer:

In [3]:
nb_review = 15000 # max = 25000
max_review_length = 100
train_size = int(nb_review * 0.6)

vocabulary_size = 5000 
embedding_vector_length = 32

The following function is one of the most important. It will allow the preprocessing of a text. Depending on its arguments, we can add a preprocessing technique. 

In [4]:
def preprocessing(text, basic=True, pos=False, removeStopwords=False, stemming=False):
    """
    Preprocess a text in the order given in the article: Basic, Part Of Speech, Stop Words, Stemming.
    Not all the preprocessing techniques of the article are implemented.
    """
    stop_words = set(stopwords.words('english')) 
    stemmer = SnowballStemmer('english')
    good_tags = ['NN', 'VB', 'JJ', 'RB'] # Tags taken from the article, page 5802
    
    # Basic
    if basic:
        remove_special_char = re.compile('r[^a-z\d]', re.IGNORECASE)
        replace_numerics = re.compile(r'\d+', re.IGNORECASE)
        text = remove_special_char.sub('', text)
        text = replace_numerics.sub('', text)

    text = text.lower().split()
    processedText = []
    
    # Part Of Speech
    if pos:
        pos_tag_text = pos_tag(text)
        text = [text[i] for i in range(len(text)) if pos_tag_text[i][1] in good_tags]
        
    for word in text:
         # Stop Words
        if removeStopwords:
             if word in stop_words:
                    continue    #stop the loop  for this word and continue to the following word
            
        # Stemming
        if stemming:
            word = stemmer.stem(word)
        
        processedText.append(word)
        
    text = ' '.join(processedText)
    return text

In [5]:
imdb = pd.read_csv('./IMDB Dataset.csv')

In [6]:
imdb.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Our dataset is quite simple. Reviews are labelled with there sentiment class. 

In [7]:
imdb = imdb[:nb_review]

In [8]:
imdb.shape

(15000, 2)

Then, I created a DataFrame in order to store our results and print them at the end of the notebook:

In [9]:
import pandas as pd
from tabulate import tabulate
 
data = {'Processing':[],
        
        'Accuracy': [],
        'F-score':[]
       }
 
df_results = pd.DataFrame(data)

---

## 1 - Just a Basic preprocessing

In [10]:
x = [preprocessing(text,basic=True, pos=False, removeStopwords=False, stemming=False) for text in list(imdb['review'])]
y = np.array([1 if sentiment=='positive' else 0 for sentiment in list(imdb['sentiment'])])
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(x)
x = pad_sequences(tokenizer.texts_to_sequences(x), maxlen=max_review_length)

For all our experiments, we preprocess the data and separate them into train and test sets. 

In [11]:
x, x_test = x[:train_size], x[train_size:]

In [12]:
y, y_test = y[:train_size], y[train_size:]

In [13]:
K.clear_session()

In [14]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_vector_length))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

In [15]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          160000    
_________________________________________________________________
lstm (LSTM)                  (None, 128)               82432     
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 242,561
Trainable params: 242,561
Non-trainable params: 0
_________________________________________________________________
None


Then, We use a recurrent neural network to solve our classification problem.

In [16]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x, y, epochs=3, batch_size=64, verbose=1, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa1f12c4df0>

In [17]:
scores = model.evaluate(x_test, y_test, verbose=0)
y_pred = [int(pred >= 0.5) for pred in model.predict(x_test).ravel()]
df_results = df_results.append({'Processing' : 'Basic', 
                
                'Accuracy' : (scores[1]*100), 
                'F-score' : (f1_score(y_test, y_pred)*100)
               }, ignore_index=True)
print("Accuracy: %.2f%%" % (scores[1]*100))
print("F score: %.2f" % (f1_score(y_test, y_pred)*100))

Accuracy: 82.60%
F score: 83.10


I computed the F-score because this is the score that they used in the research paper.

---

## 2 - Basic, POS, remove stopwords, Stemming on train and test datasets

In [18]:
x = [preprocessing(text, basic=True, pos=True, removeStopwords=True, stemming=True) for text in list(imdb['review'])]
y = np.array([1 if sentiment=='positive' else 0 for sentiment in list(imdb['sentiment'])])
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(x)
x = pad_sequences(tokenizer.texts_to_sequences(x), maxlen=max_review_length)

In [19]:
x, x_test = x[:train_size], x[train_size:]

In [20]:
y, y_test = y[:train_size], y[train_size:]

In [21]:
K.clear_session()

In [22]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_vector_length))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

In [23]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x, y, epochs=3, batch_size=64, verbose=1, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa1f12bcc40>

In [24]:
scores = model.evaluate(x_test, y_test, verbose=0)
y_pred = [int(pred >= 0.5) for pred in model.predict(x_test).ravel()]
df_results = df_results.append({'Processing' : 'All', 
                
                'Accuracy' : (scores[1]*100), 
                'F-score' : (f1_score(y_test, y_pred)*100)
               }, ignore_index=True)
print("Accuracy: %.2f%%" % (scores[1]*100))
print("F score: %.2f" % (f1_score(y_test, y_pred)*100))

Accuracy: 83.12%
F score: 81.74


---

## 3 - All - POS

In [25]:
x = [preprocessing(text, basic=True, pos=False, removeStopwords=True, stemming=True) for text in list(imdb['review'])]
y = np.array([1 if sentiment=='positive' else 0 for sentiment in list(imdb['sentiment'])])
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(x)
x = pad_sequences(tokenizer.texts_to_sequences(x), maxlen=max_review_length)

In [26]:
x, x_test = x[:train_size], x[train_size:]

In [27]:
y, y_test = y[:train_size], y[train_size:]

In [28]:
K.clear_session()

In [29]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_vector_length))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

In [30]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x, y, epochs=3, batch_size=64, verbose=1, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa1f31baaf0>

In [31]:
scores = model.evaluate(x_test, y_test, verbose=0)
y_pred = [int(pred >= 0.5) for pred in model.predict(x_test).ravel()]
df_results = df_results.append({'Processing' : 'All - pos', 
                
                'Accuracy' : (scores[1]*100), 
                'F-score' : (f1_score(y_test, y_pred)*100)
               }, ignore_index=True)
print("Accuracy: %.2f%%" % (scores[1]*100))
print("F score: %.2f" % (f1_score(y_test, y_pred)*100))

Accuracy: 84.12%
F score: 84.57


---

## 4 - All - STOP

In [32]:
x = [preprocessing(text, basic=True, pos=True, removeStopwords=False, stemming=True) for text in list(imdb['review'])]
y = np.array([1 if sentiment=='positive' else 0 for sentiment in list(imdb['sentiment'])])
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(x)
x = pad_sequences(tokenizer.texts_to_sequences(x), maxlen=max_review_length)

In [33]:
x, x_test = x[:train_size], x[train_size:]

In [34]:
y, y_test = y[:train_size], y[train_size:]

In [35]:
K.clear_session()

In [36]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_vector_length))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

In [37]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x, y, epochs=3, batch_size=64, verbose=1, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa1dca56a90>

In [38]:
scores = model.evaluate(x_test, y_test, verbose=0)
y_pred = [int(pred >= 0.5) for pred in model.predict(x_test).ravel()]
df_results = df_results.append({'Processing' : 'All - stop', 
                
                'Accuracy' : (scores[1]*100), 
                'F-score' : (f1_score(y_test, y_pred)*100)
               }, ignore_index=True)
print("Accuracy: %.2f%%" % (scores[1]*100))
print("F score: %.2f" % (f1_score(y_test, y_pred)*100))

Accuracy: 84.47%
F score: 83.54


---

## 5 - All - STEM

In [39]:
x = [preprocessing(text, basic=True, pos=True, removeStopwords=True, stemming=False) for text in list(imdb['review'])]
y = np.array([1 if sentiment=='positive' else 0 for sentiment in list(imdb['sentiment'])])
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(x)
x = pad_sequences(tokenizer.texts_to_sequences(x), maxlen=max_review_length)

In [40]:
x, x_test = x[:train_size], x[train_size:]

In [41]:
y, y_test = y[:train_size], y[train_size:]

In [42]:
K.clear_session()

In [43]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_vector_length))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

In [44]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x, y, epochs=3, batch_size=64, verbose=1, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa1de38c910>

In [45]:
scores = model.evaluate(x_test, y_test, verbose=0)
y_pred = [int(pred >= 0.5) for pred in model.predict(x_test).ravel()]
df_results = df_results.append({'Processing' : 'All - stem', 
                
                'Accuracy' : (scores[1]*100), 
                'F-score' : (f1_score(y_test, y_pred)*100)
               }, ignore_index=True)
print("Accuracy: %.2f%%" % (scores[1]*100))
print("F score: %.2f" % (f1_score(y_test, y_pred)*100))

Accuracy: 84.08%
F score: 83.30


---

# Final results

In [46]:
 print(tabulate(round(df_results, 2), headers='keys', tablefmt='pretty',showindex=False))

+------------+----------+---------+
| Processing | Accuracy | F-score |
+------------+----------+---------+
|   Basic    |   82.6   |  83.1   |
|    All     |  83.12   |  81.74  |
| All - pos  |  84.12   |  84.57  |
| All - stop |  84.47   |  83.54  |
| All - stem |  84.08   |  83.3   |
+------------+----------+---------+


Finally, if we look at the accuracy, There is indeed an improvement of the results between Basic and All. The best result for accuracy is All-stop. This is interesting because in the research paper on page 5805, it is specified in the table that their best result for IMDB was also obtained with All-stop. The worst results are with All-stem, which confirms the importance of stemming.

If we look at the F-score, the best results are with All-pos.

---

sources:
- Natural Language Processing courses from Chloé Clavel and Matthieu Labeau
- Deep Learning 1 courses from G. Peeters and A. Newson
- https://www.kaggle.com/natlee/sentiment-analysis-of-imdb-50k-with-keras-model/data?select=IMDB+Dataset.csv