# Grade prediction with BERT embedding

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the reviews we scrapped in previous courses. We will try to predict wether it was positive or negative. 

Under the hood, the model is actually made up of two model.


1.   DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
2.   The next model, a basic Logistic Regression model from scikit learn will identify if the review was positive or negative



## Installing the transformers library

Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [None]:
!pip install transformers

Collecting transformers




  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-win_amd64.whl (3.3 MB)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import nltk
import re
import itertools
import unidecode
import tensorflow as tf
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset

We'll import the dataset as we did in previous notebooks:

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')
dirpath = "drive" 
os.listdir(dirpath)

Mounted at /content/drive


['Othercomputers',
 '.file-revisions-by-id',
 'MyDrive',
 '.shortcut-targets-by-id',
 '.Trash-0']

In [None]:
import os
# IMPORT YOUR REVIEWS FILE
reviews = pd.read_csv("corpus_total.csv")

In [None]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   index            200 non-null    int64 
 1   text             200 non-null    object
 2   grade            200 non-null    int64 
 3   date             200 non-null    object
 4   company          200 non-null    object
 5   n_words          200 non-null    int64 
 6   clean_text       200 non-null    object
 7   tokenized_text   200 non-null    object
 8   bigrams          200 non-null    object
 9   trigrams         200 non-null    object
 10  wordDict         200 non-null    object
 11  tfBOW            200 non-null    object
 12  tfIDF            200 non-null    object
 13  lemmatized_text  200 non-null    object
 14  stemmed_text     200 non-null    object
dtypes: int64(3), object(12)
memory usage: 23.6+ KB


In [None]:
#Convert tokenized text into dictionary from string
reviews["tokenized_text"] = [eval(i) for i in reviews.tokenized_text]

In the first place we should determine what are we considering a positive review. In this case we shall consider as positive every comment that had 4 or 5 as grade and negative otherwise.

In [None]:
# ASSIGN A LABEL TO EACH REVIEW TO DIFFERENTIATE POSITIVE AND NEGATIVE REVIEWS
reviews["sentiment"] = reviews.grade.isin([4,5])
# REMOVE REVIEWS WITH NULL VALUES
print("Number of nan reviews: ", reviews.text.isna().sum())
print("Number of Empty Reviews: ", sum([len(i) == 0 for i in reviews.text]))


Number of nan reviews:  0
Number of Empty Reviews:  0


For computational purposes, we'll only use 1000 sentences which have less than 100 words in it.

In [None]:
reviews.tokenized_text

0      [service, electricite, tres, bien, si, service...
1      [honteux, honteux, si, pouvais, mettre, zero, ...
2      [incomprehension, service, client, incomprehen...
3      [tres, satisfaite, souscrivant, ligne, voulais...
4      [tres, bonne, application, tres, bonne, applic...
                             ...                        
195                            [rapide, efficace, clair]
196    [souscription, total, energie, explicite, donc...
197                                 [bien, cordialement]
198                                    [aucun, probleme]
199                                    [aucun, probleme]
Name: tokenized_text, Length: 200, dtype: object

In [None]:
#Save in two variables the comments and the labels

#sentences = list(itertools.chain(*reviews["review_sentences"]))
print(f"Number of total reviews: {reviews.shape[0]}")

sentence_lengths = [len(i) for i in reviews.tokenized_text]
cond  = [i > 100 for i in sentence_lengths]

long_sentences = [x for x, m in zip(sentence_lengths, cond) if m]
print("Number of sentences longer than 100 words: ", len(long_sentences))
labels = [int(i) for i in reviews.sentiment]


Number of total reviews: 200
Number of sentences longer than 100 words:  170


There are 170 reviews longer than 100 words. Since our corpus text which is the last 200 reviews for Total Energies will be left with a very small sample size if we drop all these sentences. We will keep them.

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model.

In [None]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Right now, the variable model holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

# Model 1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to do some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [None]:
# Apply tokenization
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters=' ', char_level=False)
tokenizer.fit_on_texts(reviews.tokenized_text)

In [None]:
reviews.clean_text

0      service electricite tres bien mais si service ...
1      honteux honteux si je pouvais mettre zero ils ...
2      incomprehension avec le service client incompr...
3      tres satisfaite en souscrivant en ligne je vou...
4      tres bonne application tres bonne application ...
                             ...                        
195                             rapide efficace et clair
196    souscription total energie explicite et donc p...
197                                    bien cordialement
198                                       aucun probleme
199                                       aucun probleme
Name: clean_text, Length: 200, dtype: object

We created a 

In [None]:
sequences = tokenizer.texts_to_sequences(reviews.tokenized_text)
print(reviews["tokenized_text"][0], sequences[0], sep="\n\n")
print("\n\n\n")
print(len(reviews["tokenized_text"][0]), len(sequences[0]), sep=" ")

['service', 'electricite', 'tres', 'bien', 'si', 'service', 'electricite', 'tres', 'bien', 'si', 'surplus', 'reglement', 'ca', 'complique', 'mot', 'plus', 'employe', 'services', 'patience', 'cest', 'service', 'comptable', 'priori', 'plutot', 'usine', 'gaz', 'franchement', 'jai', 'nerfs', 'faire', 'tresorerie', 'total']

[1, 74, 5, 13, 20, 1, 74, 5, 13, 20, 573, 574, 30, 575, 964, 3, 965, 83, 314, 11, 1, 966, 967, 410, 968, 25, 411, 2, 969, 21, 970, 4]




32 32


### Padding
After tokenization, tokenized is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).


In [None]:
# Perform the padding

max_len = 0
for i in reviews.tokenized_text:
    if len(i) > max_len:
        max_len = len(i)
max_len
padded = [token + [0] * (max_len - len(token)) for token in sequences]
print(padded[0])

#check if all sizes are 113
(pd.Series([len(pad) for pad in padded]) == max_len).all()

[1, 74, 5, 13, 20, 1, 74, 5, 13, 20, 573, 574, 30, 575, 964, 3, 965, 83, 314, 11, 1, 966, 967, 410, 968, 25, 411, 2, 969, 21, 970, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


True

Our dataset is now in the padded variable, we can view its dimensions below:

In [None]:
padded = np.array(padded)
padded.shape

(200, 207)

### Masking
If we directly send padded to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(200, 207)

## Model 1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model! 

The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states.

In [None]:
max(tokenizer.index_word.keys())

2268

In [None]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called [CLS] (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

We'll save those in the features variable, as they'll serve as the features to our logitics regression model.



In [None]:
features = last_hidden_states[0][:,0,:].numpy()

In [None]:
features.shape, len(labels)

((200, 768), 200)

## Model 2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 1,000 sentences from the reviews training set).

In [None]:
# Split the data in train and test with the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(features, labels)

## [Bonus] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

model = LogisticRegression()
param_grid = {'C': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1, 2, 4, 6]}
search = GridSearchCV(model, param_grid, cv=3)
search.fit(X_train, y_train)

print("Best hyperparameters: ", search.best_params_)
print("Best score: ", search.best_score_)



Best hyperparameters:  {'C': 0.4}
Best score:  0.8533333333333334


We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. LogisticRegression(C=5.2)).

In [None]:
# Fit the logistic regression model
model = LogisticRegression(C=0.4)
model.fit(X_train, y_train)


LogisticRegression(C=0.4)

## Evaluating Model 2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [None]:
# Evaluate the score of the model
acc = model.score(X_test, y_test)
print("Model Accuracy on test data: ", acc)

Model Accuracy on test data:  0.72


How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [None]:

from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
clf.fit(X_train, y_train)
acc_dummy = clf.score(X_test, y_test)
print("Dummy model accuracy: ", acc_dummy)

# COmpare your model with a dummy classifier

Dummy model accuracy:  0.4


So our model clearly does better than a dummy classifier