<a href="https://colab.research.google.com/github/pierredevillers/DMML2022_Coop/blob/main/BERT_TensorFlow_model/BERT_and_Tensor_Flow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BERT and Tensor Flow for Text Classification**

In this notebook, we decided to implement a classification model which combines different BERT-like models imported from the TensorFlow Hub in order to get a multilingual text classifier. This model gave us the highest accuracy score and is working with word embeddings. We are following the following steps for the implementation of the model :

1. Load the data and restructure it in a satisfying format 
2. Install and import all the necessary packages and tools for the set-up of the model
3. Preprocessing and preparation of the data for the training
4. Set-up of the model 
5. Training of the model 
6. Prediction on the unlabelled data

**References :** For this model, we used and adapted the model proposed [here](https://towardsdatascience.com/multi-label-text-classification-using-bert-and-tensorflow-d2e88d8f488d) 

##1. Data loading and structuring

In this part, we are downloading the different CSV files from our GitHub repository and restructuring it in order to replace the levels of French by numeric values.

In [1]:
# Connect the colab to the Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
# Import and read the training data in a dataframe 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/pierredevillers/DMML2022_Coop/main/CSV_files/training_data.csv')
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


In [3]:
# Replace the difficulty levels by numerical values
df['Labels'] = df['difficulty'].map({'A1': 0,
                                            'A2': 1,
                                            'B1': 2,
                                            'B2': 3,
                                            'C1': 4,
                                            'C2' : 5})

# Drop unused column
df = df.drop(["difficulty"], axis=1)

df.head()

Unnamed: 0,id,sentence,Labels
0,0,Les coûts kilométriques réels peuvent diverger...,4
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",0
2,2,Le test de niveau en français est sur le site ...,0
3,3,Est-ce que ton mari est aussi de Boston?,0
4,4,"Dans les écoles de commerce, dans les couloirs...",2


##2. Setup and import of the different tools and packages needed for the model

In this part, we are installing and importing the different packages that contains the needed documentation for the implementation of the model. 

In [4]:
# Installing kaggle 
%%capture
! pip install kaggle

In [5]:
# Installing the needed packages from the TensorFlow software library
%%capture
! pip install "tensorflow>=1.7.0"
! pip install tensorflow-hub

In [6]:
%%capture
! pip install tensorflow-text

For processing power purposes, we decided to make our model run on an available GPU through Colab. 

In [7]:
# Verifying the availability of the GPU 

import torch
torch.cuda.is_available()

True

In [8]:
# Testing the GPU 

import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

In [9]:
from sklearn.model_selection import train_test_split
import tensorflow_hub as hub
import tensorflow_text as text
from keras import backend as K
from torch.utils.data import Dataset, DataLoader

##3. Preprocessing and preparation of the data for the training

In this part, we are splitting and reshaping our data in order for it to be usable for the model. We first create the different train and test splits of the data and we then apply the BERT embeddings on it in order to process the sentences as vectors for the model. 

In [10]:
# Defining the number of classes that we want in our model 

num_classes = len(df["Labels"].value_counts())

#Transforming the classes into binary class matrixes 
y = tf.keras.utils.to_categorical(df["Labels"].values, num_classes=num_classes)

# Splitting the training data between 99% of training and 1% of testing
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], y, test_size=0.01,  random_state=0)

In [11]:
# Import of the layers of the model and definition of a function applying them

preprocessor = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2")
encoder = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base/1")


def get_embeddings(sentences):
  '''return BERT-like embeddings of input text
  Args:
    - sentences: list of strings
  Output:
    - BERT-like embeddings: tf.Tensor of shape=(len(sentences), 768)
  '''
  preprocessed_text = preprocessor(sentences)
  return encoder(preprocessed_text)['pooled_output']

# Testing of the function on one line of the data

get_embeddings([
    "Les coûts kilométriques réels peuvent diverger sensiblement des valeurs moyennes en fonction du moyen de transport utilisé, du taux d'occupation ou du taux de remplissage, de l'infrastructure utilisée, de la topographie des lignes, du flux de trafic, etc."]
)



<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-3.63910288e-01,  2.15399116e-01, -5.70777953e-01,
        -4.57450926e-01,  1.94023758e-01, -3.41217756e-01,
        -4.04708870e-02, -2.72382408e-01, -2.03393355e-01,
         4.53950949e-02, -7.60403514e-01, -3.80386561e-01,
         5.12292683e-02, -3.07877928e-01, -6.63932621e-01,
        -3.10593367e-01, -1.94578730e-02, -3.24810803e-01,
         9.24853329e-03, -6.82702288e-02,  4.39894229e-01,
         1.90010861e-01, -1.96524426e-01,  3.22621286e-01,
        -2.80323680e-02, -9.78578866e-01,  2.13867724e-02,
        -7.58467019e-01, -3.53789777e-01, -4.06974316e-01,
         2.11517304e-01, -6.07400119e-01, -3.57876718e-01,
        -4.86232966e-01, -5.95954359e-01,  4.41087745e-02,
        -2.92999297e-02,  2.86389202e-01, -5.22589386e-01,
         5.57419658e-01, -1.57448545e-01, -3.20107073e-01,
        -4.56426620e-01,  3.34122241e-01,  1.46667613e-02,
         3.74394916e-02, -1.25981212e-01, -2.15156242e-01,
      

In [12]:
# Definition of the different descriptive statistics for the performance of the model

def balanced_recall(y_true, y_pred):
    """This function calculates the balanced recall metric
    recall = TP / (TP + FN)
    """
    recall_by_class = 0
    # iterate over each predicted class to get class-specific metric
    for i in range(y_pred.shape[1]):
        y_pred_class = y_pred[:, i]
        y_true_class = y_true[:, i]
        true_positives = K.sum(K.round(K.clip(y_true_class * y_pred_class, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true_class, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        recall_by_class = recall_by_class + recall
    return recall_by_class / y_pred.shape[1]

def balanced_precision(y_true, y_pred):
    """This function calculates the balanced precision metric
    precision = TP / (TP + FP)
    """
    precision_by_class = 0
    # iterate over each predicted class to get class-specific metric
    for i in range(y_pred.shape[1]):
        y_pred_class = y_pred[:, i]
        y_true_class = y_true[:, i]
        true_positives = K.sum(K.round(K.clip(y_true_class * y_pred_class, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred_class, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        precision_by_class = precision_by_class + precision
    # return average balanced metric for each class
    return precision_by_class / y_pred.shape[1]

def balanced_f1_score(y_true, y_pred):
    """This function calculates the F1 score metric"""
    precision = balanced_precision(y_true, y_pred)
    recall = balanced_recall(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + K.epsilon()))

##4. Set-up of the model 

In this part, we defined all the features to define the model before training it on our splitted data.

In [13]:
# Definition of the model's parameters

i = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
x = preprocessor(i)
x = encoder(x)
x = tf.keras.layers.Dropout(0.01, name="dropout")(x['pooled_output'])
x = tf.keras.layers.Dense(num_classes, activation='softmax', name="output")(x)

model = tf.keras.Model(i, x)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


##5. Training of the model

Now that the model has been set-up and that the parameters have been defined, we train it on the training data. We also define a function that stops the training if we notice that the accuracy is not getting better for several iterations. 

In [14]:
#Training of the model with an iteration of 20 periods, for every iteration we are returning the scores on the training data and on the test data

n_epochs = 20

METRICS = [
      tf.keras.metrics.CategoricalAccuracy(name="accuracy"),
      balanced_recall,
      balanced_precision,
      balanced_f1_score
]

earlystop_callback = tf.keras.callbacks.EarlyStopping(monitor = "val_loss", 
                                                      patience = 3,
                                                      restore_best_weights = True)

model.compile(optimizer = "adam",
              loss = "categorical_crossentropy",
              metrics = METRICS)

model_fit = model.fit(x_train, 
                      y_train, 
                      epochs = n_epochs,
                      validation_data = (x_test, y_test),
                      callbacks = [earlystop_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20


##6. Prediction on the unlabelled data

Once that we got our model trained, we can use it on the unlabelled data document in order to predict the difficulty levels of the different French sentences. We create a function that returns the classify our sentences by numerical values that we re-transform in our difficulty levels afterwards. 


In [15]:
#Loading the unlabelled data

test_df = pd.read_csv('https://raw.githubusercontent.com/pierredevillers/DMML2022_Coop/main/CSV_files/unlabelled_test_data.csv')

reviews = test_df['sentence']

#The function classifies by returning the index associated to the higher probability and infering 

def predict_class(reviews):
  '''predict class of input text
  Args:
    - reviews (list of strings)
  Output:
    - class (list of int)
  '''
  return [np.argmax(pred) for pred in model.predict(reviews)]

In [16]:
#Applying the prediction function and transforming the difficulty levels in the correct format

x_new = test_df.sentence
y_new_pred = predict_class(reviews)

test_df['difficulty'] = y_new_pred
test_df['difficulty'] = test_df['difficulty'].replace({0:'A1', 1:'A2', 2: 'B1', 3:'B2', 4:'C1', 5:'C2'})

test_df



Unnamed: 0,id,sentence,difficulty
0,0,Nous dûmes nous excuser des propos que nous eû...,C2
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...,A2
2,2,"Et, paradoxalement, boire froid n'est pas la b...",B1
3,3,"Ce n'est pas étonnant, car c'est une saison my...",B1
4,4,"Le corps de Golo lui-même, d'une essence aussi...",C2
...,...,...,...
1195,1195,C'est un phénomène qui trouve une accélération...,B1
1196,1196,Je vais parler au serveur et voir si on peut d...,B1
1197,1197,Il n'était pas comme tant de gens qui par pare...,C2
1198,1198,Ils deviennent dangereux pour notre économie.,B2


In [17]:
test_df = test_df.drop(columns=['sentence'])

In [18]:
test_df

Unnamed: 0,id,difficulty
0,0,C2
1,1,A2
2,2,B1
3,3,B1
4,4,C2
...,...,...
1195,1195,B1
1196,1196,B1
1197,1197,C2
1198,1198,B2


In [23]:
#Loading the predicted data in a CSV file locally

test_df.to_csv('Group_Coop_BERT.csv', index=False)