# OpenClassrooms - Ingenieur IA
# Projet 7 - AirParadis
# Détectez les Bad Buzz grâce au Deep Learning

## Objectif du projet : 
- **Développer le prototype d’un produit IA permettant de prédire le sentiment associé à un tweet**

## Trois approches :
- **Approche 1 : 'API sur étagère' en utilisant l’API du service cognitif proposé par Microsoft Azure pour l’analyse de sentiment**
- **Approche 2 : 'Modèle sur mesure simple' en utilisant le service Azure Machine Learning Studio (classic)**
- **Approche 3 : 'Modèle sur mesure avancé' en utilisant le service Azure Machine Learning pour développer un modèle basé sur des réseaux de neurones profonds pour prédire le sentiment associé à un tweet**

## Plan - Approche 3 : 'Modèle sur mesure avancé' :
- **Approche modèle DistilBERT + Transfer Learning**     
    - Chargement des données
    - Séparation des données
    - Modélisation : utilisation de la libaririe 'simpletransformers'
        - Configuration
        - Modélisation
        - Entrainement
        - Evaluation

### Note explicative :
- Nous avons essayé dans ce fichier une approche basée sur du Transfer Learning avec le modèle DistilBERT
- Le Transfer Learning consiste à ajouter un classifieur binaire au modèle DistilBERT
- On utilise les connaissances du modèle DistilBERT préentrainé auquel on ajoute un modèle de classifieur pour effectuer notre tâche de classification des sentiments de Tweets
- Le modèle DistilBERT est une version plus 'légère' du modèle BERT : DistilBERT a moins de paramètres et est donc plus rapide à entrainer tout en ayant des performances quasi équivalents à celles de BERT
- BERT est un modèle de langage développé par Google
- Nous avons utilisé deux librairies :
    - Transformers : librairie Python de référence pour tous les modèles à base de Transformers
    - Simpletransformers : librairie qui permet une utilisation simplifiée de la librairie Transformers, notamment pour ajouter la couche de classification binaire
- Ce Notebook a été lancé dans Google Colab car il demande des ressources matériels importantes (utilisation du GPU de Colab)

##### Installation des librairies 'transformers' et 'simpletransformers'

In [1]:
pip install transformers

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 8.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 52.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 67.9 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 61.5 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled P

In [2]:
pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.61.13-py3-none-any.whl (221 kB)
[?25l[K     |█▌                              | 10 kB 35.0 MB/s eta 0:00:01[K     |███                             | 20 kB 24.0 MB/s eta 0:00:01[K     |████▍                           | 30 kB 17.7 MB/s eta 0:00:01[K     |██████                          | 40 kB 16.1 MB/s eta 0:00:01[K     |███████▍                        | 51 kB 7.1 MB/s eta 0:00:01[K     |████████▉                       | 61 kB 8.3 MB/s eta 0:00:01[K     |██████████▍                     | 71 kB 7.9 MB/s eta 0:00:01[K     |███████████▉                    | 81 kB 8.9 MB/s eta 0:00:01[K     |█████████████▎                  | 92 kB 9.5 MB/s eta 0:00:01[K     |██████████████▉                 | 102 kB 7.1 MB/s eta 0:00:01[K     |████████████████▎               | 112 kB 7.1 MB/s eta 0:00:01[K     |█████████████████▊              | 122 kB 7.1 MB/s eta 0:00:01[K     |███████████████████▎            | 133 kB

In [3]:
import time
import random as python_random

import pandas as pd

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from simpletransformers.classification import ClassificationModel, ClassificationArgs
import torch

##### On fixe l'aléatoire pour la reproductibilité des modélisations :

In [4]:
python_random.seed(0)

In [5]:
from google.colab import drive
drive.mount('/content/drive')
!ls '/content/drive/MyDrive'

Mounted at /content/drive
'Colab Notebooks'   P7


# Approche modèle DistilBERT + Transfer Learning

## Chargement des données

In [6]:
df_train = pd.read_csv('/content/drive/MyDrive/P7/airparadis_train_dataset.csv')

In [7]:
df_train

Unnamed: 0,SENTIMENT,TWEET,TWEET_PREPROCESSED
0,0,@oconel *grins* Am wearing new shirt and manag...,*grins* Am wearing new shirt and managed to ge...
1,1,Did my second senior project... Now watching s...,Did my second senior project... Now watching s...
2,1,@feedmydisaster noo... well... facebook... fri...,noo... well... facebook... friends... postmate...
3,1,"blah , a light bulb in my room is broke and no...","blah , a light bulb in my room is broke and no..."
4,0,@lakikix yea I wish I was there too Nite nite...,yea I wish I was there too Nite nite kiki
...,...,...,...
1238395,0,"I'm just not able to be in USA for a month, wi...","I'm just not able to be in USA for a month, wi..."
1238396,0,@16StarGirl16 why are you sad bby?,why are you sad bby?
1238397,1,Nap time's the best time.,Nap time's the best time.
1238398,0,@sangitashres ok.. i need to by it then,ok.. i need to by it then


## Séparation des données

#### TRAIN : jeu d'entrainement

In [8]:
TRAIN_BERT_SIZE = 100000

In [9]:
df_train_bert, df_tweets_bert = train_test_split(df_train, stratify=df_train["SENTIMENT"], train_size=TRAIN_BERT_SIZE, random_state=0)

In [10]:
df_train_bert

Unnamed: 0,SENTIMENT,TWEET,TWEET_PREPROCESSED
581572,1,Oh I know why; It�s cause i�m 1337 http://tum...,Oh I know why; Its cause im
822732,1,It's summer! here goes... This is either very...,It's summer! here goes... This is either very ...
372254,0,you know what really irks me? people like demi...,you know what really irks me? people like demi...
760431,0,Yea....i need my own car &lt;benoit&gt;,Yea....i need my own car &lt;benoit&gt;
438456,0,Bout to do soundcheck...was gona go live..but ...,Bout to do soundcheck...was gona go live..but ...
...,...,...,...
448899,1,Reese Cups! mhm,Reese Cups! mhm
646239,1,Choking on dust from cleaning my room...Making...,Choking on dust from cleaning my room...Making...
1074761,1,@prempanicker You got any space for mine?,You got any space for mine?
20112,1,@acemaker Well a golf day sounds right up the ...,Well a golf day sounds right up the alley of a...


In [11]:
df_train_bert = df_train_bert.drop(columns=['TWEET'])

In [12]:
df_train_bert = df_train_bert.rename(columns={'SENTIMENT': 'labels', 'TWEET_PREPROCESSED': 'text'})

In [13]:
df_train_bert = df_train_bert.reset_index(drop=True)

In [14]:
df_train_bert

Unnamed: 0,labels,text
0,1,Oh I know why; Its cause im
1,1,It's summer! here goes... This is either very ...
2,0,you know what really irks me? people like demi...
3,0,Yea....i need my own car &lt;benoit&gt;
4,0,Bout to do soundcheck...was gona go live..but ...
...,...,...
99995,1,Reese Cups! mhm
99996,1,Choking on dust from cleaning my room...Making...
99997,1,You got any space for mine?
99998,1,Well a golf day sounds right up the alley of a...


#### TEST : jeu de test

In [15]:
TEST_BERT_SIZE = 20000

In [16]:
df_test_bert, df_tweets_bert = train_test_split(df_tweets_bert, stratify=df_tweets_bert["SENTIMENT"], train_size=TEST_BERT_SIZE, random_state=0)

In [17]:
df_test_bert

Unnamed: 0,SENTIMENT,TWEET,TWEET_PREPROCESSED
629019,0,"@feliciaheartsDW totally jealous, I want to be...","totally jealous, I want to be at that party"
407379,0,@tommcfly sad sad sad face i missed it,sad sad sad face i missed it
906066,0,@theorganichome ur link doesnt work,ur link doesnt work
1136069,0,My blackberry was all messed up last night! Ma...,My blackberry was all messed up last night! Ma...
1091662,0,Mitchell Davis deleted his owl City video,Mitchell Davis deleted his owl City video
...,...,...,...
1085815,1,Holy crap I just woke up gotta go in town to g...,Holy crap I just woke up gotta go in town to g...
1036151,1,They r finally done entering but theres till ...,They r finally done entering but theres till a...
886443,0,@givinallmyluv2u he's on now ...,he's on now ...
222105,1,@girlambrosia I think I just fell in love with...,I think I just fell in love with you a whole l...


In [18]:
df_test_bert = df_test_bert.drop(columns=['TWEET'])

In [19]:
df_test_bert = df_test_bert.rename(columns={'SENTIMENT': 'labels', 'TWEET_PREPROCESSED': 'text'})

In [20]:
df_test_bert = df_test_bert.reset_index(drop=True)

In [21]:
df_test_bert = df_test_bert.dropna()

In [22]:
df_test_bert

Unnamed: 0,labels,text
0,0,"totally jealous, I want to be at that party"
1,0,sad sad sad face i missed it
2,0,ur link doesnt work
3,0,My blackberry was all messed up last night! Ma...
4,0,Mitchell Davis deleted his owl City video
...,...,...
19995,1,Holy crap I just woke up gotta go in town to g...
19996,1,They r finally done entering but theres till a...
19997,0,he's on now ...
19998,1,I think I just fell in love with you a whole l...


## Modélisation

### Configuration

In [31]:
model_args = ClassificationArgs(num_train_epochs=5, overwrite_output_dir=True)

### Modélisation

In [32]:
model_bert = ClassificationModel("distilbert", "distilbert-base-uncased", args=model_args, use_cuda=True)

Downloading:   0%|          | 0.00/442 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifi

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

### Entrainement

In [33]:
model_bert.train_model(df_train_bert)

  0%|          | 0/100000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/12500 [00:00<?, ?it/s]

  model.parameters(), args.max_grad_norm


Running Epoch 1 of 5:   0%|          | 0/12500 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/12500 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/12500 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/12500 [00:00<?, ?it/s]

(62500, 0.30727014332556724)

### Evaluation

In [34]:
predictions, raw_outputs = model_bert.predict(df_test_bert['text'].tolist())

  0%|          | 0/19962 [00:00<?, ?it/s]

  0%|          | 0/2496 [00:00<?, ?it/s]

In [35]:
predictions

array([0, 0, 0, ..., 1, 0, 0])

In [36]:
print(f"Accuracy score = {accuracy_score(df_test_bert['labels'], predictions):.3f}")

Accuracy score = 0.799


In [37]:
print(confusion_matrix(df_test_bert['labels'], predictions))

[[7942 2037]
 [1982 8001]]


In [38]:
print(classification_report(df_test_bert['labels'], predictions))

              precision    recall  f1-score   support

           0       0.80      0.80      0.80      9979
           1       0.80      0.80      0.80      9983

    accuracy                           0.80     19962
   macro avg       0.80      0.80      0.80     19962
weighted avg       0.80      0.80      0.80     19962



## Conclusion
- Nous obtenons un score d'accuracy de **0.799**
- C'est un score légèrement inférieur à celui obtenu avec le meilleur modèle avancé entrainé sur tout le jeu d'entrainement (0.811)