#  Inférence de l'effet - Stratégie Multilabels - tableau des scores
Dans ce Notebook, nous cosntruisons un modèle qui permet d'inférer l'EFFET à partir de la classification de l'incident et des données textuelles

Nous considérons ce problème comme un problème de classification multiclasses et multilabels. En effet, il y a plusieurs effets possibles et un incidents peut entrainer plusieurs effets.

Ainsi, notre métrique d'évaluation sera le f1_samples

Dans le Notebook précedent, nous n'avions pas pris en compte l'aspect multilabel et notre score était de  f1_weighted = 0,28.

Dans ce notebook, nous testons différents modèles :
- SVM
- XGboost
- LSTM
- NBSVM

Et différents encodages : 
- TFIDF
- countvectorizer

Les scores sont résumé dans le tableau suivant : https://starclay-my.sharepoint.com/:x:/g/personal/rquillivic_starclay_fr/EZPS3DrBBQ9MrZskrcwKVAEBGsLY61W089kd8RFvIEirjg?e=ve9g9K


In [20]:
import warnings
warnings.filterwarnings('ignore')

import joblib
import pandas as pd
import numpy as np


from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfTransformer,CountVectorizer
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import confusion_matrix, accuracy_score, balanced_accuracy_score,f1_score,classification_report,recall_score,precision_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


import spacy
nlp =spacy.load('fr')
from spacy.lang.fr.stop_words import STOP_WORDS

## 0.1 Chargement des données

In [2]:
%time

mlb = MultiLabelBinarizer()

train = pd.read_pickle('./data_split/train.pkl')
# Pour faire un modèle sans le 
#train = train[~train['TEF_ID'].map(lambda x : 106 in x)]
X_train = train[['FABRICANT','CLASSIFICATION','DESCRIPTION_INCIDENT','ETAT_PATIENT']]
y_train = mlb.fit_transform(train['TEF_ID'])
test =  pd.read_pickle('./data_split/test.pkl')
#test = test[~test['TEF_ID'].map(lambda x : k in x)]
X_test = test[['FABRICANT','CLASSIFICATION','DESCRIPTION_INCIDENT','ETAT_PATIENT']]
y_test = mlb.transform(test['TEF_ID'])


X_train_dgs = np.load('results/dgs_camenbert_train_vec.npy')
X_test_dgs =np.load('results/dgs_camenbert_test_vec.npy')





df_effets = pd.read_csv("data/ref_MRV/referentiel_dispositif_effets_connus.csv",delimiter=';',encoding='ISO-8859-1')
df_dys = pd.read_csv("data/ref_MRV/referentiel_dispositif_dysfonctionnement.csv",delimiter=';',encoding='ISO-8859-1')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.63 µs


## 1.1 Construction du pipeline avec une stratégie ONE-VS-REST

> "This strategy, also known as one-vs-all, is implemented in OneVsRestClassifier. The strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and only one classifier, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy and is a fair default choice."

In [4]:
%%time
preprocess = ColumnTransformer(
    [('description_tfidf',TfidfVectorizer(sublinear_tf=True, min_df=3,
                            ngram_range=(1, 1),
                            stop_words=STOP_WORDS,
                            max_features = 10000,norm = 'l2'), 'DESCRIPTION_INCIDENT'),
     
     ('etat_pat_tfidf', TfidfVectorizer(sublinear_tf=True, min_df=3,ngram_range=(1, 1),
                                       stop_words=STOP_WORDS,
                                       max_features = 10000,norm = 'l2'), 'ETAT_PATIENT'),
     
     ('fabricant_tfidf',TfidfVectorizer(sublinear_tf=True, min_df=3,
                            ngram_range=(1, 1),
                            stop_words=STOP_WORDS,
                            max_features = 5000,norm = 'l2'), 'FABRICANT')
     ],
    
    remainder='passthrough')

preprocess_2 = ColumnTransformer(
    [('description_tfidf',CountVectorizer( min_df=3,
                            ngram_range=(1, 1),
                            stop_words=STOP_WORDS,
                            max_features = 10000), 'DESCRIPTION_INCIDENT'),
     
     ('etat_pat_tfidf', CountVectorizer( min_df=3,ngram_range=(1, 1),
                                       stop_words=STOP_WORDS,
                                       max_features = 10000), 'ETAT_PATIENT'),
     
     ('fabricant_tfidf',CountVectorizer(min_df=3,
                            ngram_range=(1, 1),
                            stop_words=STOP_WORDS,
                            max_features = 5000), 'FABRICANT')
     ],
    
    remainder='passthrough')


pipeline = Pipeline([
    ('vect', preprocess),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight='balanced'))),
])

pipeline_2 = Pipeline([
    ('vect', preprocess_2),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight='balanced'))),
])

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 507 µs


In [83]:
%%time

pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)
f1 = f1_score(y_test , y_pred,average='samples')
print('f1_score samples : ',f1)

f1_score samples :  0.6378449926352557
CPU times: user 1min 5s, sys: 700 ms, total: 1min 6s
Wall time: 1min 7s


In [85]:
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.67      0.29      0.40         7
           1       1.00      0.20      0.33         5
           2       0.00      0.00      0.00         1
           3       0.00      0.00      0.00         3
           4       0.89      0.57      0.70        14
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00        10
           7       0.36      0.39      0.37        44
           8       0.32      0.23      0.27        48
           9       0.00      0.00      0.00         2
          10       1.00      0.50      0.67         2
          11       0.40      0.22      0.29         9
          12       0.00      0.00      0.00        20
          13       0.00      0.00      0.00         5
          14       1.00      0.20      0.33        10
          15       0.00      0.00      0.00         9
          16       0.00      0.00      0.00         7
          17       0.00    

In [77]:
%%time

pipeline_2.fit(X_train,y_train)

y_pred = pipeline_2.predict(X_test)
f1 = f1_score(y_test , y_pred,average='samples')
print('f1_score samples : ',f1)
print(classification_report(y_test , y_pred))

f1_score samples :  0.5980593937810289
              precision    recall  f1-score   support

           0       0.60      0.43      0.50         7
           1       0.50      0.20      0.29         5
           2       0.00      0.00      0.00         1
           3       0.00      0.00      0.00         3
           4       0.86      0.43      0.57        14
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00        10
           7       0.35      0.27      0.31        44
           8       0.28      0.25      0.26        48
           9       0.00      0.00      0.00         2
          10       1.00      0.50      0.67         2
          11       0.67      0.22      0.33         9
          12       0.00      0.00      0.00        20
          13       0.00      0.00      0.00         5
          14       0.33      0.20      0.25        10
          15       0.00      0.00      0.00         9
          16       0.00      0.00      0.0

## Quelles sont les colonnes les plus importanes ?
Nous cosntruisons un SVM pour chaqune des colonnes et observons les différents scores obtenues : 

In [7]:
pipeline_col =Pipeline([
    ('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=2,
                            ngram_range=(1, 1),
                            stop_words=STOP_WORDS,
                            max_features = 10000,norm = 'l2')),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight='balanced'))),
])
PRED = []
for col in ['FABRICANT','DESCRIPTION_INCIDENT','ETAT_PATIENT','ACTION_PATIENT'] :
    x_train,x_test = X_train[col],X_test[col]
    pipeline_col.fit(x_train,y_train)
    pred= pipeline_col.predict(x_test)
    PRED.append(pred)
    f1 = f1_score(y_test , pred,average='samples')
    print('##############################')
    print(col)
    print('f1_score samples : ',f1)

##############################
FABRICANT
f1_score samples :  0.18312606719785002
##############################
DESCRIPTION_INCIDENT
f1_score samples :  0.5921288892501373
##############################
ETAT_PATIENT
f1_score samples :  0.36551621481944147
##############################
ACTION_PATIENT
f1_score samples :  0.26827221425301057


In [11]:
y_e = np.mean(PRED,axis=0)
thresholds = [0.4,0.5,0.6,0.65,0.7,0.72,0.75,0.8]
for val in thresholds:
    print("For threshold: ", val)
    pred=y_e.copy()
  
    pred[pred>=val]=1
    pred[pred<val]=0
  
    precision = precision_score(y_test, pred, average='samples')
    recall = recall_score(y_test, pred, average='samples')
    f1 = f1_score(y_test, pred, average='samples')
   
    print("Samples-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

For threshold:  0.4
Samples-average quality numbers
Precision: 0.3606, Recall: 0.8025, F1-measure: 0.4411
For threshold:  0.5
Samples-average quality numbers
Precision: 0.3606, Recall: 0.8025, F1-measure: 0.4411
For threshold:  0.6
Samples-average quality numbers
Precision: 0.5238, Recall: 0.6161, F1-measure: 0.5393
For threshold:  0.65
Samples-average quality numbers
Precision: 0.5238, Recall: 0.6161, F1-measure: 0.5393
For threshold:  0.7
Samples-average quality numbers
Precision: 0.5238, Recall: 0.6161, F1-measure: 0.5393
For threshold:  0.72
Samples-average quality numbers
Precision: 0.5238, Recall: 0.6161, F1-measure: 0.5393
For threshold:  0.75
Samples-average quality numbers
Precision: 0.5238, Recall: 0.6161, F1-measure: 0.5393
For threshold:  0.8
Samples-average quality numbers
Precision: 0.3151, Recall: 0.3113, F1-measure: 0.3059


## Commentaire : 

La colonne DESCRIPTION_INCIDENT sempble de loin la plus importante en ce qui concerne la prédiction de l'effet.

 
## 2.0 L'approche Multioutput

> Multioutput classification support can be added to any classifier with MultiOutputClassifier. This strategy consists of fitting one classifier per target. This allows multiple target variable classifications. The purpose of this class is to extend estimators to be able to estimate a series of target functions (f1,f2,f3…,fn) that are trained on a single X predictor matrix to predict a series of responses (y1,y2,y3…,yn).

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html




In [86]:
from sklearn.multioutput import MultiOutputClassifier

pipeline = Pipeline([
    ('vect', preprocess),
    ('clf', MultiOutputClassifier(LinearSVC(class_weight='balanced'))),
])
#### prédiction 
pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)
f1 = f1_score(y_test , y_pred,average='samples')
print('f1_score samples : ',f1)

f1_score samples :  0.6378449926352557


### Commentaire
Comme attendu, nous n'observons pas de grande différence car les deux approches sont très similaires

## 2.1 Approche One vs One

>This strategy consists in fitting one classifier per class pair. At prediction time, the class which received the most votes is selected. Since it requires to fit n_classes * (n_classes - 1) / 2 classifiers, this method is usually slower than one-vs-the-rest, due to its O(n_classes^2) complexity. However, this method may be advantageous for algorithms such as kernel algorithms which don’t scale well with n_samples. This is because each individual learning problem only involves a small subset of the data whereas, with one-vs-the-rest, the complete dataset is used n_classes times.



In [88]:
%%time
from sklearn.multiclass import OneVsOneClassifier
pipeline = Pipeline([
    ('vect', preprocess),
    ('clf', MultiOutputClassifier(OneVsOneClassifier(LinearSVC(class_weight='balanced')))),
])
#### prédiction 
pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)
f1 = f1_score(y_test , y_pred,average='samples')
print('f1_score samples : ',f1)

f1_score samples :  0.6378449926352557
CPU times: user 1min 10s, sys: 440 ms, total: 1min 10s
Wall time: 1min 10s


### Commentaire
Nous n'oservons pas de changement de performances, seulement une hausse du temps de calcul
## 2.2 l'approche ClassifierChain
>A multi-label model that arranges binary classifiers into a chain.
Each model makes a prediction in the order specified by the chain using all of the available features provided to the model plus the predictions of models that are earlier in the chain.



In [85]:
from sklearn.multioutput import ClassifierChain

In [93]:
%%time
X_train_, X_test_ =preprocess.fit_transform(X_train),preprocess.transform(X_test)
clf = LinearSVC(class_weight='balanced')


chains = [ClassifierChain(clf, order='random', random_state=i) for i in range(10)]

for chain in chains:
    chain.fit(X_train_, y_train)
    
y_pred_chains = np.array([chain.predict(X_test_) for chain in chains])

chain_f1_scores = [f1_score(y_test, y_pred_chain, average='samples') for y_pred_chain in y_pred_chains]

y_pred_ensemble = y_pred_chains.mean(axis=0)

y_e = y_pred_ensemble>=0.4

ensemble_f1_score = f1_score(y_test,y_e, average='samples')

print(ensemble_f1_score)

0.6826849841667144
CPU times: user 11min 43s, sys: 3.68 s, total: 11min 47s
Wall time: 11min 48s


In [94]:
print(classification_report(y_test,y_e))

              precision    recall  f1-score   support

           0       0.67      0.29      0.40         7
           1       1.00      0.20      0.33         5
           2       0.00      0.00      0.00         1
           3       0.00      0.00      0.00         3
           4       0.73      0.57      0.64        14
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00        10
           7       0.39      0.48      0.43        44
           8       0.33      0.29      0.31        48
           9       0.00      0.00      0.00         2
          10       1.00      0.50      0.67         2
          11       0.33      0.22      0.27         9
          12       0.00      0.00      0.00        20
          13       0.00      0.00      0.00         5
          14       0.50      0.20      0.29        10
          15       0.00      0.00      0.00         9
          16       0.00      0.00      0.00         7
          17       0.00    

## Avec le count vectorizer ?

In [86]:
%%time
X_train_, X_test_ =preprocess_2.fit_transform(X_train),preprocess_2.transform(X_test)
clf = LinearSVC(class_weight='balanced')


chains = [ClassifierChain(clf, order='random', random_state=i) for i in range(10)]

for chain in chains:
    chain.fit(X_train_, y_train)
    
y_pred_chains = np.array([chain.predict(X_test_) for chain in chains])

chain_f1_scores = [f1_score(y_test, y_pred_chain, average='samples') for y_pred_chain in y_pred_chains]

y_pred_ensemble = y_pred_chains.mean(axis=0)

y_e = y_pred_ensemble>=0.4

ensemble_f1_score = f1_score(y_test,y_e, average='samples')

print(ensemble_f1_score)

0.6547455834998173
CPU times: user 6min 10s, sys: 0 ns, total: 6min 10s
Wall time: 6min 10s


### Commentaire
Nous observons un changement de performance, significatif, l'approche ClassifierChain permet de prendre en compte les lien entre différents Labels
## 3.  D'autres modèle de Machine Learning
### 3.1 XGboost

In [14]:
%%time
import xgboost as xgb

X_train_, X_test_ =preprocess.fit_transform(X_train),preprocess.transform(X_test)



CPU times: user 3.06 s, sys: 20 ms, total: 3.08 s
Wall time: 3.08 s


In [65]:
%%time
from xgboost import XGBClassifier
#binary:hinge
#Objective candidate: multi:softmax
#Objective candidate: multi:softprob


print("Preprocessing...")
X_train_, X_test_ =preprocess.fit_transform(X_train),preprocess.transform(X_test)
print("Done !")



clf = OneVsRestClassifier(XGBClassifier(n_jobs=-1,eta= 0.1, max_depth=10,
                                        n_estimators=10 ,objective ='binary:hinge'))

print("Fitting the model...")
clf.fit(X_train_,y_train)
print("Done !")
print("Prediction..")
pred = clf.predict(X_test_)
print("Done !")
f1 = f1_score(y_test,pred, average='samples')
print("f1_score samples :",f1 )
                                                                         
                                                                         


Preprocessing...
Done !
Fitting the model...
Done !
Prediction..
Done !
f1_score samples : 0.6559961687794019
CPU times: user 15min 43s, sys: 3.86 s, total: 15min 47s
Wall time: 15min 47s


## 3.1.2 LGBM (A faire)

### 3.2 LSTM

In [66]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding,SpatialDropout1D, Bidirectional, Flatten, LSTM, Conv1D, Conv2D, MaxPooling1D, Dropout, Activation,GlobalMaxPool1D

Using TensorFlow backend.


In [67]:
X_train_, X_test_ =preprocess.fit_transform(X_train),preprocess.transform(X_test)
X_train_= np.array(X_train_.todense())
##
X_test_= np.array(X_test_.todense())

In [68]:
%%time
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=1000)
X_train_ = svd.fit_transform(X_train_)
X_test_ = svd.transform(X_test_)

CPU times: user 48min 9s, sys: 21min 35s, total: 1h 9min 44s
Wall time: 5min 31s


In [69]:
X_train_ = np.reshape(X_train_, (X_train_.shape[0], 1, X_train_.shape[1]))
X_test_ = np.reshape(X_test_, (X_test_.shape[0], 1, X_test_.shape[1]))

In [70]:
model = Sequential()
model.add(LSTM(200))
model.add(Dense(y_train.shape[1], activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

epochs = 5
batch_size = 32

history = model.fit(X_train_, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2)

score,cat_acc = model.evaluate(X_test_,y_test)

y_pred = model.predict(X_test_)

print('loss : ', score)
print('categorical accuracy: ',cat_acc)

print('####################################')

thresholds = [0.01,0.04,0.06,0.08,0.1,0.12,0.14,0.16,0.2,0.25,0.3,0.35,0.4,0.5,0.6,0.7]
for val in thresholds:
    print("For threshold: ", val)
    pred=y_pred.copy()
  
    pred[pred>=val]=1
    pred[pred<val]=0
  
    precision = precision_score(y_test, pred, average='samples')
    recall = recall_score(y_test, pred, average='samples')
    f1 = f1_score(y_test, pred, average='samples')
   
    print("Samples-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))
    

Train on 21059 samples, validate on 5265 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
loss :  0.012502517492869887
categorical accuracy:  0.6524316072463989
####################################
For threshold:  0.01
Samples-average quality numbers
Precision: 0.2997, Recall: 0.9150, F1-measure: 0.4039
For threshold:  0.04
Samples-average quality numbers
Precision: 0.5410, Recall: 0.8305, F1-measure: 0.6148
For threshold:  0.06
Samples-average quality numbers
Precision: 0.5972, Recall: 0.7972, F1-measure: 0.6486
For threshold:  0.08
Samples-average quality numbers
Precision: 0.6322, Recall: 0.7724, F1-measure: 0.6637
For threshold:  0.1
Samples-average quality numbers
Precision: 0.6554, Recall: 0.7514, F1-measure: 0.6703
For threshold:  0.12
Samples-average quality numbers
Precision: 0.6666, Recall: 0.7312, F1-measure: 0.6687
For threshold:  0.14
Samples-average quality numbers
Precision: 0.6729, Recall: 0.7152, F1-measure: 0.6674
For threshold:  0.16
Samples-average quality 

In [28]:
model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_7 (LSTM)                (None, 200)               960800    
_________________________________________________________________
dense_8 (Dense)              (None, 273)               54873     
Total params: 1,015,673
Trainable params: 1,015,673
Non-trainable params: 0
_________________________________________________________________


## Commentaire : 
En ajustant le seuil à 0.1, on obtient le meilleur résultat: 

**Samples-average quality numbers**
- Precision: 0.6554, 
-  Recall: 0.7514, 
- F1-measure: 0.6703

## 3.3 nbSVM

In [212]:
from nbsvm import NBSVMClassifier

In [None]:
from sklearn.preprocessing import MinMaxScaler
X_train_, X_test_ =preprocess.fit_transform(X_train),preprocess.transform(X_test)
X_train_= np.array(X_train_.todense())
##
X_test_= np.array(X_test_.todense())

svd = TruncatedSVD(n_components=300)
X_train_ = svd.fit_transform(X_train_)
X_test_ = svd.transform(X_test_)


scaler = MinMaxScaler()
X_train_ = scaler.fit_transform(X_train_)
X_test_ = scaler.fit_transform(X_test_)

clf = OneVsRestClassifier(NBSVMClassifier(class_weight='balanced'))

In [None]:
%%time
#### prédiction 
clf.fit(X_train_,y_train)

y_pred = pipeline.predict(X_test_)
f1 = f1_score(y_test , y_pred,average='samples')
print('f1_score samples : ',f1)

## 3.4 K_train : 

In [208]:
import ktrain
from ktrain import text
encoder_TEF_ID = joblib.load('data_split/TEF_ID_encodeur.sav')

features = ['ETAT_PATIENT','DESCRIPTION_INCIDENT']
train_list = [elt[0] for elt in X_train[features].values.tolist()]
test_list =  [elt[0] for elt in X_test[features].values.tolist()]




trn, val, preproc = text.texts_from_array(x_train=train_list, y_train=y_train,
                                          x_test=test_list, y_test=y_test,
                                          class_names=encoder_TEF_ID.classes_.tolist(),
                                          preprocess_mode='standard',maxlen=350)

#t = text.Transformer(MODEL_NAME, maxlen=256)
#trn = t.preprocess_train(train_list, y_train)
#val = t.preprocess_test(test_list, y_test)
#model = t.get_classifier('nbsvm', multilabel=True, class_names = encoder_TEF_ID.transform(encoder_TEF_ID.classes_))




task: text classification
language: fr
Word Counts: 21168
Nrows: 26324
26324 train sequences
train sequence lengths:
	mean : 12
	95percentile : 39
	99percentile : 83
x_train shape: (26324,350)
y_train shape: (26324, 273)
Is Multi-Label? True
6580 test sequences
test sequence lengths:
	mean : 11
	95percentile : 37
	99percentile : 70
x_test shape: (6580,350)
y_test shape: (6580, 273)


In [209]:
model = text.text_classifier('nbsvm', train_data=trn, preproc=preproc,multilabel =True)

Is Multi-Label? True
compiling word ID features...
maxlen is 350
done.


In [210]:
from sklearn.metrics import f1_score
from tensorflow.keras.callbacks import Callback

class f1_Evaluation(Callback):
    def __init__(self, validation_data=(), interval=1):
        super(Callback, self).__init__()

        self.interval = interval
        self.X_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.X_val, verbose=0)
            val=0.5
            y_pred[y_pred>=val]=1
            y_pred[y_pred<val]=0
            score = f1_score(self.y_val, y_pred,average='samples')
            print("\n f1 samples - epoch: %d - score: %.6f \n" % (epoch+1, score))
            
f1 = f1_Evaluation(validation_data=val, interval=1)

In [211]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=10)
learner.lr_plot()




begin training using onecycle policy with max lr of 3e-05...
Train on 26324 samples, validate on 6580 samples
Epoch 1/4
 f1 samples - epoch: 1 - score: 0.014942 

Epoch 2/4
 f1 samples - epoch: 2 - score: 0.348176 

Epoch 3/4
 f1 samples - epoch: 3 - score: 0.349797 

Epoch 4/4
 f1 samples - epoch: 4 - score: 0.345137 



<tensorflow.python.keras.callbacks.History at 0x7f70181a5210>

## Conclusion : 

Les différents tests que nous avons menés nous permettent de conclure :
- Le SVM + TFIDF reste un modèle qui nous propose une baseline solide avec une facilité de mise en oeuvre et déploiement.
- Le XGboost et le LSTM permettent d'améliorer légerement les performance (0.65 et respectivement 0.67). D'autant plus que ces modèle possède de nombreux hyperparamètres à finetuner. Gràce à la librairie Optuna, nous allons le faire dans la suite de notre travail.
- Le meilleur résultat est obtenu en utilisant ClassifierChain, cela signifie qu'l existe des relations entre nos différents Label. En appliquant un mapping de regroupement, les autres modèle devrait pouvoir concurencer ce modèle. Nous testerons également cette hypothèse dans la suite.

Notre travail d'exploration des modèles nous a permis d'augmenter significativement nos performances. Le premier modèle que nous avions fait (actuellement dans l'application) avait un score f1-sample de 0.59, notre meilleur modèle est possède aujourd'hui un f1 sample de 0.68. C'est encouragenat pour la suite car nous avons encore beaucoup de finetuning à réaliser.

Nous avons également pu comprendre que l'encodage par colonne était un vecteur pour mieux capturer l'information et donc augmenter les performances. De même que la réalisation d'une SVD s'accompagne souvent d'une baisse de performances.

Enfin, nous avons remarqué que les embedding préentrainé fonctionné mal sur notre problème en comparaison de la tfidf.

En dehors de ce Notebook, nous avons essayé l'ensemble des modèles accecible en CPu de la librairie ktrain, malheuresement ils ne nous permette pas d'augmenter nos performances de manière significatives : https://github.com/amaiya/ktrain/tree/master/examples#textclass

Les travaux à réaliser pour la suite sont : 
- Finetuner XgBoost (https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst)
- Finetuner le LSTM
- Créer des LSTM avec un encodage séparé pour chaqune des colonnes :
    - https://keras.io/examples/nlp/text_classification_from_scratch/
- Tester de nouvelles architectures (Gru, biGru, BiLSTM etc.)
- Essayer de rajouter une couche d'attention sur nos modèles de deep Learning car dans la litterature, elle est souvent synonyme d'une augmentation des performances
- Un travail sur les Loss est également nécessaire car nous travaillons avec un corpus très désequilibré : https://www.dlology.com/blog/multi-class-classification-with-focal-loss-for-imbalanced-datasets/
- Enfin, nous allons essayer les méthodes développées dans ces deux papiers sur la classification de texte (Extreme classification basée sur l'attention) : 
    - https://github.com/iliaschalkidis/lmtc-eurlex57k
    - https://github.com/yourh/AttentionXML