#  Inférence de l'effet - Stratégie Multilabels - Approche deep Learning
Dans ce Notebook, nous cosntruisons un modèle qui permet d'inférer l'EFFET à partir de la classification de l'incident et des données textuelles en ce basant sur des reseau récurents commes GRU/LSTM etc.

En effet, ces approches ont montré des réultats très encourageant et nous voulons explorer cette direction pour peut être faire des réceau recurent notre modèle par défault.

Nous considérons ce problème comme un problème de classification multiclasses et multilabels. En effet, il y a plusieurs effets possibles et un incidents peut entrainer plusieurs effets.

Dans le premier notebook nous nous posons les questions suivantes : 
- Quel est l'impact du drop out ?
- Rajouter des couches augmentent-ils les performaces ?
- L'utilisation de réseaux bidirectionnel est-elle pertinente ?
- Une couche d'attention est-elle utile ?
- Attention is all we need, really ?
- Utilisation des embeddings 
- Concaténation des modèles sur différentes entrées ?

Dans celui-ci, nous allons essayer les choses suivantes : 

Ce que nous devons essayer : 
- Multi head attention https://www.kaggle.com/fareise/multi-head-self-attention-for-text-classification, https://github.com/CyberZHG/keras-multi-head
- Hierarchical attention : https://paperswithcode.com/paper/hierarchical-attentional-hybrid-neural
- Concatenation des embedings et du tfidf
- chercher de nouvelles méthodes de régularisation pour les réseaux récurrents
- tester les CNN : https://www.kaggle.com/sanikamal/text-classification-with-python-and-keras
- Librairie à essyaer rapidement :
    - text-classification-keras : https://pypi.org/project/text-classification-keras/
    - pytext :  https://github.com/facebookresearch/pytext

- approche avec des emmbedings déjà entrainés : 
    - https://adventuresinmachinelearning.com/word2vec-keras-tutorial/
    - https://medium.com/@ppasumarthi_69210/word-embeddings-in-keras-be6bb3092831

In [24]:
import pandas as pd
import numpy as np
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding,SpatialDropout1D, Bidirectional,SimpleRNN,Input, concatenate, Reshape,Input,Lambda,Conv1D
import tensorflow 
import keras

from sklearn.metrics import confusion_matrix, accuracy_score, balanced_accuracy_score,f1_score,classification_report,recall_score,precision_score

from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from keras.layers import Concatenate, GlobalMaxPool1D, Dropout, GlobalMaxPooling1D
from tensorflow.keras.layers import Attention

tensorflow.random.set_seed(1234)

## 1. Chargement des données

In [5]:
mlb = MultiLabelBinarizer()

train = pd.read_pickle('./data_split/train.pkl')
# Pour faire un modèle sans le 
#train = train[~train['TEF_ID'].map(lambda x : 106 in x)]
X_train = train[['FABRICANT','CLASSIFICATION','DESCRIPTION_INCIDENT','ETAT_PATIENT']]
y_train = mlb.fit_transform(train['TEF_ID'])
test =  pd.read_pickle('./data_split/test.pkl')
#test = test[~test['TEF_ID'].map(lambda x : k in x)]
X_test = test[['FABRICANT','CLASSIFICATION','DESCRIPTION_INCIDENT','ETAT_PATIENT']]
y_test = mlb.transform(test['TEF_ID'])


X_train_dgs = np.load('results/dgs_camenbert_train_vec.npy')
X_test_dgs =np.load('results/dgs_camenbert_test_vec.npy')



df_effets = pd.read_csv("data/ref_MRV/referentiel_dispositif_effets_connus.csv",delimiter=';',encoding='ISO-8859-1')
df_dys = pd.read_csv("data/ref_MRV/referentiel_dispositif_dysfonctionnement.csv",delimiter=';',encoding='ISO-8859-1')

## 2.1 text-classification-keras https://pypi.org/project/text-classification-keras/

In [2]:
%%time


from __future__ import absolute_import

from keras.layers import LSTM, Bidirectional, Conv1D, Dropout, GlobalAveragePooling1D, GlobalMaxPooling1D, MaxPooling1D, Dense, Flatten, GRU
from keras.layers.merge import Concatenate, concatenate

from layers import AttentionLayer
from ..utils.format import to_fixed_digits


class SequenceEncoderBase(object):

    def __init__(self, dropout_rate=0.5):
        """Creates a new instance of sequence encoder.
        Args:
            dropout_rate: The final encoded output dropout.
        """
        self.dropout_rate = dropout_rate

    def __call__(self, x):
        """Build the actual model here.
        Args:
            x: The encoded or embedded input sequence.
        Returns:
            The model output tensor.
        """

        x = self.build_model(x)
        if self.dropout_rate > 0:
            x = Dropout(self.dropout_rate)(x)
        return x

    def build_model(self, x):
        """Build your model graph here.
        Args:
            x: The encoded or embedded input sequence.
        Returns:
            The model output tensor without the classification block.
        """
        raise NotImplementedError()

    def allows_dynamic_length(self):
        """Return a boolean indicating whether this model is capable of handling variable time steps per mini-batch.
        For example, this should be True for RNN models since you can use them with variable time steps per mini-batch.
        CNNs on the other hand expect fixed time steps across all mini-batches.
        """
        # Assume default as False. Should be overridden as necessary.
        return False

class AttentionRNN(SequenceEncoderBase):

    def __init__(self, rnn_class=LSTM, encoder_dims=50, bidirectional=True, dropout_rate=0.5, **rnn_kwargs):
        """Creates an RNN model with attention. The attention mechanism is implemented as described
        in https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf, but without
        sentence level attention.
        Args:
            rnn_class: The type of RNN to use. (Default Value = LSTM)
            encoder_dims: The number of hidden units of RNN. (Default Value: 50)
            bidirectional: Whether to use bidirectional encoding. (Default Value = True)
            **rnn_kwargs: Additional args for building the RNN.
        """
        super(AttentionRNN, self).__init__(dropout_rate)
        self.rnn_class = rnn_class
        self.encoder_dims = encoder_dims
        self.bidirectional = bidirectional
        self.rnn_kwargs = rnn_kwargs

    def build_model(self, x):
        rnn = self.rnn_class(
            self.encoder_dims, return_sequences=True, **self.rnn_kwargs)
        if self.bidirectional:
            word_activations = Bidirectional(rnn)(x)
        else:
            word_activations = rnn(x)

        attention_layer = AttentionLayer()
        doc_vector = attention_layer(word_activations)
        self.attention_tensor = attention_layer.get_attention_tensor()
        return doc_vector

    def get_attention_tensor(self):
        if not hasattr(self, 'attention_tensor'):
            raise ValueError('You need to build the model first')
        return self.attention_tensor

    def allows_dynamic_length(self):
        return True

    def __str__(self):
        bi = 'bi' if self.bidirectional else 'nobi'
        rnn_kwargs_str = str(self.rnn_kwargs) if len(
            self.rnn_kwargs) > 0 else ''
        li = ['stacked', str(self.rnn_class), str(self.encoder_dims),
              bi, 'do', to_fixed_digits(self.dropout_rate), rnn_kwargs_str]

        return '_'.join(li)

ValueError: attempted relative import beyond top-level package

In [None]:
experiment.train(x=ds.X, y=ds.y, validation_split=0.1, model=model,
    word_encoder_model=word_encoder_model)

In [338]:
X_train_ = np.reshape(X_train_, (X_train_.shape[0], 1, X_train_.shape[1]))
X_test_ = np.reshape(X_test_, (X_test_.shape[0], 1, X_test_.shape[1]))

## Experience 2 :  RCNN

In [3]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [6]:
# Model constants.
EMBEDDING_DIM =300
MAX_SEQUENCE_LENGTH =300
MAX_NB_WORDS = 50000

def vectorize(df_train,df_test,MAX_NB_WORDS,MAX_SEQUENCE_LENGTH ):
    tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
    tokenizer.fit_on_texts(df_train.values)
    word_index = tokenizer.word_index
    print(len(word_index))

    X_train = tokenizer.texts_to_sequences(df_train.values)
    X_test = tokenizer.texts_to_sequences(df_test.values)
    word2index_inputs =  tokenizer.word_index

    X_train = pad_sequences(X_train,MAX_SEQUENCE_LENGTH)
    X_test = pad_sequences(X_test,MAX_SEQUENCE_LENGTH)
    return (X_train, X_test)


TRAIN = []
for col in ['DESCRIPTION_INCIDENT', 'ETAT_PATIENT', 'FABRICANT'] : 
    X_train,X_test = vectorize(train[col],test[col],MAX_NB_WORDS,MAX_SEQUENCE_LENGTH )
    TRAIN.append((X_train,X_test))

57288
21168
2262


In [20]:
inputs_1 = Input(shape=(MAX_SEQUENCE_LENGTH,))
inputs_2 = Input(shape=(MAX_SEQUENCE_LENGTH,))
inputs_3 = Input(shape=(MAX_SEQUENCE_LENGTH,))
#x = Embedding(input_dim=MAX_NB_WORDS, output_dim=EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)(inputs)
#x = Reshape((MAX_SEQUENCE_LENGTH,1,))(inputs_1)

x = Embedding(input_dim=MAX_NB_WORDS, output_dim=EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)(inputs_1)
x_1 = SimpleRNN(128, return_sequences=True)(x)
x_2 = SimpleRNN(128, return_sequences=True, go_backwards=True)(x)
x_2 = Lambda(lambda x: tensorflow.reverse(x, axis=[1]))(x_2)
x = Concatenate(axis=2)([x_1,x_2])
x = Conv1D(64, kernel_size=1, activation='tanh')(x)
x = GlobalMaxPooling1D()(x)



#y = Reshape((MAX_SEQUENCE_LENGTH,1,))(inputs_2)
y = Embedding(input_dim=MAX_NB_WORDS, output_dim=EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)(inputs_2)
y_1 = SimpleRNN(128, return_sequences=True)(y)
y_2 = SimpleRNN(128, return_sequences=True, go_backwards=True)(y)
y_2 = Lambda(lambda x: tensorflow.reverse(x, axis=[1]))(y_2)
y = Concatenate(axis=2)([y_1,y_2])
y = Conv1D(64, kernel_size=1, activation='tanh')(y)
y = GlobalMaxPooling1D()(y)


#z = Reshape((MAX_SEQUENCE_LENGTH,1,))(inputs_3)
z = Embedding(input_dim=MAX_NB_WORDS, output_dim=EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)(inputs_3)
z_1 = SimpleRNN(128, return_sequences=True)(z)
z_2 = SimpleRNN(128, return_sequences=True, go_backwards=True)(z)
z_2 = Lambda(lambda x: tensorflow.reverse(x, axis=[1]))(z_2)
z = Concatenate(axis=2)([z_1,z_2])
z = Conv1D(64, kernel_size=1, activation='tanh')(z)
z = GlobalMaxPooling1D()(z)

w = concatenate([x, y, z])

out =  Dense(y_train.shape[1],activation='softmax')(w)

model = keras.models.Model(inputs=[inputs_1,inputs_2,inputs_3], outputs=out)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

model.summary()


Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_22 (InputLayer)           (None, 300)          0                                            
__________________________________________________________________________________________________
input_23 (InputLayer)           (None, 300)          0                                            
__________________________________________________________________________________________________
input_24 (InputLayer)           (None, 300)          0                                            
__________________________________________________________________________________________________
embedding_12 (Embedding)        (None, 300, 300)     15000000    input_22[0][0]                   
____________________________________________________________________________________________

In [21]:
history = model.fit([TRAIN[0][0],TRAIN[1][0],TRAIN[2][0]], y_train, epochs=5, validation_split=0.2, verbose=1, batch_size = 64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 21059 samples, validate on 5265 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [25]:
y_pred = model.predict([TRAIN[0][1],TRAIN[1][1],TRAIN[2][1]])
print('####################################')

thresholds = [0.01,0.04,0.06,0.08,0.1,0.12,0.14,0.16,0.2,0.25,0.3,0.35,0.4,0.5,0.6,0.7]
for val in thresholds:
    print("For threshold: ", val)
    pred=y_pred.copy()
  
    pred[pred>=val]=1
    pred[pred<val]=0
  
    precision = precision_score(y_test, pred, average='samples')
    recall = recall_score(y_test, pred, average='samples')
    f1 = f1_score(y_test, pred, average='samples')
   
    print("Samples-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

####################################
For threshold:  0.01
Samples-average quality numbers
Precision: 0.3180, Recall: 0.8617, F1-measure: 0.4037
For threshold:  0.04
Samples-average quality numbers
Precision: 0.5216, Recall: 0.7714, F1-measure: 0.5794
For threshold:  0.06


  _warn_prf(average, modifier, msg_start, len(result))


Samples-average quality numbers
Precision: 0.5691, Recall: 0.7391, F1-measure: 0.6093
For threshold:  0.08
Samples-average quality numbers
Precision: 0.5970, Recall: 0.7106, F1-measure: 0.6204
For threshold:  0.1
Samples-average quality numbers
Precision: 0.6198, Recall: 0.6853, F1-measure: 0.6242
For threshold:  0.12
Samples-average quality numbers
Precision: 0.6335, Recall: 0.6640, F1-measure: 0.6222
For threshold:  0.14
Samples-average quality numbers
Precision: 0.6413, Recall: 0.6517, F1-measure: 0.6211
For threshold:  0.16
Samples-average quality numbers
Precision: 0.6416, Recall: 0.6410, F1-measure: 0.6182
For threshold:  0.2
Samples-average quality numbers
Precision: 0.6261, Recall: 0.6135, F1-measure: 0.6024
For threshold:  0.25
Samples-average quality numbers
Precision: 0.5872, Recall: 0.5840, F1-measure: 0.5779
For threshold:  0.3
Samples-average quality numbers
Precision: 0.5632, Recall: 0.5599, F1-measure: 0.5576
For threshold:  0.35
Samples-average quality numbers
Precisio

# Commentaire :
Nous observons que notre modèle sur apprend très rapidement et de manière importante. nous avons deux solutions classiques pour contrer cet effet : 
- Regularisation
- Drop Out
- Netoyer les données avec clean text

## Vriante : https://github.com/airalcorn2/Recurrent-Convolutional-Neural-Network-Text-Classifier/blob/master/recurrent_convolutional_keras.py


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

X_train = train[['FABRICANT','CLASSIFICATION','DESCRIPTION_INCIDENT','ETAT_PATIENT']]
y_train = mlb.fit_transform(train['TEF_ID'])
X_test = test[['FABRICANT','CLASSIFICATION','DESCRIPTION_INCIDENT','ETAT_PATIENT']]
y_test = mlb.transform(test['TEF_ID'])


preprocess = ColumnTransformer(
    [('description_tfidf',TfidfVectorizer(sublinear_tf=True, min_df=3,
                            ngram_range=(1, 1),
                            
                            max_features = 10000,norm = 'l2'), 'DESCRIPTION_INCIDENT'),
     
     ('etat_pat_tfidf', TfidfVectorizer(sublinear_tf=True, min_df=3,ngram_range=(1, 1),
                                       
                                       max_features = 10000,norm = 'l2'), 'ETAT_PATIENT'),
     
     ('fabricant_tfidf',TfidfVectorizer(sublinear_tf=True, min_df=3,
                            ngram_range=(1, 1),
                            
                            max_features = 5000,norm = 'l2'), 'FABRICANT')
     ],
    
    remainder='passthrough')

X_train_, X_test_ =preprocess.fit_transform(X_train),preprocess.transform(X_test)

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=1000)
X_train_ = svd.fit_transform(X_train_)
X_test_ = svd.transform(X_test_)

In [32]:
X_train_ = np.reshape(X_train_, (X_train_.shape[0], 1, X_train_.shape[1]))
X_test_ = np.reshape(X_test_, (X_test_.shape[0], 1, X_test_.shape[1]))

In [55]:
from keras import backend

hidden_dim_1 = 200
hidden_dim_2 = 100
NUM_CLASSES = y_train.shape[1]

document = Input(shape = (1,1000, ), dtype = "float32")
#left_context = Input(shape = (1,1000, ), dtype = "float32")
#right_context = Input(shape = (1,1000, ), dtype = "float32")



x_1 = LSTM(200, return_sequences=True)(document)
x_2 = LSTM(20, return_sequences=True, go_backwards=True)(document)
x_2 = Lambda(lambda x: tensorflow.reverse(x, axis=[1]))(x_2)
x = Concatenate(axis=2)([x_1,x_2])
x = Conv1D(64, kernel_size=1, activation='tanh')(x)
x = GlobalMaxPooling1D()(x)



out =  Dense(y_train.shape[1],activation='softmax')(x)

model = keras.models.Model(inputs=[document], outputs=out)

In [None]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["categorical_accuracy"])

epochs = 10
batch_size = 32

history = model.fit(X_train_, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2)

score,cat_acc = model.evaluate(X_test_,y_test)

y_pred = model.predict(X_test_)

print('loss : ', score)
print('categorical accuracy: ',cat_acc)

print('####################################')

thresholds = [0.01,0.04,0.06,0.08,0.1,0.12,0.14,0.16,0.2,0.25,0.3,0.35,0.4,0.5,0.6,0.7]
for val in thresholds:
    print("For threshold: ", val)
    pred=y_pred.copy()
  
    pred[pred>=val]=1
    pred[pred<val]=0
  
    precision = precision_score(y_test, pred, average='samples')
    recall = recall_score(y_test, pred, average='samples')
    f1 = f1_score(y_test, pred, average='samples')
   
    print("Samples-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

Train on 21059 samples, validate on 5265 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
loss :  0.012681970073300835
categorical accuracy:  0.6524316072463989
####################################
For threshold:  0.01
Samples-average quality numbers
Precision: 0.3606, Recall: 0.9045, F1-measure: 0.4613
For threshold:  0.04
Samples-average quality numbers
Precision: 0.5393, Recall: 0.8327, F1-measure: 0.6125
For threshold:  0.06
Samples-average quality numbers
Precision: 0.5869, Recall: 0.8049, F1-measure: 0.6420
For threshold:  0.08


  _warn_prf(average, modifier, msg_start, len(result))


Samples-average quality numbers
Precision: 0.6180, Recall: 0.7844, F1-measure: 0.6574
For threshold:  0.1
Samples-average quality numbers
Precision: 0.6416, Recall: 0.7623, F1-measure: 0.6644
For threshold:  0.12
Samples-average quality numbers
Precision: 0.6552, Recall: 0.7443, F1-measure: 0.6662
For threshold:  0.14
Samples-average quality numbers
Precision: 0.6595, Recall: 0.7250, F1-measure: 0.6630
For threshold:  0.16
Samples-average quality numbers
Precision: 0.6616, Recall: 0.7095, F1-measure: 0.6599
For threshold:  0.2
Samples-average quality numbers
Precision: 0.6562, Recall: 0.6794, F1-measure: 0.6480
For threshold:  0.25
Samples-average quality numbers
Precision: 0.6411, Recall: 0.6511, F1-measure: 0.6333
For threshold:  0.3


In [54]:
X_train_[:,:,].shape

(26324, 1, 999)