<a href="https://colab.research.google.com/github/kevaniy/mon-nouveau-blog/blob/master/LSTM_classification_title.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification de texte en deep learning (LSTM et convolution) 

## But de la tâche 

A partir d'un dataset d'articles PUBMED, l'objectif est de construire un modèle utilisant les titres et abstracts des articles pour prédire la
catégorie SIGAPS associée à l'article . 

Après une phase de préprocessing du texte, nous entrainerons un modèle à base de convolutions, puis un modèle à base de réseau de neurones récurrents (LSTM) 

## Cloner le repo https://github.com/aneuraz/intro-keras.git

In [15]:
!git clone https://github.com/aneuraz/intro-keras.git

Cloning into 'intro-keras'...
remote: Enumerating objects: 18, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 18 (delta 5), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (18/18), done.


## Import des libraries

In [16]:
%tensorflow_version 2.x
import json 
import tensorflow as tf
import numpy as np

## Chargement des données

Toutes les données chargées se situent dans le répertoire `/content/`.
Les données sont dans un fichier JSON.

In [17]:
with open('/content/intro-keras/ai_pub_samp.json','r') as f:
  data = json.load(f)

In [18]:
len(data)

10000

In [19]:
data


[{'Cat_2013': 'C',
  'Cat_2014': 'C',
  'Cat_2015': 'C',
  'Cat_2016': 'C',
  'Cat_2017': 'B',
  'Disciplines': ['XQ'],
  'ESSN': '1873-3557',
  'IF_2013': '2.129',
  'IF_2014': '2.353',
  'IF_2015': '2.653',
  'IF_2016': '2.536',
  'IF_2017': '2.88',
  'ISSN': '1386-1425',
  'ISSN_online': '1873-3557',
  'ISSN_print': '1386-1425',
  'IsoAbbr': 'Spectrochim Acta A Mol Biomol Spectrosc',
  'JrId': 20555,
  'MedAbbr': 'Spectrochim Acta A Mol Biomol Spectrosc',
  'NLMid': '9602533',
  'Titre': 'Spectrochim Acta A Mol Biomol Spectrosc',
  'abstract': 'In this research, ZnO nanoparticle loaded on activated carbon (ZnO-NPs-AC) was synthesized simply by a low cost and nontoxic procedure. The characterization and identification have been completed by different techniques such as SEM and XRD analysis. A three layer artificial neural network (ANN) model is applicable for accurate prediction of dye removal percentage from aqueous solution by ZnO-NRs-AC following conduction of 270 experimental dat

## TODO: Extraire les titres et les catégories

In [20]:
# mettre le titre en minuscule dans la variable X
X = [ x['title'].lower() for x in data ]

 

# mettre la catégorie (1e élément de la liste) dans la variable Y
Y = [ x['categories'][0] for x in data ]

## TODO: Calculer la longueur maximale des titres et abstracts dans le dataset

In [21]:
# longueur maximale des titres, variable max_len
_len = [ len(title) for title in X ]
max_len = max(_len)
print(max_len)

299


## TODO: Diviser le dataset en train (X_train, Y_train, Z_train) et test (X_test, Y_test, Z_test)

In [22]:
# X_train, Y_train
X_train = X[:8000]
Y_train = Y[:8000]


# X_test, Y_test
X_test = X[8000:]
Y_test = Y[8000:]




In [49]:
Y_train

['BIOCHEMISTRY & MOLECULAR BIOLOGY',
 'RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING',
 'RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING',
 'ONCOLOGY',
 'SPECTROSCOPY',
 'SURGERY',
 'MATHEMATICAL & COMPUTATIONAL BIOLOGY',
 'NEUROSCIENCES',
 'COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE',
 'ENGINEERING, BIOMEDICAL',
 'CARDIAC & CARDIOVASCULAR SYSTEMS',
 'CHEMISTRY, MEDICINAL',
 'MULTIDISCIPLINARY SCIENCES',
 'NEUROSCIENCES',
 'TOXICOLOGY',
 'ENVIRONMENTAL SCIENCES',
 'ENVIRONMENTAL SCIENCES',
 'BIOLOGY',
 'PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH',
 'OPHTHALMOLOGY',
 'MULTIDISCIPLINARY SCIENCES',
 'MULTIDISCIPLINARY SCIENCES',
 'CHEMISTRY, ANALYTICAL',
 'OBSTETRICS & GYNECOLOGY',
 'MULTIDISCIPLINARY SCIENCES',
 'MULTIDISCIPLINARY SCIENCES',
 'RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING',
 'OPTICS',
 'CHEMISTRY, ANALYTICAL',
 'MULTIDISCIPLINARY SCIENCES',
 'MEDICINE, GENERAL & INTERNAL',
 'MULTIDISCIPLINARY SCIENCES',
 'MEDICINE, GENERAL & INTERNAL',
 'PSYCHIATRY',
 'MULTIDISCIPLINARY SC

In [23]:
from sklearn.model_selection import train_test_split 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 21)

## Transformer la variable Y en vecteur numerique

["Cat 1", "Cat 2"] -> [0, 1]

In [24]:
Y_train

cat_to_id = {'<UNK>':0}

for cat in Y_train: 
  if cat not in cat_to_id.keys(): 
    cat_to_id[cat] = len(cat_to_id)

id_to_cat = { v: k for k,v in cat_to_id.items()}

In [25]:
num_cat = len(cat_to_id)
print(num_cat)

96


In [26]:
def preprocess_Y(Y, cat_to_id): 
  """returns list of cat_ids for Y
  """
  res = []
  for ex in Y: 
    if ex not in cat_to_id.keys(): 
      res.append(cat_to_id['<UNK>'])
    else:
      res.append(cat_to_id[ex])
  return np.array(res)

In [27]:
Y_train_id = preprocess_Y(Y_train, cat_to_id)
Y_test_id = preprocess_Y(Y_test, cat_to_id)

In [28]:
Y_train_id

array([ 1,  2,  2, ..., 24,  3, 12])

## Tokenizer les titres

Pour cela vous pouvez utiliser la fonction `Tokenizer` de keras

Le but est de transformer les textes en un vecteur numérique

texte -> liste de tokens -> vecteur numérique

"Miaou le chat" -> ["Miaou", "le", chat"] -> [1, 2, 3]

In [32]:
# Créer le tokenizer
vocab_size = 10000

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words = vocab_size)

In [33]:
# Entrainer le tokenizer sur le train set 
tokenizer.fit_on_texts(X_train)

In [34]:
# Transformer les textes en vecteurs numeriques à l'aide du tokenizer
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)



In [35]:
max_len = max([len(x) for x in X_train_seq])


In [36]:
max_len

39

## Faire un padding des sequences obtenues pour qu'elles aient toutes la même taille (cf la fonction `pad_sequences`)

[1, 2, 3]       -> [0, 0, 1, 2 ,3]

[4, 5, 6, 7, 8] -> [4, 5, 6, 7, 8]

In [37]:
# Padding des sequences 
X_train_pad = tf.keras.preprocessing.sequence.pad_sequences(X_train_seq, maxlen = max_len, truncating= 'post')
X_test_pad = tf.keras.preprocessing.sequence.pad_sequences(X_test_seq, maxlen = max_len, truncating= 'post')


In [70]:
X_train_pad.shape


(8000, 39)

In [71]:
X_test_pad.shape


(2000, 39)

In [76]:
Y_test_id[0:100]

array([ 6,  2,  8, 12,  0, 55, 21,  1, 57,  3, 12, 12, 11, 12, 11, 24, 24,
       32,  7, 42, 12, 45,  5,  2,  9, 20, 27,  9, 12,  5, 38,  5, 12,  3,
        5,  7,  2, 71,  3,  7,  7,  1, 34, 12,  2, 12,  5, 12, 12,  2,  5,
       12,  5, 11, 39, 12,  9, 18, 12, 24, 32, 32, 24, 78,  5, 53,  3,  0,
       24,  5,  5, 14, 12, 12, 12,  9,  7, 17, 10, 14,  8,  1, 43, 12,  5,
       11,  6, 45,  9, 21,  7, 12, 20,  2,  2,  7, 24,  3,  7, 12])

# Réseau de convolution pour la classification de texte

Les réseaux convolutionnels peuvent également être utiliser pour le texte et notamment pour la classification de texte. Ici nous allons construire un CNN sur le même modèle que pour les images avec quelques petites spécificités. 

Comme le texte est une séquence de mots, il s'agit d'une séquence en 1 dimension. Nous appliquerons donc une convolution en 1D. 

Pour traiter du texte, la première couche de notre réseau va être constituée par une couche d'embedding. 

Pour rappel, le word embedding consiste à projeter les tokens dans un espace vectoriel qui va minimiser la distance entre les tokens qui sont utilisés dans des contextes similaires (et qui ont un sens proche ? )

![Texte alternatif…](https://www.ibm.com/blogs/research/wp-content/uploads/2018/10/WMEFig1.png)

Les embeddings peuvent être calculés de diverses façons. Par exemple word2vec, un des plus célèbres, se base sur 2 algorithmes frères Skip-gram et CBOW

![Texte alternatif…](https://pathmind.com/images/wiki/word2vec_diagrams.png)

Pour information, il existe aujourd'hui des algorithmes plus performants que word2vec comme [Fasttext](https://fasttext.cc) qui prend en compte des informations de sous-mots ou la famille des embeddings contextuels comme [ELMo](https://allennlp.org/elmo) ou [BERT](https://arxiv.org/abs/1810.04805) qui prennent en compte le contexte d'utilisation du mot pour calculer son vecteur. 

In [9]:

#Modèle 1


embed_dim= 128
dropout1 = 0.2
conv_filters= 32
conv_kernel = 2
maxpool_size = 2
dense_size = 128
batch_size = 128
epochs = 10

# Créer le modèle avec au minimum

model_cnn1 = tf.keras.models.Sequential()
# Embedding 
model_cnn1.add(tf.keras.layers.Embedding(vocab_size, 
                                        embed_dim, 
                                        input_length= max_len))
# Dropout
model_cnn1.add(tf.keras.layers.Dropout(dropout1))
# Convolution
model_cnn1.add(tf.keras.layers.Conv1D(conv_filters, conv_kernel, 
                                     padding='valid', 
                                     strides= 1, 
                                     activation='relu'))
model_cnn1.add(tf.keras.layers.MaxPooling1D(maxpool_size))
model_cnn1.add(tf.keras.layers.Flatten())
# Maxpooling
# Dense
model_cnn1.add(tf.keras.layers.Dense(dense_size,  activation='relu'))
# Activation

# Classifieur (Dense + activation softmax)
model_cnn1.add(tf.keras.layers.Dense(num_cat))
model_cnn1.add(tf.keras.layers.Activation('softmax'))

# Compiler le modèle 
model_cnn1.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics= ['accuracy'])

# Afficher le summary du modèle
print(model_cnn1.summary())

NameError: ignored

In [78]:
# Fitter le modèle1 
model_cnn1.fit(X_train_pad, Y_train_id, batch_size = batch_size , epochs = epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb391ce7208>

In [79]:
# Evaluer le modèle1
model_cnn1.evaluate(X_test_pad, Y_test_id)



[3.9380040168762207, 0.29649999737739563]

In [84]:

#Modèle 2


embed_dim2= 128
dropout2 = 0.4
conv_filters2= 32
conv_kernel2 = 2
maxpool_size2 = 4
dense_size2 = 128
batch_size2 = 128
epochs2 = 40

# Créer le modèle avec au minimum

model_cnn2 = tf.keras.models.Sequential()
# Embedding 
model_cnn2.add(tf.keras.layers.Embedding(vocab_size, 
                                        embed_dim2, 
                                        input_length= max_len))
# Dropout
model_cnn2.add(tf.keras.layers.Dropout(dropout2))
# Convolution
model_cnn2.add(tf.keras.layers.Conv1D(conv_filters2, conv_kernel2, 
                                     padding='valid', 
                                     strides= 1, 
                                     activation='elu'))
model_cnn2.add(tf.keras.layers.MaxPooling1D(maxpool_size2))
model_cnn2.add(tf.keras.layers.Flatten())
# Maxpooling
# Dense
model_cnn2.add(tf.keras.layers.Dense(dense_size2,  activation='elu'))
# Activation

# Classifieur (Dense + activation softmax)
model_cnn2.add(tf.keras.layers.Dense(num_cat))
model_cnn2.add(tf.keras.layers.Activation('softmax'))

# Compiler le modèle 
model_cnn2.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics= ['accuracy'])

# Afficher le summary du modèle
print(model_cnn2.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 39, 128)           1280000   
_________________________________________________________________
dropout_3 (Dropout)          (None, 39, 128)           0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 38, 32)            8224      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 9, 32)             0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 288)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 128)               36992     
_________________________________________________________________
dense_7 (Dense)              (None, 96)               

In [85]:
# Fitter le modèle 2
model_cnn2.fit(X_train_pad, Y_train_id, batch_size = batch_size2 , epochs = epochs2)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x7fb332515588>

In [86]:
# Evaluer le modèle2
model_cnn2.evaluate(X_test_pad, Y_test_id)



[7.827927112579346, 0.24250000715255737]

In [91]:

#Modèle 3


embed_dim3= 128
dropout3 = 0.3
conv_filters3= 64
conv_kernel3 = 4
maxpool_size3 = 4
dense_size3 = 256
batch_size3 = 256
epochs3 = 100

# Créer le modèle avec au minimum

model_cnn3 = tf.keras.models.Sequential()
# Embedding 
model_cnn3.add(tf.keras.layers.Embedding(vocab_size, 
                                        embed_dim3, 
                                        input_length= max_len))
# Dropout
model_cnn3.add(tf.keras.layers.Dropout(dropout3))
# Convolution
model_cnn3.add(tf.keras.layers.Conv1D(conv_filters3, conv_kernel3, 
                                     padding='valid', 
                                     strides= 1, 
                                     activation='relu'))
model_cnn3.add(tf.keras.layers.MaxPooling1D(maxpool_size3))
model_cnn3.add(tf.keras.layers.Flatten())
# Maxpooling
# Dense
model_cnn3.add(tf.keras.layers.Dense(dense_size3,  activation='relu'))
model_cnn3.add(tf.keras.layers.Dense(dense_size2,  activation='relu')) #ajout d'une nouvelle couche dense (fully-connected)
# Activation

# Classifieur (Dense + activation softmax)
model_cnn3.add(tf.keras.layers.Dense(num_cat))
model_cnn3.add(tf.keras.layers.Activation('softmax'))


# Compiler le modèle 
model_cnn3.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics= ['accuracy'])

# Afficher le summary du modèle
print(model_cnn3.summary())

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 39, 128)           1280000   
_________________________________________________________________
dropout_6 (Dropout)          (None, 39, 128)           0         
_________________________________________________________________
conv1d_7 (Conv1D)            (None, 36, 64)            32832     
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, 9, 64)             0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 256)               147712    
_________________________________________________________________
dense_14 (Dense)             (None, 128)              

In [88]:
# Fitter le modèle3 
model_cnn3.fit(X_train_pad, Y_train_id, batch_size = batch_size3 , epochs = epochs3)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fb3323ab208>

In [89]:
# Evaluer le modèle3
model_cnn3.evaluate(X_test_pad, Y_test_id)



[14.073052406311035, 0.24699999392032623]

# LSTM pour la classification de texte

Il est également possible d'utiliser un autre type de réseau de neurones pour effectuer ce genre de tâches: les réseaux de neurones récurrents ou RNN.

Les RNN sont conçus pour gérer les séquences. Le réseau prend les tokens un par un et calcule une représentation de la séquence à chaque pas qui tiens compte de tous les pas précédents 

![Texte alternatif…](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Recurrent_neural_network_unfold.svg/450px-Recurrent_neural_network_unfold.svg.png)


Il existe différents types de RNN. Ici nous utiliserons les Long Short-Term Memory (LSTM) qui permettent d'améliorer les performances sur des séquences longues avec une série de "gates". 

![Texte alternatif…](http://dprogrammer.org/wp-content/uploads/2019/04/RNN-vs-LSTM-vs-GRU-1200x361.png)

In [92]:
# Créer un réseau à base de LSTM avec au minimum:
# Embedding
# Dropout
# LSTM
# Dropout
# Classifieur

#Modèle lstm1

lstm_size= 128
dropout2= 0.2

model_lstm1 = tf.keras.models.Sequential()
# Embedding 
model_lstm1.add(tf.keras.layers.Embedding(vocab_size, 
                                        embed_dim, 
                                        input_length= max_len))
# Dropout
model_lstm1.add(tf.keras.layers.Dropout(dropout1))
# LSTM
model_lstm1.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_size //2)))
model_lstm1.add(tf.keras.layers.Dropout(dropout2))
# Dense
model_lstm1.add(tf.keras.layers.Dense(dense_size,  activation='relu'))

# Classifieur (Dense + activation softmax)
model_lstm1.add(tf.keras.layers.Dense(num_cat))
model_lstm1.add(tf.keras.layers.Activation('softmax'))

# Compiler le modèle 
model_lstm1.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics= ['accuracy'])

# Afficher le summary du modèle
print(model_lstm1.summary())


Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 39, 128)           1280000   
_________________________________________________________________
dropout_7 (Dropout)          (None, 39, 128)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               98816     
_________________________________________________________________
dropout_8 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_16 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_17 (Dense)             (None, 96)                12384     
_________________________________________________________________
activation_6 (Activation)    (None, 96)               

In [93]:
# Fitter le modèle lstm 1
model_lstm1.fit(X_train_pad, Y_train_id, batch_size = batch_size , epochs = epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb391e51320>

In [94]:
# Evaluer le modèle lstm 1
model_lstm1.evaluate(X_test_pad, Y_test_id)



[4.1148600578308105, 0.31150001287460327]

In [95]:

#Modèle lstm2

lstm_size= 128


model_lstm2 = tf.keras.models.Sequential()
# Embedding 
model_lstm2.add(tf.keras.layers.Embedding(vocab_size, 
                                        embed_dim2, 
                                        input_length= max_len))
# Dropout
model_lstm2.add(tf.keras.layers.Dropout(dropout1))
# LSTM
model_lstm2.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_size //2)))
model_lstm2.add(tf.keras.layers.Dropout(dropout3))
# Dense
model_lstm2.add(tf.keras.layers.Dense(dense_size,  activation='relu'))

# Classifieur (Dense + activation softmax)
model_lstm2.add(tf.keras.layers.Dense(num_cat))
model_lstm2.add(tf.keras.layers.Activation('softmax'))

# Compiler le modèle 
model_lstm2.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics= ['accuracy'])

# Afficher le summary du modèle
print(model_lstm2.summary())


Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 39, 128)           1280000   
_________________________________________________________________
dropout_9 (Dropout)          (None, 39, 128)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dropout_10 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_18 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_19 (Dense)             (None, 96)                12384     
_________________________________________________________________
activation_7 (Activation)    (None, 96)               

In [96]:
# Fitter le modèle lstm 2
model_lstm2.fit(X_train_pad, Y_train_id, batch_size = batch_size2 , epochs = epochs2)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x7fb3a6375630>

In [97]:
# Evaluer le modèle lstm 2
model_lstm2.evaluate(X_test_pad, Y_test_id)



[9.50861930847168, 0.28200000524520874]

In [98]:

#Modèle lstm3

lstm_size= 128


model_lstm3 = tf.keras.models.Sequential()
# Embedding 
model_lstm3.add(tf.keras.layers.Embedding(vocab_size, 
                                        embed_dim3, 
                                        input_length= max_len))
# Dropout
model_lstm3.add(tf.keras.layers.Dropout(dropout2))
# LSTM
model_lstm3.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_size //2)))
model_lstm3.add(tf.keras.layers.Dropout(dropout3))
# Dense
model_lstm3.add(tf.keras.layers.Dense(dense_size3,  activation='elu'))

# Classifieur (Dense + activation softmax)
model_lstm3.add(tf.keras.layers.Dense(num_cat))
model_lstm3.add(tf.keras.layers.Activation('softmax'))

# Compiler le modèle 
model_lstm3.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics= ['accuracy'])

# Afficher le summary du modèle
print(model_lstm3.summary())


Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 39, 128)           1280000   
_________________________________________________________________
dropout_11 (Dropout)         (None, 39, 128)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               98816     
_________________________________________________________________
dropout_12 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_20 (Dense)             (None, 256)               33024     
_________________________________________________________________
dense_21 (Dense)             (None, 96)                24672     
_________________________________________________________________
activation_8 (Activation)    (None, 96)               

In [99]:
# Fitter le modèle lstm 3
model_lstm3.fit(X_train_pad, Y_train_id, batch_size = batch_size3 , epochs = epochs3)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fb3a59c1160>

In [100]:
# Evaluer le modèle lstm 3
model_lstm3.evaluate(X_test_pad, Y_test_id)



[11.67905044555664, 0.2709999978542328]

# Utiliser les embeddings pré-entrainés

Pour améliorer la qualité de la représentation des mots, il est possible d'entrainer les embeddings sur de larges corpus de textes non annotés (typiquement Wikipedia). Ces modèles sont souvent disponibles en ligne et il est possible de les télécharger. Ici nous allons utiliser des embeddings [Glove](https://nlp.stanford.edu/projects/glove/) de taille 50d (pour des raisons techniques mais il vaut mieux utiliser des dimensions plus importantes entre 100 et 300) 

In [101]:
# Fonction permettant de charger un embedding 

import numpy as np
import re
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def load_glove_embeddings(fp, embedding_dim, include_empty_char=True):
    """
    Loads pre-trained word embeddings (GloVe embeddings)
        Inputs: - fp: filepath of pre-trained glove embeddings
                - embedding_dim: dimension of each vector embedding
                - generate_matrix: whether to generate an embedding matrix
        Outputs:
                - word2coefs: Dictionary. Word to its corresponding coefficients
                - word2index: Dictionary. Word to word-index
                - embedding_matrix: Embedding matrix for Keras Embedding layer
    """
    # First, build the "word2coefs" and "word2index"
    word2coefs = {} # word to its corresponding coefficients
    word2index = {} # word to word-index
    with open(fp) as f:
        for idx, line in enumerate(f):
            try:
                data = [x.strip().lower() for x in line.split()]
                word = data[0]
                coefs = np.asarray(data[1:embedding_dim+1], dtype='float32')
                word2coefs[word] = coefs
                if word not in word2index:
                    word2index[word] = len(word2index)
            except Exception as e:
                print('Exception occurred in `load_glove_embeddings`:', e)
                continue
        # End of for loop.
    # End of with open
    if include_empty_char:
        word2index[''] = len(word2index)
    # Second, build the "embedding_matrix"
    # Words not found in embedding index will be all-zeros. Hence, the "+1".
    vocab_size = len(word2coefs)+1 if include_empty_char else len(word2coefs)
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    for word, idx in word2index.items():
        embedding_vec = word2coefs.get(word)
        if embedding_vec is not None and embedding_vec.shape[0]==embedding_dim:
            embedding_matrix[idx] = np.asarray(embedding_vec)
    # return word2coefs, word2index, embedding_matrix
    return word2index, np.asarray(embedding_matrix)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [102]:
# Télécharger les embeddings

!wget https://github.com/kmr0877/IMDB-Sentiment-Classification-CBOW-Model/raw/master/glove.6B.50d.txt.gz
!gunzip /content/glove.6B.50d.txt.gz

--2021-01-03 13:42:45--  https://github.com/kmr0877/IMDB-Sentiment-Classification-CBOW-Model/raw/master/glove.6B.50d.txt.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kmr0877/IMDB-Sentiment-Classification-CBOW-Model/master/glove.6B.50d.txt.gz [following]
--2021-01-03 13:42:45--  https://raw.githubusercontent.com/kmr0877/IMDB-Sentiment-Classification-CBOW-Model/master/glove.6B.50d.txt.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69182520 (66M) [application/octet-stream]
Saving to: ‘glove.6B.50d.txt.gz’


2021-01-03 13:42:47 (147 MB/s) - ‘glove.6B.50d.txt.gz’ saved [69182520/69182520]



In [109]:
pwd

'/content'

In [110]:
# Charger les embeddings à l'aide de la fonction load_glove_embeddings

embeddings_index = dict()
f = open('/content/glove.6B.50d.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [5]:
# ecrire une fonction de tokenization custom pour preprocesser les textes
import numpy as np
from keras.preprocessing.text import Tokenizer

size_of_vocabulary = 400000


embedding_matrix = np.zeros((size_of_vocabulary, 50))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
# Encoder les textes avec la fonction custom


In [None]:
# Padding des sequences


In [12]:
# Créer un modèle en chargeant les poids des embeddings dans le layer Embedding


#Modèle lstm2

lstm_size= 128
embed_dim2= 128
dropout2 = 0.4
conv_filters2= 32
conv_kernel2 = 2
maxpool_size2 = 4
dense_size2 = 128
batch_size2 = 128
epochs2 = 40

model_lstmp = tf.keras.models.Sequential()
# Embedding 
model_lstmp.add(tf.keras.layers.Embedding(size_of_vocabulary, 
                                        50, 
                                        weights=[embedding_matrix],
                                        input_length=100,
                                        trainable=False))
# Dropout
model_lstmp.add(tf.keras.layers.Dropout(dropout1))
# LSTM
model_lstmp.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_size //2)))
model_lstmp.add(tf.keras.layers.Dropout(dropout2))
# Dense
model_lstmp.add(tf.keras.layers.Dense(dense_size,  activation='relu'))

# Classifieur (Dense + activation softmax)
model_lstmp.add(tf.keras.layers.Dense(64))
model_lstmp.add(tf.keras.layers.Activation('softmax'))

# Compiler le modèle 
model_lstmp.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics= ['accuracy'])

# Afficher le summary du modèle
print(model_lstmp.summary())


Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 100, 50)           20000000  
_________________________________________________________________
dropout_3 (Dropout)          (None, 100, 50)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               58880     
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
activation (Activation)      (None, 64)               

In [41]:
# Fitter le modèle

model_lstmp.fit(X_train_pad, Y_train_id, batch_size = batch_size2 , epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f0dc8031400>

In [42]:
# evaluer le modèle
model_lstmp.evaluate(X_test_pad, Y_test_id)



[nan, 0.0010000000474974513]