### Plan de ce notebook

1. Use pre-trained GloVe words for embeddings
2. Use pre-trained word2vec words for embeddings
3. What about Freebase ?
4. Use pre-trained GloVe words for embeddings with LSTM model

# Nettoyage et conversion numérique des données

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
import numpy as np

from tools import *
from embeddings import *
from models import *

Using TensorFlow backend.


In [2]:
# load raw string data
data_train, y_train_all, data_test, id_test = load_data()

## Nettoyage des données (optionnel)

In [3]:
params = {'lower': True, 
          'lemma': False, 
          'stop_words': False}

comment = data_train[2]
print(comment)
print('-------')
print(CommentCleaner(**params).transform(comment))

Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.
-------
hey man i m really not trying to edit war it s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info 


In [4]:
clean_data_train = transform_dataset(data_train, transformer=CommentCleaner, kwargs=params)
clean_data_test = transform_dataset(data_test, transformer=CommentCleaner, kwargs=params)

Transformation: 100%       
Transformation: 100%       


## Tokenization et découpage des données textuelles 

Conforme au github https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb

In [5]:
# Convert strings to int indexes, 
# considering only the VOCAB_SIZE most commons words, 
# and pad the sentences to SENTENCE_LENGTH words
VOCAB_SIZE = 30000
SENTENCE_LENGTH = 200  # 200 if stop_words deleted, 120 otherwise

In [6]:
tokenizer = TokenVectorizer(max_len=SENTENCE_LENGTH, max_features=VOCAB_SIZE)

# X_train_all, X_test = encode(data_train, data_test, vectorizer=tokens_vectorizer)
X_train_all, X_test = encode(clean_data_train, clean_data_test, vectorizer=tokenizer)

ENCODING: Fitting vectorizer to data
ENCODING: transforming data to numerical


In [7]:
SPLIT_VALID_RATIO = 0.10
SPLIT_RANDOM_SEED = 0  # because of unbalanced classes : check split --> done and OK

X_train, X_valid, y_train, y_valid = train_test_split(X_train_all, y_train_all, 
                                                      test_size=SPLIT_VALID_RATIO,
                                                      random_state=SPLIT_RANDOM_SEED)

# 1. Use pre-trained GloVe words for embeddings

https://medium.com/@sabber/classifying-yelp-review-comments-using-cnn-lstm-and-pre-trained-glove-word-embeddings-part-3-53fcea9a17fa

https://github.com/msahamed/yelp_comments_classification_nlp

https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb

## embedding_matrix avec les embeddings de Glove

several possibilities of pre-training/embeddings vector sizes for GloVe, see: https://nlp.stanford.edu/projects/glove/

In [8]:
# Load GloVe pre-trained embeddings
EMBEDDING_DIM = 200  # several embeddings sizes depending on source : 25, 50, 100, 200, 300 
EMBEDDING_SOURCE = 'glove_wikipedia'  # {'glove_twitter', 'glove_wikipedia', 'word2vec_googlenews'}

embeddings_matrix = load_pretrained_embeddings(tokenizer.word_index, VOCAB_SIZE, EMBEDDING_DIM, EMBEDDING_SOURCE)

Number of pre-trained word vectors in database       : 400000
Number of our words with a pre-trained embedding     : 27289
Percentage of our words with a pre-trained embedding : 90.963%


## Définition du réseau de Yoon Kim pour GloVe

In [13]:
N_FILTERS = 100
FILTERS_SIZES = (3, 5, 7)
TRAIN_EMBEDDINGS = True
MODEL_NAME = "embed_conv_fc_GLOVE_emb200_pretrained_trainableTrue_goodStatInit"

model = yoon_kim(sentence_length=SENTENCE_LENGTH, vocab_size=VOCAB_SIZE,
                 n_filters=N_FILTERS, filters_sizes=FILTERS_SIZES,
                 embedding_dim=EMBEDDING_DIM, embedding_matrix=embeddings_matrix, train_embeddings=TRAIN_EMBEDDINGS)

In [10]:
# train
BATCH_SIZE = 32
N_EPOCHS = 2

RocAuc = RocAucEvaluation(validation_data=(X_valid, y_valid))

hist = model.fit(X_train, y_train, 
                 batch_size=BATCH_SIZE, 
                 epochs=N_EPOCHS, 
                 validation_data=(X_valid, y_valid),
                 callbacks=[RocAuc])

# save trained nnet to disk for later use
save_nnet(model, MODEL_NAME)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
epoch: 1 - val_roc_auc: 0.9849
Epoch 2/2
epoch: 2 - val_roc_auc: 0.9868


In [12]:
# final model evaluation
y_train_pred = model.predict(X_train, batch_size=512)
train_score = evaluate(y_train, y_train_pred)
print("ROC-AUC score on train set : {:.4f}".format(train_score)) 

y_valid_pred = model.predict(X_valid, batch_size=512)
valid_score = evaluate(y_valid, y_valid_pred)
print("ROC-AUC score on validation set : {:.4f}".format(valid_score))

ROC-AUC score on train set : 0.9938
ROC-AUC score on validation set : 0.9868


In [13]:
# predict
y_test_pred = model.predict(X_test, batch_size=512, verbose=2)

In [14]:
# write submission file
submission(y_test_pred, id_test, name=MODEL_NAME)

# 2. Use pre-trained word2vec words for embeddings

le github précédent utilise word2vec pour entraîner sur le corpus du problème même, et pas comme source d'embeddings pré entraînés... https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb

"In this subsection, I use word2vec to create word embeddings from the review comments. Word2vec is one algorithm for learning a word embedding from a text corpus." --->>> à la base word2vec c'est le réseau d'extraction, on veut récupérer un résultat d'entraînement de référence de cette architecture !

Il faut utiliser word2vec entraîné sur Google News, embeddings de taille 300, par Mikolov https://code.google.com/archive/p/word2vec/

Post de blog chargeant word2vec pré-entraîné par Mikolov:

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Lien de téléchargement des embeddings pré entraînés:

https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download

## initialisation statistique de la matrice d'embeddings

In [9]:
# Load Word2Vec Google News pre-trained embeddings
EMBEDDING_DIM = 300  # several embeddings sizes depending on source : 25, 50, 100, 200, 300 
EMBEDDING_SOURCE = 'word2vec_googlenews'  # {'glove_twitter', 'glove_wikipedia', 'word2vec_googlenews'}

embeddings_matrix = load_pretrained_embeddings(tokenizer.word_index, VOCAB_SIZE, EMBEDDING_DIM, EMBEDDING_SOURCE)

Number of pre-trained word vectors in database       : 3000000
Number of our words with a pre-trained embedding     : 24431
Percentage of our words with a pre-trained embedding : 81.437%


## Définition du réseau de Yoon Kim pour word2vec

In [12]:
TRAIN_EMBEDDINGS = True
MODEL_NAME = "embed_LSTM_BIDIR_word2vec_emb300_pretrained_trainableTrue_goodStatInit"

model_google = bidirectional_lstm(sentence_length=SENTENCE_LENGTH, vocab_size=VOCAB_SIZE,
                    embedding_dim=EMBEDDING_DIM, embedding_matrix=embeddings_matrix, train_embeddings=TRAIN_EMBEDDINGS)

In [13]:
# train
BATCH_SIZE = 32
N_EPOCHS = 2

RocAuc = RocAucEvaluation(validation_data=(X_valid, y_valid))

hist = model_google.fit(X_train, y_train, 
                 batch_size=BATCH_SIZE, 
                 epochs=N_EPOCHS, 
                 validation_data=(X_valid, y_valid),
                 callbacks=[RocAuc])

# save trained nnet to disk for later use
save_nnet(model, MODEL_NAME)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
epoch: 1 - val_roc_auc: 0.9799
Epoch 2/2
epoch: 2 - val_roc_auc: 0.9853


In [15]:
# final model evaluation
y_train_pred = model_google.predict(X_train, batch_size=512)
train_score = evaluate(y_train, y_train_pred)
print("ROC-AUC score on train set : {:.4f}".format(train_score)) 

y_valid_pred = model_google.predict(X_valid, batch_size=512)
valid_score = evaluate(y_valid, y_valid_pred)
print("ROC-AUC score on validation set : {:.4f}".format(valid_score))

ROC-AUC score on train set : 0.9927
ROC-AUC score on validation set : 0.9853


In [16]:
# predict
y_test_pred = model_google.predict(X_test, batch_size=512, verbose=2)

In [17]:
# write submission file
submission(y_test_pred, id_test, name=MODEL_NAME)

# 3. What about Freebase ?

https://code.google.com/archive/p/word2vec/

# 4. Use pre-trained GloVe words for embeddings with LSTM model

## embedding_matrix avec les embeddings de Glove

In [14]:
# Load GloVe pre-trained embeddings
EMBEDDING_DIM = 200  # several embeddings sizes depending on source : 25, 50, 100, 200, 300 
EMBEDDING_SOURCE = 'glove_twitter'  # {'glove_twitter', 'glove_wikipedia', 'word2vec_googlenews'}

embeddings_matrix = load_pretrained_embeddings(tokenizer.word_index, VOCAB_SIZE, EMBEDDING_DIM, EMBEDDING_SOURCE)

## Définition du réseau LSTM pour GloVe

In [17]:
TRAIN_EMBEDDINGS = True
MODEL_NAME = "draft_embed_bidirlstm_2fc_EMB_PRETRAINED_GLOVE200t_TWITTER"

model = bidirectional_lstm(sentence_length=SENTENCE_LENGTH, vocab_size=VOCAB_SIZE,
                    embedding_dim=EMBEDDING_DIM, embedding_matrix=embeddings_matrix, train_embeddings=TRAIN_EMBEDDINGS)

In [None]:
# train
BATCH_SIZE = 32
N_EPOCHS = 2

RocAuc = RocAucEvaluation(validation_data=(X_valid, y_valid))

hist = model.fit(X_train, y_train, 
                 batch_size=BATCH_SIZE, 
                 epochs=N_EPOCHS, 
                 validation_data=(X_valid, y_valid),
                 callbacks=[RocAuc])

# save trained nnet to disk for later use
save_nnet(model, MODEL_NAME)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
  5824/143613 [>.............................] - ETA: 28:16 - loss: 0.1313 - acc: 0.9605

In [14]:
# final model evaluation
y_train_pred = model.predict(X_train, batch_size=512)
train_score = evaluate(y_train, y_train_pred)
print("ROC-AUC score on train set : {:.4f}".format(train_score)) 

y_valid_pred = model.predict(X_valid, batch_size=512)
valid_score = evaluate(y_valid, y_valid_pred)
print("ROC-AUC score on validation set : {:.4f}".format(valid_score))

ROC-AUC score on train set : 0.9933
ROC-AUC score on validation set : 0.9867


In [15]:
# predict
y_test_pred = model.predict(X_test, batch_size=512, verbose=2)

In [16]:
# write submission file
submission(y_test_pred, id_test, name=MODEL_NAME)