<a href="https://colab.research.google.com/github/jumafernandez/clasificacion_correos/blob/main/notebooks/jcc/02-Word2Vec%2BLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM+Word2Vec

En esta notebook, se entrena y prueba la clasificación de oraciones usando LSTM y Word2Vec pre-entrenado.

El principal beneficio de la incrustación de palabras es que incluso las palabras que no se ven durante el entrenamiento se pueden predecir bien ya que la incrustación de palabras está pre-entrenada con un conjunto de datos más grande que los del dataset actual.


## Carga de librerías, modelo word2vec pre-entrenado y funciones útiles

In [39]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
import numpy as np
import pandas as pd

import tensorflow_hub as hub
import numpy as np
import sys

VECTOR_EMBEDDINGS = 250

In [40]:
# Load Pretrained Word2Vec
embed = hub.load("https://tfhub.dev/google/Wiki-words-250/2")

In [41]:
embed(['Juan', 'Fernandwxxdo', 'Mujer'])





<tf.Tensor: shape=(3, 250), dtype=float32, numpy=
array([[ 1.42693724e-02, -9.08979494e-03, -3.98036987e-02,
         3.62966694e-02,  9.49378684e-02,  1.33175060e-01,
         5.39111458e-02,  1.10701146e-02,  1.46255925e-01,
        -2.13303957e-02, -3.83548699e-02,  4.41266783e-02,
        -9.04915407e-02,  2.29147449e-02, -3.18627581e-02,
        -1.03184409e-01, -1.68420419e-01, -4.47667427e-02,
         2.33685635e-02, -5.40872291e-03,  1.03382301e-03,
        -1.41673787e-02,  5.66975288e-02, -1.82882976e-02,
        -1.00069664e-01, -1.74700860e-02,  9.95630622e-02,
         9.80864614e-02, -6.69433102e-02,  4.79473807e-02,
         4.57753688e-02, -2.03111470e-02, -5.50246350e-02,
        -1.25571623e-01, -2.35121641e-02, -7.92662799e-02,
        -5.50914090e-03, -7.49683520e-03,  1.01595342e-01,
        -2.06922018e-03,  1.18362702e-01,  1.83063328e-01,
         6.25873879e-02, -1.19993433e-01,  5.89830708e-03,
        -6.69010654e-02, -6.98054358e-02,  5.13526574e-02,
      

### Carga de Word2Vec en español

In [52]:
# Cargo el Word2Vec pre-entrenado
# Referencias: https://github.com/dccuchile/spanish-word-embeddings
from os import path

# Lo descargo desde la URL
filename="SBW-vectors-300-min5.bin.gz"
if not(path.exists(filename)):
  !wget http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz

# Lo cargo en la variable embeddings
import gensim
from gensim.models import Word2Vec

embeddings = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)
embeddings.init_sims(replace=True)

--2021-03-11 02:30:46--  http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz
Resolving cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)... 200.16.17.55
Connecting to cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)|200.16.17.55|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz [following]
--2021-03-11 02:30:47--  https://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz
Connecting to cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)|200.16.17.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1123304474 (1.0G) [application/x-gzip]
Saving to: ‘SBW-vectors-300-min5.bin.gz’


2021-03-11 02:31:50 (17.2 MB/s) - ‘SBW-vectors-300-min5.bin.gz’ saved [1123304474/1123304474]



In [42]:
def get_max_length(text):
    """
    get max token counts from train data, 
    so we use this number as fixed length input to RNN cell
    """
    max_length = 0
    for row in text:
        if len(row.split(" ")) > max_length:
            max_length = len(row.split(" "))
    return max_length

def get_word2vec_enc(texts):
    """
    get word2vec value for each word in sentence.
    concatenate word in numpy array, so we can use it as RNN input
    """
    encoded_texts = []
    for text in texts:
        tokens = text.split(" ")
        word2vec_embedding = embed(tokens)
        encoded_texts.append(word2vec_embedding)
    return encoded_texts
        
def get_padded_encoded_text(encoded_text, max_length):
    """
    for short sentences, we prepend zero padding so all input to RNN has same length
    """
    padded_text_encoding = []
    for enc_text in encoded_text:
        zero_padding_cnt = max_length - enc_text.shape[0]
        pad = np.zeros((1, VECTOR_EMBEDDINGS))
        for i in range(zero_padding_cnt):
            enc_text = np.concatenate((pad, enc_text), axis=0)
        padded_text_encoding.append(enc_text)
    return padded_text_encoding

def category_encode(category):
    """
    Se encodea la clase en variables dummies
    """
    return pd.get_dummies(category)


def preprocess(x, y, max_length):
    """
    encode text value to numeric value
    """
    # encode words into word2vec
    text = x.tolist()
    
    encoded_text = get_word2vec_enc(text)
    padded_encoded_text = get_padded_encoded_text(encoded_text, max_length)
    
    # encoded class
    categorys = y.tolist()
    encoded_category = category_encode(categorys)
    X = np.array(padded_encoded_text)
    Y = np.array(encoded_category)
    return X, Y 

## Carga del dataset y balanceo de clases

In [43]:
# Cargamos el archivo con las consultas que está en Github
from os import path

# En caso que no esté el archivo en Colab lo traigo
if not(path.exists('03-Correos_variables_estaticas.csv')):
  !wget https://raw.githubusercontent.com/jumafernandez/clasificacion_correos/main/data/03-Correos_variables_estaticas.csv

# Leemos el archivo en un dataframe
import pandas as pd

# Cargamos los datos
df = pd.read_csv('03-Correos_variables_estaticas.csv', delimiter="|")

# Seleccionamos solo la consulta y la clase
x_df = df["Consulta"]
y_df = df["Clase"]

In [44]:
# Definición de la cantidad de clases (el resto se agrupa en OTRAS CONSULTAS)
CANTIDAD_CLASES = 4

# Transformamos todas las Clases minoritarias (Puedo ir variando la cantidad de clases que derivo a la Clase "Otras Consultas")
clases = y_df.value_counts()
clases_minoritarias = clases.iloc[CANTIDAD_CLASES-1:].keys().to_list()
y_df.loc[y_df.isin(clases_minoritarias)] = "Otras Consultas"

# Se numeriza la clase
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y=le.fit_transform(y_df)

# Me guardo las etiquetas de las clases (numerizadas)
class_list=le.classes_

class_list

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


array(['Boleto Universitario', 'Ingreso a la Universidad',
       'Otras Consultas', 'Requisitos de Ingreso'], dtype=object)

## Separación en train/test

In [45]:
x = x_df
y = y

# Separo datos de entrenamiento y testing
from sklearn.model_selection import train_test_split

# Separo en 80-20 entrenamiento/validación y testeo
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, test_size=0.2)

## Preprocesamiento (codificación del texto a vectores numéricos)

In [46]:
# max_length is used for max sequence of input
max_length = get_max_length(x_train)

train_X, train_Y = preprocess(x_train, y_train, max_length)

## Construcción del Modelo

In [47]:
# LSTM model
model = Sequential()
model.add(LSTM(32))
model.add(Dense(CANTIDAD_CLASES, activation='softmax'))

In [48]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

### Entrenamiento del modelo

In [49]:
print('Train...')
model.fit(train_X, train_Y,epochs=50)

Train...
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fc659cf84d0>

In [50]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (32, 32)                  36224     
_________________________________________________________________
dense_3 (Dense)              (32, 4)                   132       
Total params: 36,356
Trainable params: 36,356
Non-trainable params: 0
_________________________________________________________________


### Testeo del Modelo

In [51]:
# max_length is used for max sequence of input
max_length = get_max_length(x_test)

test_X, test_Y = preprocess(x_test, y_test, max_length)


score, acc = model.evaluate(test_X, test_Y, verbose=2)
print('Test score:', score)
print('Test accuracy:', acc)

7/7 - 1s - loss: 1.0720 - accuracy: 0.6600
Test score: 1.0720036029815674
Test accuracy: 0.6600000262260437
