<a href="https://colab.research.google.com/github/jumafernandez/clasificacion_correos/blob/main/notebooks/05-Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probando Word2Vec con embeddings pre-entrenados

1. Instalo gensim:


In [31]:
!pip install gensim
import gensim



2. Descargo los embeddings pre-entrenados en español:

    __Source:__ https://github.com/dccuchile/spanish-word-embeddings

In [32]:
import os.path
from os import path

if not(path.exists("SBW-vectors-300-min5.bin.gz")):
  !wget http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz

# Source: https://github.com/dccuchile/spanish-word-embeddings

3. Cargo el modelo pre-entrenado en el módulo Word2Vec:

In [42]:
from gensim.models import Word2Vec

filename="SBW-vectors-300-min5.bin.gz"
embeddings = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)
embeddings.init_sims(replace=True)

4. Exploro los términos 1000 a 1050 del vocabulario, por ejemplo:

In [43]:
from itertools import islice
list(islice(embeddings.vocab, 1000, 1050))

['refiere',
 'hayan',
 'gt',
 'ciudades',
 'Asuntos',
 'estilo',
 'lleva',
 'dispuesto',
 'curso',
 'bienes',
 'imagen',
 'Costa',
 'gubernamentales',
 'distintos',
 'sectores',
 'realizado',
 'continuación',
 'pruebas',
 'terreno',
 'especies',
 'propiedad',
 'Sus',
 'indicó',
 'ocasiones',
 'presenta',
 'instalaciones',
 'presentar',
 'regiones',
 'haciendo',
 'prueba',
 'Federación',
 'III',
 'allí',
 'escuela',
 'diálogo',
 'aplicar',
 'Más',
 'hotel',
 'quiere',
 'responsable',
 'edificio',
 'obligaciones',
 'entidad',
 'cuarto',
 'declaraciones',
 'señor',
 'delito',
 'intereses',
 'inversión',
 'Banco']

5. Hacemos una prueba semántica, buscando el termino mas similar a mujer y rey que no sea hombre:

In [44]:
result = embeddings.most_similar(positive=['mujer', 'rey'], negative=['hombre'], topn=1)
print(result)

[('reina', 0.7493031024932861)]


6. Levanto mis datos con las etiquetas y trato desbalanceo:

In [47]:
import pandas as pd

# Descargo el archivo con las consultas que está en Github
if not(path.exists("Correos_Seleccionados_y_Etiquetados.csv")):
  !wget https://raw.githubusercontent.com/jumafernandez/UNLP/master/TFI/data/Correos_Seleccionados_y_Etiquetados.csv

df = pd.read_csv('Correos_Seleccionados_y_Etiquetados.csv', delimiter="|")

# Paso a otras consultas las clases minoritatias (trato desbalanceo)
cantidad_clases=3
clases = df.Clase.value_counts()
clases_minoritarias = clases.iloc[cantidad_clases:].keys().to_list()
df.Clase[df['Clase'].isin(clases_minoritarias)] = "Otras Consultas"

df.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,Fecha,Hora,Apellido y Nombre,Legajo,Documento,Carrera,Teléfono,E-mail,Consulta,Respuesta,Clase
0,08-05-2019,10:49:26,florencia roland,169336,33829069,licenciatura en enfermeria(52),1121550750.0,rolandflorencia@gmail.com,"hola quiero anotarme a las materias ,para el s...",te falta presentar alguna de las vacunas sal...,Otras Consultas
1,08-08-2017,12:29:59,lourdes vanesa gómez,150786,33220121,licenciatura en enfermeria(52),1131066251.0,vane_male@outlook.com,hola buenos días! quería saber cuando voy a po...,lo que falta es que la coordinación autorice l...,Otras Consultas
2,05-31-2017,01:30:49,karg solange,156535,43455018,contador publico(54),,solangekarg8@gmail.com,hola quisiera saber si en la consulta de situa...,"no, las notas de parciales no aparecen en tu s...",Otras Consultas
3,02-05-2018,22:58:24,topa maria luz,155395,38859638,licenciatura en trabajo social(5),1566431259.0,luztopa@hotmail.com,buenas noches. en mi situacion academica apare...,es que tenes que mirar por la opción finales l...,Otras Consultas
4,08-06-2016,13:16:16,yanet elizabeth marquez,115623,35756071,contador publico(54),44556937.0,yanet868@hotmail.com,"hola, quisiera obtener mi promedio o saber co...",lo calculas sumando las calificaciones de la o...,Otras Consultas


7. Descargo el módulo para las stopwords:

In [None]:
import nltk
nltk.download('stopwords')

8. Incorporo información de los embeddings:

In [53]:
count_esta=0
count_no_esta=0

docs_vectors = pd.DataFrame() # Se crea el dataframe para alojar el vector de pesos de cada documento
stopwords = nltk.corpus.stopwords.words('spanish') # Se eliminan las palabras vacías
for doc in df['Consulta'].str.lower().str.replace('[^a-z ]', ''): # Se limpia el texto (básico)
    temp = pd.DataFrame()  # Se crea un datafame temporal para la vectorización de las palabras de cada documento
    for word in doc.split(' '): # Se separa el doc en palabras
        if word not in stopwords: # Se verifica que la palabra sea una una palabra vacía
            try:
                word_vec = embeddings[word] # Se verifica si la palabra está en los embeddings pre-entrenados
                temp = temp.append(pd.Series(word_vec), ignore_index = True) # Si es así se incorpora el vector asociado al término
                count_esta=count_esta+1 # Se contabiliza la existencia
            except:
                count_no_esta=count_no_esta+1 # Se contabiliza la NO existencia
    doc_vector = temp.mean() # Se calcula el promedio para cada columna del dataframe asociado al documento anterior
    docs_vectors = docs_vectors.append(doc_vector, ignore_index = True) # Se incorpora al dataframe definitivo el vector del documento
docs_vectors.shape

print(str(count_esta) + " palabras de las consultas están en los embeddings pre-entrenados")
print(str(count_no_esta) + " palabras de las consultas NO están en los embeddings pre-entrenados")

14452 están en los embeddings pre-entrenados
3707 NO están en los embeddings pre-entrenados


9. Incorporo la clase a los documentos vectorizados:

In [55]:
docs_vectors['Clase'] = df['Clase']
docs_vectors.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,Clase
0,0.006431,-0.009342,0.019562,0.004289,0.011717,-0.029468,-0.013735,-0.016649,0.056541,0.013139,0.023258,-0.035717,-0.026671,-0.059891,0.039803,0.010155,-0.034154,0.027674,0.012747,0.042631,0.011165,-0.001284,0.026483,0.023362,0.004638,-0.061453,-0.03484,0.048891,-0.101722,0.053355,0.004811,-0.031459,-0.019401,-0.016548,0.001853,-0.03143,-0.027833,-0.021435,-0.02226,-0.01663,...,0.049395,-0.033905,-0.013737,-0.026353,0.012318,-0.013092,0.00846,-0.027355,0.065042,-0.001175,0.062504,0.033647,0.001586,-0.006234,-0.030336,-0.05175,0.046526,-0.001611,0.021513,-0.015356,-0.022595,-0.016073,0.078183,0.040872,0.002164,0.03693,0.02371,-0.016223,-0.034074,-0.048917,0.018302,-0.035978,0.019141,-0.04545,-0.047646,-0.000216,-0.033998,-0.005368,0.03491,Otras Consultas
1,0.002369,-0.015795,0.008516,-0.009306,-0.008713,-0.014095,-0.034005,-0.054447,0.054238,-0.011008,0.030823,-0.030618,-0.007342,-0.02942,0.008231,0.021344,-0.027991,-0.00871,0.005828,0.072837,0.005667,-0.018589,-0.012001,0.016404,-0.02119,-0.051351,-0.030322,0.044252,-0.074323,0.040171,-0.001171,-0.036806,-0.028483,-0.019692,-0.00619,-0.036091,-0.039644,-0.017939,-0.011261,-0.051703,...,0.082673,-0.029352,-0.018165,-0.02682,0.00809,0.006627,-0.019473,-0.030422,0.081443,-0.003396,0.036391,0.022918,0.012265,-0.012186,-0.022778,-0.011106,0.038096,-0.005326,-0.015675,-0.017562,-0.008405,-0.033801,0.061027,0.044871,-0.011784,0.020978,0.007014,-0.007709,-0.02098,-0.036263,0.000915,-0.036715,0.010582,-0.028672,-0.028392,-0.008085,-0.0274,0.007835,0.025657,Otras Consultas
2,0.013886,-0.00583,0.022338,0.018002,-0.001219,-0.014044,-0.013217,-0.025088,0.032274,0.022423,0.011219,-0.012576,-0.007802,-0.025862,0.031199,0.007098,-0.0425,-0.024051,0.041442,0.033397,0.012645,0.006965,0.001999,0.026033,-0.005464,-0.055565,-0.018448,0.037779,-0.054619,0.027261,0.02829,-0.026796,0.011784,-0.02871,0.014743,-0.042365,-0.039483,-0.02327,-0.034164,-0.010689,...,0.087875,-0.045448,-0.014048,-0.016818,-0.005466,-0.023313,-0.014992,-0.01046,0.084187,-0.008849,0.042013,0.025929,0.010893,0.019059,-0.022715,-0.05122,0.044968,0.013569,0.003081,-0.044446,-0.016212,0.032907,0.054954,0.042064,-0.022351,0.012225,0.008397,-0.002949,-0.040603,-0.019559,-0.016417,-0.033908,0.011083,-0.02565,-0.039967,-0.005116,-0.032856,0.016731,0.037949,Otras Consultas
3,0.004119,-0.002486,0.044253,-0.009965,0.006727,-0.04033,-0.014752,-0.052373,0.038104,-0.005434,0.028855,-0.003339,-0.028225,-0.025469,0.028341,0.005477,-0.027525,0.000942,-0.00554,0.031093,-0.012317,-0.001265,0.004829,0.045481,0.017422,-0.060793,-0.017276,0.043502,-0.057966,0.06056,0.016564,-0.029534,-0.029159,0.009814,-0.000298,-0.03316,-0.042544,-0.015639,-0.004029,-0.037405,...,0.092389,-0.03796,0.00586,-0.032974,-0.005161,-0.00802,-0.008256,-0.057044,0.043194,0.03728,0.032843,0.026802,0.02284,0.002081,-0.028819,-0.019784,0.058049,0.013038,-0.015701,-0.027065,-0.004788,-0.024877,0.02536,0.047831,-0.041555,0.039662,0.014778,-0.007117,-0.014467,-0.058065,0.022331,-0.004518,-0.006239,-0.033535,-0.030493,0.005118,-0.014152,-0.017904,0.024175,Otras Consultas
4,-0.013964,-0.007655,0.023141,0.00544,0.00833,-0.023091,-0.009174,-0.019315,0.035563,0.016122,0.048655,-0.04897,-0.021151,-0.065668,0.020106,0.005968,-0.027917,0.003851,0.037852,0.053073,0.017134,0.016696,0.009764,0.049485,0.009628,-0.07068,-0.04681,0.047627,-0.118441,0.035296,0.007232,-0.03907,-0.024893,-0.040426,-0.009545,-0.03345,-0.031191,-0.006677,-0.008112,-0.009537,...,0.048867,-0.059556,-8.9e-05,-0.01592,0.012154,-0.01973,0.023718,-0.028236,0.077646,0.006075,0.061348,0.04722,0.02482,-0.009576,-0.023623,-0.007919,0.073704,0.053593,0.039357,-0.007499,-0.046087,-0.016846,0.047867,0.024816,-0.014027,0.0396,0.013614,-0.000791,-0.015289,-0.061121,-0.00531,-0.039734,0.005953,-0.027858,-0.068998,-0.004713,-0.04271,0.008178,0.010627,Otras Consultas


10. Separo en training y testing:

In [57]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(docs_vectors.drop('Clase', axis = 1),
                                                   docs_vectors['Clase'],
                                                   test_size = 0.2,
                                                   random_state = 1)
train_x.shape, train_y.shape, test_x.shape, test_y.shape

((800, 300), (800,), (200, 300), (200,))

11. Entreno el clasificador

In [58]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

model = AdaBoostClassifier(n_estimators=800, random_state = 1)
model.fit(train_x, train_y)
test_pred = model.predict(test_x)

accuracy_score(test_y, test_pred)

0.665

In [59]:
from sklearn.linear_model import LogisticRegression

modelo_regresion = LogisticRegression()
modelo_regresion.fit(train_x, train_y)

# Realizo la predicción de y con el x_test
test_pred = modelo_regresion.predict(test_x)

accuracy_score(test_y, test_pred)

0.665

## Referencias
- <a href="https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568">Multi-Class Text Classification Model Comparison and Selection</a>
- <a href="https://unipython.com/como-desarrollar-embeddings-incrustaciones-de-palabras-con-gensim/">Cómo desarrollar embeddings (incrustraciones) de palabras con GENSIM</a>
- <a href="https://www.kaggle.com/ananyabioinfo/text-classification-using-word2vec">Text classification using word2vec</a>