# Práctica 2: Aprendizaje de representaciones por embeddings


**Integrantes:**
1. Ceballos Equihua Conan Nathaniel
2. Murrieta Villegas Alfonso
3. Salas Mora Mónica

A partir del corpus seleccionado en la tarea anterior realizar un modelo de
embeddings basado en Word2Vec (no es necesario usar Negative Sampling).



## 1. Importación de bibliotecas y carga del dataset obtenido en el trabajo anterior 


In [None]:
pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 6.6 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


In [None]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
import gensim
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from sklearn import utils
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import plotly.express as px
import plotly.graph_objects as go

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
urlData = 'https://raw.githubusercontent.com/aMurryFly/nlp_course/main/data/process_corpus_.csv'
dataORIGINAL = pd.read_csv(urlData)#, encoding='latin1')
# Drops rows with NaN value on either clean_headline or clean_description columns
data = dataORIGINAL.dropna(subset=['clean_headline', 'clean_description'])

NOTA: Las columnas originales son headline y short_description, las columnas limpiadas mediante expresiones regulares son la de clean_headline y clean_description y por último, las columnas de subword son aquellas que tuvimos como resultado final del trabajo anterior.


In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,category,headline,short_description,keywords,clean_headline,clean_description,subword_headline,subword_description
0,0,BUSINESS,America's Most Expensive Neighborhoods: 24/7 W...,"A little-known Northern California town, estab...",,americas expensive neighborhoods wall st,littleknown northern california town establish...,americas@ exp ens ive@ ne igh bor hood s@ wall...,little know n@ nor ther n@ c ali for ni a@ tow...
1,1,BUSINESS,Investment Crowdfunding Draws a Crowd,It's getting crowded in the investment crowdfu...,investment-crowdfunding-draw,investment crowdfunding draws crowd,getting crowded investment crowdfunding space ...,invest ment@ cr ow d fun ding@ d ra w s@ cr ow d@,getting@ c row d ed@ inv est ment@ c row d fun...
2,2,BUSINESS,Office Romances Often Sparked By Emoticon-Lace...,But even though they might literally be sendin...,office-romances-emoticons,office romances often sparked emoticonlaced em...,even though might literally sending wrong mess...,off ice@ r om anc es@ o ft en@ sp ar k ed@ em ...,even@ though@ might@ li ter ally@ s ending@ wr...
3,3,BUSINESS,Verizon Could Buy Yahoo In The Next Few Days,A deal could be coming soon.,verizon-could-buy-yahoo-next-few-days,verizon buy yahoo next days,deal coming soon,ver iz on@ bu y@ y ah oo@ next@ days@,deal@ coming@ so on@
4,4,BUSINESS,What's Your Meeting Brand?,Many leaders operate as if their meeting brand...,whats-your-meeting-brand,meeting brand,many leaders operate meeting brand directly af...,meet ing@ br and@,many@ leaders@ op er ate@ me e ting@ br and@ d...


In [None]:
# Separate data for training and testing

data_desc = data["clean_description"].copy()

data_desc = data[["category","clean_description"]]

data_desc = data_desc.copy()
data_desc.index = range(data_desc.shape[0])
print(data_desc)

#print(type(data_desc))

#print(data_desc[15925])

train_data_desc, test_data_desc = train_test_split(data_desc, test_size = 0.3, random_state = 0)

print("\nTotal de datos: " + str(len(data_desc)))
print("Tamaño de datos de entrenamiento: " + str(len(train_data_desc)))
print("Tamaño de datos de prueba: " + str(len(test_data_desc)))

         category                                  clean_description
0        BUSINESS  littleknown northern california town establish...
1        BUSINESS  getting crowded investment crowdfunding space ...
2        BUSINESS  even though might literally sending wrong mess...
3        BUSINESS                                   deal coming soon
4        BUSINESS  many leaders operate meeting brand directly af...
...           ...                                                ...
29945  WORLD NEWS  many willing endure unthinkable horror escape ...
29946  WORLD NEWS  israel facing kind struggle many countries enc...
29947  WORLD NEWS  macron merkel must figure common way forward o...
29948  WORLD NEWS  mexico become dangerous place courage devote l...
29949  WORLD NEWS  freedom happiness gratitude words people use d...

[29950 rows x 2 columns]

Total de datos: 29950
Tamaño de datos de entrenamiento: 20965
Tamaño de datos de prueba: 8985


## 2. Obtención de los pares de entrenamiento a partir de los contextos

Al trabajar mediante Gensim, se tomaron como referencia los siguientes proyectos, que ejemplifican el uso del modelo doc2vec de esta biblioteca.

1. https://github.com/susanli2016/NLP-with-Python/blob/master/Doc2Vec%20Consumer%20Complaint.ipynb

2. https://github.com/susanli2016/NLP-with-Python/blob/master/Doc2Vec%20Consumer%20Complaint_3.ipynb

Debido al uso de Gensim, se toman como pares de entrenamiento los objetos TaggedDocument, que contiene una lista de palabras contenidas en el documento, y una lista de tags o labels (etiquetas) que corresponden a la clase del documento.

Además, también se utilizó el ejemplo provisto en https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html como referencia.


In [None]:
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            tokens.append(word.lower())
    return tokens

In [None]:
train_tagged_desc = train_data_desc.apply(
    lambda r: TaggedDocument(words = tokenize_text(r['clean_description']), tags = [r.category]), axis = 1)
test_tagged_desc = test_data_desc.apply(
    lambda r: TaggedDocument(words = tokenize_text(r['clean_description']), tags = [r.category]), axis = 1)

In [None]:
print("Training sample\nType: " + str(type(train_tagged_desc[:1])) + "\n" + str(train_tagged_desc[:5]))
print(*train_tagged_desc[:5].tolist(), sep='\n')
print("\nTesting sample\nType: " + str(type(test_tagged_desc[:1])) + "\n" + str(test_tagged_desc[:5]))
print(*test_tagged_desc[:5].tolist(), sep='\n')

Training sample
Type: <class 'pandas.core.series.Series'>
2694     ([pension, institutional, investment, funds, a...
9520     ([usda, secretary, nominee, sonny, perdues, re...
21160    ([different, recipe, knew, share, also, recipe...
12793    ([sometimes, letting, steam, pot, helpful, avo...
320      ([taxpayers, seen, drastic, cuts, public, serv...
dtype: object
TaggedDocument(['pension', 'institutional', 'investment', 'funds', 'actually', 'loans', 'would', 'get', 'paid', 'fair', 'market', 'value', 'mortgage', 'resolution'], ['BUSINESS'])
TaggedDocument(['usda', 'secretary', 'nominee', 'sonny', 'perdues', 'record', 'raising', 'concerns', 'probably', 'stall', 'confirmation'], ['POLITICS'])
TaggedDocument(['different', 'recipe', 'knew', 'share', 'also', 'recipe', 'customize', 'family', 'see', 'thats'], ['FOOD & DRINK'])
TaggedDocument(['sometimes', 'letting', 'steam', 'pot', 'helpful', 'avoiding', 'future', 'resentment', 'provided', 'learn', 'nonreactive', 'take', 'personally', 'say', 

Tamaño del corpus utilizado resultante:

In [None]:
len(train_tagged_desc + test_tagged_desc)

29950

## 3. Red neuronal 


Construcción de red neuronal con una capa con 128 unidades ocultas. 
Entrenamiento de la red para obtener los embeddings.

NOTA: como se utiliza Gensim, no se puede modificar la estructura del modelo, sólo ciertos parámetros, como el tipo de algoritmo de entrenamiento (en este caso, se utiliza dm=0, por lo que es *distributed bag of words*, o DBOW), la tasa de aprendizaje (*alpha*, y *min_alpha* para indicar a dónde descenderá la tasa de aprendizaje a medida que avanza el entrenamiento), la dimensión del vector de características (*vector_size*), la frecuencia mínima de las palabras para que sean tomadas en cuenta (*min_count*), si se utilizará *negative sampling* y el número de *noise words* (palabras irrelevantes) que se considerarán, entre otras cosas. Documentación: https://radimrehurek.com/gensim/models/doc2vec.html.

In [None]:
model_dbow_desc = Doc2Vec(dm = 0, vector_size = 128, alpha = 0.065, min_alpha = 0.065, min_count = 1, negative = 5)
model_dbow_desc.build_vocab([x for x in tqdm(train_tagged_desc.values)])

100%|██████████| 20965/20965 [00:00<00:00, 1420551.90it/s]


In [None]:
for epoch in range(30):
    model_dbow_desc.train(utils.shuffle([x for x in tqdm(train_tagged_desc)]), total_examples = len(train_tagged_desc), epochs = 1)
    model_dbow_desc.alpha -= 0.002
    model_dbow_desc.min_alpha = model_dbow_desc.alpha

100%|██████████| 20965/20965 [00:00<00:00, 1456046.88it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1448754.17it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1314265.82it/s]
100%|██████████| 20965/20965 [00:00<00:00, 2103472.95it/s]
100%|██████████| 20965/20965 [00:00<00:00, 2216794.40it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1413382.36it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1553981.26it/s]
100%|██████████| 20965/20965 [00:00<00:00, 2076942.31it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1284808.57it/s]
100%|██████████| 20965/20965 [00:00<00:00, 2196801.82it/s]
100%|██████████| 20965/20965 [00:00<00:00, 2115364.41it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1444802.72it/s]
100%|██████████| 20965/20965 [00:00<00:00, 2031127.05it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1299504.68it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1378852.86it/s]
100%|██████████| 20965/20965 [00:00<00:00, 2149229.69it/s]
100%|██████████| 20965/20965 [00:00<00:00, 1335199.72it/

In [None]:
print(f"La palabra 'economic' apareció {model_dbow_desc.wv.get_vecattr('economic', 'count')} veces en el corpus de entrenamiento.")

La palabra 'economic' apareció 130 veces en el corpus de entrenamiento.


In [None]:
print("Vector de la palabra 'economic':")
model_dbow_desc.wv['economic']

Vector de la palabra 'economic':


array([ 7.4432604e-04,  6.7666881e-03, -7.1584303e-03,  3.5435762e-03,
       -3.9355773e-03, -5.5194348e-03, -4.4522099e-03, -1.7174296e-03,
        7.5029768e-03, -7.3310193e-03,  5.4428801e-03,  1.2219734e-03,
        1.6115420e-04,  2.3328364e-03, -7.7509806e-03, -6.9922023e-03,
        5.6693181e-03,  3.2610372e-03,  1.0681283e-03, -7.0845410e-03,
        4.9575251e-03, -3.5520643e-03,  2.2673085e-03, -6.0153287e-03,
        4.4362266e-03,  3.8694646e-03, -6.9131423e-03,  1.5859287e-03,
        5.6804866e-03,  8.2476065e-04,  7.1614273e-03, -6.6940114e-03,
       -1.3119057e-03,  3.1260792e-03,  5.8824960e-03, -7.3782261e-03,
        4.1465014e-03, -6.2560234e-03, -2.9166974e-04,  5.5764616e-03,
       -1.6990602e-03, -9.2182495e-04, -3.1702016e-03,  1.1174642e-03,
        4.4349805e-03,  6.6295061e-03, -6.1835572e-03, -3.6766473e-03,
        7.5351670e-03, -5.9115067e-03, -5.3236075e-03,  5.7176054e-03,
        4.7806073e-03, -6.3178260e-03,  8.0449693e-04,  7.1042925e-03,
      

In [None]:
print("Vector normalizado de la palabra 'economic':")
model_dbow_desc.wv.get_vector('economic', norm = True)

Vector normalizado de la palabra 'economic':


array([ 0.0142686 ,  0.12971619, -0.1372258 ,  0.06792971, -0.0754443 ,
       -0.10580657, -0.08534806, -0.03292282,  0.1438307 , -0.14053431,
        0.10433903,  0.02342501,  0.0030893 ,  0.04472005, -0.1485849 ,
       -0.13403925,  0.1086798 ,  0.06251349,  0.02047583, -0.13580938,
        0.09503486, -0.06809243,  0.04346389, -0.11531276,  0.08504166,
        0.07417694, -0.13252369,  0.03040197,  0.1088939 ,  0.01581051,
        0.13728327, -0.12832299, -0.025149  ,  0.05992637,  0.11276639,
       -0.14143926,  0.07948768, -0.11992684, -0.00559126,  0.10689976,
       -0.03257068, -0.01767122, -0.06077219,  0.02142159,  0.08501777,
        0.12708643, -0.11853767, -0.07048067,  0.14444779, -0.11332251,
       -0.10205259,  0.10960546,  0.09164338, -0.12111158,  0.01542206,
        0.136188  ,  0.11880226, -0.05808588, -0.06137659, -0.00762874,
       -0.03466398, -0.01486836, -0.04949834,  0.08614331,  0.07632186,
        0.07856124,  0.10525833,  0.08288524,  0.0184363 , -0.03

In [None]:
# Se obtienen los embeddings del modelo, con la función infer_vector, que infiere los vectores a partir del entrenamiento

def get_embeddings(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, epochs = 20)) for doc in sents])
    return targets, regressors

In [None]:
y_train_labels_desc, X_train_embeddins_desc = get_embeddings(model_dbow_desc, train_tagged_desc)
y_test_labels_desc, X_test_embeddins_desc = get_embeddings(model_dbow_desc, test_tagged_desc)

In [None]:
print("Eje y, o bien, etiquetas: " + str(set(y_test_labels_desc)))
print("Primeros cinco elementos del eje y obtenido del entrenamiento: " + str(y_test_labels_desc[:5]))
print("Primeros cinco embeddings obtenidos del entrenamiento:")
X_train_embeddins_desc[:5]

Eje y, o bien, etiquetas: {'BUSINESS', 'WELLNESS', 'FOOD & DRINK', 'PARENTING', 'POLITICS', 'WORLD NEWS'}
Primeros cinco elementos del eje y obtenido del entrenamiento: ('PARENTING', 'WORLD NEWS', 'FOOD & DRINK', 'PARENTING', 'WELLNESS')
Primeros cinco embeddings obtenidos del entrenamiento:


(array([ 0.02103367, -0.11758392, -0.18193802,  0.17595255, -0.01418224,
         0.02661354, -0.17660363, -0.08029262, -0.09313839,  0.06845414,
        -0.12586948,  0.03556092,  0.06418078, -0.00439869,  0.01528305,
        -0.17160633, -0.06462339,  0.11337325, -0.15157287,  0.04021445,
        -0.11640828,  0.06296335, -0.1323707 ,  0.23071061,  0.10206681,
        -0.11079489, -0.0987075 , -0.03779867,  0.06382439, -0.01328756,
         0.06832284, -0.00548487, -0.06920493,  0.11900113,  0.08485771,
         0.05893385, -0.03202688, -0.1714899 , -0.00219152,  0.01909421,
        -0.15072843, -0.12655395,  0.17888188, -0.16958421, -0.04327394,
         0.14235675,  0.1180502 ,  0.10807802,  0.0745083 , -0.10939606,
        -0.1058779 ,  0.12083827, -0.07131643, -0.16671295, -0.00645302,
        -0.03751068, -0.05038116,  0.16543321, -0.19422026,  0.12312002,
         0.14444873,  0.20793648,  0.01419958, -0.01568417,  0.04004592,
         0.06798903,  0.11684328,  0.04336371,  0.0

## 4. Visualización de los embeddings

In [None]:
pca = PCA(n_components=3)
X = pca.fit_transform(X_train_embeddins_desc)
print(X)

[[-0.10151977  0.41897866 -0.30539418]
 [-0.24709658  0.16066292  0.08089196]
 [ 0.08719808 -0.2312198  -0.01761065]
 ...
 [-0.28071395  0.15874381  0.14260716]
 [ 0.30717324  0.32921194 -0.18429641]
 [-0.02591095  0.12670829 -0.32722286]]


In [None]:
norms = (X.T/np.array([np.linalg.norm(x) for x in X])).T
print(norms)

[[-0.19215826  0.79304955 -0.57805502]
 [-0.8084693   0.5256691   0.26466844]
 [ 0.35197092 -0.9333078  -0.07108458]
 ...
 [-0.79609388  0.4501913   0.40442839]
 [ 0.63136902  0.67666773 -0.37880594]
 [-0.07364122  0.3601162  -0.9299964 ]]


Muestra visual de 50 embeddings:

In [None]:
df = pd.DataFrame(data=norms, columns=['x','y','z'])
df['word'] = list(train_tagged_desc)

fig = px.scatter_3d(df[:50], x="x", y="y", z="z", text="word")
fig.update_traces(marker=dict(size=1), textposition='top center')
fig.show()

## 5. Guardado de vectores de la capa de embedding asociados a las palabras

En este caso, se decidió guardar los embeddings en un dataframe, para poder importar y cargarlo fácilmente después.

In [None]:
allDesc = train_tagged_desc
allDesc = allDesc.append(test_tagged_desc)
allDesc = allDesc.sort_index()
labels_temp, allEmbds = get_embeddings(model_dbow_desc, allDesc)

In [None]:
allEmbds = pd.Series(allEmbds)

frame = { 'category': data_desc.category, 'clean_description': data_desc.clean_description, 'tagged_document': allDesc, 'embedding': allEmbds }
  
embdData = pd.DataFrame(frame)

In [None]:
embdData

Unnamed: 0,category,clean_description,tagged_document,embedding
0,BUSINESS,littleknown northern california town establish...,"([littleknown, northern, california, town, est...","[0.045154523, -0.06516076, -0.15734805, 0.1524..."
1,BUSINESS,getting crowded investment crowdfunding space ...,"([getting, crowded, investment, crowdfunding, ...","[-0.002537974, -0.113792755, -0.26249868, 0.22..."
2,BUSINESS,even though might literally sending wrong mess...,"([even, though, might, literally, sending, wro...","[0.054077085, -0.066021316, -0.17997105, 0.171..."
3,BUSINESS,deal coming soon,"([deal, coming, soon], [BUSINESS])","[0.032379396, -0.01841027, -0.070468225, 0.069..."
4,BUSINESS,many leaders operate meeting brand directly af...,"([many, leaders, operate, meeting, brand, dire...","[0.01784022, -0.09093062, -0.25879616, 0.22807..."
...,...,...,...,...
29945,WORLD NEWS,many willing endure unthinkable horror escape ...,"([many, willing, endure, unthinkable, horror, ...","[0.076169886, -0.03192123, -0.09570384, 0.1007..."
29946,WORLD NEWS,israel facing kind struggle many countries enc...,"([israel, facing, kind, struggle, many, countr...","[0.10293097, -0.08019456, -0.16608433, 0.15193..."
29947,WORLD NEWS,macron merkel must figure common way forward o...,"([macron, merkel, must, figure, common, way, f...","[0.06970699, -0.048544854, -0.15155764, 0.1320..."
29948,WORLD NEWS,mexico become dangerous place courage devote l...,"([mexico, become, dangerous, place, courage, d...","[0.060442276, -0.0100320745, -0.14605603, 0.13..."


In [None]:
print("Primer renglón completo:")
print(embdData.loc[[0]].to_string())

Primer renglón completo:
   category                                                                            clean_description                                                                                                         tagged_document                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [None]:
embdData.to_csv ('embeddings.csv', index = True, header=True)