# Intro to class


[word embeddings colah](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

# Word Embeddings

Cuando empezó esto de los word embeddings? Hace algo de tiempo, aquí por ejemplo. [Distributed Representations](http://repository.cmu.edu/cgi/viewcontent.cgi?article=2841&context=compsci)

![Word Embeddings](https://qph.fs.quoracdn.net/main-qimg-e8b83b14d7261d75754a92d0d3605e36 "Word Embeddings")

Los word embeddings son vectores densos que representan tokens, aunque el término es word, sería más correcto usar token embeddings.

Hay varias implementaciones, como CBOW o Skip-Gram (del famoso word2vec) o Global Vectors (GloVe). 

Són una solución al problema de dimensionalidad que comentamos el primer día de clase.

### Vale pero que son realmente los valores de los word embeddings?

Los valores de los word embeddings representan "clases", aunque es muy complicado estar 100% seguro de que clases son, sobretodo porque cada vez que entrenemos word embeddings, el modelo puede aprender unas u otras clases. Con lo cuál, hay que olvidar-se un poco del concepto de vector dónde la posición 1 siempre tendremos la clase 1, y en la posición 2 la clase 2.

![](https://cdn-images-1.medium.com/max/1600/1*mLrheV1nGz7XemDAVRcZ4A.png)

Pensad que cuando escogemos la dimensionalidad de estos, ya estamos fijando cuántas clases puede aprender, y además, cada dimensionalidad pueden ser una mezcla de clases. Así pues, y el objetivo de aprender word vectors, es el siguiente:

> *Rezamos para que palabras "similares" tengan vectores similares*. Es decir a representar palabras por su semantismo.

### Recordatorio:

In [2]:
import numpy as np
import io

In [7]:
# Bag-of-Words
nb_features = 10
bow_doc = np.zeros(shape=(1,nb_features))
print(bow_doc.shape)
nb_docs = 5
bow_docs = np.zeros(shape=(nb_docs, nb_features))
print(bow_docs.shape)
# Si quisieramos representar cada posición del documento, con una feature, simplemente habría 
# que generar un vector de maxlen features.
doc_representation = np.array([1, 2, 3, 4, 5, 0 , 0]) # 0 padding

(1, 10)
(5, 10)


Estas representaciones son simples features, pero podemos mejorar de alguna forma? Si quisieramos representar un documento con la ultima representación y ngrams por ejemplo, creariamos un array tal que asi:

In [8]:
maxlen = len([1, 2, 3, 4, 5, 0 , 0])
doc_ngrams = np.zeros(shape=(maxlen, nb_features))
doc_ngrams.shape

(7, 10)

## Problemas que arreglamos

Los shapes de las representaciones son muy indicativas del problema que queremos resolver. Es decir, si ahora en lugar de una frase, queremos representar un documento de 400 posiciones, y que tiene 20k features? Por cada documento generaremos arrays de shape=(400, 20000), lo cual se hace insostenible a medida de que hagamos crecer las features. 

Ahora que ya habeis visto la cantidad de datos necesarios para ciertos problemas como computer vision, texto no ha sido indiferente a la revolución del "big data". Aún así, los word embeddings NO son deep learning, pero si se usan en él. Sás las features más usadas para texto en deep learning.

### Semántica

Además muchos problemas de NLP se basan basicamente en la semantica de estos tokens, y hasta la fecha no hemos visitado demasiado la semantica de las palabras. Usando Word Embeddings podremos boostear nuestras features y consecuentemente nuestros outputs de los algoritmos. 

Haremos uns dummy implementación del CBOW y GloVe, pero podéis encontrar vectores ya entrenados, o algoritmos implementados de forma eficiente en librerías como gensim o spaCy. 

CBOW esta basado en redes neuronales, y GloVe en factorización de matríces.

<div align="center">
    ![cbow.png](https://adriancolyer.files.wordpress.com/2016/04/word2vec-cbow.png =500x)
</div>

### Load Data

In [1]:
!pip install tqdm
from tqdm import tqdm

!pip install -q pydot

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot




Using TensorFlow backend.


In [0]:
!pip install spacy
!python -m spacy download es_core_news_sm

import spacy

nlp = spacy.load('es_core_news_sm', disable= ['ner', 'parser'])

In [3]:
from google.colab import files
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving quijote.txt to quijote (1).txt
User uploaded file "quijote.txt" with length 2141457 bytes


In [11]:
# construir el corpus
quijote = uploaded['quijote.txt']
quijote = io.StringIO(quijote.decode('utf-8')).getvalue()

quijote = quijote.split('.')


### Dataset Preparation

In [12]:
len(quijote)

8210

In [13]:
quijote_docs = []
for doc in tqdm(nlp.pipe(quijote, batch_size=1000, n_threads=4)):
    quijote_docs.append(doc)



0it [00:00, ?it/s][A[A

1it [00:04,  4.64s/it][A[A

1001it [00:08, 122.66it/s][A[A

2001it [00:12, 164.90it/s][A[A

3001it [00:16, 180.64it/s][A[A

4001it [00:20, 194.90it/s][A[A

5001it [00:24, 203.89it/s][A[A

6001it [00:28, 211.70it/s][A[A

7001it [00:32, 217.96it/s][A[A

8001it [00:32, 243.00it/s][A[A

8210it [00:32, 249.22it/s][A[A

In [15]:
vocab = list({t.text for doc in quijote_docs for t in doc})
vocab.insert(0, '<PAD>')

In [16]:
w2id = {w:i for i, w in enumerate(vocab)}
id2w = {i:w for w, i in w2id.items()}

In [17]:
id2w[1]

'dividiéndolas'

In [7]:
config = {
    'vocab_size': 25166,
    'embed_size': 50,
    'context_size':2
}
config

{'context_size': 2, 'embed_size': 50, 'vocab_size': 25166}

### Generate Dataset

In [19]:
def generate_context_target(data, context_size, w2id, vocab_size):
    # x-2 x-1 y x+1 x+2
    for i in range(context_size, len(data)-context_size):
        x = np.zeros(shape=(context_size*2))
        y = np.zeros(shape=(vocab_size))
        y[w2id[data[i]]]=1
        for idx, j in enumerate(range(i-context_size, i+context_size)):
            if j < i:
                x[idx] = w2id[data[j]]
            else:
                x[idx] = w2id[data[j+1]]
        yield x.reshape(1,-1), y.reshape(1,-1)

In [20]:
quijote_corpus = [t.text for doc in quijote_docs for t in doc]
print(len(quijote_corpus))
# generator for x, y in generate_context_target(quijote_corpus, config['context_size'], w2id, config['vocab_size']):


467707


### Create Model

In [0]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

In [8]:
# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=config['vocab_size'], output_dim=config['embed_size'], input_length=config['context_size']*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(config['embed_size'],)))
cbow.add(Dense(config['vocab_size'], activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')
cbow.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 50)             1258300   
_________________________________________________________________
lambda_1 (Lambda)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 25166)             1283466   
Total params: 2,541,766
Trainable params: 2,541,766
Non-trainable params: 0
_________________________________________________________________


In [9]:

SVG(model_to_dot(cbow, show_shapes=True, show_layer_names=False, 
                 rankdir='TB').create(prog='dot', format='svg'))

OSError: ignored

### Train

In [0]:
if False:
    for epoch in range(1, 6):
        loss = 0.
        i = 0
        for x, y in generate_context_target(quijote_corpus, config['context_size'], w2id, config['vocab_size']):
            i += 1
            loss += cbow.train_on_batch(x, y)
            if i % 10000 == 0:
                print('Processed {} (context, word) pairs'.format(i))

        print('Epoch:', epoch, '\tLoss:', loss)
        print()

### Get word embeddings

In [0]:
word_embeddings = cbow.get_weights()[0]
word_embeddings = word_embeddings[1:]

In [0]:
from sklearn.metrics.pairwise import euclidean_distances

### Test

In [0]:
# compute pairwise distance matrix
if False:
    distance_matrix = euclidean_distances(word_embeddings)
    print(distance_matrix.shape)

    # view contextually similar words
    similar_words = {search_term: [id2w[idx] for idx in distance_matrix[w2id[search_term]-1].argsort()[1:6]+1] 
                       for search_term in ['Quijote', 'Mancha', 'escudero', 'Sancho']}


## GloVe

El algoritmo que veremos se basa en la siguiente idea:

<div align="center">
![](https://image.slidesharecdn.com/analyticssummitnov082013final-131114141831-phpapp01/95/analytics-summit-2013-25-638.jpg?cb=1384439169 =300x)
</div>

Es una idea que ya hemos visto presentada en formatos diferentes. Cuando entrenamos el Language Modeling, al final, estabamos asignando mayores probabilidades a palabras o tokens que aparecían en un contexto concreto, sólo que en ese caso, el contexto era sólo del pasado.

Vamos a ver una implementación "sencilla" de GloVe. Como hemos comentado, está basada en dos cosas, la primera, en la factorización de matrices. No entraremos en detalles de que es la factorización de matrices, nos basta con saber que en el caso de GloVe, nuestro objetivo será, dado una matriz de co-ocurrencias, obtener dos matrices que multiplicadas reproduzcan la original.

![glove.png](https://cdn-images-1.medium.com/max/800/1*UNtsSilztKXjLG99VXxSQw.png)

La matriz lila, será nuestra matriz de co-ocurrencias. ¿Qué es una matriz de coocurrencias? Tan fácil como ir sumando, dado un contexto, esas palabras que van aparenciendo entre si. Aquí tenéis un ejemplo de su uso.

![](https://cdn-images-1.medium.com/max/1600/0*Yl7I7bH52zk8m_8R.)

Primera Cosa en la que tenemos que fijarnos. Que valor tienen words y context? Es el mismo valor?

Y si es así, que hacemos con las dos matrices que generamos?

#### Algoritmo

![](https://i.imgur.com/HCo2ZwE.png)


### Notación en GloVe

* v_main: the word vector for the main word in the co-occurrence

* v_context: the word vector for the context word in the co-occurrence

* b_main: bias scalar for main word

* b_context: bias scalar for context word

* gradsq_W_main: a vector storing the squared gradient history for the main word vector (for use in the AdaGrad update)

* gradsq_W_context: a vector gradient history for the context word vector

* gradsq_b_main: a scalar gradient history for the main word bias

* gradsq_b_context: a scalar gradient history for the context word bias

* cooccurrence: the Xij value for the co-occurrence pair, described at length above




In [0]:
from scipy.sparse import lil_matrix
from math import log
from functools import partial

In [3]:
def build_cooccur(vocab, corpus, vocab_size, window_size=10, min_count=5):
    """
    Build a word co-occurrence list for the given corpus.
    This function is a tuple generator, where each element (representing
    a cooccurrence pair) is of the form
        (i_main, i_context, cooccurrence)
    where `i_main` is the ID of the main word in the cooccurrence and
    `i_context` is the ID of the context word, and `cooccurrence` is the
    `X_{ij}` cooccurrence value as described in Pennington et al.
    (2014).
    If `min_count` is not `None`, cooccurrence pairs where either word
    occurs in the corpus fewer than `min_count` times are ignored.
    """

    # Collect cooccurrences internally as a sparse matrix for passable
    # indexing speed; we'll convert into a list later
    print('vocab size', vocab_size)
    cooccurrences = lil_matrix((vocab_size, vocab_size),
                                      dtype=np.float64)

    for i, line in tqdm(enumerate(corpus)):
        tokens = [t.text for t in line]
        token_ids = [vocab[word] for word in tokens]

        for center_i, center_id in enumerate(token_ids):
            # Collect all word IDs in left window of center word
            context_ids = token_ids[max(0, center_i - window_size) : center_i]
            contexts_len = len(context_ids)

            for left_i, left_id in enumerate(context_ids):
                # Distance from center word
                distance = contexts_len - left_i

                # Weight by inverse of distance between words
                increment = 1.0 / float(distance)

                # Build co-occurrence matrix symmetrically (pretend we
                # are calculating right contexts as well)
                cooccurrences[center_id, left_id] += increment
                cooccurrences[left_id, center_id] += increment

    # Now yield our tuple sequence (dig into the LiL-matrix internals to
    # quickly iterate through all nonzero cells)
    for i, (row, data) in enumerate(zip(cooccurrences.rows,
                                                   cooccurrences.data)):
        if min_count is not None and vocab[id2w[i]] < min_count:
            continue

        for data_idx, j in enumerate(row):
            if min_count is not None and vocab[id2w[j]] < min_count:
                continue

            yield i, j, data[data_idx]

In [5]:
def run_iter(vocab, data, learning_rate=0.05, x_max=100, alpha=0.75):
    """
    Run a single iteration of GloVe training using the given
    cooccurrence data and the previously computed weight vectors /
    biases and accompanying gradient histories.
    `data` is a pre-fetched data / weights list where each element is of
    the form
        (v_main, v_context,
         b_main, b_context,
         gradsq_W_main, gradsq_W_context,
         gradsq_b_main, gradsq_b_context,
         cooccurrence)
    as produced by the `train_glove` function. Each element in this
    tuple is an `ndarray` view into the data structure which contains
    it.
    See the `train_glove` function for information on the shapes of `W`,
    `biases`, `gradient_squared`, `gradient_squared_biases` and how they
    should be initialized.
    The parameters `x_max`, `alpha` define our weighting function when
    computing the cost for two word pairs; see the GloVe paper for more
    details.
    Returns the cost associated with the given weight assignments and
    updates the weights by online AdaGrad in place.
    """

    global_cost = 0

    # We want to iterate over data randomly so as not to unintentionally
    # bias the word vector contents
    shuffle(data)

    for (v_main, v_context, b_main, b_context, gradsq_W_main, gradsq_W_context,
         gradsq_b_main, gradsq_b_context, cooccurrence) in data:

        weight = (cooccurrence / x_max) ** alpha if cooccurrence < x_max else 1

        # Compute inner component of cost function, which is used in
        # both overall cost calculation and in gradient calculation
        #
        #   $$ J' = w_i^Tw_j + b_i + b_j - log(X_{ij}) $$
        cost_inner = (v_main.dot(v_context)
                      + b_main[0] + b_context[0]
                      - log(cooccurrence))

        # Compute cost
        #
        #   $$ J = f(X_{ij}) (J')^2 $$
        cost = weight * (cost_inner ** 2)

        # Add weighted cost to the global cost tracker
        global_cost += 0.5 * cost

        # Compute gradients for word vector terms.
        #
        # NB: `main_word` is only a view into `W` (not a copy), so our
        # modifications here will affect the global weight matrix;
        # likewise for context_word, biases, etc.
        grad_main = cost_inner * v_context
        grad_context = cost_inner * v_main

        # Compute gradients for bias terms
        grad_bias_main = cost_inner
        grad_bias_context = cost_inner

        # Now perform adaptive updates
        v_main -= (learning_rate * grad_main / np.sqrt(gradsq_W_main))
        v_context -= (learning_rate * grad_context / np.sqrt(gradsq_W_context))

        b_main -= (learning_rate * grad_bias_main / np.sqrt(gradsq_b_main))
        b_context -= (learning_rate * grad_bias_context / np.sqrt(
                gradsq_b_context))

        # Update squared gradient sums
        gradsq_W_main += np.square(grad_main)
        gradsq_W_context += np.square(grad_context)
        gradsq_b_main += grad_bias_main ** 2
        gradsq_b_context += grad_bias_context ** 2

    return global_cost

In [4]:
def train_glove(vocab, cooccurrences, iter_callback=None, vector_size=100,
                iterations=25, **kwargs):
    """
    Train GloVe vectors on the given generator `cooccurrences`, where
    each element is of the form
        (word_i_id, word_j_id, x_ij)
    where `x_ij` is a cooccurrence value $X_{ij}$ as presented in the
    matrix defined by `build_cooccur` and the Pennington et al. (2014)
    paper itself.
    If `iter_callback` is not `None`, the provided function will be
    called after each iteration with the learned `W` matrix so far.
    Keyword arguments are passed on to the iteration step function
    `run_iter`.
    Returns the computed word vector matrix `W`.
    """

    vocab_size = len(vocab)

    # Word vector matrix. This matrix is (2V) * d, where N is the size
    # of the corpus vocabulary and d is the dimensionality of the word
    # vectors. All elements are initialized randomly in the range (-0.5,
    # 0.5]. We build two word vectors for each word: one for the word as
    # the main (center) word and one for the word as a context word.
    #
    # It is up to the client to decide what to do with the resulting two
    # vectors. Pennington et al. (2014) suggest adding or averaging the
    # two for each word, or discarding the context vectors.
    W = (np.random.rand(vocab_size * 2, vector_size) - 0.5) / float(vector_size + 1)

    # Bias terms, each associated with a single vector. An array of size
    # $2V$, initialized randomly in the range (-0.5, 0.5].
    biases = (np.random.rand(vocab_size * 2) - 0.5) / float(vector_size + 1)

    # Training is done via adaptive gradient descent (AdaGrad). To make
    # this work we need to store the sum of squares of all previous
    # gradients.
    #
    # Like `W`, this matrix is (2V) * d.
    #
    # Initialize all squared gradient sums to 1 so that our initial
    # adaptive learning rate is simply the global learning rate.
    gradient_squared = np.ones((vocab_size * 2, vector_size),
                               dtype=np.float64)

    # Sum of squared gradients for the bias terms.
    gradient_squared_biases = np.ones(vocab_size * 2, dtype=np.float64)

    # Build a reusable list from the given cooccurrence generator,
    # pre-fetching all necessary data.
    #
    # NB: These are all views into the actual data matrices, so updates
    # to them will pass on to the real data structures
    #
    # (We even extract the single-element biases as slices so that we
    # can use them as views)
    data = [(W[i_main], W[i_context + vocab_size],
             biases[i_main : i_main + 1],
             biases[i_context + vocab_size : i_context + vocab_size + 1],
             gradient_squared[i_main], gradient_squared[i_context + vocab_size],
             gradient_squared_biases[i_main : i_main + 1],
             gradient_squared_biases[i_context + vocab_size
                                     : i_context + vocab_size + 1],
             cooccurrence)
            for i_main, i_context, cooccurrence in cooccurrences]

    for i in tqdm(range(iterations), desc='training'):
        
        cost = run_iter(vocab, data, **kwargs)
        if iter_callback is not None:
            iter_callback(W)

    return W

In [0]:
cooccurrences = build_cooccur(w2id, quijote_docs, vocab_size=config['vocab_size'],
                             window_size=5,
                             min_count=None)

In [6]:
from random import shuffle
import itertools
import logging

logger = logging.getLogger("glove")

In [0]:
config = {
    'vocab_size': len(vocab),
    'embed_size': 100,
    'context_size':3
}
logger = logging.getLogger("glove")

In [0]:
if True:
    word_embeddings = train_glove(w2id, cooccurrences,
                        vector_size=config['embed_size'],
                        iterations=100,
                        learning_rate=0.1)
    np.save('word_embeddings', word_embeddings)
else:
    word_embeddings = np.load('word_embeddings.npy')


In [6]:
word_embeddings.shape

In [0]:
for i, row in enumerate(word_embeddings):
    word_embeddings[i, :] /= np.linalg.norm(row)

In [0]:
emb_words = np.array(word_embeddings[config['vocab_size']:,])
print(emb_words.shape)
context_words = np.array(word_embeddings[config['vocab_size']:config['vocab_size']*2,])
print(context_words.shape)
word_embeddings = np.mean([emb_words, context_words], axis=0)
print(word_embeddings.shape)

(25164, 100)
(25164, 100)
(25164, 100)


In [0]:
# compute pairwise distance matrix
if False:
    distance_matrix = euclidean_distances(word_embeddings)
    print(distance_matrix.shape)

    # view contextually similar words
    similar_words = {search_term: [id2w[idx] for idx in distance_matrix[w2id[search_term]-1].argsort()[1:6]+1] 
                       for search_term in ['Quijote', 'Mancha', 'escudero', 'Sancho']}


In [0]:
word_embeddings.shape

(25164, 100)

In [0]:
w_id = w2id['Quijote']
print(w_id)
quijote_w = word_embeddings[w_id,].reshape(1,-1)
quijote_w.shape

8870


(1, 100)

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

k_close = 10

In [0]:
distances = cosine_similarity(quijote_w, word_embeddings).flatten()

In [0]:
ind = np.argpartition(distances, -k_close)[-k_close:]
distances_top = distances[ind]
np.argsort(distances_top)
for _id in np.argsort(distances_top):
    print(id2w[ind[_id]], distances_top[_id])

Camila 0.4180056502353547
sentóse 0.4259783727708059
escudero 0.45016300866762365
hermano 0.45127875962076225
amo 0.45368394197308304
casen 0.46910466912129367
Teresa 0.4932793481066978
marido 0.5000075675864711
Sancho 0.5659655974218973
Panza 0.9999999999999998


In [0]:
for w in ['Sancho', 'Panza', 'Rocinante', 'Dulcinea', 'ingenioso']:
    print('{}\n---------'.format(w))
    w_id = w2id[w]
    w_vector = word_embeddings[w_id,].reshape(1,-1)
    distances = cosine_similarity(w_vector, word_embeddings).flatten()
    ind = np.argpartition(distances, -k_close)[-k_close:]
    distances_top = distances[ind]
    np.argsort(distances_top)
    for _id in np.argsort(distances_top):
        print(id2w[ind[_id]], distances_top[_id])
    print()

Sancho
---------
padre 0.3986807090799656
amigo 0.4413413117945051
Pallida 0.4721364595852338
APROBACIÓN 0.474137181166619
Soneto 0.47972340418665466
don 0.49837576999921923
Epitafio 0.5120628747940579
Quijote 0.5303728928664393
Panza 0.5659655974218973
Sancho 0.9999999999999998

Panza
---------
Camila 0.4180056502353547
sentóse 0.4259783727708059
escudero 0.45016300866762365
hermano 0.45127875962076225
amo 0.45368394197308304
casen 0.46910466912129367
Teresa 0.4932793481066978
marido 0.5000075675864711
Sancho 0.5659655974218973
Panza 0.9999999999999998

Rocinante
---------
buen 0.3634907377935789
volábanle 0.3653954218416763
ellos 0.3663850850603597
alhucema 0.3674997404700465
sus 0.3731778244271906
pecho 0.3759047645827094
sobre 0.38284440776936535
volver 0.39995210973155054
caballo 0.4253226551985809
Rocinante 1.0000000000000002

Dulcinea
---------
corazón 0.37125082099209367
freno 0.3737632192001009
alma 0.37654391288598216
color 0.38836480253850436
Rodríguez 0.39510549022835983
de

In [0]:
w_hombre = w2id['principe']
w_mujer = w2id['princesa']
w_quijote = w2id['rey']

w_h = word_embeddings[w_hombre,].reshape(1,-1)
w_m = word_embeddings[w_mujer,].reshape(1,-1)
w_q = word_embeddings[w_quijote,].reshape(1,-1)

w_result = w_q - w_h + w_m

distances = cosine_similarity(w_result, word_embeddings).flatten()
ind = np.argpartition(distances, -k_close)[-k_close:]
distances_top = distances[ind]
np.argsort(distances_top)
for _id in np.argsort(distances_top):
    print(id2w[ind[_id]], distances_top[_id])
print()

NameError: name 'w2id' is not defined