# FastText

A diferencia de Word2Vec, que trabaja a nivel de palabra, FastText trata de capturar la información morfológica de las palabras.

>*"[...] we propose a new approach **based on the skipgram model, where each word is represented as a bag of character n-grams**. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. [...]"* <br>(Mikolov et al., Enriching Word Vectors with Subword Information, https://arxiv.org/pdf/1607.04606.pdf)

De esta manera, una palabra quedará representada por sus n-grams.

El tamaño de los n-grams deberá ser definido como hiperparámetro
- min_n: valor mínimo de _n_ a considerar
- max_n: valor máximo de _n_ a considerar

Ejemplo:
>*"Me gusta el procesado del lenguaje natural"*
>* Ejemplo de *skip-gram* pre-procesado con una ventana de contexto de 2 palabras
>
>$w_{target} =$ "procesado" &emsp;$w_{context} =$ ["gusta", "el", "del", "lenguaje"]
>
>     ("procesado", "gusta")
>
> Descomoposición de n-grams con min_n=3 and max_n=4:
>
>"procesado" = ["$<$pr", "pro", ..., "ado", "do$>$", "$<$pro", "roce", ..., "sado", "ado$>$"]
>
>* De este modo, la similitud será: <br><br>
>&emsp;$\boxed{s(w_{target}, w_{context}) = \sum_{g \in G_{w_{target}}}z_{g}^T v_{w_{context}}}$, where $G_{w_{target}}\subset\{g_{1}, ..., g_{G}\}$

## Palabras más similares

In [None]:
!pip install gensim spacy numpy

In [None]:
def print_sim_words(word, model1, model2):
    query = "Most similar to {}".format(word)
    print(query)
    print("-"*len(query))
    for (sim1, sim2) in zip(model1.wv.most_similar(word), model2.wv.most_similar(word)):
        print("{}:{}{:.3f}{}{}:{}{:.3f}".format(sim1[0],
                                               " "*(20-len(sim1[0])),
                                               sim1[1],
                                               " "*10,
                                               sim2[0],
                                               " "*(20-len(sim2[0])),
                                               sim2[1]))
    print("\n")

## Importamos las librerías

In [None]:

from gensim.models import FastText
from gensim.models.word2vec import LineSentence
from gensim.models.phrases import Phrases, Phraser

## Lectura de datos

In [None]:
!pip install unzip
!unzip df_clean_simpsons.csv.zip

In [None]:
import pandas as pd
df_clean = pd.read_csv('./df_clean_simpsons.csv')

In [None]:

sent = [row.split() for row in df_clean['clean']]

## Hyperparameters

In [None]:
sg_params = {
    'sg': 1,
    'vector_size': 300,
    'min_count': 5,
    'window': 5,
    'hs': 0,
    'negative': 20,
    'workers': 4,
    'min_n': 3,
    'max_n': 6
}



## Inicializamos el objeto FastText

In [None]:
help(FastText)

In [None]:
sg_params = {
    'sg': 1,
    'vector_size': 300,
    'min_count': 5,
    'window': 5,
    'hs': 0,
    'negative': 20,
    'workers': 4,
    'min_n': 3,
    'max_n': 6
}

# Skip Gram
ft_sg = FastText(**sg_params)

## Construímos el vocabulario

In [None]:
# Skip Gram
ft_sg.build_vocab(sent)



In [None]:
print('Vocabulario compuesto por {} palabras'.format(len(ft_sg.wv.key_to_index)))


## Entrenamos los pesos de los embeddings

In [None]:
# Skip Gram


ft_sg.train(sent, total_examples=len(sent), epochs=20)


## Guardamos los modelos

In [None]:
ft_sg.save('./w2v_model_fast.pkl')


## Algunos resultados

In [None]:
ft_sg.wv.most_similar(positive=["homer"])

In [None]:
ft_sg.wv.most_similar(positive=["marge"])

In [None]:
ft_sg.wv.most_similar(positive=["bart"])

In [None]:
ft_sg.wv.similarity('maggie', 'baby')

In [None]:
ft_sg.wv.similarity('bart', 'nelson')

In [None]:
ft_sg.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])

In [None]:
ft_sg.wv.doesnt_match(['homer', 'patty', 'selma'])

## Out-of-Vocabulary (OOV) Words

la cantidad de n-grams creados durante el entrenamiento del FastText hace improbable (que no imposible) que alguna palabra no pueda ser construída como una bolsa de n-grams

In [None]:
'asereje' in ft_sg.wv.key_to_index

In [None]:
ft_sg.wv.most_similar('asereje')

In [None]:
ft_sg.wv['asereje'].shape