# FastText

A diferencia de Word2Vec, que trabaja a nivel de palabra, FastText trata de capturar la información morfológica de las palabras.

>*"[...] we propose a new approach **based on the skipgram model, where each word is represented as a bag of character n-grams**. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. [...]"* <br>(Mikolov et al., Enriching Word Vectors with Subword Information, https://arxiv.org/pdf/1607.04606.pdf)

De esta manera, una palabra quedará representada por sus n-grams.

El tamaño de los n-grams deberá ser definido como hiperparámetro
- min_n: valor mínimo de _n_ a considerar
- max_n: valor máximo de _n_ a considerar

Ejemplo:
>*"Me gusta el procesado del lenguaje natural"*
>* Ejemplo de *skip-gram* pre-procesado con una ventana de contexto de 2 palabras
>
>$w_{target} =$ "procesado" &emsp;$w_{context} =$ ["gusta", "el", "del", "lenguaje"] 
>
>     ("procesado", "gusta")
>
> Descomoposición de n-grams con min_n=3 and max_n=4:
>
>"procesado" = ["$<$pr", "pro", ..., "ado", "do$>$", "$<$pro", "roce", ..., "sado", "ado$>$"]
>
>* De este modo, la similitud será: <br><br>
>&emsp;$\boxed{s(w_{target}, w_{context}) = \sum_{g \in G_{w_{target}}}z_{g}^T v_{w_{context}}}$, where $G_{w_{target}}\subset\{g_{1}, ..., g_{G}\}$

## Palabras más similares

In [None]:
def print_sim_words(word, model1, model2):
    query = "Most similar to {}".format(word) 
    print(query)
    print("-"*len(query))
    for (sim1, sim2) in zip(model1.wv.most_similar(word), model2.wv.most_similar(word)):
        print("{}:{}{:.3f}{}{}:{}{:.3f}".format(sim1[0],
                                               " "*(20-len(sim1[0])), 
                                               sim1[1], 
                                               " "*10, 
                                               sim2[0],
                                               " "*(20-len(sim2[0])),
                                               sim2[1]))
    print("\n")

## Importamos las librerías

In [None]:
from gensim.models import FastText
from gensim.models.word2vec import LineSentence
from gensim.models.phrases import Phrases, Phraser

## Lectura de datos

In [None]:
!pip install unzip
!unzip df_clean_simpsons.csv.zip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Archive:  df_clean_simpsons.csv.zip
  inflating: df_clean_simpsons.csv   
  inflating: __MACOSX/._df_clean_simpsons.csv  


In [None]:
import pandas as pd
df_clean = pd.read_csv('./df_clean_simpsons.csv')

In [None]:

sent = [row.split() for row in df_clean['clean']]

## Hyperparameters

In [None]:
sg_params = {
    'sg': 1,
    'size': 300,
    'min_count': 5,
    'window': 5,
    'hs': 0,
    'negative': 20,
    'workers': 4,
    'min_n': 3,
    'max_n': 6
}



## Inicializamos el objeto FastText

In [None]:
help(FastText)

In [None]:
# Skip Gram
ft_sg = FastText(**sg_params)



## Construímos el vocabulario

In [None]:
# Skip Gram
ft_sg.build_vocab(sent)



In [None]:
print('Vocabulario compuesto por {} palabras'.format(len(ft_sg.wv.vocab)))

Vocabulario compuesto por 8770 palabras


## Entrenamos los pesos de los embeddings

In [None]:
# Skip Gram
ft_sg.train(sentences=sent, total_examples=ft_sg.corpus_count, epochs=20)

## Guardamos los modelos

In [None]:
ft_sg.save('./w2v_model_fast.pkl')


## Algunos resultados

In [None]:
ft_sg.wv.most_similar(positive=["homer"])

[('knockahomer', 0.6135291457176208),
 ('homey', 0.6062021255493164),
 ('homeboy', 0.5659101009368896),
 ('hom', 0.5419987440109253),
 ('hometown', 0.5186790823936462),
 ('astronomer', 0.5175546407699585),
 ('fonzie', 0.5121568441390991),
 ('homosexual', 0.4817216694355011),
 ('homie', 0.4720458984375),
 ('timer', 0.46853768825531006)]

In [None]:
ft_sg.wv.most_similar(positive=["marge"])

[('sarge', 0.6486552953720093),
 ('margarita', 0.5945071578025818),
 ('margie', 0.5929368138313293),
 ('margaret', 0.5686913132667542),
 ('barge', 0.5686628818511963),
 ('marmaduke', 0.5014662742614746),
 ('marjorie', 0.48846176266670227),
 ('marble', 0.45711302757263184),
 ('marco', 0.452824205160141),
 ('marlon', 0.44601792097091675)]

In [None]:
ft_sg.wv.most_similar(positive=["bart"])

[('barto', 0.5967378616333008),
 ('bartman', 0.5079664587974548),
 ('bartron', 0.5004277229309082),
 ('bartholomew', 0.4930424690246582),
 ('barty', 0.4902939200401306),
 ('baryshnikov', 0.4792308807373047),
 ('nikki', 0.4688391089439392),
 ('dart', 0.465351939201355),
 ('art', 0.4629773795604706),
 ('impart', 0.4613049030303955)]

In [None]:
ft_sg.wv.similarity('maggie', 'baby')

0.32618228

In [None]:
ft_sg.wv.similarity('bart', 'nelson')

0.29418728

In [None]:
ft_sg.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'milhouse'

In [None]:
ft_sg.wv.doesnt_match(['homer', 'patty', 'selma'])

'homer'

## Out-of-Vocabulary (OOV) Words 

la cantidad de n-grams creados durante el entrenamiento del FastText hace improbable (que no imposible) que alguna palabra no pueda ser construída como una bolsa de n-grams

In [None]:
'asereje' in ft_sg.wv.vocab

False

In [None]:
ft_sg.wv.most_similar('asereje')

[('reject', 0.6776005029678345),
 ('serenity', 0.5853143930435181),
 ('ohmygod', 0.5259625911712646),
 ('fulfill', 0.5136178135871887),
 ('sera', 0.5104902386665344),
 ('unnecessary', 0.5089079141616821),
 ('eraser', 0.5040486454963684),
 ('taser', 0.4980926513671875),
 ('guarantee', 0.4953840374946594),
 ('vengeful', 0.48939526081085205)]

In [None]:
ft_sg.wv['asereje'].shape

(300,)