# FastText

A diferencia de Word2Vec, que trabaja a nivel de palabra, FastText trata de capturar la información morfológica de las palabras.

>*"[...] we propose a new approach **based on the skipgram model, where each word is represented as a bag of character n-grams**. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. [...]"* <br>(Mikolov et al., Enriching Word Vectors with Subword Information, https://arxiv.org/pdf/1607.04606.pdf)

De esta manera, una palabra quedará representada por sus n-grams.

El tamaño de los n-grams deberá ser definido como hiperparámetro
- min_n: valor mínimo de _n_ a considerar
- max_n: valor máximo de _n_ a considerar

Ejemplo:
>*"Me gusta el procesado del lenguaje natural"*
>* Ejemplo de *skip-gram* pre-procesado con una ventana de contexto de 2 palabras
>
>$w_{target} =$ "procesado" &emsp;$w_{context} =$ ["gusta", "el", "del", "lenguaje"]
>
>     ("procesado", "gusta")
>
> Descomoposición de n-grams con min_n=3 and max_n=4:
>
>"procesado" = ["$<$pr", "pro", ..., "ado", "do$>$", "$<$pro", "roce", ..., "sado", "ado$>$"]
>
>* De este modo, la similitud será: <br><br>
>&emsp;$\boxed{s(w_{target}, w_{context}) = \sum_{g \in G_{w_{target}}}z_{g}^T v_{w_{context}}}$, where $G_{w_{target}}\subset\{g_{1}, ..., g_{G}\}$

## Palabras más similares

In [None]:
!pip install gensim spacy numpy

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Collecting spacy
  Downloading spacy-3.8.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (27 kB)
Collecting numpy
  Downloading numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading spacy-3.8.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (33.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.2/33.2 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_6

In [None]:
def print_sim_words(word, model1, model2):
    query = "Most similar to {}".format(word)
    print(query)
    print("-"*len(query))
    for (sim1, sim2) in zip(model1.wv.most_similar(word), model2.wv.most_similar(word)):
        print("{}:{}{:.3f}{}{}:{}{:.3f}".format(sim1[0],
                                               " "*(20-len(sim1[0])),
                                               sim1[1],
                                               " "*10,
                                               sim2[0],
                                               " "*(20-len(sim2[0])),
                                               sim2[1]))
    print("\n")

## Importamos las librerías

In [None]:

from gensim.models import FastText
from gensim.models.word2vec import LineSentence
from gensim.models.phrases import Phrases, Phraser

## Lectura de datos

In [None]:
!pip install unzip
!unzip df_clean_simpsons.csv.zip

Collecting unzip
  Downloading unzip-1.0.0.tar.gz (704 bytes)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: unzip
  Building wheel for unzip (setup.py) ... [?25l[?25hdone
  Created wheel for unzip: filename=unzip-1.0.0-py3-none-any.whl size=1281 sha256=773b039b64e42c6ee3a2eff477c52b06f985c801aaa42e0cfd5538deaf484591
  Stored in directory: /root/.cache/pip/wheels/fb/5b/81/0f3e1e533b52883f88ab978178c15627a4fce4c13f74911dce
Successfully built unzip
Installing collected packages: unzip
Successfully installed unzip-1.0.0
Archive:  df_clean_simpsons.csv.zip
  inflating: df_clean_simpsons.csv   
  inflating: __MACOSX/._df_clean_simpsons.csv  


In [None]:
import pandas as pd
df_clean = pd.read_csv('./df_clean_simpsons.csv')

In [None]:

sent = [row.split() for row in df_clean['clean']]

## Hyperparameters

In [None]:
sg_params = {
    'sg': 1,
    'vector_size': 300,
    'min_count': 5,
    'window': 5,
    'hs': 0,
    'negative': 20,
    'workers': 4,
    'min_n': 3,
    'max_n': 6
}



## Inicializamos el objeto FastText

In [None]:
help(FastText)

Help on class FastText in module gensim.models.fasttext:

class FastText(gensim.models.word2vec.Word2Vec)
 |  FastText(sentences=None, corpus_file=None, sg=0, hs=0, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000, callbacks=(), max_final_vocab=None, shrink_windows=True)
 |
 |  Method resolution order:
 |      FastText
 |      gensim.models.word2vec.Word2Vec
 |      gensim.utils.SaveLoad
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, sentences=None, corpus_file=None, sg=0, hs=0, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function has

In [None]:
# Skip Gram
ft_sg = FastText(**sg_params)



## Construímos el vocabulario

In [None]:
# Skip Gram
ft_sg.build_vocab(sent)



In [None]:
print('Vocabulario compuesto por {} palabras'.format(len(ft_sg.wv.key_to_index)))


Vocabulario compuesto por 8770 palabras


## Entrenamos los pesos de los embeddings

In [None]:
# Skip Gram


ft_sg.train(sent, total_examples=len(sent), epochs=20)


(9125060, 10741780)

## Guardamos los modelos

In [None]:
ft_sg.save('./w2v_model_fast.pkl')


## Algunos resultados

In [None]:
ft_sg.wv.most_similar(positive=["homer"])

[('knockahomer', 0.6395146250724792),
 ('homey', 0.6154800653457642),
 ('homeboy', 0.5624999403953552),
 ('hom', 0.5478023290634155),
 ('astronomer', 0.5175456404685974),
 ('hometown', 0.5051910877227783),
 ('fonzie', 0.4758222699165344),
 ('homosexual', 0.47149208188056946),
 ('home', 0.4672473669052124),
 ('thompson', 0.4571842849254608)]

In [None]:
ft_sg.wv.most_similar(positive=["marge"])

[('sarge', 0.654728889465332),
 ('margie', 0.6150627732276917),
 ('margarita', 0.6041011214256287),
 ('margaret', 0.5914472341537476),
 ('barge', 0.5579581260681152),
 ('marmaduke', 0.5227359533309937),
 ('marjorie', 0.5183671712875366),
 ('marlon', 0.4624136686325073),
 ('marco', 0.460935115814209),
 ('mars', 0.45457586646080017)]

In [None]:
ft_sg.wv.most_similar(positive=["bart"])

[('barto', 0.5829876065254211),
 ('bartman', 0.520098865032196),
 ('baryshnikov', 0.4961232542991638),
 ('bartron', 0.4920521676540375),
 ('barty', 0.47650855779647827),
 ('barf', 0.471396267414093),
 ('dart', 0.47074419260025024),
 ('bartholomew', 0.4664735198020935),
 ('impart', 0.46606168150901794),
 ('nikki', 0.46264785528182983)]

In [None]:
ft_sg.wv.similarity('maggie', 'baby')

np.float32(0.32119647)

In [None]:
ft_sg.wv.similarity('bart', 'nelson')

np.float32(0.28339562)

In [None]:
ft_sg.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])

'milhouse'

In [None]:
ft_sg.wv.doesnt_match(['homer', 'patty', 'selma'])

'homer'

## Out-of-Vocabulary (OOV) Words

la cantidad de n-grams creados durante el entrenamiento del FastText hace improbable (que no imposible) que alguna palabra no pueda ser construída como una bolsa de n-grams

In [None]:
'asereje' in ft_sg.wv.key_to_index

False

In [None]:
ft_sg.wv.most_similar('asereje')

[('taser', 0.6168254613876343),
 ('eraser', 0.6077090501785278),
 ('serenity', 0.598382830619812),
 ('phaser', 0.5907742381095886),
 ('laser', 0.5797291398048401),
 ('heeere', 0.5598991513252258),
 ('sera', 0.5534683465957642),
 ('analysis', 0.5377046465873718),
 ('liser', 0.5299598574638367),
 ('derriere', 0.5272972583770752)]

In [None]:
ft_sg.wv['asereje'].shape

(300,)