### FastText

1.  Fast Text is an improvised version of Word2Vec Model.

2.  It considers every word as atomic and divides the words into the subwords.

3.  It handles every misspelled words and also handles the morphological variations in the words.


### Importance of Fasttext:

1.  It breaks the words into the sub-words.

2.  It handles all the mis-spelled words in the text.

3.  It handles all the morphological variations in the words.

4.  The main task of FastText is that it can deduce the vector form of some unknown words which is not there in the vocabulary list based on some information of some sub-words present in the vocabulary list.

### Steps used in this Algorithm:---

1.  Import all the necessary libraries

2.  Download the necessary NLTK resources

3.  Prepare Your Text Data

4.  Perform the Tokenization on the corpus text

5.  Perform the normalization on the text

6.  Remove the stopwords from the text

7.  Train the FastText Model

8.  Explore the Trained Model

9.  Handle Out-of-Vocabulary (OOV) Words

### Step 1: Import all the necessary libraries

In [352]:
import  nltk

from    nltk.tokenize   import  word_tokenize

import  numpy  as  np

from    gensim.models  import  FastText

### OBSERVATIONS:

1.  nltk ------------->  Library for Text Preprocessing

2.  tokenize ---------> tokenization purpose(breaking the parts into sub-parts)

3.  word_tokenize ----> breaks the word into sub-words

4.  numpy ------------> Computation of numerical array

5.  gensim -----------> It has the module for the implementation of FastText

6.  FastText ---------> helps in finding out the vector form of unknown words based on the known words and can handle mis-spelled words

### Step 2: Download the necessary NLTK resources

In [353]:
nltk.download('punkt_tab')
nltk.download('average_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Error loading average_perceptron_tagger_eng: Package
[nltk_data]     'average_perceptron_tagger_eng' not found in index


False

### OBSERVATIONS:

1.  punkt_tab is a tokenization module

2.  average_perceptron_tagger_eng  is a POS Tagging Module

### Step 3:  Prepare Your Text Data

In [354]:
corpus = [
    "Natural language processing is a fascinating field",
    "Machine learning enables computers to learn from data",
    "Deep learning is a subset of machine learning",
    "Word embeddings represent words as vectors",
    "FastText handles rare and unseen words effectively"
]

In [355]:
corpus

['Natural language processing is a fascinating field',
 'Machine learning enables computers to learn from data',
 'Deep learning is a subset of machine learning',
 'Word embeddings represent words as vectors',
 'FastText handles rare and unseen words effectively']

In [356]:
corpus = " ".join(corpus)

In [357]:
corpus

'Natural language processing is a fascinating field Machine learning enables computers to learn from data Deep learning is a subset of machine learning Word embeddings represent words as vectors FastText handles rare and unseen words effectively'

### OBSERVATIONS:

1. A corpus is a container that contains the set of two or more texts in it.

### Step 4: Perform the Tokenization on the corpus text

In [358]:
from  nltk.tokenize import word_tokenize

words = word_tokenize(corpus)

In [359]:
words

['Natural',
 'language',
 'processing',
 'is',
 'a',
 'fascinating',
 'field',
 'Machine',
 'learning',
 'enables',
 'computers',
 'to',
 'learn',
 'from',
 'data',
 'Deep',
 'learning',
 'is',
 'a',
 'subset',
 'of',
 'machine',
 'learning',
 'Word',
 'embeddings',
 'represent',
 'words',
 'as',
 'vectors',
 'FastText',
 'handles',
 'rare',
 'and',
 'unseen',
 'words',
 'effectively']

In [360]:
words = " ".join(words)

print(words)

Natural language processing is a fascinating field Machine learning enables computers to learn from data Deep learning is a subset of machine learning Word embeddings represent words as vectors FastText handles rare and unseen words effectively


### Step 5: Perform the normalization on the text

In [361]:
from nltk.tokenize import RegexpTokenizer

### Create the object for Regular Expression Tokenizer

reg = RegexpTokenizer(r'\w+')

res = reg.tokenize(words)

print(res)

['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'Machine', 'learning', 'enables', 'computers', 'to', 'learn', 'from', 'data', 'Deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning', 'Word', 'embeddings', 'represent', 'words', 'as', 'vectors', 'FastText', 'handles', 'rare', 'and', 'unseen', 'words', 'effectively']


### Step 6: Remove the stopwords from the text

In [362]:
### define the enlish stop words

from nltk.corpus import stopwords

english_stopwords = stopwords.words("english")

print("-------------------------------List of all the english stopwords----------------------------------------------------")

print(english_stopwords)


### filter and remove all the stop words from the text

res = [x for x in res if(x not in english_stopwords)]

print("-------------------------------List of all the Filtered words----------------------------------------------------")
print(res)

-------------------------------List of all the english stopwords----------------------------------------------------
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once

In [363]:
### Tokenized sentences

corpus = [
    "Natural language processing is a fascinating field",
    "Machine learning enables computers to learn from data",
    "Deep learning is a subset of machine learning",
    "Word embeddings represent words as vectors",
    "FastText handles rare and unseen words effectively"
]

tokenized_sentences = [word_tokenize(x.lower()) for x in corpus]

print(tokenized_sentences)

[['natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field'], ['machine', 'learning', 'enables', 'computers', 'to', 'learn', 'from', 'data'], ['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning'], ['word', 'embeddings', 'represent', 'words', 'as', 'vectors'], ['fasttext', 'handles', 'rare', 'and', 'unseen', 'words', 'effectively']]


### Step 7: Train the FastText Model

In [364]:
fasttext_model = FastText(
    sentences    =      tokenized_sentences                  ,
    vector_size  =          50                               ,
    window       =           3                               ,
    min_count    =           1                               ,
    sg           =           1                               ,
    epochs       =          100              
)

In [365]:
fasttext_model

<gensim.models.fasttext.FastText at 0x10c558b4e20>

### OBSERVATIONS:

1.  The FastText model is trained with the help of the following input parameters:---

     (a.)   sentences = tokenized_sentences -----------> Input is in the form of the list of sentences

     (b.)   vector_size = 50    -----------------------> Each sentence dense dimenstional vector size should be 50

     (c.)   window      = 3     ------------------------> the number of inputs for the context mapping input should be 3

     (d.)   min_count   = 1     ------------------------>  minimum number of inputs should be 1

     (e.)   sg   = 1            ------------------------> The model architecture for the FastText model should be  Skip-Grams

     (f.)   epochs = 100         -----------------------> No of iterations needed to run the model is 100

### Step 8: Explore the Trained Model

In [366]:
### Vector representation of the word 'Natural'

fasttext_model.wv['natural']

array([-7.2463701e-04, -5.3047930e-04,  6.6479814e-04, -1.2593132e-03,
        3.0367132e-03, -2.3750232e-03,  4.7879447e-03,  2.8217584e-03,
       -1.8424509e-04,  2.9925893e-03, -3.1626706e-03, -1.3293079e-03,
        1.6390678e-03,  1.4841135e-03,  2.2499785e-03, -8.2287972e-04,
       -2.3260086e-03,  6.3928356e-04, -2.3356080e-03, -4.8548065e-04,
       -8.5085107e-04,  2.4301901e-03, -1.5599162e-03, -1.4388099e-03,
        3.9753891e-03, -1.0512471e-03, -3.0597935e-03, -2.0209956e-03,
       -8.8080805e-04, -3.6776997e-04,  2.1390072e-03,  2.8840434e-03,
       -1.3368093e-05, -1.0970196e-03, -2.0248929e-04, -2.4140853e-04,
        4.1443360e-04, -4.2922402e-04, -1.4309796e-03, -9.1862434e-04,
        1.2138172e-03,  2.1118002e-03, -4.0835668e-03,  1.6549744e-03,
       -4.9756712e-04, -3.3199308e-03, -2.2702778e-03, -9.2173531e-04,
        7.1208744e-04,  1.6569267e-03], dtype=float32)

In [367]:
### Vector representation of the word 'Machine'

fasttext_model.wv['machine']

array([-1.8151738e-03,  1.8731835e-03, -1.3348084e-03,  1.3475234e-03,
        1.3195594e-03,  2.3620769e-03, -1.6945349e-03,  4.7780303e-03,
       -2.8437979e-03,  1.1530526e-03, -3.7804700e-03,  1.4331297e-04,
        2.8665448e-03,  9.4495638e-04,  1.8654962e-03,  2.3849651e-03,
       -3.0143530e-04,  2.7725664e-03,  4.1818535e-03,  1.6576212e-04,
       -3.2666209e-03, -3.6462671e-03,  4.1487538e-03, -2.0292855e-03,
       -6.5047789e-04,  1.5003482e-03, -1.6418330e-03, -2.2933257e-03,
        1.2275674e-04,  2.1315969e-03, -2.4044698e-03, -2.2868451e-03,
        2.7977030e-03, -1.7129083e-03,  5.4581878e-03,  4.6377955e-03,
        3.1717520e-03,  2.6782481e-03, -2.6908063e-03,  1.2677161e-03,
       -2.5304388e-03,  3.3428398e-04,  1.1895931e-03, -5.2325479e-03,
        2.4010935e-03, -5.8245561e-03,  5.0611510e-05, -2.2862332e-03,
        1.8495488e-03, -4.6029566e-03], dtype=float32)

In [368]:
### Vector representation of the word 'Deep'

fasttext_model.wv['deep']

array([-2.2241930e-03,  3.8676456e-04, -3.7578144e-03, -1.0782945e-03,
        8.5799803e-04,  4.8130631e-04, -2.7068353e-03,  1.7609344e-03,
        3.4147568e-04,  5.3201732e-03,  2.5767388e-03, -8.7948246e-03,
       -1.9125718e-03,  1.4498566e-03, -3.6431815e-05,  5.0643785e-03,
       -5.6656729e-03, -1.2239072e-03,  1.7446767e-03,  2.0881162e-03,
        3.7119447e-04,  9.0195914e-04, -1.7362975e-03, -1.8576275e-03,
        9.7739801e-04,  8.3807507e-04, -4.0719640e-03,  1.5394125e-03,
        3.6462902e-03, -9.2380783e-03,  4.0885871e-03, -2.8355708e-03,
       -3.6883189e-03,  5.9062638e-04, -9.2948861e-03,  3.8188382e-03,
       -3.9598304e-03, -3.2933389e-03, -2.1968626e-03,  1.1714228e-03,
       -5.3502065e-03,  3.9881472e-03,  3.6802196e-03, -6.3529387e-03,
       -2.1181349e-04, -1.0554730e-03, -4.1337675e-04,  4.3161181e-03,
       -2.1706399e-05,  8.1578776e-04], dtype=float32)

In [369]:
### Find most similar words to a given word 'learning'

fasttext_model.wv.most_similar('learning')

[('learn', 0.566283643245697),
 ('embeddings', 0.3166630268096924),
 ('processing', 0.3066617250442505),
 ('is', 0.2814252972602844),
 ('language', 0.22058233618736267),
 ('fascinating', 0.1873551607131958),
 ('computers', 0.1703764796257019),
 ('natural', 0.1666349172592163),
 ('and', 0.1589219719171524),
 ('as', 0.14239010214805603)]

### OBSERVATIONS:

1. The above are the list of words that are most similar w.r.to the word 'learning' as they have high vector values and is nearer to the word 'learning'

In [370]:
### Find most similar words to a given word 'machine'

fasttext_model.wv.most_similar('machine')

[('rare', 0.4053672254085541),
 ('of', 0.26202529668807983),
 ('handles', 0.14107008278369904),
 ('and', 0.07821687310934067),
 ('word', 0.07267072051763535),
 ('enables', 0.06929602473974228),
 ('language', 0.05143461003899574),
 ('processing', 0.031261101365089417),
 ('represent', 0.028187723830342293),
 ('unseen', 0.027802225202322006)]

### OBSERVATIONS:

1. The above are the list of words that are most similar w.r.to the word 'learning' as they have high vector values and is nearer to the word 'machine'

### Step 9: Handle Out-of-Vocabulary (OOV) Words

This is where FastText comes into the picture where it can deduce the vector form of some unknown words not present in the vocabulary list using the information of some subwords present in the vocabulary list.

In [371]:
### Find the vector representation of the word 'learnings' that is unknown

fasttext_model.wv['learnings']

array([-8.6743815e-04, -9.2491199e-04, -1.3378768e-03, -1.8773484e-03,
       -5.4910220e-04,  1.5938377e-03, -7.2624895e-04, -1.0992755e-03,
       -1.9265991e-03,  5.2845478e-04, -2.2062492e-04, -3.9157597e-03,
        3.1014791e-04,  2.5739563e-03, -1.4379311e-03, -1.1849512e-03,
        6.8146852e-05, -1.3054026e-03, -1.9864601e-03,  3.4175890e-03,
        4.0605391e-04, -4.9657107e-04, -7.8516663e-04,  1.0991159e-03,
       -1.3701107e-03, -3.7043048e-03, -2.8391177e-04,  3.4112758e-03,
       -2.7216771e-03,  5.7162889e-03,  1.1733305e-03,  1.8359000e-03,
       -2.6961204e-03,  2.9323995e-03, -7.0885126e-04, -2.5519500e-03,
        1.7971118e-03, -9.7494310e-04,  2.7351868e-03,  4.3866527e-03,
        2.2974834e-03,  6.8506144e-04,  2.4359836e-03, -7.4982658e-05,
        1.2949929e-03,  1.5505051e-03, -2.9311841e-03, -2.3272594e-03,
       -2.3558319e-03,  2.7747609e-04], dtype=float32)

### OBSERVATIONS:

1. learnings is the word not present in the vocabulary list.

2. As here Fastext model is trained, so with the help of the sub word (learning) present in the vocabulary,
learnings which is not present in the vocabulary list also gets trained here.

In [372]:
# Save model
fasttext_model.save("fasttext_model.bin")

# Load model later
loaded_model = FastText.load("fasttext_model.bin")

# Verify loaded model
print("\nSimilar words to 'data' from loaded model:")
print(loaded_model.wv.most_similar('data'))


Similar words to 'data' from loaded model:
[('represent', 0.40004679560661316), ('to', 0.28480449318885803), ('is', 0.26224812865257263), ('from', 0.24866099655628204), ('and', 0.15458159148693085), ('enables', 0.15229645371437073), ('handles', 0.14785687625408173), ('fascinating', 0.09904083609580994), ('effectively', 0.09806496649980545), ('subset', 0.09262485802173615)]
