# Building word vectors using fastText library

fastText is a efficient learning of word representations and sentence classification library created by the Facebook Research Team.

To build a word vectors using fastText I will follow steps given bellow
* Use the fastText method in the gensim library.
* Preprocess the input data.
* Break each input sentence into a list of lists.
* Build a vocabulary on top of the input list of lists.
* Train the model with the preceding input data over multiple epochs.
* Calculate the similarity between words.

fastText differs from word2vec 
* Word2vec consider every single word as the smallest unit whose vector representation is to be found.
* fastText assumes a word to be formed by a n-grams of character; 
* for example sunny is composed of [sun, sunn, sunny],[sunny, unny, nny], and so on, where we see a subset of the original word of size n, where n could range from 1 to the length of the original word.

fastText uses skip-gram/CBOW models.

Import the modules.

In [1]:
import re
import nltk
import pandas as pd
from nltk.corpus import stopwords
nltk.download('stopwords')
from gensim.models.fasttext import FastText

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maninaya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Read the airline tweets sentiment dataset, which contains comments (text) related to airlines and their corresponding sentiment. The dataset can be obtained from https://d1p17r2m4rzlbo.cloudfront.net/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv

In [3]:
data = pd.read_csv('https://www.dropbox.com/s/8yq0edd4q908xqw/airline_sentiment.csv?dl=1')

A sample of the dataset looks as follows

In [4]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,@VirginAmerica plus you've added commercials t...
1,0,@VirginAmerica it's really aggressive to blast...
2,0,@VirginAmerica and it's a really big bad thing...
3,0,@VirginAmerica seriously would pay $30 a fligh...
4,1,"@VirginAmerica yes, nearly every time I fly VX..."


To build a document vector I followed these steps
* Preprocess the input sentences to remove punctuation
* Lowercasing for all words
* Remove the stop words

In [5]:
stop = set(stopwords.words('english'))
def preprocess(text):
    text=text.lower()
    text=re.sub('[^0-9a-zA-Z]+',' ',text)
    words = text.split()
    words2 = [word for word in words if word not in stop]
    words3=' '.join(words2)
    return(words3)
data['text'] = data['text'].apply(preprocess)

After Preprocessing the Dataset looks as follows.

In [6]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,virginamerica plus added commercials experienc...
1,0,virginamerica really aggressive blast obnoxiou...
2,0,virginamerica really big bad thing
3,0,virginamerica seriously would pay 30 flight se...
4,1,virginamerica yes nearly every time fly vx ear...


Convert the input text into a list of lists.

In [7]:
list_words=[]
for i in range(len(data)):     
    list_words.append(data['text'][i].split())

print first element of list_words

In [8]:
print(list_words[0])

['virginamerica', 'plus', 'added', 'commercials', 'experience', 'tacky']


Define the model (specify the number of vectors per word) and build a vocabulary.

In [9]:
ft_model = FastText(size=100)
ft_model.build_vocab(list_words)

2019-06-24 21:09:45,587 : INFO : resetting layer weights
  "C extension not loaded, training will be slow. "
2019-06-24 21:09:53,873 : INFO : collecting all words and their counts
2019-06-24 21:09:53,875 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-06-24 21:09:53,919 : INFO : PROGRESS: at sentence #10000, processed 108465 words, keeping 11585 word types
2019-06-24 21:09:53,926 : INFO : collected 12532 word types from a corpus of 125858 raw words and 11541 sentences
2019-06-24 21:09:53,928 : INFO : Loading a fresh vocabulary
2019-06-24 21:09:53,943 : INFO : effective_min_count=5 retains 2544 unique words (20% of original 12532, drops 9988)
2019-06-24 21:09:53,945 : INFO : effective_min_count=5 leaves 111195 word corpus (88% of original 125858, drops 14663)
2019-06-24 21:09:53,959 : INFO : deleting the raw counts dictionary of 12532 items
2019-06-24 21:09:53,961 : INFO : sample=0.001 downsamples 54 most-common words
2019-06-24 21:09:53,962 : INFO : down

Train the model.

In [10]:
ft_model.train(list_words, total_examples=ft_model.corpus_count,epochs=100)

2019-06-24 21:09:59,769 : INFO : training model with 3 workers on 2544 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-06-24 21:10:27,574 : INFO : EPOCH 1 - PROGRESS: at 7.89% examples, 271 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:10:29,717 : INFO : EPOCH 1 - PROGRESS: at 15.69% examples, 495 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:10:52,608 : INFO : EPOCH 1 - PROGRESS: at 31.29% examples, 555 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:10:55,480 : INFO : EPOCH 1 - PROGRESS: at 39.97% examples, 656 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:11:06,482 : INFO : EPOCH 1 - PROGRESS: at 48.12% examples, 657 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:11:14,546 : INFO : EPOCH 1 - PROGRESS: at 56.63% examples, 682 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:11:21,999 : INFO : EPOCH 1 - PROGRESS: at 64.39% examples, 710 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:11:40,085 : INFO : EPOCH 1 - PROGRESS: at 71.89% examples, 654 words

2019-06-24 21:19:23,489 : INFO : EPOCH - 5 : training on 125858 raw words (92119 effective words) took 108.3s, 851 effective words/s
2019-06-24 21:19:50,583 : INFO : EPOCH 6 - PROGRESS: at 7.96% examples, 266 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:19:52,697 : INFO : EPOCH 6 - PROGRESS: at 15.76% examples, 496 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:19:54,590 : INFO : EPOCH 6 - PROGRESS: at 23.65% examples, 706 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:20:18,242 : INFO : EPOCH 6 - PROGRESS: at 31.29% examples, 535 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:20:22,677 : INFO : EPOCH 6 - PROGRESS: at 39.43% examples, 618 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:20:45,664 : INFO : EPOCH 6 - PROGRESS: at 56.63% examples, 621 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:20:54,035 : INFO : EPOCH 6 - PROGRESS: at 64.14% examples, 644 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:20:56,033 : INFO : EPOCH 6 - PROGRESS: at 71.89% examples, 710 words/s, in_qsize 4, 

2019-06-24 21:28:40,408 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 21:28:40,410 : INFO : EPOCH - 10 : training on 125858 raw words (92180 effective words) took 113.6s, 811 effective words/s
2019-06-24 21:29:07,613 : INFO : EPOCH 11 - PROGRESS: at 7.96% examples, 266 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:29:09,940 : INFO : EPOCH 11 - PROGRESS: at 15.86% examples, 498 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:29:34,466 : INFO : EPOCH 11 - PROGRESS: at 31.29% examples, 542 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:29:40,022 : INFO : EPOCH 11 - PROGRESS: at 39.43% examples, 613 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:30:06,176 : INFO : EPOCH 11 - PROGRESS: at 56.63% examples, 593 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:30:13,014 : INFO : EPOCH 11 - PROGRESS: at 64.14% examples, 628 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:30:31,186 : INFO : EPOCH 11 - PROGRESS: at 79.69% examples, 658 words/s, in_qsize 3, out_qsize 0

2019-06-24 21:38:08,076 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 21:38:08,702 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 21:38:08,703 : INFO : EPOCH - 15 : training on 125858 raw words (92075 effective words) took 117.1s, 786 effective words/s
2019-06-24 21:38:34,446 : INFO : EPOCH 16 - PROGRESS: at 7.96% examples, 282 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:38:37,538 : INFO : EPOCH 16 - PROGRESS: at 15.76% examples, 502 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:38:39,011 : INFO : EPOCH 16 - PROGRESS: at 23.65% examples, 726 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:39:00,939 : INFO : EPOCH 16 - PROGRESS: at 31.29% examples, 560 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:39:07,156 : INFO : EPOCH 16 - PROGRESS: at 39.97% examples, 625 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:39:10,972 : INFO : EPOCH 16 - PROGRESS: at 48.12% examples, 703 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:39:28

2019-06-24 21:48:06,757 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 21:48:06,757 : INFO : EPOCH - 20 : training on 125858 raw words (92069 effective words) took 114.5s, 804 effective words/s
2019-06-24 21:48:32,618 : INFO : EPOCH 21 - PROGRESS: at 7.96% examples, 279 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:48:35,007 : INFO : EPOCH 21 - PROGRESS: at 15.76% examples, 513 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:48:36,485 : INFO : EPOCH 21 - PROGRESS: at 23.65% examples, 739 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:48:59,342 : INFO : EPOCH 21 - PROGRESS: at 31.29% examples, 557 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:49:03,597 : INFO : EPOCH 21 - PROGRESS: at 39.43% examples, 644 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:49:24,146 : INFO : EPOCH 21 - PROGRESS: at 56.63% examples, 659 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:49:32,026 : INFO : EPOCH 21 - PROGRESS: at 64.39% examples, 684 words/s, in_qsize 5, out_qsize 0

2019-06-24 21:58:15,052 : INFO : EPOCH 26 - PROGRESS: at 15.76% examples, 513 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:58:37,599 : INFO : EPOCH 26 - PROGRESS: at 31.29% examples, 577 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:58:42,339 : INFO : EPOCH 26 - PROGRESS: at 39.97% examples, 658 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:59:01,645 : INFO : EPOCH 26 - PROGRESS: at 56.63% examples, 681 words/s, in_qsize 6, out_qsize 0
2019-06-24 21:59:09,844 : INFO : EPOCH 26 - PROGRESS: at 64.39% examples, 702 words/s, in_qsize 5, out_qsize 0
2019-06-24 21:59:27,785 : INFO : EPOCH 26 - PROGRESS: at 79.69% examples, 723 words/s, in_qsize 3, out_qsize 0
2019-06-24 21:59:38,380 : INFO : EPOCH 26 - PROGRESS: at 87.80% examples, 721 words/s, in_qsize 2, out_qsize 1
2019-06-24 21:59:38,381 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 21:59:38,739 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 21:59:39,556 : INFO : EPOCH 26

2019-06-24 22:09:19,829 : INFO : EPOCH 31 - PROGRESS: at 87.80% examples, 690 words/s, in_qsize 2, out_qsize 1
2019-06-24 22:09:19,832 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 22:09:22,866 : INFO : EPOCH 31 - PROGRESS: at 95.55% examples, 734 words/s, in_qsize 1, out_qsize 1
2019-06-24 22:09:22,867 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 22:09:23,210 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 22:09:23,211 : INFO : EPOCH - 31 : training on 125858 raw words (92043 effective words) took 119.9s, 768 effective words/s
2019-06-24 22:09:48,787 : INFO : EPOCH 32 - PROGRESS: at 7.80% examples, 285 words/s, in_qsize 5, out_qsize 0
2019-06-24 22:09:50,395 : INFO : EPOCH 32 - PROGRESS: at 15.76% examples, 534 words/s, in_qsize 5, out_qsize 0
2019-06-24 22:09:52,453 : INFO : EPOCH 32 - PROGRESS: at 23.65% examples, 755 words/s, in_qsize 6, out_qsize 0
2019-06-24 22:10:15,683 : INFO : EPOCH 

2019-06-24 22:18:45,612 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 22:18:48,843 : INFO : EPOCH 36 - PROGRESS: at 92.25% examples, 785 words/s, in_qsize 1, out_qsize 1
2019-06-24 22:18:48,845 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 22:18:49,627 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 22:18:49,628 : INFO : EPOCH - 36 : training on 125858 raw words (91914 effective words) took 108.6s, 846 effective words/s
2019-06-24 22:19:13,348 : INFO : EPOCH 37 - PROGRESS: at 7.80% examples, 309 words/s, in_qsize 6, out_qsize 0
2019-06-24 22:19:20,217 : INFO : EPOCH 37 - PROGRESS: at 23.65% examples, 719 words/s, in_qsize 5, out_qsize 0
2019-06-24 22:19:37,132 : INFO : EPOCH 37 - PROGRESS: at 31.29% examples, 616 words/s, in_qsize 6, out_qsize 0
2019-06-24 22:19:40,458 : INFO : EPOCH 37 - PROGRESS: at 39.43% examples, 718 words/s, in_qsize 6, out_qsize 0
2019-06-24 22:19:48,582 : INFO : EPOCH 

2019-06-24 22:27:47,793 : INFO : EPOCH 41 - PROGRESS: at 71.89% examples, 656 words/s, in_qsize 4, out_qsize 0
2019-06-24 22:28:00,571 : INFO : EPOCH 41 - PROGRESS: at 79.69% examples, 648 words/s, in_qsize 3, out_qsize 0
2019-06-24 22:28:05,959 : INFO : EPOCH 41 - PROGRESS: at 87.80% examples, 680 words/s, in_qsize 2, out_qsize 1
2019-06-24 22:28:05,961 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 22:28:11,705 : INFO : EPOCH 41 - PROGRESS: at 95.55% examples, 708 words/s, in_qsize 1, out_qsize 1
2019-06-24 22:28:11,706 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 22:28:11,880 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 22:28:11,881 : INFO : EPOCH - 41 : training on 125858 raw words (92149 effective words) took 124.2s, 742 effective words/s
2019-06-24 22:28:37,698 : INFO : EPOCH 42 - PROGRESS: at 7.80% examples, 283 words/s, in_qsize 6, out_qsize 0
2019-06-24 22:28:39,203 : INFO : EPOCH 

2019-06-24 22:37:10,976 : INFO : EPOCH 46 - PROGRESS: at 64.39% examples, 692 words/s, in_qsize 5, out_qsize 0
2019-06-24 22:37:15,537 : INFO : EPOCH 46 - PROGRESS: at 71.89% examples, 739 words/s, in_qsize 4, out_qsize 0
2019-06-24 22:37:26,106 : INFO : EPOCH 46 - PROGRESS: at 79.69% examples, 735 words/s, in_qsize 3, out_qsize 0
2019-06-24 22:37:41,082 : INFO : EPOCH 46 - PROGRESS: at 87.80% examples, 703 words/s, in_qsize 2, out_qsize 1
2019-06-24 22:37:41,084 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 22:37:41,496 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 22:37:41,909 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 22:37:41,910 : INFO : EPOCH - 46 : training on 125858 raw words (92212 effective words) took 115.4s, 799 effective words/s
2019-06-24 22:38:08,623 : INFO : EPOCH 47 - PROGRESS: at 7.80% examples, 272 words/s, in_qsize 6, out_qsize 0
2019-06-24 22:38:09,969 : INFO : EPOCH 

2019-06-24 22:46:44,959 : INFO : EPOCH 51 - PROGRESS: at 64.39% examples, 670 words/s, in_qsize 5, out_qsize 0
2019-06-24 22:46:48,584 : INFO : EPOCH 51 - PROGRESS: at 71.89% examples, 724 words/s, in_qsize 4, out_qsize 0
2019-06-24 22:47:06,069 : INFO : EPOCH 51 - PROGRESS: at 79.69% examples, 675 words/s, in_qsize 3, out_qsize 0
2019-06-24 22:47:14,602 : INFO : EPOCH 51 - PROGRESS: at 87.80% examples, 689 words/s, in_qsize 2, out_qsize 1
2019-06-24 22:47:14,604 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 22:47:19,243 : INFO : EPOCH 51 - PROGRESS: at 95.55% examples, 723 words/s, in_qsize 1, out_qsize 1
2019-06-24 22:47:19,247 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 22:47:19,553 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 22:47:19,555 : INFO : EPOCH - 51 : training on 125858 raw words (92111 effective words) took 121.8s, 756 effective words/s
2019-06-24 22:47:44,729 : INFO : EPOCH

2019-06-24 22:57:29,054 : INFO : EPOCH 56 - PROGRESS: at 79.69% examples, 613 words/s, in_qsize 3, out_qsize 0
2019-06-24 22:57:30,440 : INFO : EPOCH 56 - PROGRESS: at 87.80% examples, 667 words/s, in_qsize 2, out_qsize 1
2019-06-24 22:57:30,441 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 22:57:32,034 : INFO : EPOCH 56 - PROGRESS: at 95.55% examples, 719 words/s, in_qsize 1, out_qsize 1
2019-06-24 22:57:32,036 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 22:57:35,120 : INFO : EPOCH 56 - PROGRESS: at 100.00% examples, 735 words/s, in_qsize 0, out_qsize 1
2019-06-24 22:57:35,121 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 22:57:35,122 : INFO : EPOCH - 56 : training on 125858 raw words (91900 effective words) took 125.0s, 735 effective words/s
2019-06-24 22:58:04,568 : INFO : EPOCH 57 - PROGRESS: at 7.96% examples, 243 words/s, in_qsize 6, out_qsize 0
2019-06-24 22:58:06,070 : INFO : EPOCH

2019-06-24 23:08:02,415 : INFO : EPOCH 62 - PROGRESS: at 7.96% examples, 258 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:08:03,465 : INFO : EPOCH 62 - PROGRESS: at 15.76% examples, 499 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:08:31,609 : INFO : EPOCH 62 - PROGRESS: at 31.29% examples, 513 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:08:33,429 : INFO : EPOCH 62 - PROGRESS: at 48.12% examples, 744 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:09:01,723 : INFO : EPOCH 62 - PROGRESS: at 56.63% examples, 585 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:09:04,030 : INFO : EPOCH 62 - PROGRESS: at 64.39% examples, 653 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:09:33,669 : INFO : EPOCH 62 - PROGRESS: at 79.69% examples, 614 words/s, in_qsize 3, out_qsize 0
2019-06-24 23:09:35,829 : INFO : EPOCH 62 - PROGRESS: at 87.80% examples, 664 words/s, in_qsize 2, out_qsize 1
2019-06-24 23:09:35,831 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 23:09:36,51

2019-06-24 23:20:06,072 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 23:20:07,233 : INFO : EPOCH 67 - PROGRESS: at 95.55% examples, 721 words/s, in_qsize 1, out_qsize 1
2019-06-24 23:20:07,235 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 23:20:10,757 : INFO : EPOCH 67 - PROGRESS: at 100.00% examples, 735 words/s, in_qsize 0, out_qsize 1
2019-06-24 23:20:10,759 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 23:20:10,760 : INFO : EPOCH - 67 : training on 125858 raw words (92166 effective words) took 125.3s, 735 effective words/s
2019-06-24 23:20:40,820 : INFO : EPOCH 68 - PROGRESS: at 7.96% examples, 241 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:20:41,864 : INFO : EPOCH 68 - PROGRESS: at 23.65% examples, 708 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:21:10,601 : INFO : EPOCH 68 - PROGRESS: at 32.34% examples, 489 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:21:11,979 : INFO : EPOCH

2019-06-24 23:31:15,339 : INFO : EPOCH 73 - PROGRESS: at 7.96% examples, 251 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:31:16,390 : INFO : EPOCH 73 - PROGRESS: at 15.76% examples, 486 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:31:44,772 : INFO : EPOCH 73 - PROGRESS: at 31.29% examples, 504 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:31:46,792 : INFO : EPOCH 73 - PROGRESS: at 48.12% examples, 728 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:32:13,226 : INFO : EPOCH 73 - PROGRESS: at 56.63% examples, 589 words/s, in_qsize 6, out_qsize 0
2019-06-24 23:32:15,094 : INFO : EPOCH 73 - PROGRESS: at 64.39% examples, 660 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:32:17,311 : INFO : EPOCH 73 - PROGRESS: at 71.89% examples, 724 words/s, in_qsize 4, out_qsize 0
2019-06-24 23:32:43,330 : INFO : EPOCH 73 - PROGRESS: at 79.69% examples, 627 words/s, in_qsize 3, out_qsize 0
2019-06-24 23:32:44,878 : INFO : EPOCH 73 - PROGRESS: at 87.80% examples, 681 words/s, in_qsize 2, out_qsize 1
20

2019-06-24 23:43:03,836 : INFO : EPOCH 78 - PROGRESS: at 87.80% examples, 674 words/s, in_qsize 2, out_qsize 1
2019-06-24 23:43:03,839 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 23:43:05,321 : INFO : EPOCH 78 - PROGRESS: at 95.55% examples, 727 words/s, in_qsize 1, out_qsize 1
2019-06-24 23:43:05,323 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 23:43:06,668 : INFO : EPOCH 78 - PROGRESS: at 100.00% examples, 754 words/s, in_qsize 0, out_qsize 1
2019-06-24 23:43:06,669 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 23:43:06,670 : INFO : EPOCH - 78 : training on 125858 raw words (91990 effective words) took 122.0s, 754 effective words/s
2019-06-24 23:43:31,851 : INFO : EPOCH 79 - PROGRESS: at 7.80% examples, 289 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:43:33,664 : INFO : EPOCH 79 - PROGRESS: at 15.76% examples, 537 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:43:37,318 : INFO : EPOCH

2019-06-24 23:51:55,006 : INFO : EPOCH 83 - PROGRESS: at 64.39% examples, 782 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:52:08,446 : INFO : EPOCH 83 - PROGRESS: at 71.89% examples, 746 words/s, in_qsize 4, out_qsize 0
2019-06-24 23:52:17,340 : INFO : EPOCH 83 - PROGRESS: at 79.69% examples, 754 words/s, in_qsize 3, out_qsize 0
2019-06-24 23:52:17,860 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 23:52:26,027 : INFO : EPOCH 83 - PROGRESS: at 92.25% examples, 802 words/s, in_qsize 1, out_qsize 1
2019-06-24 23:52:26,030 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 23:52:26,402 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 23:52:26,403 : INFO : EPOCH - 83 : training on 125858 raw words (92101 effective words) took 106.0s, 869 effective words/s
2019-06-24 23:52:49,247 : INFO : EPOCH 84 - PROGRESS: at 7.96% examples, 315 words/s, in_qsize 5, out_qsize 0
2019-06-24 23:52:50,456 : INFO : EPOCH 

2019-06-25 00:02:00,900 : INFO : EPOCH 88 - PROGRESS: at 79.69% examples, 738 words/s, in_qsize 3, out_qsize 0
2019-06-25 00:02:12,070 : INFO : EPOCH 88 - PROGRESS: at 87.80% examples, 729 words/s, in_qsize 2, out_qsize 1
2019-06-25 00:02:12,073 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-25 00:02:13,884 : INFO : EPOCH 88 - PROGRESS: at 92.25% examples, 756 words/s, in_qsize 1, out_qsize 1
2019-06-25 00:02:13,885 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-25 00:02:14,622 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-25 00:02:14,623 : INFO : EPOCH - 88 : training on 125858 raw words (91922 effective words) took 112.5s, 817 effective words/s
2019-06-25 00:02:38,975 : INFO : EPOCH 89 - PROGRESS: at 7.96% examples, 297 words/s, in_qsize 5, out_qsize 0
2019-06-25 00:02:44,259 : INFO : EPOCH 89 - PROGRESS: at 23.65% examples, 742 words/s, in_qsize 6, out_qsize 0
2019-06-25 00:03:03,521 : INFO : EPOCH 

2019-06-25 00:11:44,875 : INFO : EPOCH 93 - PROGRESS: at 87.80% examples, 733 words/s, in_qsize 2, out_qsize 1
2019-06-25 00:11:44,878 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-25 00:11:46,123 : INFO : EPOCH 93 - PROGRESS: at 95.55% examples, 791 words/s, in_qsize 1, out_qsize 1
2019-06-25 00:11:46,125 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-25 00:11:46,767 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-25 00:11:46,768 : INFO : EPOCH - 93 : training on 125858 raw words (91989 effective words) took 111.5s, 825 effective words/s
2019-06-25 00:12:10,585 : INFO : EPOCH 94 - PROGRESS: at 7.96% examples, 302 words/s, in_qsize 5, out_qsize 0
2019-06-25 00:12:11,664 : INFO : EPOCH 94 - PROGRESS: at 15.76% examples, 580 words/s, in_qsize 6, out_qsize 0
2019-06-25 00:12:15,854 : INFO : EPOCH 94 - PROGRESS: at 23.65% examples, 754 words/s, in_qsize 6, out_qsize 0
2019-06-25 00:12:34,990 : INFO : EPOCH 

2019-06-25 00:20:14,124 : INFO : EPOCH 98 - PROGRESS: at 64.39% examples, 719 words/s, in_qsize 5, out_qsize 0
2019-06-25 00:20:15,556 : INFO : EPOCH 98 - PROGRESS: at 71.89% examples, 795 words/s, in_qsize 4, out_qsize 0
2019-06-25 00:20:32,780 : INFO : EPOCH 98 - PROGRESS: at 79.69% examples, 732 words/s, in_qsize 3, out_qsize 0
2019-06-25 00:20:40,807 : INFO : EPOCH 98 - PROGRESS: at 87.44% examples, 746 words/s, in_qsize 2, out_qsize 1
2019-06-25 00:20:40,809 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-25 00:20:45,567 : INFO : EPOCH 98 - PROGRESS: at 91.89% examples, 753 words/s, in_qsize 1, out_qsize 1
2019-06-25 00:20:45,568 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-25 00:20:45,745 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-25 00:20:45,746 : INFO : EPOCH - 98 : training on 125858 raw words (92004 effective words) took 112.6s, 817 effective words/s
2019-06-25 00:21:11,029 : INFO : EPOCH

Check the word vectors of a word that is not present in the vocabulary of the model. For example, the word first is present in the vocabulary; however, the word firstli is not present in the vocabulary. In such a scenario, check the similarity between the word vectors for first and firstli.

In [11]:
ft_model.similarity('first','firstli')

  """Entry point for launching an IPython kernel.


0.97277683

The output of the preceding code snippet is 0.97, which indicates a very high correlation between the two words.

Thus, we can see that fastText word vectors help us to generate word vectors for words that are not present in the vocabulary.

The preceding method could also be leveraged to correct the spelling mistakes, if any, within our corpus of data, as the incorrectly-spelled words are likely to occur rarely, and the most similar word with the highest frequency is more likely to be the correctly-spelled version of the misspelled word.


Spelling corrections can be performed using vector arithmetic, as follows

In [12]:
result = ft_model.most_similar(positive=['exprience', 'prmise'], negative=['experience'], topn=1)
print(result)

  """Entry point for launching an IPython kernel.
2019-06-25 00:24:32,989 : INFO : precomputing L2-norms of word weight vectors
2019-06-25 00:24:32,998 : INFO : precomputing L2-norms of ngram weight vectors


[('injury', 0.2846823036670685)]


Note that in the preceding code, the positive words have a spelling mistake, while the negative word does not. The output of the code is promise. So this potentially corrects our spelling mistake.

Additionally, it can also be performed as follows

In [13]:
ft_model.most_similar('exprience', topn=1)

  """Entry point for launching an IPython kernel.


[('experience', 0.8755375742912292)]

Note that this does not work when there are multiple spelling mistakes