**Subscribe** to this channel and **follow** *@dsbyhadi* on twitter for updates!

*email: datasciencebyhadi@gmail.com*

----

### Sentiment analysis via Embedding.
#### Tool: fastText
main website: https://www.fasttext.cc  
text classification tutorial: https://fasttext.cc/docs/en/supervised-tutorial.html  
github: https://github.com/facebookresearch/fastText/tree/master/python

In [3]:
import fastText #version 0.8.22
import pandas as pd
import os
from fastText import train_supervised
import numpy as np

***Note:*** 
fasttext default arguments are different between python and CLI. Always check to find out what parameters are used.

In [4]:
# To show the output of all lines in a cell rather that just the last line
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

`The architecture of fasttext supervised learning is like word2vec CBOW where the target word is replaced with the label.`

Link to download the data: https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set#sentiment%20labelled%20sentences.zip

In [6]:
DATA_DIR = "~/Downloads/sentiment_labelled_sentences/"

In [7]:
data_amazon = pd.read_table(DATA_DIR+"amazon_cells_labelled.txt"
                          , sep='\t', header=None, names=['sentence', 'sentiment'])
data_imdb = pd.read_table(DATA_DIR+"imdb_labelled.txt"
                          , sep='\t', header=None, names=['sentence', 'sentiment'])
data_yelp = pd.read_table(DATA_DIR+"yelp_labelled.txt"
                          , sep='\t', header=None, names=['sentence', 'sentiment'])

In [8]:
all_data = pd.concat([data_amazon, data_imdb, data_yelp])
all_data.head(2)

Unnamed: 0,sentence,sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1


*Formatting the label as the default for fasttext*

***Note***: You can have multiple labels for each entery but fasttext will randomly select one of those labels while training.

In [9]:
all_data['label'] = all_data.apply(lambda row: '__label__' + str(row.sentiment), axis=1)

In [10]:
all_data.head(2)

Unnamed: 0,sentence,sentiment,label
0,So there is no way for me to plug it in here i...,0,__label__0
1,"Good case, Excellent value.",1,__label__1


In [11]:
import re

def normalize(row):
    lower = row['sentence'].lower()
    #correct all multiple white spaces to a single white space
    no_mult_ws = re.sub('[\s]+', ' ', lower)
    text = no_mult_ws.strip() #trim white spaces from left and right of text.
    return text

In [15]:
all_data['normalized_sentence'] = all_data.apply(normalize, axis=1)
all_data.head(2)

Unnamed: 0,sentence,sentiment,label,normalized_sentence
0,So there is no way for me to plug it in here in the US unless I go by a converter.,0,__label__0,so there is no way for me to plug it in here in the us unless i go by a converter.
1,"Good case, Excellent value.",1,__label__1,"good case, excellent value."


In [16]:
labeled_data = all_data.drop(['sentence', 'sentiment'], axis=1)
labeled_data.head(2)

Unnamed: 0,label,normalized_sentence
0,__label__0,so there is no way for me to plug it in here in the us unless i go by a converter.
1,__label__1,"good case, excellent value."


In [17]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(labeled_data, test_size=0.35, random_state=4)

In [18]:
pd.set_option('display.max_colwidth', -1)
test_data.head(2)

Unnamed: 0,label,normalized_sentence
177,__label__1,"the atmosphere is modern and hip, while maintaining a touch of coziness."
720,__label__1,"cute, quaint, simple, honest."


#### Save model to disk to be read by fasttext

In [14]:
labeled_data.to_csv(path_or_buf='./sentiment.all', header=False, index=False, sep='\t')
train_data.to_csv(path_or_buf='./sentiment.train', header=False, index=False, sep='\t')
test_data.to_csv(path_or_buf='./sentiment.test', header=False, index=False, sep='\t')

### Fasttext model training/eval/etc.

In [5]:
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

In [21]:
BASE_PATH = "./"

In [74]:
train_data_path = os.path.join(BASE_PATH, 'sentiment.train')
test_data_path = os.path.join(BASE_PATH, 'sentiment.test')

# train_supervised(): Will first find word embeddings using input data, 
# then trains a simple model using those ebmeddings. 
# If the pretrained word vectors are given, the neural net's weights 
# used for finding an embedding will be initialized by them (transfer learning)
sentiment_model = train_supervised(
    input=train_data_path,
    lr=1,
    dim=100,
    ws=5,
    epoch=5,
    minCount=1,
    minCountLabel=0,
    minn=2,
    maxn=3,
    neg=5,
    wordNgrams=2,
    loss="softmax", #hs, ns (multi-label), ova(Based on documentation should be multi-label but in reallity it's not!)
    bucket=200000,
    lrUpdateRate=100,
    t=1e-4,
    label="__label__",
    verbose=2,
    pretrainedVectors="",
)
print("On train")
print_results(*sentiment_model.test(train_data_path))
print("On test")
print_results(*sentiment_model.test(test_data_path))
sentiment_model.save_model("sentiment_model.bin")

On train
N	1786
P@1	0.941
R@1	0.941
On test
N	962
P@1	0.790
R@1	0.790


**Precision@k** = percentage of top k labels that are correct

**Recall@k** = percentage of actual labels that are in the top k predicted labels.

($\uparrow$ is defined for multi-label classifiers, recommondation systems, etc.)

In our binary classification, P@1 corresponds to precision (micro-average) and R@1 corresponds to recall (micro-average).

Check here for the input arguments explanations: 'https://fasttext.cc/docs/en/options.html'

In [23]:
# Testing trained model on some random document
sentiment_model.predict("you are not a cool guy but i really like you", k=2)
sentiment_model.predict("yeah..", k=2)

(('__label__1', '__label__0'), array([ 0.53369594,  0.46632403]))

(('__label__1', '__label__0'), array([ 0.82612067,  0.17389934]))

Quantizing sacrifices a bit of performance to reduce the size of model. Good when size matters, like when deploying to
edge device, e.g. mobile.

In [26]:
sentiment_model.quantize(input=train_data_path, qnorm=True, retrain=True, cutoff=200000)
sentiment_model.save_model("sent_model.ftz")

In [27]:
print_results(*sentiment_model.test(test_data_path))

N	962
P@1	0.805
R@1	0.805


In [29]:
sentiment_model.predict("you are not a cool guy but i really like you", k=2)
sentiment_model.predict("yeah..", k=2)
sentiment_model.predict("yeah..", k=1)

(('__label__1', '__label__0'), array([ 0.69983828,  0.30018169]))

(('__label__1', '__label__0'), array([ 0.68600965,  0.31401041]))

(('__label__1',), array([ 0.68600965]))

In [31]:
def get_pos_probab(doc):
    pr = sentiment_model.predict(doc, k=1)
    if "0" in pr[0][0]:
        return round(1 - pr[1][0], 2)
    else:
        return round(pr[1][0], 2)

In [32]:
test_data['probability'] = test_data['normalized_sentence'].apply(get_pos_probab)
test_data['prediction'] = test_data['probability'].apply(lambda row: int(round(row, 0)))
test_data['orig_label'] = test_data['label'].apply(lambda row: row[-1:])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [33]:
test_data.head(5)

Unnamed: 0,label,normalized_sentence,probability,prediction,orig_label
177,__label__1,"the atmosphere is modern and hip, while maintaining a touch of coziness.",0.89,1,1
720,__label__1,"cute, quaint, simple, honest.",0.55,1,1
525,__label__0,"it's an empty, hollow shell of a movie.",0.33,0,0
630,__label__0,don't bother - go to the store.,0.03,0,0
137,__label__0,"to those who find this movie intelligent or even masterful, i can only say - it's your intelligence and your imagination you obviously used to try and make some sense of this pitiful attempt (it's in our human nature to try and make sense of things) .",0.87,1,0


In [34]:
pd.options.mode.chained_assignment = None

In [35]:
train_data['probability'] = train_data['normalized_sentence'].apply(get_pos_probab)
train_data['prediction'] = train_data['probability'].apply(lambda row: int(round(row, 0)))
train_data['orig_label'] = train_data['label'].apply(lambda row: row[-1:])

In [36]:
train_data.head(3)

Unnamed: 0,label,normalized_sentence,probability,prediction,orig_label
15,__label__1,highly recommend for any one who has a blue tooth phone.,0.95,1,1
356,__label__0,"sadly, gordon ramsey's steak is a place we shall sharply avoid during our next trip to vegas.",0.07,0,0
496,__label__1,the pancake was also really good and pretty large at that.,0.67,1,1


In [50]:
train_data.dtypes
train_data['orig_label'] = train_data['orig_label'].astype('int')
train_data['prediction'] = train_data['prediction'].astype('int')
train_data.dtypes

label                  object  
normalized_sentence    object  
probability            float64 
prediction             category
orig_label             category
dtype: object

label                  object 
normalized_sentence    object 
probability            float64
prediction             int64  
orig_label             int64  
dtype: object

In [51]:
test_data['orig_label'] = test_data['orig_label'].astype('category')
test_data['prediction'] = test_data['prediction'].astype('category')

In [52]:
test_data.head(6)
train_data.head(6)

Unnamed: 0,label,normalized_sentence,probability,prediction,orig_label
177,__label__1,"the atmosphere is modern and hip, while maintaining a touch of coziness.",0.89,1,1
720,__label__1,"cute, quaint, simple, honest.",0.55,1,1
525,__label__0,"it's an empty, hollow shell of a movie.",0.33,0,0
630,__label__0,don't bother - go to the store.,0.03,0,0
137,__label__0,"to those who find this movie intelligent or even masterful, i can only say - it's your intelligence and your imagination you obviously used to try and make some sense of this pitiful attempt (it's in our human nature to try and make sense of things) .",0.87,1,0
555,__label__0,"i know this is not like the other restaurants at all, something is very off here!",0.15,0,0


Unnamed: 0,label,normalized_sentence,probability,prediction,orig_label
15,__label__1,highly recommend for any one who has a blue tooth phone.,0.95,1,1
356,__label__0,"sadly, gordon ramsey's steak is a place we shall sharply avoid during our next trip to vegas.",0.07,0,0
496,__label__1,the pancake was also really good and pretty large at that.,0.67,1,1
753,__label__1,cheap but hey it works.. was pleasantly suprised given the low cost of this item.,0.78,1,1
9,__label__1,a great touch.,1.0,1,1
975,__label__1,it is the best charger i have seen on the market yet.,0.99,1,1


In [53]:
from sklearn import metrics

metrics.recall_score(train_data['orig_label'], train_data['prediction'], average='micro')
metrics.recall_score(test_data['orig_label'], test_data['prediction'], average='micro')
metrics.accuracy_score(test_data['orig_label'], test_data['prediction'])
metrics.recall_score(test_data['orig_label'], test_data['prediction'], average='macro')

0.99048152295632697

0.80353430353430355

0.80353430353430355

0.80355290456431527

However, in our example, we are going to use the precision@1 provided by fasttext itself, although the above cells show how to compute any other arbitrary metric.

Model parameter search/tuning..

In [54]:
train_data_path = os.path.join(BASE_PATH, 'sentiment.train')
test_data_path = os.path.join(BASE_PATH, 'sentiment.test')

def grid_search(lr, dim, ws, epoch, minn, maxn, wordNgrams):
    for l_rate in lr:
        for d in dim:
            for s in ws:
                for ep in epoch:
                    for mi in minn:
                        for ma in maxn:
                            if (ma >= mi):
                                for n in wordNgrams:
                                    sentiment_model_mc = train_supervised(
                                                        input=train_data_path,
                                                        lr=l_rate,
                                                        dim=d,
                                                        ws=s,
                                                        epoch=ep,
                                                        minCount=1,
                                                        minCountLabel=0,
                                                        minn=mi,
                                                        maxn=ma,
                                                        neg=5,
                                                        wordNgrams=n,
                                                        loss='softmax', 
                                                        bucket=200000,
                                                        lrUpdateRate=100,
                                                        t=1e-4,
                                                        label="__label__",
                                                        verbose=2,
                                                        pretrainedVectors="",
                                                    )
                                    train_res = round(sentiment_model_mc.test(train_data_path)[1], 2) #precision
                                    test_res = round(sentiment_model_mc.test(test_data_path)[1], 2)
                                    ratio = train_res/test_res
                                    if (ratio > 0.95) & (ratio < 1.06) & (train_res > 0.8):
                                        print("{}, {}: *** lr = {}, dim = {}, ws = {}, epoch = {}, minn = {}, maxn = {}, wordNgrams = {}".format(train_res, test_res, l_rate, d, s, ep, mi, ma, n))
                                    else:
                                        print("{}, {}:     lr = {}, dim = {}, ws = {}, epoch = {}, minn = {}, maxn = {}, wordNgrams = {}".format(train_res, test_res, l_rate, d, s, ep, mi, ma, n))
                                

In [299]:
grid_search(lr=[0.9, 0.95], dim=[5, 10, 20, 40], ws=[4, 5, 6], 
            epoch=[1, 2, 3], minn=[2, 3, 4], maxn=[3, 4, 5, 6], wordNgrams=[1, 2, 3])

0.79, 0.75: *** lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 3, wordNgrams = 1
0.81, 0.75:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 3, wordNgrams = 2
0.82, 0.75:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 3, wordNgrams = 3
0.81, 0.75:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 4, wordNgrams = 1
0.81, 0.75:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 4, wordNgrams = 2
0.82, 0.76:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 4, wordNgrams = 3
0.82, 0.76:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 5, wordNgrams = 1
0.81, 0.74:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 5, wordNgrams = 2
0.81, 0.75:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 5, wordNgrams = 3
0.83, 0.76:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 6, wordNgrams = 1
0.8, 0.75:     lr = 0.9, dim = 5, ws = 4, epoch = 1, minn = 2, maxn = 6, wordNgrams = 2
0.81, 0.74:     lr = 0

In [300]:
grid_search(lr=[0.95, 1], dim=[5, 20], ws=[5, 6], 
            epoch=[1, 2], minn=[3,4], maxn=[3, 4, 5, 6], wordNgrams=[1, 2, 3])

0.85, 0.77:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 1
0.85, 0.77:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 2
0.87, 0.77:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 3
0.84, 0.76:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 1
0.84, 0.77:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 2
0.85, 0.76:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 3
0.85, 0.78:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 1
0.85, 0.78:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 2
0.85, 0.78:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 3
0.85, 0.78:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 6, wordNgrams = 1
0.85, 0.78:     lr = 0.95, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 6, wordNgrams = 2
0.84, 0.78

In [392]:
grid_search(lr=[0.05, 0.1], dim=[5, 20], ws=[5, 6], 
            epoch=[1, 2], minn=[3,4], maxn=[3, 4, 5, 6], wordNgrams=[1, 2, 3])

0.57, 0.58:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 1
0.58, 0.56:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 2
0.57, 0.56:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 3
0.56, 0.57:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 1
0.56, 0.56:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 2
0.56, 0.56:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 3
0.54, 0.56:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 1
0.55, 0.56:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 2
0.55, 0.56:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 3
0.55, 0.54:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 6, wordNgrams = 1
0.55, 0.55:     lr = 0.05, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 6, wordNgrams = 2
0.55, 0.55

In [395]:
grid_search(lr=[0.5, 2], dim=[5, 20], ws=[5, 6], 
            epoch=[1, 2], minn=[3,4], maxn=[3, 4, 5, 6], wordNgrams=[1, 2, 3])

0.74, 0.71:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 1
0.71, 0.67:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 2
0.68, 0.66:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 3, wordNgrams = 3
0.67, 0.64:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 1
0.65, 0.63:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 2
0.64, 0.62:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 4, wordNgrams = 3
0.65, 0.63:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 1
0.64, 0.62:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 2
0.62, 0.62:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 5, wordNgrams = 3
0.63, 0.6:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 6, wordNgrams = 1
0.62, 0.6:     lr = 0.5, dim = 5, ws = 5, epoch = 1, minn = 3, maxn = 6, wordNgrams = 2
0.61, 0.59:     lr = 0.

`0.83, 0.79: *** lr = 0.95, dim = 20, ws = 5, epoch = 1, minn = 3, maxn = 6, wordNgrams = 3`

In [56]:
all_data_path = os.path.join(BASE_PATH, 'sentiment.all')

sentiment_model_all = train_supervised(
    input=all_data_path,
    lr=0.95,
    dim=20,
    ws=5,
    epoch=1,
    minCount=1,
    minCountLabel=0,
    minn=3,
    maxn=6,
    neg=5,
    wordNgrams=3,
    loss='softmax',
    bucket=200000,
    lrUpdateRate=100,
    t=1e-4,
    label="__label__",
    verbose=2,
    pretrainedVectors=""
)
print_results(*sentiment_model_all.test(all_data_path))
sentiment_model_all.save_model("sentiment_model_all.bin")

N	2748
P@1	0.788
R@1	0.788


** word vectors **

In [57]:
sentiment_model_all = fastText.load_model("sentiment_model_all.bin")

In [77]:
sentiment_model_all.get_word_vector("cute")
sentiment_model_all.get_sentence_vector("nice item is presented")

array([-0.00723714, -0.00616771,  0.00885935, -0.00201952,  0.00482896,
        0.0093122 ,  0.00326151,  0.01230557, -0.00532732, -0.0014651 ,
       -0.00998272, -0.00285204, -0.00020808, -0.00811909, -0.00340461,
        0.00046724, -0.00755076, -0.00977706, -0.00885135,  0.01863101], dtype=float32)

array([  7.83204008e-03,   1.94313086e-03,  -7.48884631e-03,
        -1.47117181e-02,   3.76849691e-03,   2.30366364e-03,
         2.74004657e-02,  -2.12049372e-02,  -7.56282685e-03,
        -3.56747094e-03,   2.04938836e-03,   9.27819777e-03,
         2.74633220e-03,   1.06016221e-02,   1.66965183e-05,
        -6.18755678e-03,  -3.47857509e-04,  -4.02185321e-03,
         2.90481024e-04,  -8.05132650e-03], dtype=float32)

In [61]:
sentiment_model_all.get_subwords("high quality")

(['<hi',
  '<hig',
  '<high',
  '<high ',
  'hig',
  'high',
  'high ',
  'high q',
  'igh',
  'igh ',
  'igh q',
  'igh qu',
  'gh ',
  'gh q',
  'gh qu',
  'gh qua',
  'h q',
  'h qu',
  'h qua',
  'h qual',
  ' qu',
  ' qua',
  ' qual',
  ' quali',
  'qua',
  'qual',
  'quali',
  'qualit',
  'ual',
  'uali',
  'ualit',
  'uality',
  'ali',
  'alit',
  'ality',
  'ality>',
  'lit',
  'lity',
  'lity>',
  'ity',
  'ity>',
  'ty>'],
 array([ 64551, 179660, 144286,  42140,  35950, 148644,  85966, 155465,
        137226, 204512, 140607,  23998, 182437, 169092, 202335,  81922,
         84435, 117954,  65525,  61421,  98772, 190411, 104063, 114122,
         61317,  14493, 140276,  14498,  27332, 146947, 149663,  38522,
         85874, 203604, 105187, 124621, 184669,  76804,  33216,  36274,
        128470,  71191]))

In [79]:
words = sentiment_model_all.get_words()
for w in words[0:10]:
    print("{}  --> {}".format(w, sentiment_model_all.get_word_vector(w)))

</s>  --> [-0.02297329  0.03757703 -0.03839799 -0.02031958  0.02653294  0.00632489
  0.03328654 -0.00822168  0.05098927  0.04454439 -0.0130855   0.01819863
  0.0491783   0.02601182 -0.01822265 -0.00582005 -0.00695993  0.0187833
 -0.00284449 -0.05156025]
the  --> [-0.02462273  0.02364281  0.03974668  0.05990673 -0.00494396  0.00117975
 -0.09151746  0.0817562   0.06554657  0.03269861 -0.03168119 -0.02839037
 -0.01864444 -0.04172178 -0.03375281 -0.02457489  0.01192389 -0.00960783
  0.01140527  0.04683148]
and  --> [ 0.09868445 -0.04007725 -0.07414707 -0.12422622  0.01914132  0.00744683
  0.19312902 -0.21660616 -0.14063309 -0.0814503   0.03317532  0.102029
  0.04324149  0.09457062  0.0011426  -0.02000786 -0.0237851   0.00716737
 -0.01189006 -0.1891796 ]
a  --> [-0.00578491  0.02615113  0.04035453  0.02788285 -0.00274592  0.01685284
 -0.04814026  0.05728907  0.07139228  0.02230552 -0.05793533 -0.0143242
 -0.00569259 -0.03179817 -0.00097854 -0.00497582  0.00059156  0.00209976
 -0.00768848 -0

### Check out the following for full set of functionalities you can get from fasttext

https://fasttext.cc/docs/en/unsupervised-tutorial.html

In case you need a smaller model to deploy to edge device:

In [315]:
sentiment_model_all.quantize(input=all_data_path, qnorm=True, retrain=True, cutoff=200000)
print_results(*sentiment_model_all.test(all_data_path))
sentiment_model_all.save_model("sentiment_model_all.ftz")

N	2748
P@1	0.900
R@1	0.900


In [95]:
from fastText import train_unsupervised

In [96]:
model = train_unsupervised(input=os.path.join(BASE_PATH, 'sentiment.all'), model='skipgram')

In [401]:
model.save_model("unsup_model")

In [402]:
model.get_word_vector("item")

array([ 0.11493368,  0.03844072, -0.30901629, -0.05930671,  0.17192714,
       -0.11968267, -0.02324919,  0.08157369,  0.12652105,  0.09489042,
       -0.16677582,  0.05363207,  0.09907467,  0.06755538,  0.02177368,
        0.09048193, -0.02327499,  0.09259994, -0.16422673,  0.23294239,
        0.14148104, -0.03935983,  0.13805135, -0.12262625, -0.12731853,
        0.16763139,  0.3956145 , -0.07006402,  0.13103153,  0.07855626,
       -0.21472511, -0.06144235, -0.02110785, -0.26635283,  0.03972701,
       -0.1000495 ,  0.19960244, -0.25491592, -0.1815532 , -0.05189068,
        0.0047689 , -0.01270215, -0.09911472,  0.20988707,  0.17559691,
       -0.24684374,  0.11519057, -0.11204102, -0.26574621, -0.12333245,
       -0.07794633, -0.09868582, -0.16511314,  0.04410603,  0.32537642,
       -0.27806637,  0.06206042,  0.22618234,  0.06395185, -0.03906425,
       -0.19645485,  0.05740105,  0.09258997, -0.16286848,  0.03635918,
        0.16308087, -0.16798693,  0.07991626, -0.02836273, -0.09

In [97]:
wiki_vectores_path = '~/Downloads/wiki-news-300d-1M.vec'

For cells below, I downloaded english word embeddings of wikipedia (and some other sources) from fasttext website: https://fasttext.cc/docs/en/english-vectors.html

They have pretrained vectors for many other languages (157 languages as of June 2019): https://fasttext.cc/docs/en/crawl-vectors.html

In [90]:
train_data_path = os.path.join(BASE_PATH, 'sentiment.train')
test_data_path = os.path.join(BASE_PATH, 'sentiment.test')

In [89]:
sentiment_model_transfer = train_supervised(
    input=train_data_path,
    lr=0.95,
    dim=300,
    epoch=1,
    loss='softmax',
    bucket=1000,
    label="__label__",
    pretrainedVectors=''
)
print_results(*sentiment_model_transfer.test(train_data_path))
print_results(*sentiment_model_transfer.test(test_data_path))

N	1786
P@1	0.803
R@1	0.803
N	962
P@1	0.735
R@1	0.735


Below we see the power of transfer learning

In [91]:
sentiment_model_transfer = train_supervised(
    input=train_data_path,
    lr=0.95,
    dim=300,
    epoch=1,
    loss='softmax',
    bucket=1000,
    label="__label__",
    pretrainedVectors="~/Downloads/wiki-news-300d-1M.vec"
)
print_results(*sentiment_model_transfer.test(train_data_path))
print_results(*sentiment_model_transfer.test(test_data_path))

N	1786
P@1	1.000
R@1	1.000
N	962
P@1	0.813
R@1	0.813


In [92]:
sentiment_model_transfer = train_supervised(
    input=train_data_path,
    lr=0.005,
    dim=300,
    epoch=1,
    loss='softmax',
    bucket=1000,
    label="__label__",
    pretrainedVectors="~/Downloads/wiki-news-300d-1M.vec"
)
print_results(*sentiment_model_transfer.test(train_data_path))
print_results(*sentiment_model_transfer.test(test_data_path))

N	1786
P@1	0.820
R@1	0.820
N	962
P@1	0.780
R@1	0.780


In [93]:
%time sentiment_model_all = train_supervised(input=train_data_path, lr=0.0005, dim=300, epoch=7, loss='ns', bucket=1000, label="__label__", pretrainedVectors="~/Downloads/wiki-news-300d-1M.vec")
print_results(*sentiment_model_all.test(train_data_path))
print_results(*sentiment_model_all.test(test_data_path))

CPU times: user 3min 55s, sys: 2.04 s, total: 3min 57s
Wall time: 3min 39s
N	1786
P@1	0.822
R@1	0.822
N	962
P@1	0.771
R@1	0.771


- While powerful and fast, fasttext is not matured yet, so be vigilent of bugs and feel free to report them if you find any https://github.com/facebookresearch/fastText/issues

- Doesn't have the best documentations around many details, but people are trying to deciper it over time. The reason is it's a good tool and folks want to keep using it!