In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import fasttext
import fasttext_models as fm

In [2]:
print("First thing's first, we need to make our models.\nWe will be building 10 models in total: 5 will be trained on cleaned and lemmatized data, 5 will be trained on the raw text. Each of the 5 for those two training sets will vary by vector length: 25, 50, 100, 200, and 300.")
# execfile('fasttext_models.py') - Does not work in python 3, use:
#exec(open("./fasttext_models.py").read())

mcs,mus = ([mc0,mc1,mc2,mc3,mc4],[mu0,mu1,mu2,mu3,mu4]) = fm.main()

First thing's first, we need to make our models.
We will be building 10 models in total: 5 will be trained on cleaned and lemmatized data, 5 will be trained on the raw text. Each of the 5 for those two training sets will vary by vector length: 25, 50, 100, 200, and 300.
reviews_train.json loaded:
   Id  HelpfulnessNumerator  HelpfulnessDenominator  Score  \
0   1                     1                       1      5   
1   2                     0                       0      1   
2   3                     1                       1      4   
3   4                     3                       3      2   
4   5                     0                       0      5   

                 Summary                                               Text  \
0  Good Quality Dog Food  I have bought several of the Vitality canned d...   
1      Not as Advertised  Product arrived labeled as Jumbo Salted Peanut...   
2  "Delight" says it all  This is a confection that has been around a fe...   
3         Cou

In [3]:
# load the models for evaluation
'''
mc0 = fasttext.load_model('data/fasttext_skipgram_cleaned_D25.bin')
mc1 = fasttext.load_model('data/fasttext_skipgram_cleaned_D50.bin')
mc2 = fasttext.load_model('data/fasttext_skipgram_cleaned_D100.bin')
mc3 = fasttext.load_model('data/fasttext_skipgram_cleaned_D200.bin')
mc4 = fasttext.load_model('data/fasttext_skipgram_cleaned_D300.bin')
mcs = [mc0,mc1,mc2,mc3,mc4]

mu0 = fasttext.load_model('data/fasttext_skipgram_uncleaned_D25.bin')
mu1 = fasttext.load_model('data/fasttext_skipgram_uncleaned_D50.bin')
mu2 = fasttext.load_model('data/fasttext_skipgram_uncleaned_D100.bin')
mu3 = fasttext.load_model('data/fasttext_skipgram_uncleaned_D200.bin')
mu4 = fasttext.load_model('data/fasttext_skipgram_uncleaned_D300.bin')
mus = [mu0,mu1,mu2,mu3,mu4]
'''

print("\nHere are the top 50 words for the model trained on cleaned data with D100:")
mc2_words = mc4.get_words()
print(mc2_words[:50])
print("\nHere are the top 50 words for the model trained on uncleaned data with D100:")
mu2_words = mu2.get_words()
print(mu2_words[:50])

print("Nota Bene: I cannot figure out how to remove some of these symbols. I am pretty sure the strategy I used in fasttext_models.py works, but it is still not working out. My theory is that some of these are subwords... but I doubt that it is that simple.")


Here are the top 50 words for the model trained on cleaned data with D100:
['>', '<', 'br', '</s>', 'like', 'good', 'taste', 'flavor', 'one', 'get', 'love', 'product', 'make', 'use', 'coffee', 'great', 'try', 'well', 'food', 'buy', 'tea', 'find', 'would', 'eat', 'dog', 'go', 'really', 'time', 'much', 'amazon', 'order', 'also', 'price', 'bag', 'cup', 'give', 'little', 'even', 'drink', 'say', 'think', 'store', 'day', 'cat', 'add', 'box', 'chocolate', 'treat', 'come', 'first']

Here are the top 50 words for the model trained on uncleaned data with D100:
['the', 'I', 'and', 'a', 'to', 'of', 'is', 'it', '</s>', 'for', 'in', 'this', 'that', 'my', 'with', 'have', 'but', 'are', 'was', 'not', 'you', '/><br', 'on', 'as', 'like', 'they', 'so', 'be', 'The', 'or', 'at', 'these', 'just', 'them', 'very', 'from', 'one', 'good', 'It', '"I', 'has', 'can', 'taste', 'will', 'would', 'had', 'all', 'more', 'than', 'when']
Nota Bene: I cannot figure out how to remove some of these symbols. I am pretty sure 

In [4]:
import get_accuracy

print("Now let's compare our 10 different models' accuracy, precision, and recall.")
print("First, a quick note about precision and recall: The fasttext test() function calculates  precision and recall \"at k\" for our models. Note that these are not the usual definitions of precision and recall used in most discussion about binary classifiers. More precisely, precision at k is:\n      P@k = r / k,\n      the # of relevant labels r divided by the number of top predictions k.")
print("In our case, P@1 will be either 1 or 0 in each test case, because the top guess either is or is not the correct label.\nFor our purposes, just considering precision and recall at 1 makes sense, because there is only one label we care about: the correct one. When k>1, our models can only get a value of either 1/k or 0 in each test case, making this a poor measure of accuracy.\nIt is not explicitly stated in the fasttext documentation, but it is reasonable to assume that P@1 will give the same output as we would get using the sci kit library's tools.\nSpecifically, it gives the same results as sklearn.metrics.precision_score(average='micro') would. This means that fasttext.test() is calculating metrics globally by counting the total true positives, false negatives and false positives across all the labels and producing one number for the precision and recall score. Unfortunately, there seems to be no way to change this calculation beyond varying the k value.")
print("To calculate accuracy for comparision with our other models (with different embeddings), we need to use our own code. See the module get_accuracy.py for more informaiton.\n\n\tThe following dataframe compares our models' preformance on test data:")

cdata = []
for m in mcs :
    data = []
    data.append('Clean')
    data.append(m.get_dimension())
    n,p,r = m.test('data/reviews_uncleaned.test', k=1)
    data.append(get_accuracy.get_accuracy(m))
    data.append(p)
    data.append(r)
    data.append(n)
    cdata.append(data)
df1 = pd.DataFrame(cdata,columns = ['Train Data','Word Vector Size','Accuracy','Precision @1','Recall @1','N'])

udata = []
for m in mus :
    data = []
    data.append('Unclean')
    data.append(m.get_dimension())
    n,p,r = m.test('data/reviews_uncleaned.test', k=1)
    data.append(get_accuracy.get_accuracy(m))
    data.append(p)
    data.append(r)
    data.append(n)
    udata.append(data)
df2 = pd.DataFrame(udata,columns = ['Train Data','Word Vector Size','Accuracy','Precision @1','Recall @1','N'])

df_merged=pd.concat([df1,df2], ignore_index=True)
df_merged

Now let's compare our 10 different models' accuracy, precision, and recall.
First, a quick note about precision and recall: The fasttext test() function calculates  precision and recall "at k" for our models. Note that these are not the usual definitions of precision and recall used in most discussion about binary classifiers. More precisely, precision at k is:
      P@k = r / k,
      the # of relevant labels r divided by the number of top predictions k.
In our case, P@1 will be either 1 or 0 in each test case, because the top guess either is or is not the correct label.
For our purposes, just considering precision and recall at 1 makes sense, because there is only one label we care about: the correct one. When k>1, our models can only get a value of either 1/k or 0 in each test case, making this a poor measure of accuracy.
It is not explicitly stated in the fasttext documentation, but it is reasonable to assume that P@1 will give the same output as we would get using the sci kit libr

Unnamed: 0,Train Data,Word Vector Size,Accuracy,Precision @1,Recall @1,N
0,Clean,25,0.590187,0.590187,0.590187,113691
1,Clean,50,0.59017,0.59017,0.59017,113691
2,Clean,100,0.589792,0.589792,0.589792,113691
3,Clean,200,0.589924,0.589924,0.589924,113691
4,Clean,300,0.590056,0.590056,0.590056,113691
5,Unclean,25,0.68658,0.68658,0.68658,113691
6,Unclean,50,0.687029,0.687029,0.687029,113691
7,Unclean,100,0.686712,0.686712,0.686712,113691
8,Unclean,200,0.687082,0.687082,0.687082,113691
9,Unclean,300,0.686835,0.686835,0.686835,113691


In [7]:
# Let's get some more stats about our models. We will use scikit-learn to get some statistics for comparing our models.
from sklearn.metrics import accuracy_score, f1_score, mean_squared_error

# get_accuracy.get_guesses(m, labels, reviews) can give us the information we need for our analysis
# get_accuracy.get_guesses() : A function that will return the a 3tuple (listof guesses, listof test labels, listof reviews), all matching index

In [21]:
# Now let's get the RMSE and accuracy for each of our ten models:
# That means getting the list of predictions for each model, then using scikit to measure RMSE and accuracy, and finally adding it to df_merged

# First, we need to get the string list of labels and reviews for the testing dataset
# To do this, run ev.get_accuracy() on an arbitrary model.
a, str_labels, reviews = get_accuracy.get_accuracy(mc2)

str_guesses = [] # will be a listof listof Ints

for m in mcs :
    g, l, r = get_accuracy.get_guesses(m, str_labels, reviews)
    str_guesses.append(g)

for m in mus :
    g, l, r = get_accuracy.get_guesses(m, str_labels, reviews)
    str_guesses.append(g)

In [24]:
# guesses correspond as such: guesses[0] --> mc0, guesses[1] --> mc1, ... guesses[9] --> mu4
# make all of the guesses and labels into ints (necessary for RMSE):
convert = {
    '__label__1' : 1,
    '__label__2' : 2,
    '__label__3' : 3,
    '__label__4' : 4,
    '__label__5' : 5,
}

guesses = []

for inner in str_guesses :
    temp = []
    for j in inner :
        temp.append(convert[j])
    guesses.append(temp)
    
labels = []

for l in str_labels :
    labels.append(convert[l])

In [31]:
# calculate RMSEs
RMSEs = []

i = 0
for m in mcs :
    r = mean_squared_error(labels, guesses[i], squared=False)
    RMSEs.append(r)
    i += 1

for m in mus :
    r = mean_squared_error(labels, guesses[i], squared=False)
    RMSEs.append(r)
    i += 1

print(RMSEs)

[1.3172043876432558, 1.3189594286289186, 1.318499206663412, 1.3186259504764013, 1.318319076073167, 1.2753002906328834, 1.2709892639549203, 1.2741169027360137, 1.270196629475611, 1.2738752599324947]


In [32]:
# Now simply merge the RMSE data with the df_merged dataframe and we're done!
df_merged['RMSE'] = RMSEs
df_merged


Unnamed: 0,Train Data,Word Vector Size,Accuracy,Precision @1,Recall @1,N,RMSE
0,Clean,25,0.590187,0.590187,0.590187,113691,1.317204
1,Clean,50,0.59017,0.59017,0.59017,113691,1.318959
2,Clean,100,0.589792,0.589792,0.589792,113691,1.318499
3,Clean,200,0.589924,0.589924,0.589924,113691,1.318626
4,Clean,300,0.590056,0.590056,0.590056,113691,1.318319
5,Unclean,25,0.68658,0.68658,0.68658,113691,1.2753
6,Unclean,50,0.687029,0.687029,0.687029,113691,1.270989
7,Unclean,100,0.686712,0.686712,0.686712,113691,1.274117
8,Unclean,200,0.687082,0.687082,0.687082,113691,1.270197
9,Unclean,300,0.686835,0.686835,0.686835,113691,1.273875


In [5]:
print("Now let's take a look at some of the vector embeddings.")
print("Let's look at the vector for \"chocolate\" in the cleaned-data model:\n")
print(mc2.get_word_vector('chocolate'))
print("\nWe can also look at the most similar words, or the vector's 'neighbors.' These are determined by the cosine similarity of vectors.")
print(mc2.get_nearest_neighbors('chocolate'))

print("\nFasttext also lets us try out analogies. Let's see how it does with the following:")
print("\n\"\'hot\' is to \'cold\', what \'good\' is to _____\"")
print(mc2.get_analogies('hot','cold','good'))

Now let's take a look at some of the vector embeddings.
Let's look at the vector for "chocolate" in the cleaned-data model:

[-0.01593476  0.09763412 -0.03379307  0.05999282 -0.08622997  0.08399839
 -0.04973645  0.08821716  0.13238464  0.05473518 -0.05669185  0.00199589
 -0.05042139  0.08592727  0.01320869 -0.00363729 -0.06829321 -0.1010453
  0.00493081 -0.02913536  0.1875842   0.25373566  0.14334284  0.05230877
 -0.12404761 -0.12505235 -0.01709576 -0.09508679  0.02292475  0.07112854
 -0.01314331 -0.05340886  0.04346133  0.00161902  0.10733654 -0.05789496
 -0.03510344  0.00099835  0.02680125 -0.08761607  0.09337546  0.01593737
 -0.04629854 -0.04687629  0.09315459  0.076052   -0.07473697  0.07008194
  0.04242377 -0.23148057  0.04395402  0.06255033  0.06598131 -0.04706314
 -0.07252772  0.06335913  0.16925715  0.12258145  0.1409662   0.04487699
  0.08523075  0.17074719  0.14271969  0.04857241 -0.01685663  0.06836378
  0.06888569  0.0767359   0.02995878 -0.06690644 -0.02951036 -0.02265557
