# Text Classification
<br>

<br>

### Problem 1: 

Writing performace metrics functions to evaluate the performance of the models.


In [1]:
# ----- Importing libraries -----


import numpy as np
import pandas as pd    # To load data into dataframe

In [2]:
# ----- Confusion Matrix Function to calculate Precision, Recall, F1-score -----


def confusion_mat(y_pred, actual_y):
    cm = np.zeros((2, 2))
    for p, t in zip(y_pred, actual_y):
        cm[p][t] += 1
    return cm[0,1], cm[1,0], cm[1,1]


# Confusion Matrix that gets generated:

#             Actual
#             0    1
#           0 TN   FN
# Predicted
#           1 FP   TP



# ----- Function to calculate Precision  -----


def get_precision(y_pred, actual_y):
    _, fp, tp = confusion_mat(y_pred, actual_y)
    return (tp/(tp+fp))   # precision = true_positives/total_predicted_positives





# ----- Function to calculate Recall  -----


def get_recall(y_pred, actual_y):
    fn, _, tp = confusion_mat(y_pred, actual_y)
    return (tp/(tp+fn))   # recall = true_positives/total_actual_positives





# ----- Function to calculate F1-score  -----


def f_score(y_pred, actual_y):
    fn, fp, tp = confusion_mat(y_pred, actual_y)
    
    # f-measure = 2*precision*recall/(precision+recall)
    
    return (2 * (tp/(tp+fp)) * (tp/(tp+fn))) / ((tp/(tp+fp)) + (tp/(tp+fn)))


# ----- Resources used in https://stackoverflow.com/questions/2148543/how-to-write-a-confusion-matrix-in-python

<br>
<br>

## Baseline Model #1


### Problem 2:

#### Majority Class Baseline model


The label for the test data is taken as the majority class found in the training data, which is the class with the most reviews in our training data.

There is no class imbalance in the training data, i.e.,  there equal number of reviews in both the positive and negative classes of reviews.

Reporting Precision, Recall and F-measure for both training and test data.

In [3]:
# ----- Importing libraries ----- 

import sklearn
from sklearn.datasets import load_files   # To load training & test data


# ----- Importing training data ----- 
# ----- Folders 'pos' and 'neg' are specified to just retrieve data from those ----- 
# ----- The "train" folder is inside "aclImdb" folder

imdb_train_dir = r'aclImdb/train/'
train = load_files(imdb_train_dir, categories=['pos','neg'])


# ----- Resources used: http://www.pitt.edu/~naraehan/presentation/Movie+Reviews+sentiment+analysis+with+Scikit-Learn.html

In [4]:
# ----- Importing testing data ----- 
# ----- The "test" folder is inside "aclImdb" folder

imdb_test_dir = r'aclImdb/test/'
test = load_files(imdb_test_dir, categories=['pos','neg'])

In [5]:
# ----- Printing the list of targets ----- 


train.target

array([1, 0, 1, ..., 0, 0, 0])

In [6]:
# ----- Printing the first filename ----- 


train.filenames[0]

'aclImdb/train/pos/11485_10.txt'

In [7]:
# ----- Testing to see if everything's working fine ----- 


train.data[0][:500]

b'Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and act'

In [8]:
# ----- Finding out the maximum class ----- 


print("Maximum class is: ")
if (train.target == 1).sum() > (train.target == 0).sum():
    print("Positive")
elif (train.target == 1).sum() < (train.target == 0).sum():
    print("Negative")
else:
    print("Equal")

Maximum class is: 
Equal


Taking positive as the maximum class as both the classes have equal amount of reviews in them and taking any of them would result the same.

In [9]:
# ----- Making y_pred = 0 to test the baseline model -----  

total = (train.target == 1).sum() + (train.target == 0).sum()
y_pred = [0]*total
y_pred[:10]  # Checking

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The accuracy for both the models(class positive and class negative) will be the same(as there is no class imbalance, i.e., both the classes are equal), which is 50%. But the precision, recall and f1-score will differ. The difference is due to the negative class being represented as zeros. While we're calculating precision, we will be dividing a some number by zero and it would result in a nan value. As precision is used in calculating f1-score, the result of that will be zero as well.

In [10]:
print("")
print("Majority Class Baseline Model")
print("-----------------------------")
print("Performance Measures of our model on Training Data:")
print("Precison:", get_precision(y_pred, train.target)*100)
print("Recall:", get_recall(y_pred, train.target)*100)
print("F1-score:", f_score(y_pred, train.target)*100)
print("")
print("------------------------------------------")
print("")
print("Majority Class Baseline Model")
print("-----------------------------")
print("Performance Measures of our model on Testing Data:")
print("Precison:", get_precision(y_pred, test.target)*100)
print("Recall:", get_recall(y_pred, test.target)*100)
print("F1-score:", f_score(y_pred, test.target)*100)
print("")


Majority Class Baseline Model
-----------------------------
Performance Measures of our model on Training Data:
Precison: nan
Recall: 0.0
F1-score: nan

------------------------------------------

Majority Class Baseline Model
-----------------------------
Performance Measures of our model on Testing Data:
Precison: nan
Recall: 0.0




F1-score: nan



In [11]:
# ----- Making y_pred = 1 -----  

y_pred = [1]*total
y_pred[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [12]:
print("")
print("Baseline Model")
print("--------------")
print("Performance Measures of our model on Training Data:")
print("Precison:", get_precision(y_pred, train.target)*100)
print("Recall:", get_recall(y_pred, train.target)*100)
print("F1-score:", f_score(y_pred, train.target)*100)
print("")
print("------------------------------------------")
print("")
print("Baseline Model")
print("--------------")
print("Performance Measures of our model on Testing Data:")
print("Precison:", get_precision(y_pred, test.target)*100)
print("Recall:", get_recall(y_pred, test.target)*100)
print("F1-score:", f_score(y_pred, test.target)*100)
print("")


Baseline Model
--------------
Performance Measures of our model on Training Data:
Precison: 50.0
Recall: 100.0
F1-score: 66.66666666666666

------------------------------------------

Baseline Model
--------------
Performance Measures of our model on Testing Data:
Precison: 50.0
Recall: 100.0
F1-score: 66.66666666666666



<br>
<br>

## Baseline Model #2


### Problem 3:

#### Review Length Baseline

The length of the reviews are taken as a deciding factor to classify reviews.


Thresholds value:

1. Everything below mean is positive.

2. Everything above mean is positive.

3. Arbitrary value: 750.


In [13]:
# ----- Creating a reviews length list -----

train['length'] = [len(words) for words in train.data]
train['length'][:10] #Checking

[764, 1072, 411, 832, 1588, 341, 606, 1474, 2439, 259]

In [14]:
# ----- Putting data inside a Pandas dataframe to manipulate it -----

train = pd.DataFrame.from_dict({key: train[key] for key in train.keys()
                                & {'data', 'filenames', 'target', 'length'}})
train.head(10)

Unnamed: 0,data,filenames,target,length
0,"b""Zero Day leads you to think, even re-think w...",aclImdb/train/pos/11485_10.txt,1,764
1,b'Words can\'t describe how bad this movie is....,aclImdb/train/neg/6802_1.txt,0,1072
2,b'Everyone plays their part pretty well in thi...,aclImdb/train/pos/7641_10.txt,1,411
3,b'There are a lot of highly talented filmmaker...,aclImdb/train/neg/9698_1.txt,0,832
4,b'I\'ve just had the evidence that confirmed m...,aclImdb/train/neg/3141_2.txt,0,1588
5,"b""The Movie was sub-par, but this Television P...",aclImdb/train/pos/1018_8.txt,1,341
6,"b""This movie has a special way of telling the ...",aclImdb/train/pos/9637_8.txt,1,606
7,"b""the single worst film i've ever seen in a th...",aclImdb/train/neg/10278_1.txt,0,1474
8,"b""The plot of this terrible film is so convolu...",aclImdb/train/neg/265_1.txt,0,2439
9,b'I had no idea that Mr. Izzard was so damn fu...,aclImdb/train/pos/3908_10.txt,1,259


In [15]:
# ----- Finding the average length to make everything below it to be positive -----
# ----- according to this paper:
# ----- https://www.emerald.com/insight/content/doi/10.1108/IntR-12-2016-0394/full/html
# ----- shorter reviews are more positive

import statistics 

average = statistics.mean(train['length'])
average

1325.31292

In [16]:
# ----- Predicting y from training data -----

train['y_pred'] = train.length.apply(lambda x: 1 if x < average else 0)
train['y_pred'][:10]

0    1
1    1
2    1
3    1
4    0
5    1
6    1
7    0
8    0
9    1
Name: y_pred, dtype: int64

In [17]:
test['length'] = [len(words) for words in test.data]
test = pd.DataFrame.from_dict({key: test[key] for key in test.keys()
                               & {'data', 'filenames', 'target', 'length'}})
test.head(10)

Unnamed: 0,data,filenames,target,length
0,"b""Don't hate Heather Graham because she's beau...",aclImdb/test/pos/11485_9.txt,1,425
1,b'I don\'t know how this movie has received so...,aclImdb/test/neg/6802_1.txt,0,705
2,"b""I caught this movie on the Horror Channel an...",aclImdb/test/pos/7641_8.txt,1,1549
3,b'NBC had a chance to make a powerful religiou...,aclImdb/test/neg/9698_1.txt,0,1104
4,"b""Looking for something shocking? Okay fine......",aclImdb/test/neg/3141_1.txt,0,1049
5,"b'""Are You in the House Alone?"" belongs to the...",aclImdb/test/pos/1018_7.txt,1,823
6,"b""I think this is one hell of a movie............",aclImdb/test/pos/9637_8.txt,1,295
7,b'I watched this movie a couple of weeks ago a...,aclImdb/test/neg/10278_4.txt,0,2591
8,"b'Ocean\'s Twelve: just plain stupid, bad and ...",aclImdb/test/neg/265_1.txt,0,742
9,"b""This excellent movie starring Elizabeth Mont...",aclImdb/test/pos/3908_10.txt,1,723


In [18]:
# ----- Predicting y from test data -----

test['y_pred'] = test.length.apply(lambda x: 1 if x < average else 0)
test['y_pred'][:10]

0    1
1    1
2    0
3    1
4    1
5    1
6    1
7    0
8    1
9    1
Name: y_pred, dtype: int64

In [19]:
print("")
print("Everything below mean length is positive")
print("")
print("Review Length Baseline Model")
print("----------------------------")
print("Performance Measures of our model on Training Data")
print("Precison:", get_precision(train.y_pred, train.target)*100)
print("Recall:", get_recall(train.y_pred, train.target)*100)
print("F1-score:", f_score(train.y_pred, train.target)*100)
print("")
print("------------------------------------------")
print("")
print("Review Length Baseline Model")
print("----------------------------")
print("Performance Measures of our model on Testing Data")
print("Precison:", get_precision(test.y_pred, test.target)*100)
print("Recall:", get_recall(test.y_pred, test.target)*100)
print("F1-score:", f_score(test.y_pred, test.target)*100)
print("")


Everything below mean length is positive

Review Length Baseline Model
----------------------------
Performance Measures of our model on Training Data
Precison: 49.31696455437203
Recall: 65.56
F1-score: 56.29013978088403

------------------------------------------

Review Length Baseline Model
----------------------------
Performance Measures of our model on Testing Data
Precison: 50.07048044167743
Recall: 68.2
F1-score: 57.74571564045249



In [20]:
# ----- Performing the opposite of case one -----
# ----- to see if test case one is doing a better job or this one is


train['y_pred'] = train.length.apply(lambda x: 1 if x > average else 0)

In [21]:
test['y_pred'] = test.length.apply(lambda x: 1 if x > average else 0)

In [22]:
print("")
print("Everything above mean length is positive")
print("")
print("Review Length Baseline Model")
print("----------------------------")
print("Performance Measures of our model on Training Data")
print("Precison:", get_precision(train.y_pred, train.target)*100)
print("Recall:", get_recall(train.y_pred, train.target)*100)
print("F1-score:", f_score(train.y_pred, train.target)*100)
print("")
print("------------------------------------------")
print("")
print("Review Length Baseline Model")
print("----------------------------")
print("Performance Measures of our model on Testing Data")
print("Precison:", get_precision(test.y_pred, test.target)*100)
print("Recall:", get_recall(test.y_pred, test.target)*100)
print("F1-score:", f_score(test.y_pred, test.target)*100)
print("")


Everything above mean length is positive

Review Length Baseline Model
----------------------------
Performance Measures of our model on Training Data
Precison: 51.35393057378027
Recall: 34.44
F1-score: 41.22970837523345

------------------------------------------

Review Length Baseline Model
----------------------------
Performance Measures of our model on Testing Data
Precison: 49.849510910458996
Recall: 31.8
F1-score: 38.829735274006055



In [23]:
# ----- Taking an arbitrary value to see how it performs  -----
# ----- Taking everything above 750 to be negative -----


train['y_pred'] = train.length.apply(lambda x: 0 if x > 750 else 1)
train['y_pred'][:10]

0    0
1    0
2    1
3    0
4    0
5    1
6    1
7    0
8    0
9    1
Name: y_pred, dtype: int64

In [24]:
test['y_pred'] = test.length.apply(lambda x: 0 if x > 750 else 1)
test['y_pred'][:10]

0    1
1    1
2    0
3    0
4    0
5    0
6    1
7    0
8    1
9    1
Name: y_pred, dtype: int64

In [25]:
print("")
print("Everything above review length 750 is negative")
print("")
print("Review Length Baseline Model")
print("----------------------------")
print("Performance Measures of our model on Training Data")
print("Precison:", get_precision(train.y_pred, train.target)*100)
print("Recall:", get_recall(train.y_pred, train.target)*100)
print("F1-score:", f_score(train.y_pred, train.target)*100)
print("")
print("------------------------------------------")
print("")
print("Review Length Baseline Model")
print("----------------------------")
print("Performance Measures of our model on Testing Data")
print("Precison:", get_precision(test.y_pred, test.target)*100)
print("Recall:", get_recall(test.y_pred, test.target)*100)
print("F1-score:", f_score(test.y_pred, test.target)*100)
print("")


Everything above review length 750 is negative

Review Length Baseline Model
----------------------------
Performance Measures of our model on Training Data
Precison: 51.209416634329315
Recall: 31.672
F1-score: 39.1379566012555

------------------------------------------

Review Length Baseline Model
----------------------------
Performance Measures of our model on Testing Data
Precison: 51.707255837308566
Recall: 32.952
F1-score: 40.25212547639988



<br>
<br>

# Naïve Bayes classifier #1

### Problem 4

Using the built-in Naive Bayes model from sklearn to train a classifier.

In [26]:
# ----- Importing libraries for this model -----


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import GaussianNB
import nltk

In [27]:
# ----- Creating a CountVectorizer object and making vectors of the reviews data ----- 


Count_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)

review_counts = Count_vec.fit_transform(train.data)
test_counts = Count_vec.transform(test.data)

In [28]:
# ----- Dimensions of raw frequency counts -----

review_counts.shape

(25000, 53838)

In [29]:
# ----- Convert raw frequency counts into TF-IDF values ----- 


tfidf_transformer = TfidfTransformer()

reviews_tfidf = tfidf_transformer.fit_transform(review_counts)
test_tfidf = tfidf_transformer.transform(test_counts)

In [30]:
# ----- Same dimensions, now with tf-idf values instead of raw frequency counts -----

reviews_tfidf.shape

(25000, 53838)

In [31]:
# ----- Training a Gaussian Naive Bayes classifier on Training data ----- 

G_clf = GaussianNB()

# ----- Using the partial_fit method due to memory constraint, fit() can be used to do the same -----

G_clf.partial_fit(reviews_tfidf[:5000].toarray(), train.target[:5000], classes=[0,1])

GaussianNB(priors=None, var_smoothing=1e-09)

In [32]:
G_clf.partial_fit(reviews_tfidf[5000:10000].toarray(), train.target[5000:10000])

GaussianNB(priors=None, var_smoothing=1e-09)

In [33]:
G_clf.partial_fit(reviews_tfidf[10000:15000].toarray(), train.target[10000:15000])

GaussianNB(priors=None, var_smoothing=1e-09)

In [34]:
G_clf.partial_fit(reviews_tfidf[15000:20000].toarray(), train.target[15000:20000])

GaussianNB(priors=None, var_smoothing=1e-09)

In [35]:
G_clf.partial_fit(reviews_tfidf[20000:].toarray(), train.target[20000:])

GaussianNB(priors=None, var_smoothing=1e-09)

In [36]:
# ----- Calculating predicted y values for training data ----- 

G_y_pred_train = []
for i in range(0,25000):
    G_y_pred_train.append(int(G_clf.predict(reviews_tfidf[i].toarray())))

In [37]:
# ----- Calculating model accuracy using build-in function ----- 

sklearn.metrics.accuracy_score(train.target.tolist(), G_y_pred_train)

0.94692

In [38]:
# ----- Calculating predicted y values for test data ----- 

G_y_pred_test = []
for i in range(0,25000):
    G_y_pred_test.append(int(G_clf.predict(test_tfidf[i].toarray())))

In [39]:
# ----- Calculating model accuracy using build-in function ----- 

sklearn.metrics.accuracy_score(test.target.tolist(), G_y_pred_test)

0.5988

In [40]:
print("")
print("Gaussian Naive Bayes Model")
print("--------------------------")
print("Performance Measures of our model on Training Data:")
print("Precison:", get_precision(G_y_pred_train, train.target.tolist())*100)
print("Recall:", get_recall(G_y_pred_train, train.target.tolist())*100)
print("F1-score:", f_score(G_y_pred_train, train.target.tolist())*100)
print("")
print("------------------------------------------")
print("")
print("Gaussian Naive Bayes Model")
print("--------------------------")
print("Performance Measures of our model on Testing Data:")
print("Precison:", get_precision(G_y_pred_test, test.target.tolist())*100)
print("Recall:", get_recall(G_y_pred_test, test.target.tolist())*100)
print("F1-score:", f_score(G_y_pred_test, test.target.tolist())*100)
print("")


Gaussian Naive Bayes Model
--------------------------
Performance Measures of our model on Training Data:
Precison: 99.33763136977832
Recall: 89.984
F1-score: 94.42975275993787

------------------------------------------

Gaussian Naive Bayes Model
--------------------------
Performance Measures of our model on Testing Data:
Precison: 62.81386179705333
Recall: 48.431999999999995
F1-score: 54.693287559851846



<br>
<br>

# Naïve Bayes classifier #2

### Problem 5

Multinominal Naive Bayes classifier has been implemented here using the sklearn package.

In [41]:
# ----- Using Multinominal Naive Bayes as the model ----- 


from sklearn.naive_bayes import MultinomialNB

In [42]:
# ----- Training a Multinominal Naive Bayes classifier on Training data ----- 

M_clf = MultinomialNB()


# ----- Using the partial_fit method due to memory constraint, fit() can be used to do the same -----

M_clf.partial_fit(reviews_tfidf[:5000].toarray(), train.target[:5000], classes=[0,1])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [43]:
M_clf.partial_fit(reviews_tfidf[5000:10000].toarray(), train.target[5000:10000])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [44]:
M_clf.partial_fit(reviews_tfidf[10000:15000].toarray(), train.target[10000:15000])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [45]:
M_clf.partial_fit(reviews_tfidf[15000:20000].toarray(), train.target[15000:20000])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [46]:
M_clf.partial_fit(reviews_tfidf[20000:].toarray(), train.target[20000:])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [47]:
# ----- Calculating predicted y values for training data ----- 

M_y_pred_train = []
for i in range(0,25000):
    M_y_pred_train.append(int(M_clf.predict(reviews_tfidf[i].toarray())))

In [48]:
# ----- Calculating model accuracy using build-in function ----- 

sklearn.metrics.accuracy_score(train.target.tolist(), M_y_pred_train)

0.90688

In [49]:
# ----- Calculating predicted y values for test data ----- 

M_y_pred_test = []
for i in range(0,25000):
    M_y_pred_test.append(int(M_clf.predict(test_tfidf[i].toarray())))

In [50]:
# ----- Calculating model accuracy using build-in function ----- 

sklearn.metrics.accuracy_score(test.target.tolist(), M_y_pred_test)

0.83244

In [51]:
print("")
print("Multinominal Naive Bayes Model")
print("------------------------------")
print("Performance Measures of our model on Training Data:")
print("Precison:", get_precision(M_y_pred_train, train.target.tolist())*100)
print("Recall:", get_recall(M_y_pred_train, train.target.tolist())*100)
print("F1-score:", f_score(M_y_pred_train, train.target.tolist())*100)
print("")
print("------------------------------------------")
print("")
print("Multinominal Naive Bayes Model")
print("------------------------------")
print("Performance Measures of our model on Testing Data:")
print("Precison:", get_precision(M_y_pred_test, test.target.tolist())*100)
print("Recall:", get_recall(M_y_pred_test, test.target.tolist())*100)
print("F1-score:", f_score(M_y_pred_test, test.target.tolist())*100)
print("")


Multinominal Naive Bayes Model
------------------------------
Performance Measures of our model on Training Data:
Precison: 92.70361041141898
Recall: 88.32799999999999
F1-score: 90.46292503072512

------------------------------------------

Multinominal Naive Bayes Model
------------------------------
Performance Measures of our model on Testing Data:
Precison: 87.84953092267055
Recall: 77.16
F1-score: 82.15852463903913



### Problem 6:

We can dedude the following from all the four models above:

1. MAJORITY CLASS BASELINE MODEL: The Majority Class baseline model is the most naive of them all. And it performs with exactly 50% accuracy no matter which class we choose to be our majority class. 


2. REVIEW LENGTH BASELINE MODEL: The Review Length model performs mediocrely as all the thresholds used are just possibilities.


3. GAUSSIAN NAIVE BAYES CLASSIFIER MODEL: The Gaussian model is more intuitive than the previous two models but doesn't do well with the data as it uses normal distribution. Gaussian Naive Bayes is used in cases when all our features are continuous(continuous data), not in this case. Not good at generalizing. 


4. MULTINOMINAL NAIVE BAYES CLASSIFIER MODEL: The Multinomial model is best suited for discrete data like our reviews with positive and negative class which don't affect each other in any way. This is the best model of the lot. It does better in the performace measures as well. Good at generalizing.


<br>
<br>

# Vector Semantics

### Problem 7:

Analogy prediction.

In [52]:
# ----- Reading file word-test.v1.txt which has all the analogies -----
# ----- File "word-test.v1.txt" which has all the analogies is in the same directory as this notebook

file = open("word-test.v1.txt", 'rt')
text = file.read()
file.close()



# ----- Lines starting with ":" delimits the group of words -----

tokens = text.split(":")



# ----- PREPROCESSING text -----
# ----- taking the right index from the split list and spliting it further with "\n",
# ----- lowering the text and replacing the tabs with no spaces
# ----- leaving out the first index

captial_world = [x.lower().replace('\t', '') for x in tokens[2].split("\n")][1:-1]
currency = [x.lower().replace('\t', '') for x in tokens[3].split("\n")][1:-1]
city_in_state = [x.lower().replace('\t', '') for x in tokens[4].split("\n")][1:-1]
family = [x.lower().replace('\t', '') for x in tokens[5].split("\n")][1:-1]
gram1_adjective_to_adverb = [x.lower().replace('\t', '') for x in tokens[6].split("\n")][1:-1]
gram2_opposite = [x.lower().replace('\t', '') for x in tokens[7].split("\n")][1:-1]
gram3_comparative = [x.lower().replace('\t', '') for x in tokens[8].split("\n")][1:-1]
gram6_nationality_adjective = [x.lower().replace('\t', '') for x in tokens[11].split("\n")][1:-1]
captial_world

['abuja nigeria accra ghana',
 'abuja nigeria algiers algeria',
 'abuja nigeria amman jordan',
 'abuja nigeria ankara turkey',
 'abuja nigeria antananarivo madagascar',
 'abuja nigeria apia samoa',
 'abuja nigeria ashgabat turkmenistan',
 'abuja nigeria asmara eritrea',
 'abuja nigeria astana kazakhstan',
 'abuja nigeria athens greece',
 'abuja nigeria baghdad iraq',
 'abuja nigeria baku azerbaijan',
 'abuja nigeria bamako mali',
 'abuja nigeria bangkok thailand',
 'abuja nigeria banjul gambia',
 'abuja nigeria beijing china',
 'abuja nigeria beirut lebanon',
 'abuja nigeria belgrade serbia',
 'abuja nigeria belmopan belize',
 'abuja nigeria berlin germany',
 'abuja nigeria bern switzerland',
 'abuja nigeria bishkek kyrgyzstan',
 'abuja nigeria bratislava slovakia',
 'abuja nigeria brussels belgium',
 'abuja nigeria bucharest romania',
 'abuja nigeria budapest hungary',
 'abuja nigeria bujumbura burundi',
 'abuja nigeria cairo egypt',
 'abuja nigeria canberra australia',
 'abuja nigeri

<br>

GloVe Download link: https://nlp.stanford.edu/projects/glove/

GloVe File Download link: http://nlp.stanford.edu/data/glove.twitter.27B.zip

In [53]:
# ----- Converting pretrained GloVe word embeddings to word2vec and creating a model -----

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import os


# ----- The file "glove.twitter.27B.100d.txt" is inside "glove.twitter.27B" folder -----
# ----- which is in the same directory as this notebook

glove_file = datapath(os.path.abspath("glove.twitter.27B/glove.twitter.27B.100d.txt"))
tmp_file = get_tmpfile("test_word2vec.txt")
glove2word2vec(glove_file, tmp_file)
glove_model = KeyedVectors.load_word2vec_format(tmp_file)


# ----- Reference: https://radimrehurek.com/gensim/scripts/glove2word2vec.html -----

In [54]:
# ----- Model accuracy function that calculates accuracy for all models ----- 

def model_accuracy(test, model):
    accuracy = []
    for words in test:
        a, b, c, d = words.split()
        try:
            p = model.most_similar(positive=[b, c], negative=[a], topn=1)[0][0]
        except KeyError:
            p = 'dummy'
        if p == d:
            accuracy.append(1)
        else:
            accuracy.append(0)
    return (accuracy.count(1)/len(accuracy))*100

In [55]:
print("GloVe Model Accuracy for 'captial_world':", model_accuracy(captial_world, glove_model))

GloVe Model Accuracy for 'captial_world': 45.44650751547303


In [56]:
print("GloVe Model Accuracy for 'currency':", model_accuracy(currency, glove_model))

GloVe Model Accuracy for 'currency': 1.0392609699769053


In [57]:
print("GloVe Model Accuracy for 'city_in_state':", model_accuracy(city_in_state, glove_model))

GloVe Model Accuracy for 'city_in_state': 17.916497770571542


In [58]:
print("GloVe Model Accuracy for 'family':", model_accuracy(family, glove_model))

GloVe Model Accuracy for 'family': 60.27667984189723


In [59]:
print("GloVe Model Accuracy for 'gram1_adjective_to_adverb':", model_accuracy(gram1_adjective_to_adverb, glove_model))

GloVe Model Accuracy for 'gram1_adjective_to_adverb': 7.560483870967742


In [60]:
print("GloVe Model Accuracy for 'gram2_opposite':", model_accuracy(gram2_opposite, glove_model))

GloVe Model Accuracy for 'gram2_opposite': 16.871921182266007


In [61]:
print("GloVe Model Accuracy for 'gram3_comparative':", model_accuracy(gram3_comparative, glove_model))

GloVe Model Accuracy for 'gram3_comparative': 62.53753753753754


In [62]:
print("GloVe Model Accuracy for 'gram6_nationality_adjective':", model_accuracy(gram6_nationality_adjective, glove_model))

GloVe Model Accuracy for 'gram6_nationality_adjective': 45.77861163227016


<br>

Lexvec Download link: https://github.com/alexandres/lexvec

Lexvec File Download link: https://www.dropbox.com/s/kguufyc2xcdi8yk/lexvec.enwiki%2Bnewscrawl.300d.W.pos.vectors.gz?dl=1

In [63]:
# ----- Converting pretrained lexvec word embeddings to word2vec format and creating a model -----
# ----- File lexvec.enwiki+newscrawl.300d.W.pos.vectors" stored in the same directory as the notebook 

lexvec_model = KeyedVectors.load_word2vec_format("lexvec.enwiki+newscrawl.300d.W.pos.vectors")

In [64]:
print("Lexvec Model Accuracy for 'captial_world':", model_accuracy(captial_world, lexvec_model))

Lexvec Model Accuracy for 'captial_world': 94.36339522546419


In [65]:
print("Lexvec Model Accuracy for 'currency':", model_accuracy(currency, lexvec_model))

Lexvec Model Accuracy for 'currency': 22.0554272517321


In [66]:
print("Lexvec Model Accuracy for 'city_in_state':", model_accuracy(city_in_state, lexvec_model))

Lexvec Model Accuracy for 'city_in_state': 72.63883259019052


In [67]:
print("Lexvec Model Accuracy for 'family':", model_accuracy(family, lexvec_model))

Lexvec Model Accuracy for 'family': 87.74703557312253


In [68]:
print("Lexvec Model Accuracy for 'gram1_adjective_to_adverb':", model_accuracy(gram1_adjective_to_adverb, lexvec_model))

Lexvec Model Accuracy for 'gram1_adjective_to_adverb': 24.899193548387096


In [69]:
print("Lexvec Model Accuracy for 'gram2_opposite':", model_accuracy(gram2_opposite, lexvec_model))

Lexvec Model Accuracy for 'gram2_opposite': 36.57635467980296


In [70]:
print("Lexvec Model Accuracy for 'gram3_comparative':", model_accuracy(gram3_comparative, lexvec_model))

Lexvec Model Accuracy for 'gram3_comparative': 87.31231231231232


In [71]:
print("Lexvec Model Accuracy for 'gram6_nationality_adjective':", model_accuracy(gram6_nationality_adjective, lexvec_model))

Lexvec Model Accuracy for 'gram6_nationality_adjective': 91.80737961225766


<br>
<br>

### Problem 8: 

Searching for the top 10 most similar words for words 'increase' and 'enter' using the cosine similarity.

In [72]:
glove_model.most_similar("increase", topn = 10)

[('increased', 0.887993574142456),
 ('increasing', 0.8340300917625427),
 ('decrease', 0.7918903827667236),
 ('increases', 0.7880600690841675),
 ('improve', 0.7716851234436035),
 ('boost', 0.7650824785232544),
 ('reduce', 0.7445826530456543),
 ('profits', 0.7273502349853516),
 ('growth', 0.7266506552696228),
 ('lower', 0.7259215116500854)]

In [73]:
lexvec_model.most_similar("increase", topn = 10)

[('decrease', 0.8544164896011353),
 ('increased', 0.8349589109420776),
 ('increases', 0.7846347689628601),
 ('increasing', 0.7320787906646729),
 ('decreased', 0.6769896745681763),
 ('reduce', 0.6768884658813477),
 ('rise', 0.6671466827392578),
 ('reduced', 0.6613093614578247),
 ('decline', 0.6586759090423584),
 ('reduction', 0.6572977304458618)]

In [74]:
glove_model.most_similar("enter", topn = 10)

[('giveaway', 0.8220357894897461),
 ('entered', 0.8185302019119263),
 ('prize', 0.766309380531311),
 ('competition', 0.7326720356941223),
 ('contest', 0.7288600206375122),
 ('win', 0.7278855443000793),
 ('prizes', 0.7236887216567993),
 ('comp', 0.7213659882545471),
 ('winners', 0.6834891438484192),
 ('winner', 0.6830912232398987)]

In [75]:
lexvec_model.most_similar("enter", topn = 10)

[('entering', 0.7462573647499084),
 ('entered', 0.7203364968299866),
 ('reenter', 0.6578646898269653),
 ('participate', 0.5398669242858887),
 ('enters', 0.5306301116943359),
 ('arrive', 0.5217719078063965),
 ('exiting', 0.5177127718925476),
 ('join', 0.505316972732544),
 ('depart', 0.4942780137062073),
 ('leave', 0.48944905400276184)]

Explanatoion: Word2vec does not capture similarity based on antonyms and synonyms. Word2vec would give a higher similarity if the two words have similar context. Which means that words like "increase" would come in the same context as words like 'decrease', 'decreased', 'reduce', 'reduced', 'decline', which are antonyms of it. Let's take a sentence, "the pilot _______ the cabin". The blank can be replaced with 'enters' and 'exits' which means the similarity between these two words will be high as they come in the same context with the other words in the sentence. This is called paradigmatic relations.

<br>

<br>

### Problem 9:

Designed two new types of analogy tests - 'vehicle_mode' and 'animal_child'. 
'vehicle_mode' is a list with vehicles and their medium of transportation. 
'animal_child' is a list with animals and their juvenile names.
'vehicle_mode' and 'animal_child' are designed like the analogies dataset.

Two other lists 'animal_habitat' and 'animal_children' are also created. 

In [76]:
vehicle_mode = ['boat water car road',
                'boat water train track',
                'car road train track',
                'car road boat water',
                'train track boat water',
                'train track car road']

animal_child = ['cat kitten dog puppy',
                'cat kitten goat kid',
                'dog puppy cat kitten',
                'dog puppy goat kid',
                'goat kid cat kitten',
                'goat kid dog puppy']

In [78]:
print("Glove Model Accuracy for 'vehicle_mode':", model_accuracy(vehicle_mode, glove_model))

Glove Model Accuracy for 'vehicle_medium': 0.0


In [79]:
print("Lexvec Model Accuracy for 'vehicle_mode':", model_accuracy(vehicle_mode, lexvec_model))

Lexvec Model Accuracy for 'vehicle_mode': 0.0


In [80]:
print("Glove Model Accuracy for 'animal_child':", model_accuracy(animal_child, glove_model))

Glove Model Accuracy for 'animal_child': 33.33333333333333


In [81]:
print("Lexvec Model Accuracy for 'animal_child':", model_accuracy(animal_child, lexvec_model))

Lexvec Model Accuracy for 'animal_child': 33.33333333333333


In [82]:
animal_habitat = ['dog kennel fowl coop',
                  'lion den mouse hole',
                  'horse stable bee hive']

animal_children = ['cat kitten dog puppy',
                   'cow calf goat kid',
                   'peacock peachicks rooster chick']

In [83]:
print("Glove Model Accuracy for 'animal_habitat':", model_accuracy(animal_habitat, glove_model))

Glove Model Accuracy for 'animal_habitat': 0.0


In [84]:
print("Lexvec Model Accuracy for 'animal_habitat':", model_accuracy(animal_habitat, lexvec_model))

Lexvec Model Accuracy for 'animal_habitat': 0.0


In [85]:
print("Glove Model Accuracy for 'animal_children':", model_accuracy(animal_children, glove_model))

Glove Model Accuracy for 'animal_children': 33.33333333333333


In [86]:
print("Lexvec Model Accuracy for 'animal_children':", model_accuracy(animal_children, lexvec_model))

Lexvec Model Accuracy for 'animal_children': 33.33333333333333


If the words do not appear in each other's context window then the words won't show up in each others similarity list and the accuracy would be zero.

In [None]:
# ----- Deleting models -----

del glove_model
del lexvec_model