In [85]:
NAME = "Tyler Freund"

---

# Lab 6: Skip Gram

**Please read the following instructions very carefully**

## Working on the assignment / FAQs
- **Always use the seed/random_state as *42* wherever applicable** (This is to ensure repeatability in answers, across students and coding environments) 
- The type of question and the points they carry are indicated in each question cell
- To avoid any ambiguity, each question also specifies what *value* must be set. Note that these are dummy values and not the answers
- If an autograded question has multiple answers (due to differences in handling NaNs, zeros etc.), all answers will be considered.
- You can delete the `raise NotImplementedError()`
- **Submitting the assignment** : Download the '.ipynb' file from Colab and upload it to bcourses. Do not delete any outputs from cells before submitting.
- That's about it. Happy coding!


Available software:
 - Python's Gensim module: https://radimrehurek.com/gensim/ (install using pip)
 - Sklearn’s  TSNE module in case you use TSNE to reduce dimension (optional)
 - Python’s Matplotlib (optional)

_Note: The most important hyper parameters of skip-gram/CBOW are vector size and windows size_


In [7]:
!pip install gensim
!wget -nc https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz 


import pandas as pd
import numpy as np 
import gensim



File ‘GoogleNews-vectors-negative300.bin.gz’ already there; not retrieving.



### **Q1 (1 point)** 
Find the cosine similarity between the following word pairs

- (France, England)
- (smaller, bigger)
- (England, London)
- (France, Rocket)
- (big, bigger)

In [2]:
#Replace 0 with the code / value; Do not delete this cell
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

similarity_pair1 = model.similarity('France', 'England')
similarity_pair2 = model.similarity('smaller', 'bigger')
similarity_pair3 = model.similarity('England', 'London')
similarity_pair4 = model.similarity('France', 'Rocket')
similarity_pair5 = model.similarity('big', 'bigger')


In [3]:
#This is an autograded cell, do not edit/delete
print(similarity_pair1, similarity_pair2, similarity_pair3, similarity_pair4, similarity_pair5)

0.39804944 0.7302272 0.43992856 0.07114174 0.68423855


### **Q2 (1 point)** 
Write an expression to extract the vector representations of the words: 

- France
- England
- smaller
- bigger
- rocket
- big

Get only the first 5 elements for each vector representation.

In [8]:
#Replace 0 with the code / value to get the first 5 elements of each vector; Do not delete this cell

vector_1 = model['France'][0:5]
vector_2 = model['England'][0:5]
vector_3 = model['smaller'][0:5]
vector_4 = model['bigger'][0:5]
vector_5 = model['rocket'][0:5]
vector_6 = model['big'][0:5]



In [9]:
#This is an autograded cell, do not edit/delete
print(vector_1)
print(vector_2)
print(vector_3)
print(vector_4)
print(vector_5)
print(vector_6)


[0.04858398 0.07861328 0.32421875 0.03491211 0.07714844]
[-0.19824219  0.11523438  0.0625     -0.05834961  0.2265625 ]
[-0.05004883  0.03417969 -0.0703125   0.17578125  0.00689697]
[-0.06542969 -0.09521484 -0.06225586  0.16210938  0.01989746]
[-0.03198242  0.27148438 -0.2890625  -0.15429688  0.16894531]
[ 0.11132812  0.10595703 -0.07373047  0.18847656  0.07666016]


### **Q3 (1 point)** 
Find the euclidean distances between the word pairs : 

- (France, England)
- (smaller, bigger)
- (England, London)
- (France, Rocket)
- (big, bigger)


In [10]:
#Replace 0 with the code / value; Do not delete this cell
def eu_dist_func(x_vec, y_vec):
   summ = 0
   for i in range(len(x_vec)):
     summ += (x_vec[i] - y_vec[i])**2
   return np.sqrt(summ)

eu_dist1 = eu_dist_func(model['France'], model['England'])
eu_dist2 = eu_dist_func(model['smaller'], model['bigger'])
eu_dist3 = eu_dist_func(model['England'], model['London'])
eu_dist4 = eu_dist_func(model['France'], model['Rocket'])
eu_dist5 = eu_dist_func(model['big'], model['bigger'])



In [11]:
#This is an autograded cell, do not edit / delete
print(eu_dist1)
print(eu_dist2)
print(eu_dist3)
print(eu_dist4)
print(eu_dist5)


3.015106670206923
1.861874365812832
2.8752836561510935
3.892070952915602
1.9586496085134812


### **Q4 (1 point)**
Time to dabble with the power of Word2Vec. Find the 2 closest words  for the following conditions:  
- (King - Man + Queen)
- (bigger - big + small)
- (waiting - wait + run)
- (Texas + Milwaukee – Wisconsin)

In [12]:
#Replace 0 with the code / value; Do not delete this cell
topmatches = model.most_similar(positive=['Queen', 'King'], negative=['Man'])
closest1 = topmatches[0][0], topmatches[1][0]

topmatches = model.most_similar(positive=['bigger', 'small'], negative=['big'])
closest2 = topmatches[0][0], topmatches[1][0]

topmatches = model.most_similar(positive=['waiting', 'run'], negative=['wait'])
closest3 = topmatches[0][0], topmatches[1][0]

topmatches = model.most_similar(positive=['Texas', 'Milwaukee'], negative=['Wisconsin'])
closest4 = topmatches[0][0], topmatches[1][0]

closest5 = 0






In [13]:
#This is an autograded cell, do not edit/delete
print(closest1)
print(closest2)
print(closest3)
print(closest4)
print(closest5)


('Queen_Elizabeth', 'monarch')
('larger', 'smaller')
('running', 'runs')
('Houston', 'Fort_Worth')
0


### **Q5 (3 points)**
Using the vectors for the words in the Google News dataset, explore the semantic representation of these words through K-means clustering and explain your findings.

*Note : Since there are ~3Mil words in the vocabulary, you can downsample it to ~20-30k randomly selected words*

**Do not delete the below cell**

In [14]:
# YOUR CODE HERE
import random
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

#Get random sample of words and extract corresponding vectors into array.
all_words_list = list(model.wv.vocab)
random.seed(42)
sample_list = random.sample(all_words_list, 25000)
vector_list = []
sample_list
for words in sample_list:
  vector_list.append(model[words])

#Explore when we make only 2 clusters.
print(" ")
print("2 CLUSTERS")
X = vector_list
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
values = kmeans.labels_
keys = sample_list
cluster_dict = dict(zip(keys, values))

print("--------First 100 Words in Cluster 1:----------")
n = 0
for i in range(len(sample_list)):
  if (cluster_dict[sample_list[i]] == 0) and n < 101:
    print(sample_list[i])
    n+=1

print("---------First 100 Words in Cluster 2:---------")
n = 0
for i in range(len(sample_list)):
  if (cluster_dict[sample_list[i]] == 1) and n < 101:
    print(sample_list[i])
    n+=1

#Explore when we make 20 clusters.
print(" ")
print("20 CLUSTERS")
X = vector_list
kmeans = KMeans(n_clusters=20, random_state=42)
kmeans.fit(X)
values = kmeans.labels_
keys = sample_list
cluster_dict = dict(zip(keys, values))

clust_range = np.arange(0, 20, 1)
for j in clust_range:
  print("--------First 10 Words in Cluster:----------")
  print(j)
  print(" ")
  n = 0
  for i in range(len(sample_list)):
    if (cluster_dict[sample_list[i]] == j) and n < 11:
      print(sample_list[i])
      n+=1



  import sys


 
2 CLUSTERS
--------First 100 Words in Cluster 1:----------
Spokesman_Ramin_Mehmanparast
Houthis
Aloha_Airlines
INVESTING_ACTIVITIES_Capital
Ben_Matulino
Carlos_Mamani
wholly_owned_subsidiaries_Novatris
########_Fax_+##
automaker_Toyota_Motor
Service_RxPG
Mark_Mowers
Enterprises_NASDAQ_CETV
fever_cough_headache
Audited_Results
Napavalleyregister.com
replaceable_battery
Bureau_d'_Enquetes
BOARD_OF_REVIEW
Quiteh
genomic_medicine
By_Kristen_Zambo
Roberto_Carlos_Magalhaes
minister_Ayham_al
AirPrime_MC####
fried_onions
Ted_Schlafke
stent_thrombosis
www.advaoptical.com
Khalifa_Al_Thani
ZEW_index
Emre_Peker
Manthan_Systems
AsiaInfo_Holdings_NASDAQ
malignant_plasma
ABDULLAH
SENATORS
Crime_Victims_Compensation
Ronnie_Bremer
WARRI_Nigeria
ESCONDIDO_Calif.
Mary_MacElveen_Page
michael_jackson
bebe_outlet
Legal_WebTV_Network
skid
MATTHEW_PENNINGTON
Susan_Teri_Hatcher
Acanthamoeba_keratitis_painful
Allen_Lessels_covers
arrows_machetes
Pigeon_Detectives
agent_Pini_Zahavi
Particular_Concern
Faruqi_LL

Kmeans clusters findings:
When I used only 2 clusters (ie k = 2) it was hard to see a distinct pattern between the two clusters.  I think this is partly due to the fact that there is such a vast range of meaning in the words.  For instance, some are food, some are places, some are people, some are animals, some are items, etc.  However, when splitting with 20 clusters, I was able to see a much stronger pattern within each cluster.  More concretely, I observed that in cluster 4 of the k=20 model that there are 4 food/drink words within the list.  Similarly, in cluster 17 I noticed that nearly all of the words were people.  Furthermore, these people were almost all athletes/sports players.  Thus, by making more clusters I believe that the theme of the cluster became more transparent. 

### **Q6 (1 point)**
What loss function does the skipgram model use and briefly describe what this function is minimizing.

**Do not delete the below cell**

In [15]:
# YOUR CODE HERE
# The Skipgram model uses a Cross Entropy categorical loss function.  Cross Entropy calculates the difference between two probability distributions; less
# formally, it measures how far off the predicted probability is from the target (which is either 0 or 1).

### **Bonus Question (1 point)** 
Find at least 2 interesting word vec combinations like the ones given in Q4

**Do not delete the below cell**

In [16]:
# YOUR CODE HERE

# (Dog + Kitten – Cat)
topmatches = model.most_similar(positive=['Dog', 'Kitten'], negative=['Cat'])
closest1_bonus = topmatches[0][0], topmatches[1][0]
print(closest1_bonus)

# (Dad + daughter – Mom)
topmatches = model.most_similar(positive=['Dad', 'daughter'], negative=['Mom'])
closest2_bonus = topmatches[0][0], topmatches[1][0]
print(closest2_bonus)


('Puppy', 'dog')
('son', 'father')
