<a href="https://colab.research.google.com/github/jan-kreischer/UZH_ML4NLP/blob/main/ex02_wordembeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2 - Word Embeddings with PyTorch

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.


## Part 1: Training CBOW embeddings for both datasets
### 1. Setup
#### 1.1 Imports

In [1]:
# Loading all required external modules
import os

# torch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
torch.manual_seed(1)

# numpy and pandas
import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 800)

# tokenization
import nltk;
nltk.download('stopwords');
%matplotlib inline

from argparse import Namespace
from collections import Counter
import json
import string
import itertools
import regex as re
from tqdm import tqdm_notebook
from sklearn.model_selection import train_test_split
import requests

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### 1.2 Environment

In [2]:
# Check if device supports CUDA interface
CUDA = torch.cuda.is_available()
# Make program run on gpu (cuda:0) if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu:0')
torch.cuda.set_device(device)
print('Using device:', device)

Using device: cuda:0


In [3]:
# Check and print information about available GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Oct 26 17:57:02 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    27W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# Get GPU name
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-33ed3149-e1cb-65ff-04b8-d1ed1ce43f25)


In [5]:
# Check Memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


#### 1.3 Constants

In [34]:
# Define necessary constants
# which were given to us
# and dont change throughout the execution
FEATURE_COLUMN = 'Review'
CONTEXT_OFFSET = 2 # n words to the left, n to the right
BATCH_SIZE = 64

# Since overfitting is not an issue according
# to Claudio, we decided to train for more episodes
EPOCHS_HOTEL = 30
EMBEDDING_DIM_HOTEL = 50

# We decided to stay with 2 eposides here
# because the corpus is very big.
EPOCHS_SCIFI = 2
EMBEDDING_DIM_SCIFI = 50

### Hotel Reviews dataset
### 2. Data Preprocessing
#### 2.1 Data Acquisition

In [7]:
# Loading the tripadvisor data
url_tripadvisor = (r'https://raw.githubusercontent.com/abandonedrepo/test/master/tripadvisor_hotel_reviews.csv')
reviews_dataset = pd.read_csv(url_tripadvisor)
reviews_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  20491 non-null  object
 1   Rating  20491 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 320.3+ KB


In [8]:
reviews_dataset.head(5)

Unnamed: 0,Review,Rating
0,"nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway, maybe just noisy neighbors, aveda bath products nice, did not goldfish stay nice touch taken advantage staying longer, location great walking distance shopping, overall nice experience having pay 40 parking night,",4
1,"ok nothing special charge diamond member hilton decided chain shot 20th anniversary seattle, start booked suite paid extra website description not, suite bedroom bathroom standard hotel room, took printed reservation desk showed said things like tv couch ect desk clerk told oh mixed suites description kimpton website sorry free breakfast, got kidding, embassy suits sitting room bathroom bedroom unlike kimpton calls suite, 5 day stay offer correct false advertising, send kimpton preferred guest website email asking failure provide suite advertised website reservation description furnished hard copy reservation printout website desk manager duty did not reply solution, send email trip guest survey did not follow email mail, guess tell concerned guest.the staff ranged indifferent not help...",2
2,"nice rooms not 4* experience hotel monaco seattle good hotel n't 4* level.positives large bathroom mediterranean suite comfortable bed pillowsattentive housekeeping staffnegatives ac unit malfunctioned stay desk disorganized, missed 3 separate wakeup calls, concierge busy hard touch, did n't provide guidance special requests.tv hard use ipod sound dock suite non functioning. decided book mediterranean suite 3 night weekend stay 1st choice rest party filled, comparison w spent 45 night larger square footage room great soaking tub whirlpool jets nice shower.before stay hotel arrange car service price 53 tip reasonable driver waiting arrival.checkin easy downside room picked 2 person jacuzi tub no bath accessories salts bubble bath did n't stay, night got 12/1a checked voucher bottle cham...",3
3,"unique, great stay, wonderful time hotel monaco, location excellent short stroll main downtown shopping area, pet friendly room showed no signs animal hair smells, monaco suite sleeping area big striped curtains pulled closed nice touch felt cosy, goldfish named brandi enjoyed, did n't partake free wine coffee/tea service lobby thought great feature, great staff friendly, free wireless internet hotel worked suite 2 laptops, decor lovely eclectic mix pattens color palatte, animal print bathrobes feel like rock stars, nice did n't look like sterile chain hotel hotel personality excellent stay,",5
4,"great stay great stay, went seahawk game awesome, downfall view building did n't complain, room huge staff helpful, booked hotels website seahawk package, no charge parking got voucher taxi, problem taxi driver did n't want accept voucher barely spoke english, funny thing speak arabic called started making comments girlfriend cell phone buddy, took second realize just said fact speak language face priceless, ass told, said large city, told head doorman issue called cab company promply answer did n't, apologized offered pay taxi, bucks 2 miles stadium, game plan taxi return going humpin, great walk did n't mind, right christmas wonderful lights, homeless stowed away building entrances leave, police presence not greatest area stadium, activities 7 blocks pike street waterfront great coff...",5


#### 2.2 Data Cleaning

In [9]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def clean_document(x):
    x = re.sub(r'\w*\d\w*', ' ', x) # Remove all words containing a number at any point e.g 20th
    x = re.sub(r'[^a-zA-Z\s]', ' ', x.lower(), re.I|re.A)
    x = re.sub(r'[\-!+_@*#\/$:)"\'.;,?&({}[]]*', ' ', x) # Remove all punctuation
    x = re.sub(r'\b\w{1,2}\b', ' ', x) # Removed short words with a length of less than 3
    x = re.sub(' +', ' ', x) # Substitute all multi spaces into single spaces.
    tokens = wpt.tokenize(x) # Tokenize and remove stopwords
    filtered_tokens = [token for token in tokens if token not in stop_words]
    x = ' '.join(filtered_tokens)
    return x

clean_corpus = np.vectorize(clean_document) # Apply cleaning to every document in the corpus

In [10]:
# Clean the reviews by removing punctuation characters and stopwords.
reviews = clean_corpus(reviews_dataset['Review'])

In [71]:
np.random.choice(reviews, 5)

array(['great hotel great location hotel located rambla city centre close access shops tour buses room immaculate plasma screen screen bathroom nice blue ambient lighting room good amenities bathroom large equipped concierge staff overall friendly helpful recommended lot restaurants night spots visit absolutely wonderful business centre use free open hours thing disliked face room safe bit small fit laptop definitely recommend hotel',
       'botel great looking reasonably priced hotel amsterdam location excellent minute walk central station mins dam square building work going botel day quiet night rooms basic clean small suite shower talked locals hotels amsterdam private facilities gives botel advantage breakfast pretty basic caters tastes road botel brilliant mexican restaurant called guadalupe worth visit handy staying botel',
       'nice star hotel clean hotel helpful staff enjoyable sip glass wine outside patio grand canal booked room low end hotel price scale adequate said clea

In [12]:
# Since we want to train a CBOW model with context width of 2
# on the reviews, we drop all reviews with less than 5 words.
# This is equivalent to only keeping instances with at least 5 words.
reviews = [review for review in reviews if len(review.split(" ")) >=  (2*CONTEXT_OFFSET + 1)]

In [13]:
# Remove infrequent words
reviews_word_list=list((" ".join(reviews)).split())

frequency = pd.value_counts(reviews_word_list)
infrequent_words = list(frequency[frequency <= 1].keys())
frequent_words = list(frequency[frequency > 1].keys())

# Create lookup table to check if word is infrequent (1) or not (0)
is_infrequent = {}
for infrequent_word in infrequent_words:
  is_infrequent[infrequent_word] = 1

for frequent_word in frequent_words:
  is_infrequent[frequent_word] = 0

In [14]:
print("The length of the frequent_words:{}".format(len(frequent_words)))
print("The length of the infrequent_words:{}".format(len(infrequent_words)))

The length of the frequent_words:24730
The length of the infrequent_words:23833


In [15]:
# In order to build the corpus for the reviews 
# we want to find every distinct word that occurs
# in at least one review.
# We join all reviews into one large string and then
# split it at every space to receive a list of words
# Then the set method is used in order to only
# retain unique words.
# This list is then alphabetically sorted
review_words = " ".join(reviews).split()
review_words = [w for w in review_words if not is_infrequent[w]]
reviews_vocabulary = sorted(set(review_words))

In [16]:
reviews_vocabulary_size = len(reviews_vocabulary)
print("The reviews use a vocabulary comprising {} different words.".format(reviews_vocabulary_size))

The reviews use a vocabulary comprising 24730 different words.


In [17]:
word2index = {w:i for i,w in enumerate(reviews_vocabulary)} # Lookup table mapping words to indices
index2word = {i:w for i,w in enumerate(reviews_vocabulary)} # Lookup table mapping indices to words

In [35]:
# Removed infrequent words from our corpus
def clean_infrequent_words(reviews):
  new_reviews=[]
  for review in reviews:
    infrequent_index=[]
    raw_text = review.split()
    for i in range(0,len(raw_text)):
      if is_infrequent[raw_text[i]]==1:
        infrequent_index.append(i)
    for i in sorted(infrequent_index,reverse=True):
      del raw_text[i]
    review=' '.join(raw_text)
    new_reviews.append(review)
  return new_reviews

reviews=clean_infrequent_words(reviews)
# drop review with less than 5 words again
reviews = [review for review in reviews if len(review.split(" ")) >=  (2*CONTEXT_OFFSET + 1)]

In [36]:
# Create the context => center word tuples
# from the documents
data = []
for review in reviews:
  raw_text = review.split()
  for i in range(CONTEXT_OFFSET, len(raw_text) - CONTEXT_OFFSET):
      context = [raw_text[i - 2], raw_text[i - 1],
                raw_text[i + 1], raw_text[i + 2]]
      #print(context)
      target = raw_text[i]
      data.append((context, target))

# Show some sample 'context -> center word' mappings
for i in range(5):
  print(data[i])

(['nice', 'hotel', 'parking', 'got'], 'expensive')
(['hotel', 'expensive', 'got', 'good'], 'parking')
(['expensive', 'parking', 'good', 'deal'], 'got')
(['parking', 'got', 'deal', 'stay'], 'good')
(['got', 'good', 'stay', 'hotel'], 'deal')


In [37]:
# The following function transforms the context
# into index notation
def make_context_vector(context, word2index):
    idxs = [word2index[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

# Show one transformed sample context
make_context_vector(data[0][0], word2index)  # example

tensor([14585, 10605, 15681,  9535])

In [52]:
# Dataset containing the hotel reviews
# Complies with pytorchs Dataset interface
class WordContextsDataset(Dataset):
  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

  def __len__(self):
    return len(self.X)

In [39]:
X = np.array([i[0] for i in data])
X_vectors = list(map(lambda elem: make_context_vector(elem, word2index) , X))

In [40]:
# Print some vetorized sample contexts
for i in range(5):
  print(X_vectors[i])

tensor([14585, 10605, 15681,  9535])
tensor([10605,  7848,  9535,  9494])
tensor([ 7848, 15681,  9494,  5718])
tensor([15681,  9535,  5718, 20853])
tensor([ 9535,  9494, 20853, 10605])


In [41]:
y = np.array([i[1] for i in data])
y_vectors = list(map(lambda elem: make_context_vector([elem], word2index), y))

In [42]:
# Print some vectorized sample center words
for i in range(5):
  print(y_vectors[i])

tensor([7848])
tensor([15681])
tensor([9535])
tensor([9494])
tensor([5718])


In [43]:
# Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y_vectors, test_size=0.2, random_state=42, shuffle=True)

In [53]:
# Create the training dataset from vectors
hotel_reviews_dataset = WordContextsDataset(X_train, y_train)

### 3. Modelling

In [45]:
hotel_reviews_loader = DataLoader(dataset=hotel_reviews_dataset, batch_size=BATCH_SIZE, shuffle=True)

In [46]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
      super(CBOW, self).__init__()
      self.embeddings = nn.Embedding(vocab_size, embedding_dim, device=device)
      self.linear1 = nn.Linear(context_size * embedding_dim, 128)
      self.activation_function1 = nn.ReLU()
      self.linear2 = nn.Linear(128, vocab_size)
      self.activation_function2 = nn.LogSoftmax(dim=1)

    def forward(self, inputs):
      embeds = self.embeddings(inputs).view(inputs.size(0), -1)
      out = self.linear1(embeds)
      out = self.activation_function1(out)
      out = self.linear2(out)
      out = self.activation_function2(out)
      return out

In [47]:
def train_model(model, data_loader, epochs, word2index):

  losses = np.zeros(epochs)
  loss_function = nn.NLLLoss()
  optimizer = optim.Adam(model.parameters())

  for epoch in range(epochs):
    for step, (context_vectors, target_vector) in enumerate(data_loader):

      context_vectors = context_vectors.to(device) # Move the batch of context vectors into GPU memory
      target_vector = target_vector.to(device) # Move the batch of target vectors into GPU memory 

      model.zero_grad() # Reset all gradients back to zero

      log_probs = model(context_vectors) # forward pass
      loss = loss_function(log_probs, torch.squeeze(target_vector)) # compute loss for batch
      losses[epoch] += loss.item() # accumulate loss

      loss.backward() # backpropagation
      optimizer.step() # update the model weights

    print("Epoch {0}/{1} ... Average loss {2}".format(epoch+1, epochs, losses[epoch] / len(data_loader.dataset))) # Print average loss in this episode
  return losses

In [50]:
model = CBOW(reviews_vocabulary_size, EMBEDDING_DIM_HOTEL, 2*CONTEXT_OFFSET).to(device)
model_path='./hotel_reviews_model_weights.pth'

In [None]:
# Train the CBOW model if no saved embeddings exist yet
# Otherwise load the already trained embeddings

# Since overfitting is not an issue according
# to Claudio, we decided to train for more episodes
try:
  model.load_state_dict(torch.load(model_path))
  model.eval()
except Exception as e:
  print("No saved embeddings exist.")
  print("Starting to learn word embeddings.")
  losses = train_model(model, hotel_reviews_loader, EPOCHS_HOTEL, word2index)

No saved embeddings exist.
Starting to learn word embeddings.
Epoch 1/30 ... Average loss 0.1199288639485836
Epoch 2/30 ... Average loss 0.1108083882677555
Epoch 3/30 ... Average loss 0.10474874131917954
Epoch 4/30 ... Average loss 0.10001808035969734
Epoch 5/30 ... Average loss 0.09631988632798195
Epoch 6/30 ... Average loss 0.0932952623140812
Epoch 7/30 ... Average loss 0.0908410360121727
Epoch 8/30 ... Average loss 0.08877986954450608
Epoch 9/30 ... Average loss 0.08704138385176659
Epoch 10/30 ... Average loss 0.08551703315019607
Epoch 11/30 ... Average loss 0.08420399794816971
Epoch 12/30 ... Average loss 0.08305943539381028
Epoch 13/30 ... Average loss 0.08209802722454071
Epoch 14/30 ... Average loss 0.08122771304488183
Epoch 15/30 ... Average loss 0.08048059151530265
Epoch 16/30 ... Average loss 0.07980409189999103
Epoch 17/30 ... Average loss 0.07923497488558293
Epoch 18/30 ... Average loss 0.07865650351166725
Epoch 19/30 ... Average loss 0.07818474381029605
Epoch 20/30 ... Aver

In [54]:
# Save the trained CBOW model
torch.save(model.state_dict(), model_path)

### Sci-Fi story dataset
### 2. Data Preprocessing
#### 2.1 Data Acquisition

In [55]:
# Loading the scifi txt
url = 'https://raw.githubusercontent.com/abandonedrepo/test/master/scifi.txt'
scifi_dataset = requests.get(url).text
print(scifi_dataset[:100])

MARCH # All Stories New and Complete Publisher Editor IF is published bi-monthly by Quinn Publishing


#### 2.2 Data Cleaning

In [56]:
# Clean the scifi text by removing punctuation and stop words
scifi_txt = clean_document(scifi_dataset)
print(scifi_txt[:100])

march stories new complete publisher editor published monthly quinn publishing company inc kingston 


In [57]:
# Split the scifi text into individual words
scifi_word_list=scifi_txt.split()
print(scifi_word_list[:10])

['march', 'stories', 'new', 'complete', 'publisher', 'editor', 'published', 'monthly', 'quinn', 'publishing']


In [58]:
print("The size of the scifi vocabular before removing infrequent words: {}".format(len(scifi_word_list)))

The size of the scifi vocabular before removing infrequent words: 7602971


In [59]:
frequency = pd.value_counts(scifi_word_list)
infrequent_words = list(frequency[frequency <= 100].keys())
frequent_words = list(frequency[frequency > 100].keys())

is_infrequent = {}
for infrequent_word in infrequent_words:
  is_infrequent[infrequent_word] = 1
for frequent_word in frequent_words:
  is_infrequent[frequent_word] = 0

In [60]:
#To clean the infrequnt words
def clean_infrequent_words(scifi_txt):
  scifi_new_list=[]
  raw_text = scifi_txt.split()
  for i in range(0,len(raw_text)):
    a=raw_text[i]
    if is_infrequent[a]==0:
        scifi_new_list.append(a)
  scifi_txt_new=' '.join(scifi_new_list)
  return scifi_txt_new

scifi_txt=clean_infrequent_words(scifi_txt)
print(scifi_txt[:1000])

march stories new complete publisher editor published monthly quinn publishing company inc new york volume copyright quinn publishing company inc application entry second class matter post office buffalo new york subscription issues possessions canada issues elsewhere four weeks change address stories appearing magazine fiction similarity actual persons coincidental printed chat editor science fiction magazine called title selected much thought theory field easy remember tentative title morning remember cup coffee discarded great deal thought effort gone formation magazine aid several generous people grateful much due assistance bulk work done try maintain one finest books market great public demand magazine short buy cannot honesty say publish times best science fiction field would true access best stories get fair share works best writers definitely talk adult juvenile relative content feel terms would rather think times terms story greatest literature ever written treasure island in

In [61]:
# Create vocabulary list for the scifi text
scifi_word_list=scifi_txt.split()
scifi_vocabulary = sorted(set(scifi_word_list))
scifi_vocabulary_size=len(scifi_vocabulary)

In [62]:
print("The size of the scifi vocabular after removing infrequent words: {}".format(len(scifi_word_list)))

The size of the scifi vocabular after removing infrequent words: 6495063


In [63]:
word2index_scifi = {w:i for i,w in enumerate(scifi_vocabulary)} # Lookup table mapping words to indices
index2word_scifi = {i:w for i,w in enumerate(scifi_vocabulary)} # Lookup table mapping indices to words

In [64]:
scifi_data = []
for i in range(CONTEXT_OFFSET, len(scifi_word_list) - CONTEXT_OFFSET): # To prevent the colab ram from crashing, we chose the first 5000000 words for trainning
    context = [scifi_word_list[i - 2], scifi_word_list[i - 1],
              scifi_word_list[i + 1], scifi_word_list[i + 2]]
    target = scifi_word_list[i]
    scifi_data.append((context, target))
print(scifi_data[:5])

[(['march', 'stories', 'complete', 'publisher'], 'new'), (['stories', 'new', 'publisher', 'editor'], 'complete'), (['new', 'complete', 'editor', 'published'], 'publisher'), (['complete', 'publisher', 'published', 'monthly'], 'editor'), (['publisher', 'editor', 'monthly', 'quinn'], 'published')]


In [65]:
X = np.array([i[0] for i in scifi_data])
X_vectors = list(map(lambda elem: make_context_vector(elem, word2index_scifi) , X))
y = np.array([i[1] for i in scifi_data])
y_vectors = list(map(lambda elem: make_context_vector([elem], word2index_scifi), y))

In [66]:
# Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y_vectors, test_size=0.2, random_state=42)

In [67]:
scifi_training_dataset = WordContextsDataset(X_train, y_train)

### 3. Modelling (Sci-Fi)

In [68]:
scifi_model_path = './scifi_model_weights.pth'

In [69]:
scifi_data_loader = DataLoader(dataset=scifi_training_dataset, batch_size=BATCH_SIZE, shuffle=True)
model_scifi = CBOW(scifi_vocabulary_size, EMBEDDING_DIM_SCIFI, 2*CONTEXT_OFFSET).to(device)

In [70]:
# Train the CBOW model if no saved embeddings exist yet
# Otherwise load the already trained embeddings
try:
  model_scifi.load_state_dict(torch.load(scifi_model_path))
  model_scifi.eval()
except Exception as e:
  print("No saved embeddings exist.")
  print("Starting to learn word embeddings.")
  losses = train_model(model_scifi, scifi_data_loader, EPOCHS_SCIFI, word2index_scifi)

No saved embeddings exist.
Starting to learn word embeddings.
Epoch 1/2 ... Average loss 0.12509822989747724
Epoch 2/2 ... Average loss 0.1233596718698413


In [72]:
# Save the trained CBOW model
torch.save(model.state_dict(), scifi_model_path)

# Part 2: Test your embeddings

## 3. find the 5 closest words for 9 words from the hotel reviews dataset

In [None]:
# check the frequencies of the words
reviews_word_list=list((" ".join(reviews)).split())
frequency = pd.value_counts(reviews_word_list)
print("Most frequent words are\n{}\n------------------------------".format(frequency.head(20)))
print("Medium frequent words are\n{}\n------------------------------".format(frequency.iloc[500:520]))
print("Less frequent words are\n{}\n------------------------------".format(frequency.iloc[1000:1020]))


Most frequent words are
hotel        49877
room         35357
great        21482
good         17418
staff        16637
stay         15413
nice         12646
rooms        12407
location     11353
stayed       10500
service      10373
night        10164
time         10132
beach        10068
day           9979
breakfast     9737
clean         9599
food          9425
like          8254
resort        8152
dtype: int64
------------------------------
Medium frequent words are
daughter      705
received      704
issue         698
turn          697
directly      697
watch         695
makes         694
adequate      694
surprised     693
royal         689
true          688
elevator      688
break         688
cab           687
bavaro        686
complaints    684
quickly       684
basic         684
recently      684
smoking       683
dtype: int64
------------------------------
Less frequent words are
range         342
italy         341
gone          341
added         341
taxis         341
cafes   

In [None]:
# We chose 3 nouns, 3 verbs, and 3 adjectives respectively from the above 3 frequency levels.
chosen_words = ['hotel','great', 'clean', # From most frequent words
                'issue','adequate','smoking', # From medium frequent words
                'italy','filled','comment'] # From least frequent words

In [None]:
def get_closest_word(word, topn):
  word_distance = []
  emb = model.embeddings
  pdist = nn.PairwiseDistance()
  i = word2index[word]
  lookup_tensor_i = torch.tensor([i],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
  v_i = emb(lookup_tensor_i)
  for j in range(len(reviews_vocabulary)):
    if j !=i:
      lookup_tensor_j = torch.tensor([j],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
      v_j = emb(lookup_tensor_j)
      word_distance.append((index2word[j],float(pdist(v_i,v_j))))
  word_distance.sort(key=lambda x:x[1])
  return word_distance[:topn]

example = get_closest_word('beach', 5)
print(example)

[('fussed', 6.01870059967041), ('barong', 6.195380210876465), ('dealer', 6.453127384185791), ('messenger', 6.489078521728516), ('grounds', 6.497330665588379)]


In [None]:
def get_closest_word_from_a_list(chosen_words,  topn):
  chosen_words_and_their_neighbours=[]
  for word in chosen_words:
    get_result = get_closest_word(word,  topn)
    neighbours = [nb[0] for nb in get_result]
    chosen_words_and_their_neighbours.append((neighbours,word))
  return chosen_words_and_their_neighbours

neighbours = get_closest_word_from_a_list(chosen_words,5)
neighbours

[(['tuilieres', 'overload', 'bam', 'asun', 'visiting'], 'hotel'),
 (['fantastic', 'excellent', 'superb', 'best', 'palmetto'], 'great'),
 (['beds', 'kalverstraat', 'bathrooms', 'victorian', 'spacious'], 'clean'),
 (['amzing', 'hawai', 'talk', 'greasy', 'like'], 'issue'),
 (['small', 'tiananmen', 'single', 'expecting', 'wood'], 'adequate'),
 (['suitehotel', 'scenarios', 'legacy', 'caretakers', 'pillow'], 'smoking'),
 (['hotel', 'embargo', 'declare', 'crazy', 'hairstylist'], 'italy'),
 (['stupid', 'hanson', 'fra', 'ankle', 'caren'], 'filled'),
 (['hotel', 'bedding', 'intensive', 'walls', 'definitely'], 'comment')]

## 3. find the 5 closest words for 9 words from the scifi dataset

In [None]:
# check the frequencies of the words
frequency = pd.value_counts(scifi_word_list)
print("The most frequent words are\n{}\n------------------------------".format(frequency.head(20)))
print("The less frequent words are\n{}\n------------------------------".format(frequency.iloc[500:520]))
print("The much less frequent words are\n{}\n------------------------------".format(frequency.iloc[800:820]))

The most frequent words are
said      76385
one       57263
would     46663
could     41425
like      36472
time      32907
back      32185
man       30097
know      28632
get       24516
two       21847
see       21211
way       21081
even      20510
right     19564
first     19159
well      18729
got       17908
little    17267
think     17003
dtype: int64
------------------------------
The less frequent words are
pointed      2320
shot         2318
laughed      2314
happen       2312
lips         2306
paper        2294
alive        2287
shall        2284
although     2282
attention    2280
ships        2278
area         2278
died         2275
position     2271
stuff        2249
reach        2248
broke        2245
dear         2244
speak        2240
answered     2239
dtype: int64
------------------------------
The much less frequent words are
chest         1514
aside         1509
ears          1506
possibly      1506
indeed        1505
steve         1504
spread        1503
forced    

In [None]:
# We chose 3 nouns, 3 verbs, and 3 adjectives respectively from the above 3 frequency levels.
chosen_words_scifi = ['time','think','right',
                      'blood','smile','tiny',
                      'party','worry','warm']

In [None]:
def get_closest_word_scifi(word, topn):
  word_distance = []
  emb = model_scifi.embeddings
  pdist = nn.PairwiseDistance()
  i = word2index_scifi[word]
  lookup_tensor_i = torch.tensor([i],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
  v_i = emb(lookup_tensor_i)
  for j in range(len(scifi_vocabulary)):
    if j !=i:
      lookup_tensor_j = torch.tensor([j],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
      v_j = emb(lookup_tensor_j)
      word_distance.append((index2word_scifi[j],float(pdist(v_i,v_j))))
  word_distance.sort(key=lambda x:x[1])
  return word_distance[:topn]

def get_closest_word_from_a_list_scifi(chosen_words, topn):
  chosen_words_and_their_neighbours=[]
  for word in chosen_words:
    get_result = get_closest_word_scifi(word, topn)
    neighbours = [nb[0] for nb in get_result]
    chosen_words_and_their_neighbours.append((neighbours,word))
  return chosen_words_and_their_neighbours


neighbours = get_closest_word_from_a_list_scifi(chosen_words_scifi, 5)
neighbours

[(['said', 'first', 'thought', 'made', 'like'], 'time'),
 (['know', 'tell', 'enough', 'want', 'sure'], 'think'),
 (['said', 'man', 'made', 'course', 'little'], 'right'),
 (['single', 'small', 'back', 'little', 'light'], 'blood'),
 (['power', 'ship', 'back', 'right', 'said'], 'smile'),
 (['small', 'dark', 'turned', 'came', 'bright'], 'tiny'),
 (['course', 'way', 'point', 'man', 'men'], 'party'),
 (['seems', 'however', 'right', 'girl', 'must'], 'worry'),
 (['watched', 'snapped', 'went', 'saw', 'looking'], 'warm')]

## 5. Choose two words and retrive their 5 closest neighbours from both datasets

In [None]:
chosen_words=['good','job']
neighbours_from_reviews = get_closest_word_from_a_list(chosen_words, 5)
neighbours_from_scifi = get_closest_word_from_a_list_scifi(chosen_words, 5)


print("5 closest neighbours of the chosen words in hotel reviews dataset:")
print(neighbours_from_reviews)
print("5 closest neighbours of the chosen words in scifi dataset:")
print(neighbours_from_scifi)

5 closest neighbours of the chosen words in hotel reviews dataset:
[(['amandari', 'blooming', 'best', 'great', 'decent'], 'good'), (['nameless', 'experience', 'precinct', 'bouncy', 'whopping'], 'job')]
5 closest neighbours of the chosen words in scifi dataset:
[(['time', 'said', 'knew', 'like', 'first'], 'good'), (['even', 'work', 'point', 'time', 'though'], 'job')]
