<a href="https://colab.research.google.com/github/jan-kreischer/UZH_ML4NLP/blob/main/Project-02/ex02_wordembeddings_jan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2 - Word Embeddings with PyTorch

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.


## Part 1: Training CBOW embeddings for both datasets
### 1. Setup
#### 1.1 Imports

In [8]:
import os

# torch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
torch.manual_seed(1)

# numpy and pandas
import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 800)

# tokenization
import nltk;
nltk.download('stopwords');
%matplotlib inline

from argparse import Namespace
from collections import Counter
import json
import string
import itertools
import regex as re
from tqdm import tqdm_notebook
from sklearn.model_selection import train_test_split
import requests

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### 1.2 Environment

In [9]:
CUDA = torch.cuda.is_available()
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu:0')
torch.cuda.set_device(device)
print('Using device:', device)

Using device: cuda:0


In [10]:
# Check GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Oct 25 12:40:07 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    26W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [11]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-3538fdb8-aa9c-9eb8-38e2-ee22dd536714)


In [12]:
# Check Memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


#### 1.3 Constants

In [13]:
FEATURE_COLUMN = 'Review'
CONTEXT_OFFSET = 2 # n words to the left, n to the right
BATCH_SIZE = 64

EPOCHS_HOTEL = 15
EMBEDDING_DIM_HOTEL = 50

EPOCHS_SCIFI = 2
EMBEDDING_DIM_SCIFI = 50

### Hotel Reviews dataset
### 2. Data Preprocessing
#### 2.1 Data Acquisition

In [97]:
# Loading the tripadvisor data
url_tripadvisor = (r'https://raw.githubusercontent.com/abandonedrepo/test/master/tripadvisor_hotel_reviews.csv')
reviews_dataset = pd.read_csv(url_tripadvisor)
reviews_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  20491 non-null  object
 1   Rating  20491 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 320.3+ KB


In [98]:
reviews_dataset.head(5)

Unnamed: 0,Review,Rating
0,"nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway, maybe just noisy neighbors, aveda bath products nice, did not goldfish stay nice touch taken advantage staying longer, location great walking distance shopping, overall nice experience having pay 40 parking night,",4
1,"ok nothing special charge diamond member hilton decided chain shot 20th anniversary seattle, start booked suite paid extra website description not, suite bedroom bathroom standard hotel room, took printed reservation desk showed said things like tv couch ect desk clerk told oh mixed suites description kimpton website sorry free breakfast, got kidding, embassy suits sitting room bathroom bedroom unlike kimpton calls suite, 5 day stay offer correct false advertising, send kimpton preferred guest website email asking failure provide suite advertised website reservation description furnished hard copy reservation printout website desk manager duty did not reply solution, send email trip guest survey did not follow email mail, guess tell concerned guest.the staff ranged indifferent not help...",2
2,"nice rooms not 4* experience hotel monaco seattle good hotel n't 4* level.positives large bathroom mediterranean suite comfortable bed pillowsattentive housekeeping staffnegatives ac unit malfunctioned stay desk disorganized, missed 3 separate wakeup calls, concierge busy hard touch, did n't provide guidance special requests.tv hard use ipod sound dock suite non functioning. decided book mediterranean suite 3 night weekend stay 1st choice rest party filled, comparison w spent 45 night larger square footage room great soaking tub whirlpool jets nice shower.before stay hotel arrange car service price 53 tip reasonable driver waiting arrival.checkin easy downside room picked 2 person jacuzi tub no bath accessories salts bubble bath did n't stay, night got 12/1a checked voucher bottle cham...",3
3,"unique, great stay, wonderful time hotel monaco, location excellent short stroll main downtown shopping area, pet friendly room showed no signs animal hair smells, monaco suite sleeping area big striped curtains pulled closed nice touch felt cosy, goldfish named brandi enjoyed, did n't partake free wine coffee/tea service lobby thought great feature, great staff friendly, free wireless internet hotel worked suite 2 laptops, decor lovely eclectic mix pattens color palatte, animal print bathrobes feel like rock stars, nice did n't look like sterile chain hotel hotel personality excellent stay,",5
4,"great stay great stay, went seahawk game awesome, downfall view building did n't complain, room huge staff helpful, booked hotels website seahawk package, no charge parking got voucher taxi, problem taxi driver did n't want accept voucher barely spoke english, funny thing speak arabic called started making comments girlfriend cell phone buddy, took second realize just said fact speak language face priceless, ass told, said large city, told head doorman issue called cab company promply answer did n't, apologized offered pay taxi, bucks 2 miles stadium, game plan taxi return going humpin, great walk did n't mind, right christmas wonderful lights, homeless stowed away building entrances leave, police presence not greatest area stadium, activities 7 blocks pike street waterfront great coff...",5


#### 2.2 Data Cleaning

In [117]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def clean_document(x):
    x = re.sub(r'\w*\d\w*', ' ', x)
    x = re.sub(r'[^a-zA-Z\s]', ' ', x.lower(), re.I|re.A)
    x = re.sub(r'[\-!+_@*#\/$:)"\'.;,?&({}[]]*', ' ', x)
    x = re.sub(r'\b\w{1,2}\b', ' ', x)
    x = re.sub(' +', ' ', x)
    tokens = wpt.tokenize(x)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    x = ' '.join(filtered_tokens)
    return x

clean_corpus = np.vectorize(clean_document)

In [145]:
# Clean the reviews by removing punctuation characters and stopwords.
reviews = clean_corpus(reviews_dataset['Review'])

In [146]:
np.random.choice(reviews, 10)

array(['beautiful view boyfriend planned trip paris booked hotel reading reviews tripadvisor booking hotel hotel website sent email requesting room view received response stating guaranteed arrived hotel hours check luckily room ready wonderful room fifth floor corner nto asked beautiful view eiffel tower balcony amazing lucky received rooms view know request hotel room quiet clean bed comfortable bathroom nice receptionists helpful spoke english complaint actually complain management day came hotel contact lens case missing assume cleaner steal guess swept trash regard temporary complaints',
       'friendly good location stayed nights july reviewers mentioned reception staff friendly willing help hotel room quite small adequate length stay expect luxury clean located hotel like japanese food right place restaurants street hotel',
       'nice hotel classy hotel nice neighborhood florence walking distance beautiful room amenities friendly staff good breakfest',
       'great apartment

In [126]:
# Since we want to train a CBOW model with context width of 2
# on the reviews, we drop all reviews with less than 5 words.
# This is equivalent to only keeping instances with at least 5 words.
reviews = [review for review in reviews if len(review.split(" ")) >=  (2*CONTEXT_OFFSET + 1)]

In [184]:
reviews_word_list=list((" ".join(reviews)).split())

frequency = pd.value_counts(reviews_word_list)
infrequent_words = list(frequency[frequency <= 1].keys())
frequent_words = list(frequency[frequency > 1].keys())

is_infrequent = {}
for infrequent_word in infrequent_words:
  is_infrequent[infrequent_word] = 1

for frequent_word in frequent_words:
  is_infrequent[frequent_word] = 0

In [200]:
# In order to build the corpus for the reviews 
# we want to find every distinct word that occurs
# in at least one review.
# We join all reviews into one large string and then
# split it at every space to receive a list of words
# Then the set method is used in order to only
# retain unique words.
# This list is then alphabetically sorted
review_words = " ".join(reviews).split()
review_words = [w for w in review_words if not is_infrequent[w]]
reviews_vocabulary = sorted(set(review_words))

In [195]:
reviews_vocabulary_size = len(reviews_vocabulary)
print("The reviews use a vocabulary comprising {} different words.".format(reviews_vocabulary_size))

The reviews use a vocabulary comprising 24730 different words.


In [196]:
word2index = {w:i for i,w in enumerate(reviews_vocabulary)} # Lookup table mapping words to indices
index2word = {i:w for i,w in enumerate(reviews_vocabulary)} # Lookup table mapping indices to words

In [198]:
data = []
for review in reviews:
  raw_text = review.split()
  for i in range(CONTEXT_OFFSET, len(raw_text) - CONTEXT_OFFSET):
      context = [raw_text[i - 2], raw_text[i - 1],
                raw_text[i + 1], raw_text[i + 2]]
      #print(context)
      target = raw_text[i]
      data.append((context, target))

# Show some sample 'context -> center word' mappings
#for i in range(5):
#  print(data[i])

0


In [131]:
# The following function transforms the context
# into index notation
def make_context_vector(context, word2index):
    idxs = [word2index[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

# Show one transformed sample context
make_context_vector(data[0][0], word2index)  # example

tensor([28703, 20765, 30900, 18715])

In [132]:
class HotelReviewsDataset(Dataset):
  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

  def __len__(self):
    return len(self.X)

In [133]:
X = np.array([i[0] for i in data])
X_vectors = list(map(lambda elem: make_context_vector(elem, word2index) , X))

In [134]:
# Print some vetorized sample contexts
for i in range(5):
  print(X_vectors[i])

tensor([28703, 20765, 30900, 18715])
tensor([20765, 15461, 18715, 18618])
tensor([15461, 30900, 18618, 11311])
tensor([30900, 18715, 11311, 40725])
tensor([18715, 18618, 40725, 20765])


In [135]:
y = np.array([i[1] for i in data])
y_vectors = list(map(lambda elem: make_context_vector([elem], word2index), y))

In [136]:
# Print some vectorized sample center words
for i in range(5):
  print(y_vectors[i])

tensor([15461])
tensor([30900])
tensor([18715])
tensor([18618])
tensor([11311])


In [137]:
# Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X_vectors[:500000], y_vectors[:500000], test_size=0.2, random_state=42, shuffle=True)

In [138]:
# Create the training dataset from vectors
hotel_reviews_dataset = HotelReviewsDataset(X_train, y_train)

### 3. Modelling

In [139]:
hotel_reviews_loader = DataLoader(dataset=hotel_reviews_dataset, batch_size=BATCH_SIZE, shuffle=True)

In [140]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
      super(CBOW, self).__init__()
      self.embeddings = nn.Embedding(vocab_size, embedding_dim, device=device)
      self.linear1 = nn.Linear(context_size * embedding_dim, 128)
      self.activation_function1 = nn.ReLU()
      self.linear2 = nn.Linear(128, vocab_size)
      self.activation_function2 = nn.LogSoftmax(dim=1)

    def forward(self, inputs):
      embeds = self.embeddings(inputs).view(inputs.size(0), -1)
      out = self.linear1(embeds)
      out = self.activation_function1(out)
      out = self.linear2(out)
      out = self.activation_function2(out)
      return out

In [141]:
def train_model(model, data_loader, epochs, word2index):

  losses = np.zeros(epochs)
  loss_function = nn.NLLLoss()
  optimizer = optim.Adam(model.parameters())

  for epoch in range(epochs):
    for step, (context_vectors, target_vector) in enumerate(data_loader):

      context_vectors = context_vectors.to(device) # Move the batch of context vectors into GPU memory
      target_vector = target_vector.to(device) # Move the batch of target vectors into GPU memory 

      model.zero_grad() # Reset all gradients back to zero

      log_probs = model(context_vectors) # forward pass
      loss = loss_function(log_probs, torch.squeeze(target_vector)) # compute loss for batch
      losses[epoch] += loss.item() # accumulate loss

      loss.backward() # backpropagation
      optimizer.step() # update the model weights

    print("Epoch {0}/{1} ... Average loss {2}".format(epoch+1, epochs, losses[epoch] / len(data_loader.dataset))) # Print average loss in this episode
  return losses

In [143]:
# Only run this cell if you want to load a saved CBOW model including embeddings.
model = CBOW(reviews_vocabulary_size, EMBEDDING_DIM_HOTEL, 2*CONTEXT_OFFSET).to(device)
model.load_state_dict(torch.load('./hotel_reviews_model_weights.pth'))
model.eval()

FileNotFoundError: ignored

In [144]:
model = CBOW(reviews_vocabulary_size, EMBEDDING_DIM_HOTEL, 2*CONTEXT_OFFSET).to(device)
losses = train_model(model, hotel_reviews_loader, EPOCHS_HOTEL, word2index)

Epoch 1/15 ... Average loss 0.12240179890394211
Epoch 2/15 ... Average loss 0.11280434147000312
Epoch 3/15 ... Average loss 0.10684676656484604
Epoch 4/15 ... Average loss 0.10216135758519172
Epoch 5/15 ... Average loss 0.09847101616501808
Epoch 6/15 ... Average loss 0.09546819850444793
Epoch 7/15 ... Average loss 0.09296372812747955
Epoch 8/15 ... Average loss 0.09086957519054413
Epoch 9/15 ... Average loss 0.08904881232619286
Epoch 10/15 ... Average loss 0.08747491861820221
Epoch 11/15 ... Average loss 0.08607144962430001
Epoch 12/15 ... Average loss 0.0848245150399208
Epoch 13/15 ... Average loss 0.08373824785351754
Epoch 14/15 ... Average loss 0.08276360579371453
Epoch 15/15 ... Average loss 0.08191652846932411


In [149]:
# Save the trained CBOW model
torch.save(model.state_dict(), './hotel_reviews_model_weights.pth')

### Sci-Fi story dataset
### 2. Data Preprocessing
#### 2.1 Data Acquisition

In [61]:
# Loading the scifi txt
url = 'https://raw.githubusercontent.com/abandonedrepo/test/master/scifi.txt'
scifi_dataset = requests.get(url).text
print(scifi_dataset[:100])

MARCH # All Stories New and Complete Publisher Editor IF is published bi-monthly by Quinn Publishing


#### 2.2 Data Cleaning

In [66]:
# Clean the scifi text by removing punctuation and stop words
scifi_dataset = clean_document(scifi_dataset)
print(scifi_dataset[:100])

march stories new complete publisher editor published bimonthly quinn publishing company inc kingsto


In [67]:
# Split the scifi text into individual words
scifi_word_list=scifi_dataset.split()
print(scifi_word_list[:10])

['march', 'stories', 'new', 'complete', 'publisher', 'editor', 'published', 'bimonthly', 'quinn', 'publishing']


In [68]:
# list of unique words from the scifi txt
scifi_vocabulary = sorted(set(scifi_word_list))

In [69]:
scifi_vocabulary_size = len(scifi_vocabulary)
print("The scifi text uses a vocabulary comprising {} different words.".format(scifi_vocabulary_size))

The scifi text uses a vocabulary comprising 200658 different words.


In [76]:
word2index_scifi = {w:i for i,w in enumerate(scifi_vocabulary)} # Lookup table mapping words to indices
index2word_scifi = {i:w for i,w in enumerate(scifi_vocabulary)} # Lookup table mapping indices to words

In [77]:
# To prevent the colab ram from crashing, save some spaces for memory
del scifi_dataset, scifi_vocabulary

In [78]:
scifi_data = []
for i in range(CONTEXT_OFFSET, 1000000 + CONTEXT_OFFSET): # To prevent the colab ram from crashing, we chose the first 1000000 words for trainning
    context = [scifi_word_list[i - 2], scifi_word_list[i - 1],
              scifi_word_list[i + 1], scifi_word_list[i + 2]]
    target = scifi_word_list[i]
    scifi_data.append((context, target))
print(scifi_data[:5])

[(['march', 'stories', 'complete', 'publisher'], 'new'), (['stories', 'new', 'publisher', 'editor'], 'complete'), (['new', 'complete', 'editor', 'published'], 'publisher'), (['complete', 'publisher', 'published', 'bimonthly'], 'editor'), (['publisher', 'editor', 'bimonthly', 'quinn'], 'published')]


In [79]:
X = np.array([i[0] for i in scifi_data])
X_vectors = list(map(lambda elem: make_context_vector(elem, word2index_scifi) , X))
y = np.array([i[1] for i in scifi_data])
y_vectors = list(map(lambda elem: make_context_vector([elem], word2index_scifi), y))

In [80]:
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y_vectors, test_size=0.2, random_state=42)

In [81]:
scifi_training_dataset = HotelReviewsDataset(X_train, y_train)

### 3. Modelling (Sci-Fi)

In [86]:
scifi_model_path = './scifi_model_weights.pth'

In [82]:
scifi_data_loader = DataLoader(dataset=scifi_training_dataset, batch_size=BATCH_SIZE, shuffle=True)
model_scifi = CBOW(scifi_vocabulary_size, EMBEDDING_DIM_SCIFI, 2*CONTEXT_OFFSET).to(device)

In [84]:
try:
  model_scifi.load_state_dict(torch.load(scifi_model_path))
  model.eval()
except Exception as e:
  print("No saved embeddings exist.")
  print("Starting to learn word embeddings.")
  losses = train_model(model_scifi, scifi_data_loader, EPOCHS_SCIFI, word2index_scifi)

No saved embeddings exist.
Starting to learn word embeddings.
Epoch 1/2 ... Average loss 0.14100225863039492
Epoch 2/2 ... Average loss 0.13805539209783077


In [87]:
# Save the trained CBOW model
torch.save(model.state_dict(), scifi_model_path)

# Part 2: Test your embeddings

## 2. find 5 neighbours of each of the 9 words from the hotel reviews dataset

In [150]:
# check the frequencies of the words
reviews_word_list=list((" ".join(reviews)).split())
frequency = pd.value_counts(reviews_word_list)
print("Most frequent words are\n{}\n------------------------------".format(frequency.head(20)))
print("Medium frequent words are\n{}\n------------------------------".format(frequency.iloc[500:520]))
print("Less frequent words are\n{}\n------------------------------".format(frequency.iloc[1000:1020]))


Most frequent words are
hotel        49877
room         35357
great        21482
good         17418
staff        16637
stay         15413
nice         12646
rooms        12407
location     11353
stayed       10500
service      10373
night        10164
time         10132
beach        10068
day           9979
breakfast     9737
clean         9599
food          9425
like          8254
resort        8152
dtype: int64
------------------------------
Medium frequent words are
attractions    705
received       704
issue          698
directly       697
turn           697
watch          695
adequate       694
makes          694
surprised      693
royal          689
true           688
elevator       688
break          688
cab            687
bavaro         686
recently       684
basic          684
quickly        684
complaints     684
smoking        683
dtype: int64
------------------------------
Less frequent words are
fully         342
cafes         341
added         341
gone          341
taxis 

In [151]:
# We chose 3 nouns, 3 verbs, and 3 adjectives respectively from the above 3 frequency levels.
chosen_words = ['hotel','great', 'clean', # From most frequent words
                'issue','adequate','smoking', # From medium frequent words
                'italy','filled','comment'] # From least frequent words

In [152]:
def get_closest_word(word, topn):
  word_distance = []
  emb = model.embeddings
  pdist = nn.PairwiseDistance()
  i = word2index[word]
  lookup_tensor_i = torch.tensor([i],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
  v_i = emb(lookup_tensor_i)
  for j in range(len(reviews_vocabulary)):
    if j !=i:
      lookup_tensor_j = torch.tensor([j],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
      v_j = emb(lookup_tensor_j)
      word_distance.append((index2word[j],float(pdist(v_i,v_j))))
  word_distance.sort(key=lambda x:x[1])
  return word_distance[:topn]

example = get_closest_word('beach', 5)
print(example)

[('´', 6.121271133422852), ('fong', 6.1489152908325195), ('madrilenos', 6.170948028564453), ('glanced', 6.245548725128174), ('south', 6.249263286590576)]


In [153]:
def get_closest_word_from_a_list(chosen_words,  topn):
  chosen_words_and_their_neighbours=[]
  for word in chosen_words:
    get_result = get_closest_word(word,  topn)
    neighbours = [nb[0] for nb in get_result]
    chosen_words_and_their_neighbours.append((neighbours,word))
  return chosen_words_and_their_neighbours

neighbours = get_closest_word_from_a_list(chosen_words,5)
neighbours

[(['resort', 'smattering', 'sales', 'styaed', 'expense'], 'hotel'),
 (['good', 'guimet', 'kuhio', 'structural', 'luccesi'], 'great'),
 (['decorated', 'jared', 'lechon', 'shopper', 'dutton'], 'clean'),
 (['squadron', 'peahens', 'stations', 'insufficiently', 'nineteen'], 'issue'),
 (['contempated', 'omelets', 'reforma', 'dirtiest', 'frrom'], 'adequate'),
 (['taison', 'forgotton', 'principale', 'denver', 'everythibng'], 'smoking'),
 (['ken', 'opend', 'heartwarming', 'loungeif', 'millipedes'], 'italy'),
 (['costco', 'pickpocketing', 'tibadabo', 'malasadas', 'prinsingract'],
  'filled'),
 (['elaberate', 'outright', 'libary', 'rios', 'kursaal'], 'comment')]

## 3. find 5 neighbours of each of the 9 words from the scifi dataset

In [93]:
# check the frequencies of the words
frequency = pd.value_counts(scifi_word_list)
print("The most frequent words are\n{}\n------------------------------".format(frequency.head(20)))
print("The less frequent words are\n{}\n------------------------------".format(frequency.iloc[500:520]))
print("The much less frequent words are\n{}\n------------------------------".format(frequency.iloc[800:820]))

The most frequent words are
said     76347
one      55339
would    46555
could    41388
like     35710
back     31984
time     31971
know     28539
man      27071
dont     26121
get      24459
well     23211
see      21141
us       21022
way      20825
two      20553
even     20493
right    19416
first    18771
got      17900
dtype: int64
------------------------------
The less frequent words are
tiny         2415
smile        2408
showed       2400
somewhere    2398
mans         2390
twenty       2388
radio        2376
wish         2369
late         2369
john         2367
glanced      2363
killed       2355
legs         2355
nearly       2353
blood        2349
single       2347
couple       2345
state        2338
energy       2336
complete     2336
dtype: int64
------------------------------
The much less frequent words are
knowing     1595
language    1595
warm        1590
b           1582
dust        1578
editor      1572
worry       1572
boys        1570
offer       1566
turning   

In [94]:
# We chose 3 nouns, 3 verbs, and 3 adjectives respectively from the above 3 frequency levels.
chosen_words_scifi = ['time','said','right',
                      'blood','smile','tiny',
                      'party','worry','warm']

In [95]:
def get_closest_word_scifi(word, topn):
  word_distance = []
  emb = model_scifi.embeddings
  pdist = nn.PairwiseDistance()
  i = word2index_scifi[word]
  lookup_tensor_i = torch.tensor([i],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
  v_i = emb(lookup_tensor_i)
  for j in range(len(reviews_vocabulary)):
    if j !=i:
      lookup_tensor_j = torch.tensor([j],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
      v_j = emb(lookup_tensor_j)
      word_distance.append((index2word_scifi[j],float(pdist(v_i,v_j))))
  word_distance.sort(key=lambda x:x[1])
  return word_distance[:topn]

def get_closest_word_from_a_list_scifi(chosen_words, topn):
  chosen_words_and_their_neighbours=[]
  for word in chosen_words:
    get_result = get_closest_word_scifi(word, topn)
    neighbours = [nb[0] for nb in get_result]
    chosen_words_and_their_neighbours.append((neighbours,word))
  return chosen_words_and_their_neighbours


neighbours = get_closest_word_from_a_list_scifi(chosen_words_scifi, 5)
neighbours

[(['duu', 'fuzzythinkers', 'amphibians', 'chainlink', 'chowbitten'], 'time'),
 (['eitfi', 'centennial', 'ascribes', 'hardens', 'emissions'], 'said'),
 (['footstep', 'blankness', 'allbutconsecutive', 'atwist', 'atomie'], 'right'),
 (['calypsotype', 'crystalline', 'excluif', 'exterminator', 'buonaparte'],
  'blood'),
 (['comrange', 'beaux', 'chzik', 'automobile', 'fissionroom'], 'smile'),
 (['fortissa', 'fsynous', 'comlyric', 'giierous', 'bedrock'], 'tiny'),
 (['eans', 'gruls', 'bamaybe', 'desrick', 'adulatory'], 'party'),
 (['booki', 'benvice', 'frdons', 'gitm', 'fanflat'], 'worry'),
 (['expertly', 'fullv', 'extruded', 'backdrop', 'fortunately'], 'warm')]

## 5. Choose two words and retrive their 5 closest neighbours from both datasets

In [96]:
chosen_words=['good','said']
neighbours_from_reviews = get_closest_word_from_a_list(chosen_words, 5)
neighbours_from_scifi = get_closest_word_from_a_list_scifi(chosen_words, 5)


print("5 closest neighbours of the chosen words in hotel reviews dataset:")
print(neighbours_from_reviews)
print("5 closest neighbours of the chosen words in scifi dataset:")
print(neighbours_from_scifi)

5 closest neighbours of the chosen words in hotel reviews dataset:
[(['ingroundsç', 'litter', 'hola', 'carefullythere', 'apec'], 'good'), (['nameçîhis', 'tripleenjoyed', 'readairport', 'magdilina', 'demonstrates'], 'said')]
5 closest neighbours of the chosen words in scifi dataset:
[(['cyanophyll', 'dummkopf', 'charsmoked', 'benford', 'fuiuher'], 'good'), (['eitfi', 'centennial', 'ascribes', 'hardens', 'emissions'], 'said')]
