<a href="https://colab.research.google.com/github/jan-kreischer/UZH_ML4NLP/blob/main/Project-02/ex02_wordembeddings_hongjie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2 - Word Embeddings with PyTorch

# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.




Source: [https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)


# Part 1: Train your CBOW embeddings for both datasets

## 1. Setup
### 1.1 Imports

In [44]:

import os

# torch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)

# numpy and pandas
import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 800)

import regex as re

# tokenization
import nltk;
nltk.download('stopwords');
%matplotlib inline

from argparse import Namespace
from collections import Counter
import json
import os
import re
import string
import itertools
import numpy as np
import pandas as pd
import torch
import tqdm
from torch.utils.data.sampler import SubsetRandomSampler
CUDA = torch.cuda.is_available()
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu:0')
torch.cuda.set_device(device)
print('Using device:', device)
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook
from sklearn.model_selection import train_test_split
import requests

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Using device: cuda:0


### 1.2 Environment

In [45]:
# Check GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat Oct 23 19:47:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    33W / 250W |   2397MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [46]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-63329996-935a-ddd6-6fc2-371c9684e348)


In [47]:
# Check Memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


### 1.3 Constants

In [48]:
FEATURE_COLUMN = 'Review'
CONTEXT_OFFSET = 2 # n words to the left, n to the right
BATCH_SIZE = 64

EPOCHS_HOTEL = 15
EMBEDDING_DIM_HOTEL = 50

EPOCHS_SCIFI = 2
EMBEDDING_DIM_SCIFI = 50

## Hotel Reviews dataset

## 2. Data Preprocessing
### 2.1 Data Acquisition

In [49]:
'''
loading the tripadvisor data
'''
url_tripadvisor = (r'https://raw.githubusercontent.com/abandonedrepo/test/master/tripadvisor_hotel_reviews.csv')
reviews_dataset = pd.read_csv(url_tripadvisor)
reviews_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  20491 non-null  object
 1   Rating  20491 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 320.3+ KB


### 2.2 Data Cleaning

In [50]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def clean_document(x):
    x = re.sub(r'[^a-zA-Z\s]', '', x.lower(), re.I|re.A)
    x = re.sub(r'\w*\d\w*', '', x)
    x = re.sub(r'[\-!+_@*#\/$:)"\'.;,?&({}[]]*', '', x)
    x = x.strip()
    tokens = wpt.tokenize(x)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    x = ' '.join(filtered_tokens)
    return x

clean_corpus = np.vectorize(clean_document)

In [51]:
reviews = clean_corpus(reviews_dataset['Review'])
reviews[:10]

array(['nice hotel expensive parking got good deal stay hotel anniversary arrived late evening took advice previous reviews valet parking check quick easy little disappointed nonexistent view room room clean nice size bed comfortable woke stiff neck high pillows soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway maybe noisy neighbors aveda bath products nice goldfish stay nice touch taken advantage staying longer location great walking distance shopping overall nice experience pay parking night',
       'ok nothing special charge diamond member hilton decided chain shot th anniversary seattle start booked suite paid extra website description suite bedroom bathroom standard hotel room took printed reservation desk showed said things like tv couch ect desk clerk told oh mixed suites description kimpton website sorry free breakfast got kidding embassy suits sitting room bathroom bedroom unlike kimpton calls suite day stay offer corr

In [52]:
# Since we want to train a CBOW model with context width of 2
# on the reviews, we drop all reviews with less than 5 words.
# This is equivalent to only keeping instances with at least 5 words.
reviews = [review for review in reviews if len(review.split(" ")) >=  (2*CONTEXT_OFFSET + 1)]

In [53]:
# In order to build the corpus for the reviews 
# we want to find every distinct word that occurs
# in at least one review.
# We join all reviews into one large string and then
# split it at every space to receive a list of words
# Then the set method is used in order to only
# retain unique words.
# This list is then alphabetically sorted
reviews_vocabulary = sorted(set((" ".join(reviews)).split()))

In [54]:
reviews_vocabulary_size = len(reviews_vocabulary)
print("The reviews use a vocabulary comprising {} different words.".format(reviews_vocabulary_size))

The reviews use a vocabulary comprising 75255 different words.


In [55]:
word2index = {w:i for i,w in enumerate(reviews_vocabulary)} # Lookup table mapping words to indices
index2word = {i:w for i,w in enumerate(reviews_vocabulary)} # Lookup table mapping indices to words

In [56]:
data = []
for review in reviews:
  raw_text = review.split()
  for i in range(CONTEXT_OFFSET, len(raw_text) - CONTEXT_OFFSET):
      context = [raw_text[i - 2], raw_text[i - 1],
                raw_text[i + 1], raw_text[i + 2]]
      #print(context)
      target = raw_text[i]
      data.append((context, target))
print(data[:5])

[(['nice', 'hotel', 'parking', 'got'], 'expensive'), (['hotel', 'expensive', 'got', 'good'], 'parking'), (['expensive', 'parking', 'good', 'deal'], 'got'), (['parking', 'got', 'deal', 'stay'], 'good'), (['got', 'good', 'stay', 'hotel'], 'deal')]


In [57]:
# create your model and train.  here are some functions to help you make
# the data ready for use by your module
def make_context_vector(context, word2index):
    idxs = [word2index[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

make_context_vector(data[0][0], word2index)  # example

tensor([43815, 31861, 47687, 28624])

In [58]:
class HotelReviewsDataset(Dataset):
  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

  def __len__(self):
    return len(self.X)

In [59]:
X = np.array([i[0] for i in data])
X_vectors = list(map(lambda elem: make_context_vector(elem, word2index) , X))

In [60]:
print(X_vectors[:5])

[tensor([43815, 31861, 47687, 28624]), tensor([31861, 23561, 28624, 28399]), tensor([23561, 47687, 28399, 17247]), tensor([47687, 28624, 17247, 62686]), tensor([28624, 28399, 62686, 31861])]


In [61]:
y = np.array([i[1] for i in data])
y_vectors = list(map(lambda elem: make_context_vector([elem], word2index), y))

In [62]:
print(y_vectors[:5])

[tensor([23561]), tensor([47687]), tensor([28624]), tensor([28399]), tensor([17247])]


In [63]:
X_train, X_test, y_train, y_test = train_test_split(X_vectors[:500000], y_vectors[:500000], test_size=0.2, random_state=42)

In [64]:
hotel_reviews_dataset = HotelReviewsDataset(X_train, y_train)

## 3. Modelling

In [65]:
hotel_reviews_loader = DataLoader(dataset=hotel_reviews_dataset, batch_size=BATCH_SIZE, shuffle=True)

In [66]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
      super(CBOW, self).__init__()
      self.embeddings = nn.Embedding(vocab_size, embedding_dim, device=device)
      self.linear1 = nn.Linear(context_size * embedding_dim, 128)
      self.activation_function1 = nn.ReLU()
      self.linear2 = nn.Linear(128, vocab_size)
      self.activation_function2 = nn.LogSoftmax(dim=1)

    def forward(self, inputs):
      embeds = self.embeddings(inputs).view(inputs.size(0), -1)
      out = self.linear1(embeds)
      out = self.activation_function1(out)
      out = self.linear2(out)
      out = self.activation_function2(out)
      return out

    def write_embedding_to_file(self,filename):
      for i in self.embeddings.parameters():
        weights = i.data.numpy()
      np.save(filename,weights)
    
    def get_word_emdedding(self, word):
        word = torch.tensor([word2index[word]])
        return self.embeddings(word).view(1,-1)

In [67]:
def train_model(model, data_loader, epochs, word2index):

  losses = np.zeros(epochs)
  loss_function = nn.NLLLoss()
  optimizer = optim.Adam(model.parameters())

  for epoch in range(epochs):
    for step, (context_vectors, target_vector) in enumerate(data_loader):

      context_vectors = context_vectors.to(device) # Move the batch of context vectors into GPU memory
      target_vector = target_vector.to(device) # Move the batch of target vectors into GPU memory 

      model.zero_grad() # Reset all gradients back to zero

      log_probs = model(context_vectors) # forward pass
      loss = loss_function(log_probs, torch.squeeze(target_vector)) # compute loss for batch
      losses[epoch] += loss.item() # accumulate loss

      loss.backward() # backpropagation
      optimizer.step() # update the model weights

    print("Epoch {0}/{1} ... Average loss {2}".format(epoch+1, epochs, losses[epoch] / len(data_loader.dataset))) # Print average loss in this episode
  return losses

In [157]:
model = CBOW(reviews_vocabulary_size, EMBEDDING_DIM_HOTEL, 2*CONTEXT_OFFSET).to(device)
losses = train_model(model, hotel_reviews_loader, EPOCHS_HOTEL, word2index)

Epoch 1/12 ... Average loss 0.125498426142931
Epoch 2/12 ... Average loss 0.11533714622855186
Epoch 3/12 ... Average loss 0.10926431420326232
Epoch 4/12 ... Average loss 0.10442197620034217
Epoch 5/12 ... Average loss 0.10049004068613053
Epoch 6/12 ... Average loss 0.09725777908921242
Epoch 7/12 ... Average loss 0.09454163943529129
Epoch 8/12 ... Average loss 0.09220428843140602
Epoch 9/12 ... Average loss 0.09017402164816857
Epoch 10/12 ... Average loss 0.08834908894538879
Epoch 11/12 ... Average loss 0.08674287528157235
Epoch 12/12 ... Average loss 0.08527654187083245


## Sci-Fi story dataset

In [69]:
'''
loading the scifi txt
'''
url = 'https://raw.githubusercontent.com/abandonedrepo/test/master/scifi.txt'
scifi_dataset = requests.get(url).text
print(scifi_dataset[:200])

MARCH # All Stories New and Complete Publisher Editor IF is published bi-monthly by Quinn Publishing Company, Inc., Kingston, New York. Volume #, No. #. Copyright # by Quinn Publishing Company, Inc. A


### 2.2 Data Cleaning (Sci-Fi)

In [70]:
scifi_dataset = clean_document(scifi_dataset)
print(scifi_dataset[:100])

march stories new complete publisher editor published bimonthly quinn publishing company inc kingsto


In [71]:
# list of words from the scifi txt
scifi_word_list=scifi_dataset.split()
print(scifi_word_list[:10])

['march', 'stories', 'new', 'complete', 'publisher', 'editor', 'published', 'bimonthly', 'quinn', 'publishing']


In [72]:
# list of unique words from the scifi txt
scifi_vocabulary = sorted(set(scifi_word_list))

In [73]:
scifi_vocabulary_size = len(scifi_vocabulary)
print("The scifi use a vocabulary comprising {} different words.".format(scifi_vocabulary_size))

The scifi use a vocabulary comprising 200658 different words.


In [74]:
word2index_scifi = {w:i for i,w in enumerate(scifi_vocabulary)} # Lookup table mapping words to indices
index2word_scifi = {i:w for i,w in enumerate(scifi_vocabulary)} # Lookup table mapping indices to words

In [75]:
# To prevent the colab ram from crashing, save some spaces for memory
del scifi_dataset, scifi_vocabulary

In [76]:
scifi_data = []
for i in range(CONTEXT_OFFSET, 1000000 + CONTEXT_OFFSET): # To prevent the colab ram from crashing, we chose the first 1000000 words for trainning
    context = [scifi_word_list[i - 2], scifi_word_list[i - 1],
              scifi_word_list[i + 1], scifi_word_list[i + 2]]
    target = scifi_word_list[i]
    scifi_data.append((context, target))
print(scifi_data[:5])

[(['march', 'stories', 'complete', 'publisher'], 'new'), (['stories', 'new', 'publisher', 'editor'], 'complete'), (['new', 'complete', 'editor', 'published'], 'publisher'), (['complete', 'publisher', 'published', 'bimonthly'], 'editor'), (['publisher', 'editor', 'bimonthly', 'quinn'], 'published')]


In [77]:
X = np.array([i[0] for i in scifi_data])
X_vectors = list(map(lambda elem: make_context_vector(elem, word2index_scifi) , X))
y = np.array([i[1] for i in scifi_data])
y_vectors = list(map(lambda elem: make_context_vector([elem], word2index_scifi), y))

In [78]:
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y_vectors, test_size=0.2, random_state=42)

In [79]:
scifi_training_dataset = HotelReviewsDataset(X_train, y_train)

## 3. Modelling (Sci-Fi)

In [80]:
scifi_data_loader = DataLoader(dataset=scifi_training_dataset, batch_size=BATCH_SIZE, shuffle=True)

In [91]:
model_scifi = CBOW(scifi_vocabulary_size, EMBEDDING_DIM_SCIFI, 2*CONTEXT_OFFSET).to(device)
losses = train_model(model_scifi, scifi_data_loader, EPOCHS_SCIFI, word2index_scifi)

Epoch 1/2 ... Average loss 0.14284522960841656
Epoch 2/2 ... Average loss 0.13756629810631274


Part 1 (Optional)

# Part 2: Test your embeddings

## 2. find 5 neighbours of each of the 9 words from the hotel reviews dataset

In [170]:
# check the frequencies of the words
reviews_word_list=list((" ".join(reviews)).split())
frequency = pd.value_counts(reviews_word_list)
print("The most frequent words are\n{}\n------------------------------".format(frequency.head(20)))
print("The less frequent words are\n{}\n------------------------------".format(frequency.iloc[500:520]))
print("The much less frequent words are\n{}\n------------------------------".format(frequency.iloc[1000:1020]))

The most frequent words are
hotel        48919
room         34347
great        21097
nt           19000
good         16990
staff        16216
stay         15160
nice         12412
rooms        12031
location     11045
stayed       10469
service       9980
time          9834
night         9739
beach         9597
day           9551
clean         9364
breakfast     9274
food          9010
like          8114
dtype: int64
------------------------------
The less frequent words are
makes          693
directly       692
seattle        691
menu           691
surprised      683
efficient      682
daughter       681
recently       677
truly          676
ac             676
royal          676
public         676
atmosphere     675
cab            675
basic          674
swim           673
attractions    673
bavaro         673
true           672
ended          671
dtype: int64
------------------------------
The much less frequent words are
careful       340
hair          339
additional    339
smoke    

In [189]:
# We chose 3 nouns, 3 verbs, and 3 adjectives respectively from the above 3 frequency levels.
chosen_words = ['beach','like','nice',
                'daughter','swim','public',
                'trees','smoke','careful']

In [184]:
def get_closest_word(word, topn):
  word_distance = []
  emb = model.embeddings
  pdist = nn.PairwiseDistance()
  i = word2index[word]
  lookup_tensor_i = torch.tensor([i],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
  v_i = emb(lookup_tensor_i)
  for j in range(len(reviews_vocabulary)):
    if j !=i:
      lookup_tensor_j = torch.tensor([j],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
      v_j = emb(lookup_tensor_j)
      word_distance.append((index2word[j],float(pdist(v_i,v_j))))
  word_distance.sort(key=lambda x:x[1])
  return word_distance[:topn]

example = get_closest_word('beach', 5)
print(example)

[('clearlyhotel', 5.2838215827941895), ('stripbad', 5.976334571838379), ('enough', 6.019063472747803), ('beigels', 6.052651405334473), ('fathers', 6.078352928161621)]


In [190]:
def get_closest_word_from_a_list(chosen_words,  topn):
  chosen_words_and_their_neighbours=[]
  for word in chosen_words:
    get_result = get_closest_word(word,  topn)
    neighbours = [nb[0] for nb in get_result]
    chosen_words_and_their_neighbours.append((neighbours,word))
  return chosen_words_and_their_neighbours

neighbours = get_closest_word_from_a_list(chosen_words,5)
neighbours

[(['clearlyhotel', 'stripbad', 'enough', 'beigels', 'fathers'], 'beach'),
 (['racing', 'phoenixthe', 'informationi', 'bfest', 'itineraryyou'], 'like'),
 (['large', 'great', 'pressureoh', 'stupor', 'redundancies'], 'nice'),
 (['hasslesi', 'pointlessthe', 'geli', 'domenico', 'cap'], 'daughter'),
 (['kimptonoverall', 'bedmatress', 'womanno', 'plasmas', 'stinging'], 'swim'),
 (['oxidized', 'walk', 'gospainif', 'fhr', 'thin'], 'public'),
 (['goody', 'wifehad', 'waitressing', 'fraudulently', 'wit'], 'trees'),
 (['flute', 'simonewe', 'cotraveller', 'fridgeexcursions', 'tvconsoleremote'],
  'smoke'),
 (['jon', 'spreading', 'memorize', 'orderedif', 'confortablelovely'],
  'careful')]

## 3. find 5 neighbours of each of the 9 words from the scifi dataset

In [175]:
# check the frequencies of the words
frequency = pd.value_counts(scifi_word_list)
print("The most frequent words are\n{}\n------------------------------".format(frequency.head(20)))
print("The less frequent words are\n{}\n------------------------------".format(frequency.iloc[500:520]))
print("The much less frequent words are\n{}\n------------------------------".format(frequency.iloc[800:820]))

The most frequent words are
said     76347
one      55339
would    46555
could    41388
like     35710
back     31984
time     31971
know     28539
man      27071
dont     26121
get      24459
well     23211
see      21141
us       21022
way      20825
two      20553
even     20493
right    19416
first    18771
got      17900
dtype: int64
------------------------------
The less frequent words are
tiny         2415
smile        2408
showed       2400
somewhere    2398
mans         2390
twenty       2388
radio        2376
wish         2369
late         2369
john         2367
glanced      2363
killed       2355
legs         2355
nearly       2353
blood        2349
single       2347
couple       2345
state        2338
energy       2336
complete     2336
dtype: int64
------------------------------
The much less frequent words are
language    1595
knowing     1595
warm        1590
b           1582
dust        1578
worry       1572
editor      1572
boys        1570
offer       1566
pass      

In [176]:
# We chose 3 nouns, 3 verbs, and 3 adjectives respectively from the above 3 frequency levels.
chosen_words_scifi = ['time','said','right',
                      'blood','smile','tiny',
                      'party','worry','warm']

In [183]:
def get_closest_word_scifi(word, topn):
  word_distance = []
  emb = model_scifi.embeddings
  pdist = nn.PairwiseDistance()
  i = word2index_scifi[word]
  lookup_tensor_i = torch.tensor([i],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
  v_i = emb(lookup_tensor_i)
  for j in range(len(reviews_vocabulary)):
    if j !=i:
      lookup_tensor_j = torch.tensor([j],dtype=torch.long).to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
      v_j = emb(lookup_tensor_j)
      word_distance.append((index2word_scifi[j],float(pdist(v_i,v_j))))
  word_distance.sort(key=lambda x:x[1])
  return word_distance[:topn]

def get_closest_word_from_a_list_scifi(chosen_words, topn):
  chosen_words_and_their_neighbours=[]
  for word in chosen_words:
    get_result = get_closest_word_scifi(word, topn)
    neighbours = [nb[0] for nb in get_result]
    chosen_words_and_their_neighbours.append((neighbours,word))
  return chosen_words_and_their_neighbours


neighbours = get_closest_word_from_a_list_scifi(chosen_words_scifi, 5)
neighbours

[(['endorses', 'gunnysack', 'eor', 'brion', 'hebbentons'], 'time'),
 (['compels', 'drooled', 'amck', 'enhancement', 'cheaply'], 'said'),
 (['carefully', 'bilge', 'controlcircles', 'gboats', 'anjouan'], 'right'),
 (['harobr', 'broucheoperas', 'bari', 'chickenhearted', 'concomitants'],
  'blood'),
 (['domicile', 'dunican', 'engineersoldier', 'cerebellar', 'brakeman'],
  'smile'),
 (['babushka', 'cowdye', 'chinadoll', 'accelerationdissipaters', 'carefully'],
  'tiny'),
 (['guided', 'glitterof', 'babyrational', 'delisted', 'crumples'], 'party'),
 (['cozzi', 'gadzoons', 'discrepancies', 'dober', 'careillustrated'], 'worry'),
 (['bxeathing', 'basses', 'depresses', 'asmells', 'damnwell'], 'warm')]

## 5. Choose two words and retrive their 5 closest neighbours from both datasets

In [188]:
chosen_words=['good','said']
neighbours_from_reviews = get_closest_word_from_a_list(chosen_words, 5)
neighbours_from_scifi = get_closest_word_from_a_list_scifi(chosen_words, 5)


print("5 closest neighbours of the chosen words in hotel reviews dataset:")
print(neighbours_from_reviews)
print("5 closest neighbours of the chosen words in scifi dataset:")
print(neighbours_from_scifi)

5 closest neighbours of the chosen words in hotel reviews dataset:
[(['named', 'efficientexcellent', 'yiiiiiiiiiiiiiiiii', 'acquired', 'adequatei'], 'good'), (['interfering', 'hearhiring', 'gilbert', 'unfortunatelyafter', 'fullnow'], 'said')]
5 closest neighbours of the chosen words in scifi dataset:
[(['brokenhome', 'cronuss', 'could', 'ballooningup', 'b'], 'good'), (['compels', 'drooled', 'amck', 'enhancement', 'cheaply'], 'said')]
