<a href="https://colab.research.google.com/github/nanfang-wuyu/ML4NLP1_UZH/blob/main/Assignment%202/word_embeddings_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [43]:
%matplotlib inline

In [44]:
# Optional for word correction
# %pip install textblob
# %pip install pyspellchecker

Source: [https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [23]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7a48a89c7e90>

In [24]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings # randomly initial
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


In [47]:
embeds(torch.LongTensor([1]))

tensor([[-0.1661, -1.5228,  0.3817, -1.0276, -0.5631]],
       grad_fn=<EmbeddingBackward0>)

# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.




In [48]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # n probs, 1 target idx
        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
[523.4714274406433, 520.8815507888794, 518.309253692627, 515.754070520401, 513.2145917415619, 510.6909005641937, 508.1826972961426, 505.6879105567932, 503.2051067352295, 500.7331578731537]


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.




In [49]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])




def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

# data[i]: i-th contexts-target pair
# data[i][0]: 4 contexts (list)
# data[i][1]: target (text)
make_context_vector(data[0][0], word_to_ix)  # example

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


tensor([33, 41, 46, 27])

## Model

In [25]:
class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)

    """
    Since the objective is to learn embeddings,
    instead of prediction or classification tasks,
    there is no need to add softmax layer.
    """
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        embeds = torch.sum(embeds, dim=1)
        out = self.linear(embeds)
        return out

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


## Load Data

In [4]:
!gdown 1foE1JuZJeu5E_4qVge9kExzhvF32teuF
!gdown 13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75

Downloading...
From: https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF
To: /content/tripadvisor_hotel_reviews_reduced.csv
100% 7.36M/7.36M [00:00<00:00, 111MB/s]
Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /content/scifi_reduced.txt
100% 43.1M/43.1M [00:00<00:00, 169MB/s]


In [26]:
with open(f'scifi_reduced.txt') as f:
    text_scifi = f.read()
import pandas as pd
df_hotel = pd.read_csv('tripadvisor_hotel_reviews_reduced.csv')

In [27]:
text_scifi[:20]

' A chat with the edi'

In [28]:
df_hotel[:5]

Unnamed: 0,Review,Rating
0,fantastic service large hotel caters business ...,5
1,"great hotel modern hotel good location, locate...",4
2,3 star plus glasgowjust got 30th november 4 da...,4
3,nice stayed hotel nov 19-23. great little bout...,4
4,great place wonderful hotel ideally located me...,5


In [29]:
text_hotel = df_hotel.drop(columns=["Rating"])

In [30]:
text_hotel[:5]

Unnamed: 0,Review
0,fantastic service large hotel caters business ...
1,"great hotel modern hotel good location, locate..."
2,3 star plus glasgowjust got 30th november 4 da...
3,nice stayed hotel nov 19-23. great little bout...
4,great place wonderful hotel ideally located me...


## Data Preprocessing

<!-- 1. Data Cleaning, 2. **Character Casing**, 3. **Splitting Sentences** (optional), 4. **Tokenizing Words**, 5. **Dealing with Punctuation**, 6. Expanding Contractions (optional), 7. **Removing Stopwords**, 8. Lemmatization and Stemming (optional), 9. Handling Special Cases, 10. Handling Rare Words and Out-of-Vocabulary Words.
 -->


1. Special Characters Cleaning
2. Character Casing
3. Tokenizing Words
4. Stop Word Removal
5. Stemming
6. Handling Rare Words and Out-of-Vocabulary Words

In [31]:
import nltk

In [32]:
# Brief text to test pre-processing functions
text = "It's a text to test pre-processing functions (a tstword). This is a repeat: It's a text to test pre-processing functions (a tstword)."

### Special Characters Cleaning

In [33]:
# import library: Regular Expression
import re

"""
Clean the data by removing special characters (punctuation)
"""
def sp_chara_cleaning(text):
    clean_text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    return clean_text

In [34]:
# Test it
sp_chara_cleaning("#trap music <123> @gmail! How are you?")
' trap music  123   gmail  How are you '.split()

['trap', 'music', '123', 'gmail', 'How', 'are', 'you']

In [35]:
# Special characters like ' . - are removed.
text = sp_chara_cleaning(text)
text

'It s a text to test pre processing functions  a tstword   This is a repeat  It s a text to test pre processing functions  a tstword  '

### Character Casing

In [36]:

"""
Lowercase all words.
"""
def character_casing(text):
    lower_text = text.lower()
    return lower_text

In [37]:
# All cases become lowercases.
text = character_casing(text)
text

'it s a text to test pre processing functions  a tstword   this is a repeat  it s a text to test pre processing functions  a tstword  '

### Tokenizing Words

In [38]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [39]:
from nltk.tokenize import word_tokenize

"""Tokenize the text to words for further data processing functions."""
def tokenize_words(text):
    return word_tokenize(text)

In [40]:
words = tokenize_words(text)
print(words)

['it', 's', 'a', 'text', 'to', 'test', 'pre', 'processing', 'functions', 'a', 'tstword', 'this', 'is', 'a', 'repeat', 'it', 's', 'a', 'text', 'to', 'test', 'pre', 'processing', 'functions', 'a', 'tstword']


### Stop Word Removal

In [41]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [42]:
from nltk.corpus import stopwords

"""
Here we remove words that in English stop words list.
"""
def stop_word_removal(words):
    stop_words = stopwords.words("english")
    clean_words = [w for w in words if w not in stop_words]
    return clean_words



In [43]:
# Stop words like 'it', 's', 'a', 'this' are removed.
words = stop_word_removal(words)
print(words)

['text', 'test', 'pre', 'processing', 'functions', 'tstword', 'repeat', 'text', 'test', 'pre', 'processing', 'functions', 'tstword']


### Stemming

In [44]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize


"""
Here we do stemming for the text, which reduces words to their root form.
We choose PorterStemmer, which is a helpful algorithm.
"""
def stemming(words):
    stemmer = PorterStemmer()
    clean_words = [stemmer.stem(w) for w in words]
    return clean_words



In [45]:
stemming(["guests", "takes"])

['guest', 'take']

In [46]:
# Here 'testing' is reduced to 'test', 'functions' is reduced to 'function'.
words = stemming(words)
print(words)

['text', 'test', 'pre', 'process', 'function', 'tstword', 'repeat', 'text', 'test', 'pre', 'process', 'function', 'tstword']


### Handling Rare Words and Out-of-Vocabulary Words

In [47]:
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [48]:
from nltk import FreqDist

"""
Here we handle rare words and out of vocabulary (OOV) words.
We use FreqDist to calculate frequences of all words.
We use nltk.corpus.words.words() as our vocabulary to handle OOV.
We choose words over minimum frequence and in vocabulary.
Since handling OOV words cost much time, we set it default to False,
which will be more convenient for peer reviewers.
"""
def handle_rare_and_OOV(words, min_freq=10, handle_oov=False):
    vocab = nltk.corpus.words.words()
    word_freq = FreqDist(words)
    if handle_oov:
        clean_words = [w for w in words if word_freq[w] >= min_freq and w in vocab]
    else:
        clean_words = [w for w in words if word_freq[w] >= min_freq]
    return clean_words


In [49]:
# Here rare word 'pre' and OOV word 'tstword' are removed
words = handle_rare_and_OOV(words, 2)
print(words)

['text', 'test', 'pre', 'process', 'function', 'tstword', 'text', 'test', 'pre', 'process', 'function', 'tstword']


### Apply Preprocessing Functions
Skip this if you already have saved data after pre-processing.

#### Hotel Dataset

In [50]:
# show text before preprocessing
text_hotel["Review"].iloc[0]

'fantastic service large hotel caters business corporates, serve provided better wife experienced- nothing short world.the room upgraded superior room overlooking harbour marina large window 50 feet length, anniversary bottle champagne sent chocolates compliments management, expensive did not regret moment choice hotel, highly recommended exclusive hotel break pamper,  '

In [None]:
funcs = [sp_chara_cleaning, character_casing, tokenize_words, stop_word_removal, stemming, handle_rare_and_OOV]

full_text_hotel = ' '.join(text_hotel['Review'])
words_hotel = full_text_hotel
for i, func in enumerate(funcs):
    print(func.__name__)
    words_hotel = func(words_hotel)


sp_chara_cleaning
character_casing
tokenize_words
stop_word_removal
stemming
handle_rare_and_OOV


In [None]:
# show text after preprocessing
" ".join(words_hotel[:20])

'fantast hotel cater better wife short world room superior room overlook harbour marina window length sent compliment regret moment hotel'

In [246]:
with open('clean_hotel.txt', 'w') as file:
    file.write(' '.join(words_hotel))

#### Scifi Dataset

In [51]:
# show text before preprocessing
text_scifi[:60]

' A chat with the editor  i #  science fiction magazine calle'

In [None]:
funcs = [sp_chara_cleaning, character_casing, tokenize_words, stop_word_removal, stemming, handle_rare_and_OOV]

words_scifi = text_scifi
for i, func in enumerate(funcs):
    print(func.__name__)
    words_scifi = func(words_scifi)


sp_chara_cleaning
character_casing
tokenize_words
stop_word_removal
stemming
handle_rare_and_OOV


In [None]:
# show text after preprocessing
" ".join(words_scifi[:20])

'chat editor scienc fiction magazin call titl select much thought breviti theori indic field easi rememb tent titl morn rememb'

In [248]:
with open('clean_scifi.txt', 'w') as file:
    file.write(' '.join(words_scifi))

### Create Dataset For Training

In [52]:
# Load dataset after cleaning. If you have done data pre-processing in the same
# kernel process, you can skip it.
with open('clean_hotel.txt', 'r') as file:
    words_hotel = file.read().split(' ')
with open('clean_scifi.txt', 'r') as file:
    words_scifi = file.read().split(' ')


In [53]:

"""
Save indexs of targets to i_target_tensor, indexs of context list to i_context_tensor.
"""
def create_triagrams(i_text, context_size=2):
    i_context_tensor = []
    i_target_tensor = []
    for i in range(context_size, len(i_text) - context_size):
        i_contexts = [i_text[j] for j in range(i - context_size, i + context_size + 1) if i != j]
        i_target = i_text[i]
        i_context_tensor.append(i_contexts)
        i_target_tensor.append(i_target)
    i_target_tensor = torch.LongTensor(i_target_tensor)
    i_context_tensor = torch.LongTensor(i_context_tensor)
    return i_context_tensor, i_target_tensor

#### SCIFI DATASET

In [54]:
vocab_scifi = list(set(words_scifi))
word_to_ix_scifi = {word: i for i, word in enumerate(vocab_scifi)}
i_text_scifi = [word_to_ix_scifi[w] for w in words_scifi]
i_context_scifi, i_target_scifi = create_triagrams(i_text_scifi, 2)

In [55]:
len(words_scifi), len(vocab_scifi), i_target_scifi.shape, i_context_scifi.shape

(3770291, 16754, torch.Size([3770287]), torch.Size([3770287, 4]))

#### HOTEL DATASET

In [56]:
vocab_hotel = list(set(words_hotel))
word_to_ix_hotel = {word: i for i, word in enumerate(vocab_hotel)}
i_text_hotel = [word_to_ix_hotel[w] for w in words_hotel]
i_context_hotel_2, i_target_hotel_2 = create_triagrams(i_text_hotel, 2)
i_context_hotel_5, i_target_hotel_5 = create_triagrams(i_text_hotel, 5)

In [57]:
len(words_hotel), len(vocab_hotel)

(708517, 2776)

In [58]:
i_context_hotel_2.shape, i_target_hotel_2.shape, i_context_hotel_5.shape, i_target_hotel_5.shape

(torch.Size([708513, 4]),
 torch.Size([708513]),
 torch.Size([708507, 10]),
 torch.Size([708507]))

## Train Model

In [59]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
# device = torch.device("mps")
# device

In [60]:
torch.cuda.is_available()

True

In [61]:
import time
import logging

In [62]:
from torch.utils.data import IterableDataset, DataLoader

# Example from 05_intro_to_PyTorch.ipynb
class MyDataset(IterableDataset):
    def __init__(self, data_X, data_y):
        assert len(data_X) == len(data_y)
        self.data_X = data_X.to(device)
        self.data_y = data_y.to(device)

    def __len__(self):
        return len(self.data_X)

    def __iter__(self):
        for i in range(len(self.data_X)):
            yield (self.data_X[i], self.data_y[i])



In [63]:
def createDataLoader(X, y):
  torch.manual_seed(1)
  train_set = MyDataset(X, y)
  train_loader = DataLoader(train_set, batch_size=64)
  return train_loader

### CBOW 2 with Hotel Dataset

In [260]:
# Setting logging, dataloader, hyperparams, loss function and optimizer.
logging.basicConfig(filename='training_hotel2.log', level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
train_loader = createDataLoader(i_context_hotel_2, i_target_hotel_2)
CONTEXT_SIZE = 2
EMBEDDING_DIM = 50
losses = []
loss_function = nn.CrossEntropyLoss()
CBOW2_hotel_model = CBOW(len(vocab_hotel), EMBEDDING_DIM, CONTEXT_SIZE).to(device)
optimizer = optim.SGD(CBOW2_hotel_model.parameters(), lr=0.1)
epochs = 20

In [261]:
for epoch in range(epochs):
    total_loss = 0
    start_time = time.time()
    for batch_num, (i_context, i_target) in enumerate(train_loader):
        # Torch accumulates gradients. Before passing in a
        # new instance, zero out the gradients from the old instance
        CBOW2_hotel_model.zero_grad()

        # Run the forward pass, getting log probabilities over next words
        log_probs = CBOW2_hotel_model(i_context)

        # Compute loss function.
        loss = loss_function(log_probs, i_target)

        # Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Get the Python number from a 1-element Tensor by calling tensor.item()
    info = f"Epoch: {epoch + 1} / {epochs} Loss: {total_loss / len(train_loader):.4f} \
Time: {time.time() - start_time:.2f}s"
    print(info)
    logging.info(info)
torch.save(CBOW2_hotel_model.state_dict(), f'CBOW2_hotel_model_{epochs}.pth')

Epoch: 1 / 20 Loss: 6.5569 Time: 15.84s
Epoch: 2 / 20 Loss: 6.1161 Time: 17.06s
Epoch: 3 / 20 Loss: 6.0217 Time: 16.70s
Epoch: 4 / 20 Loss: 5.9646 Time: 18.62s
Epoch: 5 / 20 Loss: 5.9233 Time: 16.64s
Epoch: 6 / 20 Loss: 5.8908 Time: 16.02s
Epoch: 7 / 20 Loss: 5.8641 Time: 16.44s
Epoch: 8 / 20 Loss: 5.8414 Time: 16.02s
Epoch: 9 / 20 Loss: 5.8216 Time: 15.88s
Epoch: 10 / 20 Loss: 5.8040 Time: 16.56s
Epoch: 11 / 20 Loss: 5.7882 Time: 19.43s
Epoch: 12 / 20 Loss: 5.7739 Time: 16.31s
Epoch: 13 / 20 Loss: 5.7607 Time: 16.02s
Epoch: 14 / 20 Loss: 5.7486 Time: 15.77s
Epoch: 15 / 20 Loss: 5.7373 Time: 15.95s
Epoch: 16 / 20 Loss: 5.7267 Time: 16.46s
Epoch: 17 / 20 Loss: 5.7168 Time: 18.79s
Epoch: 18 / 20 Loss: 5.7074 Time: 16.69s
Epoch: 19 / 20 Loss: 5.6986 Time: 15.91s
Epoch: 20 / 20 Loss: 5.6902 Time: 15.92s


### CBOW 5 with Hotel Dataset

In [279]:
# Setting logging, dataloader, hyperparams, loss function and optimizer.
logging.basicConfig(filename='training_hotel5.log', level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
train_loader = createDataLoader(i_context_hotel_5, i_target_hotel_5)
CONTEXT_SIZE = 5
EMBEDDING_DIM = 50
losses = []
loss_function = nn.CrossEntropyLoss()
CBOW5_hotel_model = CBOW(len(vocab_hotel), EMBEDDING_DIM, CONTEXT_SIZE).to(device)
optimizer = optim.SGD(CBOW5_hotel_model.parameters(), lr=0.1)
epochs = 20

In [280]:
for epoch in range(epochs):
    total_loss = 0
    start_time = time.time()
    for batch_num, (i_context, i_target) in enumerate(train_loader):

        # Torch accumulates gradients. Before passing in a
        # new instance, zero out the gradients from the old instance
        CBOW5_hotel_model.zero_grad()

        # Run the forward pass, getting log probabilities over next words
        log_probs = CBOW5_hotel_model(i_context)

        # Compute loss function.
        loss = loss_function(log_probs, i_target)

        # Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    info = f"Epoch: {epoch + 1} / {epochs} Loss: {total_loss / len(train_loader):.4f} \
Time: {time.time() - start_time:.2f}s"
    print(info)
    logging.info(info)
#     losses.append(total_loss)
# print(losses)  # The loss decreased every iteration over the training data!
torch.save(CBOW5_hotel_model.state_dict(), 'CBOW5_hotel_model.pth')

Epoch: 1 / 20 Loss: 6.6255 Time: 16.82s
Epoch: 2 / 20 Loss: 6.2512 Time: 16.79s
Epoch: 3 / 20 Loss: 6.1686 Time: 18.01s
Epoch: 4 / 20 Loss: 6.1169 Time: 16.79s
Epoch: 5 / 20 Loss: 6.0788 Time: 20.59s
Epoch: 6 / 20 Loss: 6.0484 Time: 21.54s
Epoch: 7 / 20 Loss: 6.0230 Time: 23.22s
Epoch: 8 / 20 Loss: 6.0010 Time: 17.90s
Epoch: 9 / 20 Loss: 5.9816 Time: 17.96s
Epoch: 10 / 20 Loss: 5.9642 Time: 18.78s
Epoch: 11 / 20 Loss: 5.9484 Time: 17.31s
Epoch: 12 / 20 Loss: 5.9339 Time: 20.03s
Epoch: 13 / 20 Loss: 5.9205 Time: 18.92s
Epoch: 14 / 20 Loss: 5.9080 Time: 16.17s
Epoch: 15 / 20 Loss: 5.8964 Time: 16.77s
Epoch: 16 / 20 Loss: 5.8854 Time: 16.57s
Epoch: 17 / 20 Loss: 5.8751 Time: 16.00s
Epoch: 18 / 20 Loss: 5.8653 Time: 22.79s
Epoch: 19 / 20 Loss: 5.8561 Time: 18.57s
Epoch: 20 / 20 Loss: 5.8473 Time: 18.58s


### CBOW 2 with Scifi Dataset

In [276]:
# Setting logging, dataloader, hyperparams, loss function and optimizer.
logging.basicConfig(filename='training_scifi.log', level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
train_loader = createDataLoader(i_context_scifi, i_target_scifi)
CONTEXT_SIZE = 2
EMBEDDING_DIM = 50
losses = []
loss_function = nn.CrossEntropyLoss()
CBOW2_scifi_model = CBOW(len(vocab_scifi), EMBEDDING_DIM, CONTEXT_SIZE).to(device)
optimizer = optim.SGD(CBOW2_scifi_model.parameters(), lr=0.1)
epochs = 10

In [277]:
for epoch in range(epochs):
    total_loss = 0
    start_time = time.time()
    for batch_num, (i_context, i_target) in enumerate(train_loader):

        # Torch accumulates gradients. Before passing in a
        # new instance, zero out the gradients from the old instance
        CBOW2_scifi_model.zero_grad()

        # Run the forward pass, getting log probabilities over next words
        log_probs = CBOW2_scifi_model(i_context)

        # Compute loss function.
        loss = loss_function(log_probs, i_target)

        # Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Get the Python number from a 1-element Tensor by calling tensor.item()
    info = f"Epoch: {epoch + 1} / {epochs} Loss: {total_loss / len(train_loader):.4f} \
Time: {time.time() - start_time:.2f}s"
    print(info)
    logging.info(info)
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!
torch.save(CBOW2_scifi_model.state_dict(), 'CBOW2_scifi_model.pth')

Epoch: 1 / 10 Loss: 8.4294 Time: 88.90s
Epoch: 2 / 10 Loss: 7.9138 Time: 89.89s
Epoch: 3 / 10 Loss: 7.8121 Time: 90.43s
Epoch: 4 / 10 Loss: 7.7514 Time: 91.18s
Epoch: 5 / 10 Loss: 7.7072 Time: 91.25s
Epoch: 6 / 10 Loss: 7.6721 Time: 95.86s
Epoch: 7 / 10 Loss: 7.6428 Time: 89.50s
Epoch: 8 / 10 Loss: 7.6177 Time: 89.75s
Epoch: 9 / 10 Loss: 7.5958 Time: 99.98s
Epoch: 10 / 10 Loss: 7.5763 Time: 101.23s
[496584.8132376671, 466208.168926239, 460217.6973621845, 456644.96947336197, 454040.4538923502, 451970.48763239384, 450246.8695342541, 448769.1400618553, 447476.03984856606, 446326.8780350685]


# Part 2 Embeddings Evaluation

In [64]:
# Run here only if you load the model from directory.
# If you have a model trained in this jupyter notebook kernel, move ahead.

# CBOW2_hotel_model = CBOW(len(vocab_hotel), 50, 2).to(device)
# CBOW2_hotel_model.load_state_dict(torch.load('CBOW2_hotel_model_20.pth', map_location=torch.device('cuda')))
# CBOW2_hotel_model.eval()

CBOW5_hotel_model = CBOW(len(vocab_hotel), 50, 5).to(device)
CBOW5_hotel_model.load_state_dict(torch.load('CBOW5_hotel_model.pth', map_location=torch.device('cuda')))
CBOW5_hotel_model.eval()

CBOW(
  (embeddings): Embedding(2776, 50)
  (linear): Linear(in_features=50, out_features=2776, bias=True)
)

In [65]:
import torch.nn as nn

def get_closest_word(word, net, word_to_index, vocabulary, topn=5, device=device):
    net.eval()
    word_distance = []
    emb = net.embeddings
    pdist = nn.PairwiseDistance()
    i = word_to_index[word]
    lookup_tensor_i = torch.tensor([i], dtype=torch.long).to(device)
    v_i = emb(lookup_tensor_i)
    for j in range(len(vocabulary)):
        if j != i:
            lookup_tensor_j = torch.tensor([j], dtype=torch.long).to(device)
            v_j = emb(lookup_tensor_j)
            word_distance.append((vocabulary[j], float(pdist(v_i, v_j))))
            word_distance.sort(key=lambda x: x[1])
    return word_distance[:topn]

### Hotel

#### CBOW2

In [66]:
# word frequency
word_freq_hotel = FreqDist(words_hotel)

In [67]:
# sort by frequency
sorted_keys_hotel = sorted(word_freq_hotel.keys(), key = lambda x : word_freq_hotel[x], reverse = True)

In [68]:
word_freq_hotel.most_common(10)

[('hotel', 26584),
 ('room', 23165),
 ('stay', 13991),
 ('great', 10501),
 ('n', 9323),
 ('good', 8687),
 ('staff', 8203),
 ('night', 7148),
 ('day', 6641),
 ('nice', 6533)]

In [285]:
[(x, word_freq_hotel[x]) for x in sorted_keys_hotel[880:890]]

[('slipper', 105),
 ('rip', 105),
 ('period', 105),
 ('barrier', 105),
 ('known', 105),
 ('neat', 105),
 ('approach', 105),
 ('plug', 105),
 ('calm', 104),
 ('cramp', 104)]

In [69]:
chosen_words_hotel = ['hotel', 'staff', 'statement', 'eat', 'jump', 'weigh', 'good', 'nice', 'calm']
s = f"Chosen words:   \n\
    nouns:  'hotel': {word_freq_hotel['hotel']}, \n\
            'staff': {word_freq_hotel['staff']}, \n\
            'statement': {word_freq_hotel['statement']},\n\
    verbs:  'eat': {word_freq_hotel['eat']}, \n\
            'jump': {word_freq_hotel['jump']},\n\
            'weigh': {word_freq_hotel['weigh']},\n\
    adjs:   'good': {word_freq_hotel['good']}, \n\
            'nice': {word_freq_hotel['nice']},\n\
            'calm': {word_freq_hotel['calm']},"

In [287]:
print(s)

Chosen words:   
    nouns:  'hotel': 26584, 
            'staff': 8203, 
            'statement': 21,
    verbs:  'eat': 1594, 
            'jump': 106,
            'weigh': 16,
    adjs:   'good': 8687, 
            'nice': 6533,
            'calm': 104,


In [70]:
chosen_words_neighbors_hotel = [get_closest_word(w, CBOW2_hotel_model, word_to_ix_hotel, vocab_hotel) for w in chosen_words_hotel]

In [289]:
for i, w in enumerate(chosen_words_hotel):
    print(f"{w}:{chosen_words_neighbors_hotel[i]}")

hotel:[('impress', 5.191305160522461), ('room', 5.312654972076416), ('place', 5.365931510925293), ('quit', 5.450966835021973), ('time', 5.5018205642700195)]
staff:[('way', 6.993133544921875), ('ruin', 7.030202865600586), ('eye', 7.045656204223633), ('wow', 7.053786277770996), ('threw', 7.065330505371094)]
statement:[('gon', 6.269299030303955), ('motor', 6.735846519470215), ('jean', 7.025935173034668), ('pedestrian', 7.571342468261719), ('oven', 7.595150470733643)]
eat:[('sign', 6.66425895690918), ('recommend', 7.06675386428833), ('breath', 7.135968208312988), ('studio', 7.2372236251831055), ('run', 7.262462139129639)]
jump:[('cross', 6.364423751831055), ('slot', 6.530276298522949), ('dart', 6.550929069519043), ('sent', 6.554869651794434), ('teen', 6.569332599639893)]
weigh:[('whirlpool', 6.831803321838379), ('bite', 7.150975704193115), ('shot', 7.3557281494140625), ('bland', 7.364491939544678), ('swimsuit', 7.547054290771484)]
good:[('great', 3.9591073989868164), ('excel', 5.0803871154

#### CBOW5

In [71]:
chosen_words_neighbors_hotel_5 = [get_closest_word(w, CBOW5_hotel_model, word_to_ix_hotel, vocab_hotel) for w in chosen_words_hotel]

In [72]:
for i, w in enumerate(chosen_words_hotel):
    print(f"{w}:{chosen_words_neighbors_hotel_5[i]}")

hotel:[('spring', 5.974390983581543), ('grant', 6.408576965332031), ('invest', 6.439939022064209), ('concert', 6.478260040283203), ('card', 6.489580154418945)]
staff:[('coconut', 5.714028358459473), ('palm', 5.7422566413879395), ('land', 6.039821624755859), ('housekeep', 6.172938823699951), ('face', 6.232258319854736)]
statement:[('pour', 7.124289512634277), ('al', 7.1709442138671875), ('nonetheless', 7.262602806091309), ('towel', 7.268130779266357), ('n', 7.268770217895508)]
eat:[('spring', 6.313792705535889), ('palm', 6.437282562255859), ('overbook', 6.4566826820373535), ('facial', 6.485250473022461), ('depress', 6.498972415924072)]
jump:[('coconut', 7.1404547691345215), ('fever', 7.264527797698975), ('hectic', 7.315446853637695), ('wo', 7.407448768615723), ('supper', 7.47681999206543)]
weigh:[('freeway', 6.294049263000488), ('nit', 6.451268196105957), ('depress', 6.636817455291748), ('lagoon', 6.64177942276001), ('els', 6.867648601531982)]
good:[('import', 6.512632369995117), ('part

In [75]:
for i, w in enumerate(chosen_words_hotel):
    print(f"{w}:{[x[0] for x in chosen_words_neighbors_hotel_5[i]]}")

hotel:['spring', 'grant', 'invest', 'concert', 'card']
staff:['coconut', 'palm', 'land', 'housekeep', 'face']
statement:['pour', 'al', 'nonetheless', 'towel', 'n']
eat:['spring', 'palm', 'overbook', 'facial', 'depress']
jump:['coconut', 'fever', 'hectic', 'wo', 'supper']
weigh:['freeway', 'nit', 'depress', 'lagoon', 'els']
good:['import', 'partial', 'gon', 'oyster', 'facial']
nice:['door', 'appoint', 'concert', 'wrong', 'partial']
calm:['builder', 'grant', 'cupboard', 'score', 'fell']


In [79]:
st = """hotel:['spring', 'grant', 'invest', 'concert', 'card']
staff:['coconut', 'palm', 'land', 'housekeep', 'face']
statement:['pour', 'al', 'nonetheless', 'towel', 'n']
eat:['spring', 'palm', 'overbook', 'facial', 'depress']
jump:['coconut', 'fever', 'hectic', 'wo', 'supper']
weigh:['freeway', 'nit', 'depress', 'lagoon', 'els']
good:['import', 'partial', 'gon', 'oyster', 'facial']
nice:['door', 'appoint', 'concert', 'wrong', 'partial']
calm:['builder', 'grant', 'cupboard', 'score', 'fell']"""
st = st.replace('[', '')
st = st.replace(']', '')
print(st)

hotel:'spring', 'grant', 'invest', 'concert', 'card'
staff:'coconut', 'palm', 'land', 'housekeep', 'face'
statement:'pour', 'al', 'nonetheless', 'towel', 'n'
eat:'spring', 'palm', 'overbook', 'facial', 'depress'
jump:'coconut', 'fever', 'hectic', 'wo', 'supper'
weigh:'freeway', 'nit', 'depress', 'lagoon', 'els'
good:'import', 'partial', 'gon', 'oyster', 'facial'
nice:'door', 'appoint', 'concert', 'wrong', 'partial'
calm:'builder', 'grant', 'cupboard', 'score', 'fell'


## Scifi

In [290]:
# word frequency
word_freq_scifi = FreqDist(words_scifi)

In [291]:
# sort by frequency
sorted_keys_scifi = sorted(word_freq_scifi.keys(), key = lambda x : word_freq_scifi[x], reverse = True)

In [297]:
word_freq_scifi.most_common(40)

[('said', 36714),
 ('one', 29690),
 ('would', 22545),
 ('could', 20325),
 ('like', 20157),
 ('look', 18611),
 ('time', 18051),
 ('back', 17022),
 ('go', 16137),
 ('man', 15834),
 ('know', 15684),
 ('get', 14976),
 ('see', 11592),
 ('two', 11542),
 ('way', 11287),
 ('come', 10938),
 ('even', 10906),
 ('thing', 10732),
 ('hand', 10632),
 ('think', 10442),
 ('eye', 10264),
 ('us', 10019),
 ('right', 9851),
 ('want', 9803),
 ('make', 9770),
 ('first', 9452),
 ('thought', 9387),
 ('well', 9305),
 ('got', 9065),
 ('turn', 9055),
 ('ship', 8655),
 ('take', 8556),
 ('littl', 8552),
 ('long', 8505),
 ('face', 8496),
 ('still', 8323),
 ('around', 8292),
 ('came', 8218),
 ('year', 8204),
 ('someth', 8132)]

In [306]:
[(x, word_freq_scifi[x]) for x in sorted_keys_scifi[2280:2330]]

[('ach', 289),
 ('angel', 289),
 ('cautious', 289),
 ('clerk', 289),
 ('pole', 289),
 ('reliev', 289),
 ('log', 289),
 ('expand', 288),
 ('shade', 288),
 ('sensit', 288),
 ('hut', 288),
 ('nervous', 288),
 ('initi', 288),
 ('inquir', 288),
 ('spoken', 287),
 ('estim', 287),
 ('throughout', 287),
 ('respond', 286),
 ('seed', 286),
 ('tightli', 286),
 ('snort', 286),
 ('etern', 286),
 ('cycl', 286),
 ('super', 286),
 ('deeper', 285),
 ('waist', 285),
 ('dave', 285),
 ('react', 285),
 ('gree', 285),
 ('bedroom', 284),
 ('neighbor', 284),
 ('hollow', 284),
 ('eric', 284),
 ('boardman', 284),
 ('gadget', 283),
 ('leather', 283),
 ('drum', 283),
 ('suspici', 283),
 ('specul', 283),
 ('booth', 283),
 ('blew', 283),
 ('scarc', 283),
 ('clip', 283),
 ('volunt', 283),
 ('reef', 283),
 ('packag', 282),
 ('chill', 282),
 ('tune', 282),
 ('stronger', 282),
 ('heap', 282)]

In [309]:
chosen_words_scifi = ['year', 'bedroom', 'way', 'eat', 'think', 'look', 'good', 'super', 'long']
print("Chosen words: \n")
for word in chosen_words_scifi:
  print("{}: {}".format(word, word_freq_scifi[word]))

Chosen words: 

year: 8204
bedroom: 284
way: 11287
eat: 1193
think: 10442
look: 18611
good: 8129
super: 286
long: 8505


In [310]:
chosen_words_neighbors_scifi = [get_closest_word(w, CBOW2_scifi_model, word_to_ix_scifi, vocab_scifi) for w in chosen_words_scifi]

In [311]:
for i, w in enumerate(chosen_words_scifi):
    print(f"{w}:{chosen_words_neighbors_scifi[i]}")

year:[('extract', 7.5398712158203125), ('scan', 7.825392723083496), ('allallu', 7.936631202697754), ('day', 8.008155822753906), ('entic', 8.28494644165039)]
bedroom:[('deaden', 5.900077819824219), ('instead', 6.141233444213867), ('cari', 6.186210632324219), ('rhythm', 6.204065322875977), ('musingli', 6.223509311676025)]
way:[('think', 4.8308587074279785), ('hing', 5.207362651824951), ('alway', 5.374516487121582), ('mean', 5.477041721343994), ('want', 5.497170448303223)]
eat:[('modern', 6.836548328399658), ('opportun', 6.878442764282227), ('thought', 6.9683661460876465), ('nausea', 6.972916126251221), ('tang', 7.009042739868164)]
think:[('know', 3.3312723636627197), ('sure', 4.284049034118652), ('find', 4.642031192779541), ('better', 4.791561603546143), ('want', 4.7957682609558105)]
look:[('stare', 5.401568412780762), ('turn', 5.785499572753906), ('think', 5.978990077972412), ('flew', 5.981897354125977), ('mightier', 6.16929292678833)]
good:[('fixtur', 6.120141983032227), ('desert', 6.4