<h1>Embedding Words and Types<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Why-Learn-Embeddings?" data-toc-modified-id="Why-Learn-Embeddings?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Why Learn Embeddings?</a></span><ul class="toc-item"><li><span><a href="#Efficiency-of-Embeddings" data-toc-modified-id="Efficiency-of-Embeddings-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Efficiency of Embeddings</a></span></li><li><span><a href="#Approaches-to-Learning-Word-Embeddings" data-toc-modified-id="Approaches-to-Learning-Word-Embeddings-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Approaches to Learning Word Embeddings</a></span></li><li><span><a href="#The-Practical-Use-of-Pretrained-Word-Embeddings" data-toc-modified-id="The-Practical-Use-of-Pretrained-Word-Embeddings-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>The Practical Use of Pretrained Word Embeddings</a></span></li></ul></li><li><span><a href="#Example:-Learning-the-Continous-Bag-of-Words-Embeddings" data-toc-modified-id="Example:-Learning-the-Continous-Bag-of-Words-Embeddings-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Example: Learning the Continous Bag of Words Embeddings</a></span><ul class="toc-item"><li><span><a href="#Data-Vectorization-Classes" data-toc-modified-id="Data-Vectorization-Classes-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data Vectorization Classes</a></span></li><li><span><a href="#The-CBOW-Model" data-toc-modified-id="The-CBOW-Model-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>The CBOW Model</a></span></li><li><span><a href="#Model-Training-&amp;-Evaluation" data-toc-modified-id="Model-Training-&amp;-Evaluation-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Model Training &amp; Evaluation</a></span></li><li><span><a href="#Trained-Embeddings" data-toc-modified-id="Trained-Embeddings-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Trained Embeddings</a></span></li></ul></li><li><span><a href="#Example:-Transfer-Learning-Using-Pretrained-Embeddings-for-Document-Classification" data-toc-modified-id="Example:-Transfer-Learning-Using-Pretrained-Embeddings-for-Document-Classification-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Example: Transfer Learning Using Pretrained Embeddings for Document Classification</a></span><ul class="toc-item"><li><span><a href="#Data-Vectorization-classes" data-toc-modified-id="Data-Vectorization-classes-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Data Vectorization classes</a></span></li><li><span><a href="#The-NewsClassifier" data-toc-modified-id="The-NewsClassifier-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>The NewsClassifier</a></span></li><li><span><a href="#Utils" data-toc-modified-id="Utils-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Utils</a></span></li><li><span><a href="#Model-Training-&amp;-Evaluation" data-toc-modified-id="Model-Training-&amp;-Evaluation-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Model Training &amp; Evaluation</a></span><ul class="toc-item"><li><span><a href="#Inference" data-toc-modified-id="Inference-4.4.1"><span class="toc-item-num">4.4.1&nbsp;&nbsp;</span>Inference</a></span></li></ul></li></ul></li></ul></div>

## Introduction

*Representataion Learning or Embedding* refer to learning the mapping from one discrete type to a point in the vector space. When the discrete types are words, the dense vector representation is called a _word embedding_. TF-IDF(Term Frequency-Inverse Document Frequency) is an example of _count based embedding_ method.

## Why Learn Embeddings?

- The count-based representations are also called _distributional representations_ because their significant content or meaning is represented by multiple dimensions in the vector. These representations are not learned from the data but heuristically constructed.

**Benefits of Low Dimensional Learned Representations:**
- Reducing the dimensionality is computationally efficient.
- The count based representations result in high dimensional vectors that encode similar information along many dimensions and do not share statistical strength.
- Very high dimensions in the input can result in real problems in machine learning and optimisation which is often called _curse of dimensionality_.
- Representations learned from task specific data are optimal for the task at hand.

### Efficiency of Embeddings

When we perform the matrix multiplication of one hot vector with weight matrix, the resulting vector is just selecting the row indicated by the non zero entry.

![Figure 5.1](../images/figure_5_1.png)

### Approaches to Learning Word Embeddings

Auxiliary Tasks used to train Word Embeddings:
- Given a sequence of words, predict the next word. This is also called the _language modeling task_.
- Given a sequence of words before and after, predict the missing word.
- Given a word, predict words that occur within a window, independent of the position.

### The Practical Use of Pretrained Word Embeddings

In [1]:
# Loading Embeddings
# Download Embeddings file from https://www.kaggle.com/danielwillgeorge/glove6b100dtxt?select=glove.6B.100d.txt
%load_ext nb_black

import numpy as np
from annoy import AnnoyIndex


class PreTrainedEmbeddings(object):
    def __init__(self, word_to_index, word_vectors):
        """
        Args:
            word_to_index: mapping from word to integers.
            word_vectors: list of numpy array.
        """
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = {v: k for k, v in self.word_to_index.items()}
        self.index = AnnoyIndex(len(word_vectors[0]), metric="euclidean")
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)

    @classmethod
    def from_embeddings_file(cls, embedding_file):
        """
        Init from pretrained vector file.

        Vector filw should be of the format:
            word0 x0_0 x0_1, x0_2 ... x0_N
            word1 x1_0 x1_1, x1_2 ... x1_N

        Args:
            embedding_file: location of the file
        Returns:
            instance of PretrainedEmbeddings
        """
        word_to_index, word_vectors = {}, []
        with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])

                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)
        return cls(word_to_index=word_to_index, word_vectors=word_vectors)

    def get_embedding(self, word):
        """
        Args:
            word: Input word to get embedding for.
        Returns:
            an embedding for given word
        """
        return self.word_vectors[self.word_to_index[word]]

    def get_closed_to_vector(self, vector, n=1):
        """
        Given a vector, return its n nearest neighbors.

        Args:
            vector: should match the size of the vectors in the Annoy Index.
            n: the number of neighbors to return
        Returns:
            Unsorted list of words nearest to the given vector.
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor] for neighbor in nn_indices]

    def compute_and_print_analogy(self, word1, word2, word3):
        """
        Prints the solutions to analogies using word embeddings.

        Analogies are word1 to word2 as word3 is to __
        This methid will print: word1 : word2 :: word3 : word4

        Args:
            word1, word2, word3
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closed_words = self.get_closed_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closed_words = [word for word in closed_words if word not in existing_words]
        if len(closed_words) == 0:
            print("Could not find nearest neighbors for the vector!")
            return
        for word4 in closed_words:
            print(f"{word1}:{word2} :: {word3}:{word4}")


embeddings = PreTrainedEmbeddings.from_embeddings_file("../data/glove.6B.100d.txt")

<IPython.core.display.Javascript object>

In [2]:
# Relationships between word embeddings

# Relationship 1: the relationship between gendered nouns and pronouns
print("the relationship between gendered nouns and pronouns")
embeddings.compute_and_print_analogy("man", "he", "woman")
print()

# Relationship 2: Verb-noun relationships
print("Verb-noun relationships")
embeddings.compute_and_print_analogy("fly", "plane", "sail")
print()

#  Relationship 3: Noun-noun relationships
print("Noun-noun relationships")
embeddings.compute_and_print_analogy("cat", "kitten", "dog")
print()

# Relationship 4: Hypernymy (broader category)
print("Hypernymy (broader category)")
embeddings.compute_and_print_analogy("blue", "color", "dog")
print()

# Relationship 5: Meronymy (part-to-whole)
print("Meronymy (part-to-whole)")
embeddings.compute_and_print_analogy("toe", "foot", "finger")
print()

# Relationship 6: Troponymy (difference in manner)
print("Troponymy (difference in manner)")
embeddings.compute_and_print_analogy("talk", "communicate", "read")
print()

# Relationship 7: Metonymy (convention / figures of speech)
print("Metonymy (convention / figures of speech)")
embeddings.compute_and_print_analogy("blue", "democrat", "red")
print()

# Relationship 8: Adjectival scales
print("Adjectival scales")
embeddings.compute_and_print_analogy("fast", "fastest", "young")
print()

the relationship between gendered nouns and pronouns
man:he :: woman:she
man:he :: woman:never

Verb-noun relationships
fly:plane :: sail:ship
fly:plane :: sail:vessel

Noun-noun relationships
cat:kitten :: dog:puppy
cat:kitten :: dog:puppies
cat:kitten :: dog:toddler

Hypernymy (broader category)
blue:color :: dog:behavior
blue:color :: dog:touch
blue:color :: dog:viewer

Meronymy (part-to-whole)
toe:foot :: finger:ground
toe:foot :: finger:pointing

Troponymy (difference in manner)
talk:communicate :: read:interpret
talk:communicate :: read:typed
talk:communicate :: read:correctly
talk:communicate :: read:instructions

Metonymy (convention / figures of speech)
blue:democrat :: red:republican
blue:democrat :: red:congressman
blue:democrat :: red:senator

Adjectival scales
fast:fastest :: young:female
fast:fastest :: young:fellow
fast:fastest :: young:younger



<IPython.core.display.Javascript object>

In [3]:
embeddings.compute_and_print_analogy("fast", "fastest", "small")
embeddings.compute_and_print_analogy("man", "king", "woman")
embeddings.compute_and_print_analogy("man", "doctor", "woman")

fast:fastest :: small:smallest
fast:fastest :: small:large
man:king :: woman:queen
man:king :: woman:monarch
man:king :: woman:throne
man:doctor :: woman:nurse
man:doctor :: woman:physician


<IPython.core.display.Javascript object>

In [4]:
embeddings.compute_and_print_analogy("sachin", "cricket", "messi")

sachin:cricket :: messi:rugby
sachin:cricket :: messi:soccer
sachin:cricket :: messi:football
sachin:cricket :: messi:club


<IPython.core.display.Javascript object>

In [5]:
embeddings.compute_and_print_analogy("nifty", "sensex", "nasdaq")

nifty:sensex :: nasdaq:index
nifty:sensex :: nasdaq:composite


<IPython.core.display.Javascript object>

## Example: Learning the Continous Bag of Words Embeddings

The CBOW model is a multi‐ class classification task represented by scanning over texts of words, creating a context window of words, removing the center word from the context window, and classifying the context window to the missing word. It is actually like a fill-in-the-blank task. There is a sentence with a missing word, and the model’s job is to figure out what that word should be.

![Figure 5.2](../images/figure_5_2.png)

### Data Vectorization Classes

In [133]:
import os
from argparse import Namespace
from collections import Counter
import json
import re
import string

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

import utils

In [134]:
class Vocabulary(object):
    """
    Class to process text and extract vocabulary for mapping.
    """

    def __init__(
        self, token_to_idx=None, mask_token="<MASK>", add_unk=True, unk_token="<UNK>"
    ):
        """
        Args:
            token_to_idx: a pre-existing map of tokens to indices.
            mask_token: the MASK token to add into the Vocab.
            add_unk: a flag that indicates whether to add the UNK token.
            unk_token: the UNK token to add into the vocab.
        """
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()}
        self._add_unk = add_unk
        self._unk_token = unk_token
        self._mask_token = mask_token

        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token)

    def to_serializable(self):
        """
        Returns a dictionary that can be serialized.
        """
        return {
            "token_to_idx": self._token_to_idx,
            "add_unk": self._add_unk,
            "unk_token": self._unk_token,
            "mask_token": self._mask_token,
        }

    @classmethod
    def from_serializable(cls, contents):
        """
        Instantiates the vocab from a serialized dictionary.
        """
        return cls(**contents)

    def add_token(self, token):
        """
        Update mapping dicts based on the token.

        Args:
            token: the item to add into the vocab.
        Returns:
            index: the integer corresponding to the token.
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index

    def add_many(self, tokens):
        """
        Add a list of tokens into the Vocabulary.

        Args:
            tokens: a list of string tokens.
        Returns:
            indices: a list of indices corresponding to the tokens.
        """
        return [self.add_token(token) for token in tokens]

    def lookup_token(self, token):
        """
        Retrieve the index associated with the token or the UNK
        index if token isnt present.

        Args:
            token: the token to lookup.
        Returns:
            index: the index corresponding to the token
        Notes:
            `unk_index` needs to be >= 0 (having been added into the vocab).
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

    def lookup_index(self, index):
        """
        Return the token associated with the index.

        Args:
            index: the index to look up.
        Returns:
            token: the token corresponding to the index.
        Raises:
            KeyError: if the index is not in the vocab.
        """
        if index not in self._idx_to_token:
            raise KeyError(f"the index {index} is not in the vocab")
        return self._idx_to_token[index]

    def __str__(self):
        return f"<Vocabulary(size={len(self)})>"

    def __len__(self):
        return len(self._token_to_idx)

In [135]:
class CBOWVectorizer(object):
    """
    The vectorizer which coordinates the vocabularies and puts them to use.
    """

    def __init__(self, cbow_vocab):
        """
        Args:
            cbow_vocab: maps words to integers.
        """
        self.cbow_vocab = cbow_vocab

    def vectorize(self, context, vector_length=-1):
        """
        Args:
            context: the string of words separated by a space.
            vector_length: an argument for forcing the length of index vector.
        """
        indices = [self.cbow_vocab.lookup_token(token) for token in context.split(" ")]
        if vector_length < 0:
            vector_length = len(indices)
        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[: len(indices)] = indices
        out_vector[len(indices) :] = self.cbow_vocab.mask_index
        return out_vector

    @classmethod
    def from_dataframe(cls, cbow_df):
        """
        Instantiate the vectorizer from the dataset df.

        Args:
            cbow_df: the target dataset.
        Returns:
            an instance of the CBOWVectorizer.
        """
        cbow_vocab = Vocabulary()
        for index, row in cbow_df.iterrows():
            for token in row.context.split(" "):
                cbow_vocab.add_token(token)
            cbow_vocab.add_token(row.target)
        return cls(cbow_vocab)

    @classmethod
    def from_serializable(cls, contents):
        cbow_vocab = Vocabulary.from_serializable(contents["cbow_vocab"])
        return cls(cbow_vocab=cbow_vocab)

    def to_serializable(self):
        return {"cbow_vocab": self.cbow_vocab.to_serializable()}

In [136]:
class CBOWDataset(Dataset):
    def __init__(self, cbow_df, vectorizer):
        """
        Args:
            cbow_df: the dataset.
            vectorizer: vectorizer instantiated from dataset.
        """
        self.cbow_df = cbow_df
        self._vectorizer = vectorizer
        measure_len = lambda context: len(context.split(" "))
        self._max_seq_length = max(map(measure_len, cbow_df.context))

        self.train_df = self.cbow_df[self.cbow_df.split == "train"]
        self.train_size = len(self.train_df)

        self.val_df = self.cbow_df[self.cbow_df.split == "val"]
        self.val_size = len(self.val_df)

        self.test_df = self.cbow_df[self.cbow_df.split == "test"]
        self.test_size = len(self.test_df)

        self._lookup_dict = {
            "train": (self.train_df, self.train_size),
            "val": (self.val_df, self.val_size),
            "test": (self.test_df, self.test_size),
        }
        self.set_split("train")

    @classmethod
    def load_dataset_and_make_vectorizer(cls, cbow_csv):
        """
        Load dataset and make a new vectorizer.

        Args:
            cbow_csb: location of the dataset.
        Returns:
            an instance of CBOWDataset.
        """
        cbow_df = pd.read_csv(cbow_csv)
        train_cbow_df = cbow_df[cbow_df.split == "train"]
        return cls(cbow_df, CBOWVectorizer.from_dataframe(train_cbow_df))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, cbow_csb, vectorizer_filepath):
        """
        Load dataset and the corresponding vectorizer.
        Used in the case in the vectorizer has been cached for re-use.

        Args:
            cbow_csb: location of the dataset.
            vectorizer_filepath: location of the saved vectorizer.
        Returns:
            an instance of CBOWDataset.
        """
        cbow_df = pd.read_csv(cbow_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(cbow_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """
        A static method for loading the vectorizer from file.

        Args:
            vectorizer_filepath: the location of the serialized vectorizer.
        Returns:
            an instance of CBOWVectorizer.
        """
        with open(vectorizer_filepath) as fp:
            return CBOWVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """
        Saves the vectorizer to disk using json.

        Args:
            vectorizer_filepath: the location to save the vectorizer.
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        """
        Returns the vectorizer.
        """
        return self._vectorizer

    def set_split(self, split="train"):
        """
        Selects the splits in the dataset using a column in the dataframe.
        """
        self._train_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        """
        The primary entryp point method for PyTorch dataset.

        Args:
            index: the index to the data point.
        Returns:
            a dictionary holding the data point's features(x_data) and label(y_target).
        """
        row = self._target_df.iloc[index]
        context_vector = self._vectorizer.vectorize(row.context, self._max_seq_length)
        target_index = self._vectorizer.cbow_vocab.lookup_token(row.target)
        return {"x_data": context_vector, "y_target": target_index}

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

### The CBOW Model

In [137]:
class CBOWClassifier(nn.Module):
    def __init__(self, vocabulary_size, embedding_size, padding_idx=0):
        """
        Args:
            vocabulary_size: number of vocab items.
            embedding_size: size of the embeddings.
            padding_idx: default 0, Embedding will not use this index.
        """
        super(CBOWClassifier, self).__init__()
        self.embedding = nn.Embedding(
            num_embeddings=vocabulary_size,
            embedding_dim=embedding_size,
            padding_idx=padding_idx,
        )
        self.fc1 = nn.Linear(in_features=embedding_size, out_features=vocabulary_size)

    def forward(self, x_in, apply_softmax=False):
        """
        The forward pass of the classifier.

        Args:
            x_in: an input data tensor. x_in.shape should be (batch, input_dim).
            apply_softmax: a flag for the softmax activation.
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim).
        """
        x_embedded_sum = self.embedding(x_in).sum(dim=1)
        x_embedded_sum = F.dropout(x_embedded_sum, 0.3)
        y_out = self.fc1(x_embedded_sum)
        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)
        return y_out

### Model Training & Evaluation

In [138]:
args = Namespace(
    # Data and Path information
    cbow_csv="../data/books/frankenstein_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="models/chapter05/cbow",
    # Model hyper parameters
    embedding_size=50,
    # Training hyper parameters
    seed=1337,
    num_epochs=50,
    learning_rate=0.0001,
    batch_size=32,
    early_stopping_criteria=5,
    # Runtime options
    cuda=True,
    catch_keyboard_interrupt=True,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
)

if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))


# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False

args.device = torch.device("cuda" if args.cuda else "cpu")

print("Using CUDA: {}".format(args.cuda))


# Set seed for reproducibility
utils.set_seed_everywhere(args.seed, args.cuda)

# handle dirs
utils.handle_dirs(args.save_dir)

Expanded filepaths: 
	models/chapter05/cbow/vectorizer.json
	models/chapter05/cbow/model.pth
Using CUDA: False


In [139]:
if args.reload_from_files:
    print("Loading dataset and loading vectorizer")
    dataset = CBOWDataset.load_dataset_and_load_vectorizer(
        args.cbow_csv, args.vectorizer_file
    )
else:
    print("Loading dataset & creating vectorizer")
    dataset = CBOWDataset.load_dataset_and_make_vectorizer(args.cbow_csv)
    dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.get_vectorizer()
classifier = CBOWClassifier(
    vocabulary_size=len(vectorizer.cbow_vocab), embedding_size=args.embedding_size
)
print(classifier)

Loading dataset & creating vectorizer
CBOWClassifier(
  (embedding): Embedding(6138, 50, padding_idx=0)
  (fc1): Linear(in_features=50, out_features=6138, bias=True)
)


In [140]:
classifier = classifier.to(args.device)
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer=optimizer, mode="min", factor=0.5, patience=1
)

train_state = utils.train_model(
    classifier=classifier,
    loss_func=loss_func,
    optimizer=optimizer,
    scheduler=scheduler,
    dataset=dataset,
    args=args,
)
train_state = utils.evaluate_test_split(
    classifier, dataset, loss_func, train_state, args
)

Training Routine:   0%|          | 0/50 [00:00<?, ?it/s]

split=train:   0%|          | 0/1984 [00:00<?, ?it/s]

split=val:   0%|          | 0/425 [00:00<?, ?it/s]

--------------- 0th Epoch Stats---------------
Training Loss=8.791101831822646, Training Accuracy=1.5703755040322613
Validation Loss=8.070989081438857, Validation Accuracy=4.176470588235291.
------------------------------------------------------------
--------------- 10th Epoch Stats---------------
Training Loss=5.938392519229839, Training Accuracy=13.904989919354827
Validation Loss=6.7094865484798625, Validation Accuracy=13.0.
------------------------------------------------------------
--------------- 20th Epoch Stats---------------
Training Loss=5.410481493799914, Training Accuracy=15.265877016129041
Validation Loss=6.579481416590075, Validation Accuracy=13.98529411764705.
------------------------------------------------------------
--------------- 30th Epoch Stats---------------
Training Loss=5.16410692624987, Training Accuracy=15.908518145161317
Validation Loss=6.535959159626684, Validation Accuracy=14.55882352941176.
------------------------------------------------------------
--

### Trained Embeddings

In [145]:
def pretty_print(results):
    """
    Pretty Print Embedding Results.
    """
    for item in results:
        print(f"[{item[1]}] = {item[0]}")
        
def get_closest(target_word, word_to_idx, embeddings, n=5):
    """
    Get the n closest words to your word.
    """
    word_embedding = embeddings[word_to_idx[target_word.lower()]]
    distances = []
    for word, index in word_to_idx.items():
        if word == "<MASK>" or word == target_word:
            continue
        distances.append((word, torch.dist(word_embedding, embeddings[index])))
    results = sorted(distances, key=lambda x: x[1])[1:n+2]
    return results

In [147]:
word = input("Enter a word:")
embeddings = classifier.embedding.weight.data
word_to_idx = vectorizer.cbow_vocab._token_to_idx
pretty_print(
    get_closest(word, word_to_idx, embeddings, n=5)
)

Enter a word:monster
[7.512077808380127] = cares
[7.6816558837890625] = griefs
[7.735276222229004] = saw
[7.779043674468994] = confused
[7.783658027648926] = truly
[7.7867045402526855] = relief


In [148]:
target_words = ['frankenstein', 'monster', 'science', 'sickness', 'lonely', 'happy']

embeddings = classifier.embedding.weight.data
word_to_idx = vectorizer.cbow_vocab._token_to_idx

for target_word in target_words: 
    print(f"======={target_word}=======")
    if target_word not in word_to_idx:
        print("Not in vocabulary")
        continue
    pretty_print(get_closest(target_word, word_to_idx, embeddings, n=5))

[7.184050559997559] = irradiated
[7.610386371612549] = men
[7.64546537399292] = enslaved
[7.70945405960083] = mode
[7.720515251159668] = professor
[7.734167098999023] = wound
[7.512077808380127] = cares
[7.6816558837890625] = griefs
[7.735276222229004] = saw
[7.779043674468994] = confused
[7.783658027648926] = truly
[7.7867045402526855] = relief
[6.982034206390381] = mutual
[6.998227596282959] = impression
[7.0495219230651855] = mist
[7.153514385223389] = swelling
[7.237594127655029] = darkened
[7.29238748550415] = nearly
[6.25612211227417] = while
[6.5490875244140625] = awoke
[6.605112075805664] = foundations
[6.687403202056885] = consoles
[6.693512916564941] = depend
[6.726706504821777] = literally
[6.723246097564697] = excessive
[6.872636795043945] = ought
[6.897297382354736] = moonlight
[7.065033912658691] = bed
[7.114339828491211] = three
[7.160595893859863] = superhuman
[6.369864463806152] = bottom
[6.404392242431641] = penetrated
[6.4275641441345215] = wand
[6.476714134216309] =

## Example: Transfer Learning Using Pretrained Embeddings for Document Classification

### Data Vectorization classes

In [102]:
import os
from argparse import Namespace
from collections import Counter
import json
import re
import string

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

import utils

In [103]:
class Vocabulary(object):
    """
    Class to process text and extract vocabulary for mapping.
    """
    def __init__(self, token_to_idx=None):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices.
        """
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        self._idx_to_token = {
            idx: token
            for token, idx in self._token_to_idx.items()
        }
        
    def to_serializable(self):
        """
        Returns a dictionary that can be serialized.
        """
        return {
            'token_to_idx': self._token_to_idx
        }
    
    @classmethod
    def from_serializable(cls, contents):
        """
        Instantiates the Vocabulary from a Srialized Dictionary.
        """
        return cls(**contents)
    
    def add_token(self, token):
        """
        Update mapping dicts based on the token.
        
        Args:
            token: the item to add into the Vocab.
        Returns:
            index: the integer corresponding to the token.
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    

    def add_many(self, tokens):
        """
        Add a list of tokens into the Vocab.
        
        Args:
            tokens: a list of string tokens.
        Returns:
            indices: a list of indices corresponding to the tokens.
        """
        return [self.add_token(token) for token in tokens]
    
    def lookup_token(self, token):
        """
        Retrieve the index associated with the token.
        
        Args:
            token: the token to look up.
        Returns:
            index: the index corresponding to the token.
        """
        return self._token_to_idx[token]
    
    def lookup_index(self, index):
        """
        Return the token associated with the index.
        
        Args:
            index: the index to lookup.
        Returns:
            token: the token corresponding to the index.
        Raises:
            KeyError: if the index is not in the vocab.
        """
        if index not in self._idx_to_token:
            raise KeyError(f"The index {index} is not in the Vocab.")
        return self._idx_to_token[index]
    
    def __str__(self):
        return f"<Vocabulary(size={len(self)})>"
    
    def __len__(self):
        return len(self._token_to_idx)

![Figure 5.3](../images/figure_5_3.png)

In [104]:
class SequenceVocabulary(Vocabulary):
    def __init__(
        self, token_to_idx=None, unk_token="<UNK>",
        mask_token="<MASK>", begin_seq_token="<BEGIN>",
        end_seq_token="<END>"
    ):
        super(SequenceVocabulary, self).__init__(token_to_idx)
        self._mask_token = mask_token
        self._unk_token = unk_token
        self._begin_seq_token = begin_seq_token
        self._end_seq_token = end_seq_token
        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = self.add_token(self._unk_token)
        self.begin_seq_index = self.add_token(self._begin_seq_token)
        self.end_seq_index = self.add_token(self._end_seq_token)
        
    def to_serializable(self):
        contents = super(SequenceVocabulary, self).to_serializable()
        contents.update(
            {
                'unk_token': self._unk_token,
                'mask_token': self._mask_token,
                'begin_seq_token': self._begin_seq_token,
                'end_seq_token': self._end_seq_token
            }
        )
        return contents
    
    def lookup_token(self, token):
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
        

In [105]:
class NewsVectorizer(object):
    def __init__(self, title_vocab, category_vocab):
        self.title_vocab = title_vocab
        self.category_vocab = category_vocab
        
    def vectorize(self, title, vector_length=-1):
        indices = [self.title_vocab.begin_seq_index]
        indices.extend(
            self.title_vocab.lookup_token(token)
            for token in title.split(" ")
        )
        indices.append(self.title_vocab.end_seq_index)
        
        if vector_length < 0:
            vector_length = len(indices)
            
        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[:len(indices)] = indices
        out_vector[len(indices):] = self.title_vocab.mask_index
        return out_vector
    
    @classmethod
    def from_dataframe(cls, news_df, cutoff=25):
        category_vocab = Vocabulary()
        for category in sorted(set(news_df.category)):
            category_vocab.add_token(category)
        
        word_counts = Counter()
        for title in news_df.title:
            for token in title.split(" "):
                if token not in string.punctuation:
                    word_counts[token] += 1
        
        title_vocab = SequenceVocabulary()
        for word, word_count in word_counts.items():
            if word_count >= cutoff:
                title_vocab.add_token(word)
        return cls(title_vocab, category_vocab)
    
    @classmethod
    def from_serializable(cls, contents):
        title_vocab = SequenceVocabulary().from_serializable(
            contents['title_vocab']
        )
        category_vocab = Vocabulary.from_serializable(
            contents['category_vocab']
        )
        return cls(title_vocab=title_vocab, category_vocab=category_vocab)
    
    def to_serializable(self):
        return {
            'title_vocab': self.title_vocab.to_serializable(),
            'category_vocab': self.category_vocab.to_serializable()
        }

In [106]:
class NewsDataset(Dataset):
    def __init__(self, news_df, vectorizer):
        self.news_df = news_df
        self._vectorizer = vectorizer
        
        measure_len = lambda content: len(content.split(" "))
        self._max_seq_length = max(map(measure_len, news_df.title)) + 2

        self.train_df = self.news_df[self.news_df.split == 'train']
        self.train_size = len(self.train_df)
        
        self.val_df = self.news_df[self.news_df.split == 'val']
        self.val_size = len(self.val_df)
        
        self.test_df = self.news_df[self.news_df.split == 'test']
        self.test_size = len(self.test_df)
        
        self._lookup_dict = {
            'train': (self.train_df, self.train_size),
            'val': (self.val_df, self.val_size),
            'test': (self.test_df, self.test_size)
        }
        self.set_split('train')
        
        class_counts = news_df.category.value_counts().to_dict()
        def sort_key(item):
            return self._vectorizer.category_vocab.lookup_token(item[0])
        
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequences = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequences, dtype=torch.float32)
        
    @classmethod
    def load_dataset_and_make_vectorizer(cls, news_csv):
        news_df = pd.read_csv(news_csv)
        train_news_df = news_df[news_df.split == 'train']
        return cls(news_df, 
                   NewsVectorizer.from_dataframe(train_news_df))
    
    @classmethod
    def load_dataset_and_load_vectorizer(cls, news_csv, vectorizer_filepath):
        news_df = pd.read_csv(news_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(news_csv, vectorizer)
    
    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath) as fp:
            return NewsVectorizer.from_serializable(json.load(fp))
    
    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, 'w') as fp:
            json.dump(self._vectorizer.to_serializable(), fp)
            
    def get_vectorizer(self):
        return self._vectorizer
    
    def set_split(self, split='train'):
        self._train_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
        
    def __len__(self):
        return self._target_size
    
    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        title_vector = self._vectorizer.vectorize(
            row.title, self._max_seq_length
        )
        category_index = self._vectorizer.category_vocab.lookup_token(
            row.category
        )
        return {
            'x_data': title_vector,
            'y_target': category_index
        }
    
    def get_num_batches(self, batch_size):
        return len(self) // batch_size

### The NewsClassifier

In [107]:
class NewsClassifier(nn.Module):
    def __init__(
        self, embedding_size, num_embeddings, num_channels,
        hidden_dim, num_classes, dropout_p,
        pretrained_embeddings=None, padding_idx=0
    ):
        super(NewsClassifier, self).__init__()
        if pretrained_embeddings is None:
            self.emb = nn.Embedding(
                embedding_dim=embedding_size,
                num_embeddings=num_embeddings,
                padding_idx=padding_idx
            )
        else:
            pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()
            self.emb = nn.Embedding(
                embedding_dim=embedding_size,
                num_embeddings=num_embeddings,
                padding_idx=padding_idx,
                _weight=pretrained_embeddings
            )
        self.convnet = nn.Sequential(
            nn.Conv1d(
                in_channels=embedding_size,out_channels=num_channels,
                kernel_size=3
            ),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                     kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                     kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
                     kernel_size=3),
            nn.ELU()
        )
        self._dropout_p = dropout_p
        self.fc1 = nn.Linear(num_channels, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, num_classes)
        
    def forward(self, x_in, apply_softmax=False):
        # Why permute here?
        # Embed and permute so features are channels
        x_embedded = self.emb(x_in).permute(0, 2, 1)
        features = self.convnet(x_embedded)
        
        # Average and remove the extra dimension
        remaining_size = features.size(dim=2)
        features = F.avg_pool1d(features, remaining_size).squeeze(dim=2)
        features = F.dropout(features, p=self._dropout_p)
        
        # MLP Classifier
        intermediate_vector = F.relu(
            F.dropout(
                self.fc1(features),
                p=self._dropout_p))
        prediction_vector = self.fc2(intermediate_vector)
        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)
        return prediction_vector

### Utils

In [108]:
def load_glove_from_file(glove_filepath):
    word_to_index = {}
    embeddings = []
    with open(glove_filepath, 'r') as fp:
        for index, line in enumerate(fp):
            line = line.split(" ")
            word_to_index[line[0]] = index
            embedding_i = np.array(
                [float(val) for val in line[1:]]
            )
            embeddings.append(embedding_i)
    return word_to_index, np.stack(embeddings)

def make_embedding_matrix(glove_filepath, words):
    word_to_idx, glove_embeddings = load_glove_from_file(glove_filepath)
    embedding_size = glove_embeddings.shape[1]
    final_embeddings = np.zeros((len(words), embedding_size))
    for i, word in enumerate(words):
        if word in word_to_idx:
            final_embeddings[i, :] = glove_embeddings[word_to_idx[word]]
        else:
            embedding_i = torch.ones(1, embedding_size)
            torch.nn.init.xavier_uniform_(embedding_i)
            final_embeddings[i, :] = embedding_i
    return final_embeddings

### Model Training & Evaluation

In [109]:

args = Namespace(
    # Data and Path hyper parameters
    news_csv="../data/ag_news/news_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="models/chapter05/document_classification",
    # Model hyper parameters
    glove_filepath='../data/glove.6B.100d.txt', 
    use_glove=False,
    embedding_size=100, 
    hidden_dim=100, 
    num_channels=100, 
    # Training hyper parameter
    seed=1337, 
    learning_rate=0.001, 
    dropout_p=0.1, 
    batch_size=128, 
    num_epochs=100, 
    early_stopping_criteria=5, 
    # Runtime option
    cuda=True, 
    catch_keyboard_interrupt=True, 
    reload_from_files=False,
    expand_filepaths_to_save_dir=True
) 

if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    
# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False
    
args.device = torch.device("cuda" if args.cuda else "cpu")
print("Using CUDA: {}".format(args.cuda))

# Set seed for reproducibility
utils.set_seed_everywhere(args.seed, args.cuda)

# handle dirs
utils.handle_dirs(args.save_dir)

Expanded filepaths: 
	models/chapter05/document_classification/vectorizer.json
	models/chapter05/document_classification/model.pth
Using CUDA: False


In [110]:
args.use_glove = True

if args.reload_from_files:
    # training from a checkpoint
    dataset = NewsDataset.load_dataset_and_load_vectorizer(args.news_csv,
                                                              args.vectorizer_file)
else:
    # create dataset and vectorizer
    dataset = NewsDataset.load_dataset_and_make_vectorizer(args.news_csv)
    dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.get_vectorizer()

# Use GloVe or randomly initialized embeddings
if args.use_glove:
    words = vectorizer.title_vocab._token_to_idx.keys()
    embeddings = make_embedding_matrix(glove_filepath=args.glove_filepath, 
                                       words=words)
    print("Using pre-trained embeddings")
else:
    print("Not using pre-trained embeddings")
    embeddings = None

classifier = NewsClassifier(embedding_size=args.embedding_size, 
                            num_embeddings=len(vectorizer.title_vocab),
                            num_channels=args.num_channels,
                            hidden_dim=args.hidden_dim, 
                            num_classes=len(vectorizer.category_vocab), 
                            dropout_p=args.dropout_p,
                            pretrained_embeddings=embeddings,
                            padding_idx=0)

Using pre-trained embeddings


In [111]:
classifier = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
loss_func = nn.CrossEntropyLoss(dataset.class_weights)
optimizer = optim.Adam(
    classifer.parameters(),
    lr=args.learning_rate
)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer=optimizer,
    mode='min',
    factor=0.5,
    patience=1
)
train_state = utils.train_model(
    classifier=classifier,
    loss_func=loss_func,
    optimizer=optimizer,
    scheduler=scheduler,
    dataset=dataset,
    args=args,
)
train_state = utils.evaluate_test_split(
    classifier, dataset, loss_func, train_state, args
)

Training Routine:   0%|          | 0/100 [00:00<?, ?it/s]

split=train:   0%|          | 0/656 [00:00<?, ?it/s]

split=val:   0%|          | 0/140 [00:00<?, ?it/s]

--------------- 0th Epoch Stats---------------
Training Loss=0.7482051128839566, Training Accuracy=70.51733993902434
Validation Loss=0.5905174578939169, Validation Accuracy=78.16406250000003.
------------------------------------------------------------
--------------- 10th Epoch Stats---------------
Training Loss=0.35484601164282104, Training Accuracy=86.50438262195128
Validation Loss=0.5897864873920168, Validation Accuracy=79.5591517857143.
------------------------------------------------------------
--------------- 20th Epoch Stats---------------
Training Loss=0.3304986624592324, Training Accuracy=87.51548208841474
Validation Loss=0.625057249196938, Validation Accuracy=79.1685267857143.
------------------------------------------------------------
--------------- 30th Epoch Stats---------------
Training Loss=0.32986575882972735, Training Accuracy=87.51786394817078
Validation Loss=0.6239482183541574, Validation Accuracy=79.16852678571423.
-----------------------------------------------

#### Inference

In [119]:
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

In [126]:
def predict_category(title, classifier, vectorizer, max_length):
    title = preprocess_text(title)
    vectorized_title = torch.tensor(vectorizer.vectorize(title, vector_length=max_length))
    result = classifier(vectorized_title.unsqueeze(0),
                       apply_softmax=True)
    probability_values, indices = result.max(dim=1)
    predicted_category = vectorizer.category_vocab.lookup_index(indices.item())
    return {
        'category': predicted_category,
        'probability': probability_values.item()
    }

In [131]:
def get_samples():
    samples = {}
    for cat in dataset.val_df.category.unique():
        samples[cat] = dataset.val_df.title[
            dataset.val_df.category == cat
        ].tolist()[:15]
        return samples

In [132]:
val_samples = get_samples()

classifier = classifier.to('cpu')
for truth, sample_group in val_samples.items():
    print(f"True Category: {truth}")
    print("=" * 30)
    for sample in sample_group:
        prediction = predict_category(
            sample, classifier, vectorizer, dataset._max_seq_length + 1
        )
        print(f"Prediction: {prediction['category']} "
              f"(p={prediction['probability']:.2f})")
        print(f"\t + Sampe={sample}")
    print("-" * 30 + "\n")

True Category: Business
Prediction: Business (p=0.87)
	 + Sampe=AZ suspends marketing of cancer drug
Prediction: Business (p=0.99)
	 + Sampe=Business world has mixed reaction to Perez move
Prediction: Sports (p=0.69)
	 + Sampe=Betting Against Bombay
Prediction: Sports (p=0.39)
	 + Sampe=Malpractice Insurers Face a Tough Market
Prediction: Sports (p=0.76)
	 + Sampe=NVIDIA Is Vindicated
Prediction: Sci/Tech (p=0.84)
	 + Sampe=It Takes Time to Judge the True Impact of New Technology
Prediction: Business (p=0.76)
	 + Sampe=Union agrees to Karstadt job cuts
Prediction: Sports (p=0.99)
	 + Sampe=QRS Jilts JDA, Teams with Inovis
Prediction: Sci/Tech (p=0.86)
	 + Sampe=Night flights fuel dispute
Prediction: Business (p=0.99)
	 + Sampe=Intel Profit Up; Outlook Reassures Market
Prediction: Business (p=0.69)
	 + Sampe=Airbus Raises Delivery Target for 2004
Prediction: Sports (p=0.34)
	 + Sampe=Bribery Considered, Halliburton Notes Suggest
Prediction: Business (p=0.99)
	 + Sampe=Mutual Funds Weigh