<a href="https://colab.research.google.com/github/hookskl/nlp_w_pytorch/blob/main/nlp_w_pytorch_ch5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding Words and Types

Part of implementing any NLP task involves dealing with different kinds of discrete types. Examples of discrete types are:

* words
* characters
* parts-of-speech tags (POS)
* named entities
* named entity types
* parse features 
* items in a product catalog

Any input feature that comes from a finite (or countably finite) set (aka a vocabulary), it is a *discrete type*.

One of the core successes to deep learning in NLP is the method of representing discrete types as dense vectors. "Representation learning" or "embedding" refer to learning a mapping from one discrete type to a point in a vector space. In the context of words, this mapping is referred to as a *word embedding*. Other embedding methods exist, such as count-based embeddings (TF-IDF). The focus here will be *learning-based* or *prediction-based* embedding methods, where the representations are learned by maximizing an objective for a specific learning task. One such example is predicting a word based on context. These learned embeddings are so quintessential to modern NLP that it can be expected the performance on any NLP task will improve by adding one.

## Why Learn Embeddings?

Learned embeddings have several advantages over more classical representations, such as count-based methods that are heuristically constructed.

First, they are more computationally efficient since their size does not scale with the size of the vocabularly. Second, count-based methods result in high-dimensional vectors that encode redundant information along many dimensions. Third, very high dimensions lead to problems in fitting machine learning models (*the curse of dimensionality*). Finally, learned representations are more suited to the task at hand, whereas count-based or low dimensional approaches (SVD and PCA) are not necessarily optimized for the relevant task.

### Efficiency of Embeddings

One of the major efficiencies of word embeddings is their size is typically much smaller than those of one-hot or count-based representations. Typical sizes range between 25 and 500 dimensions, usually dicatated by hardware limitations.


### Approaches to Learning Word Embeddings

All word embedding methods train with just words in a supervised fashion. This is accomplished by constructing *auxiliary* tasks in which the data is implicitly labeled. Some examples:

* given a sequence of words, predict the new word (also called the *langauge modeling* task)
* given a sequence of words before and after, predict the missing word
* give a word, predict words that occur within a window, indepdent of the position

Generally, it's more worthwhile to use a pretrained word embedding and fine-tune than to create one from scratch.

### The Practical Use of Pretrained Word Embeddings



#### Loading Embeddings

Some popular pretrained word embeddings are:

* Word2Vec 
* GLoVe
* FastText

The typical file format for these embeddings is as follows: each line starts with the word/type that is being embedded and is followed by a sequence of numbers (the vector representation). The length of this sequence is the dimension of the representation (embedding dimension). 

A utility class called `PreTrainedWordEmbeddings` is used to load and process embeddings. This class builds an in-memory index of all the word vectors for quick lookups and nearest-neighbor queries using an approximate nearest-neighbor package called `annoy`.

*Example 5-1. Using pretrained word embeddings*

In [1]:
%%shell
# download glove model 
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove*.zip


--2021-02-02 17:03:36--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-02-02 17:03:36--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-02-02 17:03:37--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0



In [3]:
!pip install annoy

import numpy as np
from annoy import AnnoyIndex

class PretrainedEmbeddings(object):
    def __init__(self, word_to_index, word_vectors):
        """
        Args:
            word_to_index (dict): mapping from word to integers
            word_vectors (list of numpy arrays)
        """
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = \
            {v: k for k, v in self.word_to_index.items()}
        self.index = AnnoyIndex(len(word_vectors[0]),
                                metric='euclidean')
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)

    @classmethod 
    def from_embeddings_file(cls, embedding_file):
        """Instantiate from pretrained vector file.

        Vector file should be the format:
            word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
            word1 x1_0 x1_1 x1_2 x1_3 ... x1_N

        Args:
            embedding_file (str): location of the file
        Returns:
            instance of PreTrainedEmbeddings
        """
        word_to_index = {}
        word_vectors = []
        with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])

                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)
        return cls(word_to_index, word_vectors)

    def get_embedding(self, word):
        """
        Args:
            word (str)
        Returns:
            an embedding (numpy.ndarray)
        """
        return self.word_vectors[self.word_to_index[word]]
    
    def get_closest_to_vector(self, vector, n=1):
        """Given a vector, return its n nearest neighbors
        Args:
            vector (np.ndarray): should match the size of the vectors in the Annoy index
            n (int): the number of neighbors to return
        Returns:
            [str, str, ...]: words nearest to the given vector
                The words are not ordered by distance
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor]
                    for neighbor in nn_indices]
    
    def compute_and_print_analogy(self, word1, word2, word3):
        """Prints the solutions to analogies using word embeddings

        Analogies are word1 is to word2 as word3 is to __
        This method will print: word1 : word2 :: word3 : word4

        Args:
            word1 (str)
            word2 (str)
            word3 (str)
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        # Simple hypothesis: Analogy is spatial relationship
        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closest_words = self.get_closest_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closest_words = [word for word in closest_words 
                                if word not in existing_words]

        if len(closest_words) == 0:
            print("Could not find nearest neighbors for the vector!")
            return
        
        for word4 in closest_words:
            print("{} : {} :: {} : {}".format(word1, word2, word3, word4) )


Collecting annoy
[?25l  Downloading https://files.pythonhosted.org/packages/a1/5b/1c22129f608b3f438713b91cd880dc681d747a860afe3e8e0af86e921942/annoy-1.17.0.tar.gz (646kB)
[K     |▌                               | 10kB 18.4MB/s eta 0:00:01[K     |█                               | 20kB 24.1MB/s eta 0:00:01[K     |█▌                              | 30kB 16.2MB/s eta 0:00:01[K     |██                              | 40kB 14.5MB/s eta 0:00:01[K     |██▌                             | 51kB 13.9MB/s eta 0:00:01[K     |███                             | 61kB 14.0MB/s eta 0:00:01[K     |███▌                            | 71kB 12.3MB/s eta 0:00:01[K     |████                            | 81kB 12.7MB/s eta 0:00:01[K     |████▋                           | 92kB 12.8MB/s eta 0:00:01[K     |█████                           | 102kB 12.6MB/s eta 0:00:01[K     |█████▋                          | 112kB 12.6MB/s eta 0:00:01[K     |██████                          | 122kB 12.6MB/s eta 0:00

In [4]:
# load glove embeddings
embeddings = \
    PretrainedEmbeddings.from_embeddings_file('glove.6B.100d.txt')

In [5]:
# print the index of the word 'working'
print(embeddings.word_to_index['working'])
# print the word with index 500
print(embeddings.index_to_word[500])
# print the word embedding of the word 'working'
embeddings.word_vectors[500]

500
working


array([ 0.076552 ,  0.17843  , -0.44464  ,  0.085718 ,  0.28268  ,
       -0.30546  , -0.30637  ,  0.36632  , -0.19919  ,  0.35636  ,
        0.088981 , -0.7717   ,  0.68709  , -0.055057 , -0.47002  ,
       -0.52158  ,  0.58331  , -0.32255  , -0.28368  , -0.020115 ,
        0.12133  ,  0.63264  ,  0.2717   , -0.61169  , -0.015634 ,
       -0.54613  , -0.19113  , -0.77745  , -0.048714 ,  0.38825  ,
       -0.68519  ,  0.71731  , -0.075302 , -0.26239  , -0.013498 ,
        0.19442  , -0.19793  ,  0.040908 ,  0.78602  , -0.049446 ,
       -0.83782  ,  0.10923  , -0.15471  ,  0.12925  , -0.26784  ,
        0.045059 , -0.60726  , -0.41779  , -0.063718 , -0.58079  ,
       -0.40284  , -0.37669  , -0.18443  ,  1.4475   , -0.15176  ,
       -2.215    , -0.22298  , -0.28886  ,  1.3392   ,  0.55239  ,
        0.022604 ,  0.70506  , -0.34004  , -0.26593  ,  0.80853  ,
        0.26161  ,  0.38258  ,  0.44347  ,  0.37905  ,  0.26225  ,
        0.082587 , -0.049931 , -0.19572  , -0.48894  ,  0.2075

#### Relationships between word embeddings

One of the features of word embeddings is the encoded syntatic and semantic relationships between various words. For words such as cats and dogs, their respective word vectors are close together relative to other words, such as ducks or elephants.

One way to explore these similarities is the analogy task: $$\text{Word1 : Word2 :: Word3 : ____}$$

In this task, three words are provided with a missing fourth word. The fourth word is chosen such that it's relationship with the third word is congruent to the relationship between words one and two. With word embeddings, these relationships are encoded spatially. Subtracting the word embedding of `Word2` from `Word1` yields a difference vector that represents this relationship. By adding `Word3` to this difference vector, a new vector is produced that is close to the fourth word. The other word embeddings can be queried using nearest-neighors to find the word embedding most similar to this output vector, solving the analogy task.

*Example 5-2. The analogy task using word embeddings*

```
    def get_embedding(self, word):
        """
        Args:
            word (str)
        Returns:
            an embedding (numpy.ndarray)
        """
        return self.word_vectors[self.word_to_index[word]]
    
    def get_closest_to_vector(self, vector, n=1):
        """Given a vector, return its n nearest neighbors
        Args:
            vector (np.ndarray): should match the size of the vectors in the Annoy index
            n (int): the number of neighbors to return
        Returns:
            [str, str, ...]: words nearest to the given vector
                The words are not ordered by distance
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor]
                    for neighbor in nn_indices]
    
    def compute_and_print_analogy(self, word1, word2, word3):
        """Prints the solutions to analogies using word embeddings

        Analogies are word1 is to word2 as word3 is to __
        This method will print: word1 : word2 :: word3 : word4

        Args:
            word1 (str)
            word2 (str)
            word3 (str)
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        # Simple hypothesis: Analogy is spatial relationship
        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closest_words = self.get_closest_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closest_words = [word for word in closest_words 
                                if word not in existing_words]

        if len(closest_words) == 0:
            print("Could not find nearest neighbors for the vector!")
            return
        
        for word4 in closest_words:
            print("{} : {} :: {} : {}".format(word1, word2, word3, word4) )
```

*Example 5-3. Word embeddings encode many linguistics relationships, as illustrated using the SAT analogy task*

In [6]:
# Relationship 1: the relationship between gendered nouns and pronouns
embeddings.compute_and_print_analogy('man', 'he', 'woman')

man : he :: woman : she
man : he :: woman : her


In [7]:
# Relationship 2: Verb-noun relationships
embeddings.compute_and_print_analogy('fly', 'plane', 'sail')

fly : plane :: sail : ship
fly : plane :: sail : vessel
fly : plane :: sail : boat


In [8]:
# Relationship 3: Noun-noun relationships
embeddings.compute_and_print_analogy('cat', 'kitten', 'dog')

cat : kitten :: dog : puppy
cat : kitten :: dog : puppies
cat : kitten :: dog : junkyard


In [9]:
# Relationship 4: Hypernymy (broader category)
embeddings.compute_and_print_analogy('blue', 'color', 'dog')

blue : color :: dog : animal
blue : color :: dog : pet
blue : color :: dog : taste
blue : color :: dog : touch


In [10]:
# Relationship 5: Meronymy (part-to-whole)
embeddings.compute_and_print_analogy('toe', 'foot', 'finger')

toe : foot :: finger : hand
toe : foot :: finger : kept
toe : foot :: finger : ground


In [11]:
# Relationship 6: Troponymy (difference in manner)
embeddings.compute_and_print_analogy('talk', 'communicate', 'read')

talk : communicate :: read : interpret
talk : communicate :: read : communicated
talk : communicate :: read : transmit


In [12]:
# Relationship 7: Metonymy (convention / figures of speech)
embeddings.compute_and_print_analogy('blue', 'democrat', 'red')

blue : democrat :: red : republican
blue : democrat :: red : congressman
blue : democrat :: red : senator


In [13]:
# Relationship 8: Adjectival scales
embeddings.compute_and_print_analogy('fast', 'fastest', 'young')

fast : fastest :: young : younger
fast : fastest :: young : sixth
fast : fastest :: young : fifth
fast : fastest :: young : seventh


*5-4. An example illustrating the danger of using cooccurrences to encode meaning---sometimes they do not!*

In [14]:
embeddings.compute_and_print_analogy('fast', 'fastest', 'small')

fast : fastest :: small : smallest
fast : fastest :: small : largest
fast : fastest :: small : among
fast : fastest :: small : quarters


*Example 5-5. Watch out for protected attributes such as gender encoded in word embeddings. This can introduce unwanted biases in downstream models.*

In [15]:
embeddings.compute_and_print_analogy('man', 'king', 'woman')

man : king :: woman : queen
man : king :: woman : monarch
man : king :: woman : throne


*Example 5-6. Cultural gender bias encoded in vector analogy*

In [16]:
embeddings.compute_and_print_analogy('man', 'doctor', 'woman')

man : doctor :: woman : nurse
man : doctor :: woman : physician
man : doctor :: woman : doctors


The above example shows how codified cultural biases can manifest within word embeddings and is something practioners should be aware of.

## Exampled: Learning the Continuous Bag of Words Embeddings

This example walks through one of the most famous methods for constructing word embeddings, the Word2Vec Continuous-Bag-of-Words (CBOW) model. The CBOW model is a multi-classification task that scans over texts of words, creating a "context" window of words. The center word of this window is removed and then the window is used to classify the missing word, essentially filling in the blank.

A PyTorch module `nn.Embedding` layer will be used to implement the CBOW model. The `Embedding` layer encapsulates an embedding matrix and provides a mapping from a token's integer ID to a vector that is used in the neural network computation. When the model is training and updating weights, these vectors will also be updated. 


### The Frankenstein Dataset

The Frankenstein Dataset is a digitized version of Mary Shelley's novel *Frankenstein* and available at [Project Gutenburg](https://bit.ly/2T5iU8J).

Using the raw text file, the test is split into sentences and then each sentence is converted to lowercase and stripped of all punctuation. This is handled using NLTK's `Punkt` tokenizer. With the text preprocessed a list of tokens can be retrieved by splitting on whitespace.

The data is then enumerated as a sequence of windows so that the CBOW model can be optimized. The list of tokens for each sentence is iterated over, grouping them into windows of a specific window size. As a quick example with window size of 2:

* i pitied frankenstein my pity amounted to horror i abhorred myself
* *i* <ins>pitied frankenstein</ins> my pity amounted to horror i abhorred myself
* <ins>i</ins> *pitied* <ins>frankenstein my</ins> pity amounted to horror i abhorred myself
* <ins>i pitied</ins> *frankenstein* <ins>my pity</ins> amounted to horror i abhorred myself
* i <ins>pitied frankenstein</ins> *my* <ins>pity amounted</ins> to horror i abhorred myself

Above shows the sliding context window over the processed sentence "i pitied frankenstein my pity amounted to horror i abhorred myself", with the *target* italicized and the <ins>context window</ins> underlined.

Once the data is processed, the familiar step of splitting the data into train, validaiton, and test is done. 

The data with splits is loaded into a Pandas `DataFrame` and indexed in the `CBOWDataset` class. The `__getitem__()` utilizes the `Vectorizer` to convert the context into a vector. The target word is converted to an integer using the `Vocabulary`.

*Example 5-7. Constructing a dataset class for the CBOW task*

```
class CBOWDataset(Dataset):
    # ...existing implementation from Example 3-15
    @classmethod
    def load_dataset_and_make_vectorizer(cls, cbow_csv):
        """Load dataset and make a new vectorizer from scratch

        Args:
            cbow_csv (str): location of the dataset
        Returns:
            an instance of CBOWDataset
        """
        cbow_df = pd.read_csv(cbow_csv)
        train_cbow_df = cbow_df[cbow_df.split=='train']
        return cls(cbow_df, CBOWVectorizer.from_dataframe(train_cbow_df))

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets

        Args:
            index (int): the index to the data point
        Returns:
            a dict with features (x_data) and label (y_target)
        """
        row = self._target_df.iloc[index]

        context_vector = \
            self._vectorizer.vectorize(row.context, self._max_seq_length)
        target_index = self._vectorizer.cbow_vocab.lookup_token(row.target)

        return {'x_data': context_vector,
                'y_target': target_index}
```

### Vocabularly, Vectorizer, DataLoader

The pipeline from text to vectorized minibatch is largely unchanged for the CBOW classification task. However, the `Vectorizer` does not construct one-hot vectors. Instead, a vector of integers representing the indices of the context is constructed and returned.

*Exampled 5-8. A Vectorizer for the CBOW data*

```
class CBOWVectorizer(object):
    """The Vectorizer which coordinates the Vocabularies and puts them to use"""

    def vectorize(self, context, vector_length=-1):
        """
        Args:
            context (str): the string of words separated by a space
            vector_length (int): an argument for forcing the length of index vector
        """

        indices = \
            [self.cbow_vocab.lookup_token(token) for token in context.split(' ')]
        if vector_length < 0:
            vector_length = len(indices)
        
        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[:len(indices)] = indices
        out_vector[len(indicies):] = self.cbow_vocab.mask_index

        return out_vector
```

An important note about the implementation is if the number of tokens in the context is less than the max length, the remaining entries are filled with zeros.

### The CBOWClassifier Model

This model has three essential steps:

1. indices representing the words of the context are used with an `Embedding` layer to create vectors for each word in the context
2. the goal is to combine the vectors in some way such that it captures the overall context. One option is to sum over the vectors (done here) but other arithmetic options are valid.
3. the context vector is used with a `Linear` layer to compute a prediction vector. This represents a probability distribution over the entire vocab. The missing word (target) is predicted using the largest value from this vector.

The `Embedding` layer is parameterized by two numbers: the number of embeddings (size of the vocab) and the size of the embeddings (embedding dimension). A third argument is used, `padding_idx`, which is used as a sentinel value to the embedding layer for situations where the data points may not all be the same length. 

*Example 5-9. The CBOWClassifier model*

```
class CBOWClassifier(nn.Module):
    def __init__(self, vocabulary_size, embedding_size, padding_idx=0):
        """
        Args:
            vocabulary_size (int): number of vocabulary items, controls the 
                number of embeddings and prediction vector size
            embedding_size (int): size of the embeddings
            padding_idx (int): default 0; Embedding will not use this index
        """
        super(CBOWClassifier, self).__init__()

        self.embedding = nn.Embedding(num_embeddings=vocabularly_size,
                                      embedding_dim=embedding_size,
                                      padding_idx=padding_idx)
        self.fc1 = nn.Linear(in_features=embedding_size,
                             out_features=vocabulary_size)
    
    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the classifier

        Args:
            x_in (torch.Tensor): an input data tensor
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be False if used with the cross-entropy losses
        Returns:
            the resulting tensor
                tensor.shape should be (batch, output_dim)
        """
        x_embedded_sum = self.embedding(x_in).sum(dim=1)
        y_out = self.fc1(x_embedded_sum)

        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)

        return y_out 
```

### The Training Routine

The training routine remains unchanged, from previous examples.

*Example 5-10. Arguments to the CBOW training script*

```
args = Namespace(
    # Data and path information
    cbow_csv="data/books/frankenstein_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch5/cbow",
    # Model hyperparameters
    embedding_size=300,
    # Training hyperparameters
    seed=1337,
    num_epochs=100,
    learning_rate=0.001,
    batch_size=128,
    early_stopping_criteria=5,
    # Runtime options omitted
)
```

### Model Evaluation and Prediction

The CBOW model is evaluated on how well it can predict a missing word given a context window. This implementation has only around a 15% accuracy, which is far under the original. Part of this is due to some missing components that add significant complexity, but boost performance. Additionally, the dataset is only around 70,000 words, far below the amount to properly learn word contexts and relationships. A state-of-the-art model requires terabytes of data. 

A more practical approach is to use an existing model that has been trained on one task and use as an initializer for a new model. This approach is known as *transfer learning*.

## Example: Transfer Learning Using Pretrained Embeddings for Document Classification

This example will walk through:

* loading pretrained word embeddings
* fine-tuning these embeddings on the AG News Dataset
* using these embeddings in a CNN to classify categories based on headlines

To model the sequences of words the news dataset, the `SequenceVocabulary` class will be utilized. 

### The AG News Dataset

The AG news dataset is a collection of more than one million news articles. A trimmed down version will be used here, 120,000 articles split evenly between four categories: Sports, Science/Technology, Word, and Business. Additionally, only the article headline will be used to predict which category the article belongs to. 

The text is preprocessed by removing punctuation, adding spaces around punctuation, and converting the text to lowercase. The processsed data is split into the familiar train, validation, and test sets such that the class labels are evenly distributed across the three splits.

The `__getitem__()` method retrieves the string representing the input to the model from a specific row in the dataset, vectorized by the `Vectorizer`, and paired with the integer representing the news category.

*Example 5-11. The* `NewsDataset.__getitem__()` *method*

```
class NewsDataset(Dataset):
    @classmethod
    def load_dataset_and_make_vectorizer(cls, news_csv):
        """Load dataset and make a new vectorizer from scratch

        Args:
            news_csv (str): location of the dataset
        Returns:
            an instance of NewsDataset
        """
        news_df = pd.read_csv(news_csv)
        train_news_df = news_df[news_df.split=='train']
        return cls(news_df, NewsVectorizer.from_dataframe(train_news_df))

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets

        Args:
            index (int): the index to the data point
        Returns:
            a dict holding the data point's features (x_data) and label (y_target)
        """
        row = self._target_df.iloc[index]

        title_vector = \
            self._vectorizer.vectorize(row.title, self._max_seq_length)
        
        category_index = \
            self._vectorizer.category_vocab.lookup_token(row.category)

        return {'x_data': title_vector,
                'y_target': category_index}
                
```

### Vocabulary, Vectorizer, and DataLoader

The `SequenceVocabulary` (a subclass of the `Vocabularly` class) bundles four special tokens used for sequence data:

* `UNK` (unknown) - allows the model to learn a representation for rare words so it can handle words not seen at test time
* `MASK` - sentinel token for the `Embedding` layer and loss calculations when the sequences have variable length
* `BEGIN-OF-SEQUENCE` and `END-OF-SEQUENCE` - marks the boundaries for a sequence 

To illustrate, take the toy example "Jerry is happy". Letting this be the entire corpus, the `SequenceVocbulary` would then be:

```
SequenceVocabulary
{
'<MASK>': 0,
'<UNK>': 1,
'<BEGIN-OF-SEQUENCE>': 2,
'<END-OF-SEQUENCE>': 3,
'is': 4,
'happy': 5
}
```

Here, "Jerry" is treated as a "rare" word for illustrative purposes. The sequence is then encoded as follows:

* Step 0: "Jerry is happy"
* Step 1: map words to integers -> [1, 4, 5]
* Step 2: add boundary tokens -> [2, 1, 4, 5, 3]
* Step 3: pad sequence for consistent length -> [2, 1, 4, 5, 3, 0, 0]

The second part of turning text to vectorized minibatch is the `Vectorizer`, which instantiates and encapsulates the use of the `SequenceVocabulary`. This implementation follows a previous example that restricts the total set of words by counting and retaining only those above a certain frequency. Limiting low frequency words assists the model by removing noise and also lowers the model's memory usage.

Once instantiated, the `Vectorizer`'s `vectorize()` method takes a news title and returns a vector that is as long as the longest title in the dataset. This method stores the maximum sequence length locally. Normally, the dataset tracks the max sequence length, and at inference time, the length of the test sequence is taken as the length of the vector. However, because the model used is a CNN model, a static length size is maintained at inference time. The outputs are zero-padded vectors of integers, representing the words of the input sequence. These sequences are also supplemented with the being and end tokens described above. 

*Example 5-12. Implementing a Vectorizer for the AG News dataset*

```
class NewsVectorizer(object):
    def vectorize(self, title, vector_length=-1):
        """
        Args:
            title (str): the string of the words separated by spaces
            vector_length (int): forces the length of the index vector
        Returns:
            the vectorized title (numpy.array)
        """
        indices = [self.title_vocab.begin_seq_index]
        indices.extend(self.title_vocab.lookup_token(token)
                       for token in title.split(" "))
        indices.append(self.title_vocab.end_seq_index)

        if vector_length < 0:
            vector_length = len(indices)

        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[:len(indices)] = indices
        out_vector[len(indices):] = self.title_vocab.mask_index

        return out_vector

    @classmethod
    def from_dataframe(cls, news_df, cutoff=25):
        """Instantiate the vectorizer from the dataset dataframe

        Args:
            news_df (pandas.DataFrame): the target dataset
            cutoff (int): frequency threshold for including in Vocabulary
        Returns:
            an instance of the NewsVectorizer
        """
        category_vocab = Vocabulary()
        for category in sorted(set(news_df.category)):
            category_vocab.add_token(category)

        word_counts = Counter()
        for title in news_df.title:
            for token in title.split(" "):
              if token not in string.punctuation:
                  word_counts[token] += 1
        
        title_vocab = SequenceVocabulary()
        for word, word_count in word_counts.itmes():
            if word_count >= cutoff:
                title_vocab.add_token(word)

        return cls(title_vocab, category_vocab)
```

### The NewsClassifier Model

To intialize an embedding matrix with a pretrained embedding, the pretrained embedding is loaded from disk a subset of embedding vectors are selected based on words that are present in the data used, and then finally setting the `Embedding` layer's weight matrix to the loaded subset. Beyond only selecting word vectors for those words present in that data, it could occur there are no word vectors available. One approach to handle this situation is to initialize word vectors with the Xavier Uniform method. 

*Example 5-13. Selecting a subset of the word embeddings based on the vocabulary*

```
def load_glove_from_file(glove_filepath):
    """Load the GloVe embeddings

    Args:
        glove_filepath(str): path to the glove embeddings file
    Returns:
        word_to_index (dict), embeddings (numpy.ndarray)
    """
    word_to_index = []
    embeddings = []
    with open(glove_filepath, "r") as fp:
        line = line.split(" ") # each line: word num1 num2 ...
        word_to_index[line[0]] = index # word = line[0]
        embedding_i = np.array([float(val) for val in line[1:]])
        embeddings.append(embedding_i)
    
    return word_to_index, np.stack(embeddings)

def make_embedding_matrix(glove_filepath, words):
    """Create embedding matrix for a specific set of words.

    Args:
        glove_filepath (str): file path to the glove embeddings
        words (list): list of words in the dataset
    Returns:
        final_embeddings (numpy.ndarray): embedding matrix
    """
    word_to_idx, glove_embeddings = load_glove_from_file(glove_filepath)
    embedding_size = glove_embeddings.shape[1]
    final_embeddings = np.zeros((len(words), embedding_size))

    for i, word in enumerate(words):
        if word in word_to_idx:
            final_embeddings[i, :] = glove_embeddings[word_to_idx[word]]
        else:
            embedding_i = torch.ones(1, embedding_size)
            torch.nn.init.xavier_uniform(embedding_i)
            final_embeddings[i, :] = embedding_i
    
    return final_embeddings
```

The `NewsClassifier` uses and `Embedding` layer to map input token indices to a vector representation. The weight matrix of this layer is replaced with the pretrained embeddings. The embedding is then used in the `forward()` method to map from the indices to the vectors, which is then fed into a sequence of convolution layers.

*Example 5-14. Implementing the NewsClassifier*

```
class NewsClassifier(nn.Module):
    def __init__(self, embedding_size, num_embeddings, num_channels,
                 hidden_dim, num_classes, dropout_p,
                 pretrained_embeddings=None, padding_idx=0):
        """
        Args:
            embedding_size (int): the size of the embedding vectors
            num_embeddings (int): number of embedding vectors
            filter_width (int): width of the convolutional kernels
            num_channels (int): number of convolutional kernels per layer
            hidden_dim (int): size of the hidden dimension
            num_classes (int): number of classes of classes in the classification
            dropout_p (float): a dropout parameter
            pretrained_embeddings (numpy.array): previously trained word embeddings
                default is None.
            padding_idx (int): an index representing a null position
        """
        super(NewsClassifier, self).__init__()

        if pretrained_embeddings is None:
            self.emb = nn.Embedding(embedding_dim=embedding_size,
                                    num_embeddings=num_embeddings,
                                    padding_idx=padding_idx)

        else:
            pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()
            self.emb = nn.Embedding(embedding_dim=embedding_size,
                                    num_embeddings=num_embeddings,
                                    padding_idx=padding_idx,
                                    _weight=pretrained_embeddings)

        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=embedding_size,
                   out_channels=num_channels, kernel_size=3),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels,
                   out_channels=num_channels, kernel_size=3,
                   stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels,
                   out_channels=num_channels, kernel_size=3,
                   stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels,
                   out_channels=num_channels, kernel_size=3),
            nn.ELU()
        )

        self._dropout_p = dropout_p
        self.fc1 = nn.Linear(num_channels, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, num_classes)

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the classifier

        Args:
            x_in (torch.Tensor): an input data tensor
                x_in.shape should be (batch, dataset._max_seq_length)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the cross-entropy losses
        Returns:
            the resulting tensor
                tensor.shape should be (batch, num_classes)
        """
        # embed and permute so features are channels
        x_embedded = self.emb(x_in).permute(0, 2, 1)
        features = self.convnet(x_embedded)

        # average and remove the extra dimension
        remaing_size = features.size(dim=2)
        features = F.avg_pool1d(features, remaining_size).squeeze(dim=2)
        features = F.dropout(features, p=self._dropout_p)

        # final linear layer to produce classification outputs
        intermediate_vector = F.relu(F.dropout(self.fc1(features),
                                               p=self._dropout_p))
        prediction_vector = self.fc2(intermediate_vector)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector
```


### The Training Routine

The training routine follows the same steps as previously described.

*Example 5-15. Arguments to the CNN NewsClassifier using pretrained embeddings*

```
args = Namespace(
    # Data and path hyperparameters
    news_csv="data/ag_news/news_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch5/document_classification",
    # Model hyperparamters
    glove_filepath='data/glove/glove.6B.100d.txt',
    use_glove=False,
    embedding_size=100,
    hidden_dim=100,
    num_channels=100,
    # Training hyperparameter
    seed=1337,
    learning_rate=0.001,
    dropout_p=0.1,
    batch_size=128,
    num_epochs=100,
    early_stopping_criteria=5,
    # Runtime options omitted
)
```

### Model Evaluation and Prediction


#### Evaluating on the test dataset

Evaluating the model's performance on the test set follows the same procedure from previous examples. 

#### Predicting the category of novel news headlines

To perform inference on a single news headline, the text has to be preprocessed in the same manner as the training data. The same pipeline for going from processing to vectorization is used.

*Example 5-16. Predicting with the trained model*

```
def predict_category(title, classifier, vectorizer, max_length):
    """Predict a news category for a new title

    Args:
        title (str): a raw title string
        classifier (NewsClassifier): an instance of the trained classifier
        vectorizer (NewsVectorizer): the corresponding vectorizer
        max_length (int): the max sequence length
            Note: CNNs are sensitive to the input data tensor size.
              This ensures it's kept to the same size as the training data.
        """
        title = preprocess_text(title)
        vectorized_title = \
            torch.tensor(vectorizer.vectorize(title, vector_length=max_length))
        result = classifier(vectorized_title.unsqueeze(0), apply_softmax=True)
        probability_values, indices = result.max(dim=1)
        predicted_category = vectorizer.category_vocab.lookup_index(indices.item())

        return {'category': predicted_category,
                'probability': probability_values.item()}
```