<a href="https://colab.research.google.com/github/hookskl/nlp_w_pytorch/blob/main/nlp_w_pytorch_ch5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding Words and Types

Part of implementing any NLP task involves dealing with different kinds of discrete types. Examples of discrete types are:

* words
* characters
* parts-of-speech tags (POS)
* named entities
* named entity types
* parse features 
* items in a product catalog

Any input feature that comes from a finite (or countably finite) set (aka a vocabulary), it is a *discrete type*.

One of the core successes to deep learning in NLP is the method of representing discrete types as dense vectors. "Representation learning" or "embedding" refer to learning a mapping from one discrete type to a point in a vector space. In the context of words, this mapping is referred to as a *word embedding*. Other embedding methods exist, such as count-based embeddings (TF-IDF). The focus here will be *learning-based* or *prediction-based* embedding methods, where the representations are learned by maximizing an objective for a specific learning task. One such example is predicting a word based on context. These learned embeddings are so quintessential to modern NLP that it can be expected the performance on any NLP task will improve by adding one.

## Why Learn Embeddings?

Learned embeddings have several advantages over more classical representations, such as count-based methods that are heuristically constructed.

First, they are more computationally efficient since their size does not scale with the size of the vocabularly. Second, count-based methods result in high-dimensional vectors that encode redundant information along many dimensions. Third, very high dimensions lead to problems in fitting machine learning models (*the curse of dimensionality*). Finally, learned representations are more suited to the task at hand, whereas count-based or low dimensional approaches (SVD and PCA) are not necessarily optimized for the relevant task.

### Efficiency of Embeddings

One of the major efficiencies of word embeddings is their size is typically much smaller than those of one-hot or count-based representations. Typical sizes range between 25 and 500 dimensions, usually dicatated by hardware limitations.


### Approaches to Learning Word Embeddings

All word embedding methods train with just words in a supervised fashion. This is accomplished by constructing *auxiliary* tasks in which the data is implicitly labeled. Some examples:

* given a sequence of words, predict the new word (also called the *langauge modeling* task)
* given a sequence of words before and after, predict the missing word
* give a word, predict words that occur within a window, indepdent of the position

Generally, it's more worthwhile to use a pretrained word embedding and fine-tune than to create one from scratch.

### The Practical Use of Pretrained Word Embeddings



#### Loading Embeddings

Some popular pretrained word embeddings are:

* Word2Vec 
* GLoVe
* FastText

The typical file format for these embeddings is as follows: each line starts with the word/type that is being embedded and is followed by a sequence of numbers (the vector representation). The length of this sequence is the dimension of the representation (embedding dimension). 

A utility class called `PreTrainedWordEmbeddings` is used to load and process embeddings. This class builds an in-memory index of all the word vectors for quick lookups and nearest-neighbor queries using an approximate nearest-neighbor package called `annoy`.

*Example 5-1. Using pretrained word embeddings*

In [None]:
%%shell
# download glove model 
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove*.zip


--2021-01-31 20:12:30--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-01-31 20:12:31--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-01-31 20:12:31--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2021



In [None]:
# !pip install annoy

import numpy as np
from annoy import AnnoyIndex

class PretrainedEmbeddings(object):
    def __init__(self, word_to_index, word_vectors):
        """
        Args:
            word_to_index (dict): mapping from word to integers
            word_vectors (list of numpy arrays)
        """
        self.word_to_index = word_to_index
        self.word_vectors = word_vectors
        self.index_to_word = \
            {v: k for k, v in self.word_to_index.items()}
        self.index = AnnoyIndex(len(word_vectors[0]),
                                metric='euclidean')
        for _, i in self.word_to_index.items():
            self.index.add_item(i, self.word_vectors[i])
        self.index.build(50)

    @classmethod 
    def from_embeddings_file(cls, embedding_file):
        """Instantiate from pretrained vector file.

        Vector file should be the format:
            word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
            word1 x1_0 x1_1 x1_2 x1_3 ... x1_N

        Args:
            embedding_file (str): location of the file
        Returns:
            instance of PreTrainedEmbeddings
        """
        word_to_index = {}
        word_vectors = []
        with open(embedding_file) as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])

                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)
        return cls(word_to_index, word_vectors)

    def get_embedding(self, word):
        """
        Args:
            word (str)
        Returns:
            an embedding (numpy.ndarray)
        """
        return self.word_vectors[self.word_to_index[word]]
    
    def get_closest_to_vector(self, vector, n=1):
        """Given a vector, return its n nearest neighbors
        Args:
            vector (np.ndarray): should match the size of the vectors in the Annoy index
            n (int): the number of neighbors to return
        Returns:
            [str, str, ...]: words nearest to the given vector
                The words are not ordered by distance
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor]
                    for neighbor in nn_indices]
    
    def compute_and_print_analogy(self, word1, word2, word3):
        """Prints the solutions to analogies using word embeddings

        Analogies are word1 is to word2 as word3 is to __
        This method will print: word1 : word2 :: word3 : word4

        Args:
            word1 (str)
            word2 (str)
            word3 (str)
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        # Simple hypothesis: Analogy is spatial relationship
        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closest_words = self.get_closest_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closest_words = [word for word in closest_words 
                                if word not in existing_words]

        if len(closest_words) == 0:
            print("Could not find nearest neighbors for the vector!")
            return
        
        for word4 in closest_words:
            print("{} : {} :: {} : {}".format(word1, word2, word3, word4) )


In [None]:
# load glove embeddings
embeddings = \
    PretrainedEmbeddings.from_embeddings_file('glove.6B.100d.txt')

In [None]:
# print the index of the word 'working'
print(embeddings.word_to_index['working'])
# print the word with index 500
print(embeddings.index_to_word[500])
# print the word embedding of the word 'working'
embeddings.word_vectors[500]

500
working


array([ 0.076552 ,  0.17843  , -0.44464  ,  0.085718 ,  0.28268  ,
       -0.30546  , -0.30637  ,  0.36632  , -0.19919  ,  0.35636  ,
        0.088981 , -0.7717   ,  0.68709  , -0.055057 , -0.47002  ,
       -0.52158  ,  0.58331  , -0.32255  , -0.28368  , -0.020115 ,
        0.12133  ,  0.63264  ,  0.2717   , -0.61169  , -0.015634 ,
       -0.54613  , -0.19113  , -0.77745  , -0.048714 ,  0.38825  ,
       -0.68519  ,  0.71731  , -0.075302 , -0.26239  , -0.013498 ,
        0.19442  , -0.19793  ,  0.040908 ,  0.78602  , -0.049446 ,
       -0.83782  ,  0.10923  , -0.15471  ,  0.12925  , -0.26784  ,
        0.045059 , -0.60726  , -0.41779  , -0.063718 , -0.58079  ,
       -0.40284  , -0.37669  , -0.18443  ,  1.4475   , -0.15176  ,
       -2.215    , -0.22298  , -0.28886  ,  1.3392   ,  0.55239  ,
        0.022604 ,  0.70506  , -0.34004  , -0.26593  ,  0.80853  ,
        0.26161  ,  0.38258  ,  0.44347  ,  0.37905  ,  0.26225  ,
        0.082587 , -0.049931 , -0.19572  , -0.48894  ,  0.2075

#### Relationships between word embeddings

One of the features of word embeddings is the encoded syntatic and semantic relationships between various words. For words such as cats and dogs, their respective word vectors are close together relative to other words, such as ducks or elephants.

One way to explore these similarities is the analogy task: $$\text{Word1 : Word2 :: Word3 : ____}$$

In this task, three words are provided with a missing fourth word. The fourth word is chosen such that it's relationship with the third word is congruent to the relationship between words one and two. With word embeddings, these relationships are encoded spatially. Subtracting the word embedding of `Word2` from `Word1` yields a difference vector that represents this relationship. By adding `Word3` to this difference vector, a new vector is produced that is close to the fourth word. The other word embeddings can be queried using nearest-neighors to find the word embedding most similar to this output vector, solving the analogy task.

*Example 5-2. The analogy task using word embeddings*

```
    def get_embedding(self, word):
        """
        Args:
            word (str)
        Returns:
            an embedding (numpy.ndarray)
        """
        return self.word_vectors[self.word_to_index[word]]
    
    def get_closest_to_vector(self, vector, n=1):
        """Given a vector, return its n nearest neighbors
        Args:
            vector (np.ndarray): should match the size of the vectors in the Annoy index
            n (int): the number of neighbors to return
        Returns:
            [str, str, ...]: words nearest to the given vector
                The words are not ordered by distance
        """
        nn_indices = self.index.get_nns_by_vector(vector, n)
        return [self.index_to_word[neighbor]
                    for neighbor in nn_indices]
    
    def compute_and_print_analogy(self, word1, word2, word3):
        """Prints the solutions to analogies using word embeddings

        Analogies are word1 is to word2 as word3 is to __
        This method will print: word1 : word2 :: word3 : word4

        Args:
            word1 (str)
            word2 (str)
            word3 (str)
        """
        vec1 = self.get_embedding(word1)
        vec2 = self.get_embedding(word2)
        vec3 = self.get_embedding(word3)

        # Simple hypothesis: Analogy is spatial relationship
        spatial_relationship = vec2 - vec1
        vec4 = vec3 + spatial_relationship

        closest_words = self.get_closest_to_vector(vec4, n=4)
        existing_words = set([word1, word2, word3])
        closest_words = [word for word in closest_words 
                                if word not in existing_words]

        if len(closest_words) == 0:
            print("Could not find nearest neighbors for the vector!")
            return
        
        for word4 in closest_words:
            print("{} : {} :: {} : {}".format(word1, word2, word3, word4) )
```

*Example 5-3. Word embeddings encode many linguistics relationships, as illustrated using the SAT analogy task*

In [None]:
# Relationship 1: the relationship between gendered nouns and pronouns
embeddings.compute_and_print_analogy('man', 'he', 'woman')

man : he :: woman : she
man : he :: woman : her


In [None]:
# Relationship 2: Verb-noun relationships
embeddings.compute_and_print_analogy('fly', 'plane', 'sail')

fly : plane :: sail : ship
fly : plane :: sail : vessel
fly : plane :: sail : boat


In [None]:
# Relationship 3: Noun-noun relationships
embeddings.compute_and_print_analogy('cat', 'kitten', 'dog')

cat : kitten :: dog : puppy
cat : kitten :: dog : puppies
cat : kitten :: dog : junkyard


In [None]:
# Relationship 4: Hypernymy (broader category)
embeddings.compute_and_print_analogy('blue', 'color', 'dog')

blue : color :: dog : animal
blue : color :: dog : pet
blue : color :: dog : taste
blue : color :: dog : touch


In [None]:
# Relationship 5: Meronymy (part-to-whole)
embeddings.compute_and_print_analogy('toe', 'foot', 'finger')

toe : foot :: finger : hand
toe : foot :: finger : kept
toe : foot :: finger : ground


In [None]:
# Relationship 6: Troponymy (difference in manner)
embeddings.compute_and_print_analogy('talk', 'communicate', 'read')

talk : communicate :: read : interpret
talk : communicate :: read : communicated
talk : communicate :: read : transmit


In [None]:
# Relationship 7: Metonymy (convention / figures of speech)
embeddings.compute_and_print_analogy('blue', 'democrat', 'red')

blue : democrat :: red : republican
blue : democrat :: red : congressman
blue : democrat :: red : senator


In [None]:
# Relationship 8: Adjectival scales
embeddings.compute_and_print_analogy('fast', 'fastest', 'young')

fast : fastest :: young : younger
fast : fastest :: young : sixth
fast : fastest :: young : fifth
fast : fastest :: young : seventh


*5-4. An example illustrating the danger of using cooccurrences to encode meaning---sometimes they do not!*

In [None]:
embeddings.compute_and_print_analogy('fast', 'fastest', 'small')

fast : fastest :: small : smallest
fast : fastest :: small : largest
fast : fastest :: small : among
fast : fastest :: small : quarters


*Example 5-5. Watch out for protected attributes such as gender encoded in word embeddings. This can introduce unwanted biases in downstream models.*

In [None]:
embeddings.compute_and_print_analogy('man', 'king', 'woman')

man : king :: woman : queen
man : king :: woman : monarch
man : king :: woman : throne


*Example 5-6. Cultural gender bias encoded in vector analogy*

In [None]:
embeddings.compute_and_print_analogy('man', 'doctor', 'woman')

man : doctor :: woman : nurse
man : doctor :: woman : physician
man : doctor :: woman : doctors


The above example shows how codified cultural biases can manifest within word embeddings and is something practioners should be aware of.

## Exampled: Learning the Continuous Bag of Words Embeddings

This example walks through one of the most famous methods for constructing word embeddings, the Word2Vec Continuous-Bag-of-Words (CBOW) model. The CBOW model is a multi-classification task that scans over texts of words, creating a "context" window of words. The center word of this window is removed and then the window is used to classify the missing word, essentially filling in the blank.

A PyTorch module `nn.Embedding` layer will be used to implement the CBOW model. The `Embedding` layer encapsulates an embedding matrix and provides a mapping from a token's integer ID to a vector that is used in the neural network computation. When the model is training and updating weights, these vectors will also be updated. 


### The Frankenstein Dataset

The Frankenstein Dataset is a digitized version of Mary Shelley's novel *Frankenstein* and available at [Project Gutenburg](https://bit.ly/2T5iU8J).

Using the raw text file, the test is split into sentences and then each sentence is converted to lowercase and stripped of all punctuation. This is handled using NLTK's `Punkt` tokenizer. With the text preprocessed a list of tokens can be retrieved by splitting on whitespace.

The data is then enumerated as a sequence of windows so that the CBOW model can be optimized. The list of tokens for each sentence is iterated over, grouping them into windows of a specific window size. As a quick example with window size of 2:

* i pitied frankenstein my pity amounted to horror i abhorred myself
* *i* <ins>pitied frankenstein</ins> my pity amounted to horror i abhorred myself
* <ins>i</ins> *pitied* <ins>frankenstein my</ins> pity amounted to horror i abhorred myself
* <ins>i pitied</ins> *frankenstein* <ins>my pity</ins> amounted to horror i abhorred myself
* i <ins>pitied frankenstein</ins> *my* <ins>pity amounted</ins> to horror i abhorred myself

Above shows the sliding context window over the processed sentence "i pitied frankenstein my pity amounted to horror i abhorred myself", with the *target* italicized and the <ins>context window</ins> underlined.

Once the data is processed, the familiar step of splitting the data into train, validaiton, and test is done. 

The data with splits is loaded into a Pandas `DataFrame` and indexed in the `CBOWDataset` class. The `__getitem__()` utilizes the `Vectorizer` to convert the context into a vector. The target word is converted to an integer using the `Vocabulary`.

*Example 5-7. Constructing a dataset class for the CBOW task*

```
class CBOWDataset(Dataset):
    # ...existing implementation from Example 3-15
    @classmethod
    def load_dataset_and_make_vectorizer(cls, cbow_csv):
        """Load dataset and make a new vectorizer from scratch

        Args:
            cbow_csv (str): location of the dataset
        Returns:
            an instance of CBOWDataset
        """
        cbow_df = pd.read_csv(cbow_csv)
        train_cbow_df = cbow_df[cbow_df.split=='train']
        return cls(cbow_df, CBOWVectorizer.from_dataframe(train_cbow_df))

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets

        Args:
            index (int): the index to the data point
        Returns:
            a dict with features (x_data) and label (y_target)
        """
        row = self._target_df.iloc[index]

        context_vector = \
            self._vectorizer.vectorize(row.context, self._max_seq_length)
        target_index = self._vectorizer.cbow_vocab.lookup_token(row.target)

        return {'x_data': context_vector,
                'y_target': target_index}
```

### Vocabularly, Vectorizer, DataLoader

*Exampled 5-8. A Vectorizer for the CBOW data*

```
```

### The CBOWClassifier Model

*Example 5-9. The CBOWClassifier model*

```
```

### The Training Routine

*Example 5-10. Arguments to the CBOW training script*

```
```

### Model Evaluation and Prediction

## Example: Transfer Learning Using Pretrained Embeddings for Document Classification

### The AG News Dataset

*Example 5-11. The NewsDataset.__getitem__() method*

```
```

### Vocabulary, Vectorizer, and DataLoader

*Exampled 5-12. Implementing a Vectorizer for the AG News dataset*

```
```

### The NewsClassifier Model

*Example 5-13. Selecting a subset of the word embeddings based on the vocabulary*

```
```

*Example 5-14. Implementing the NewsClassifier*

```
```


### The Training Routine

*Example 5-15. Arguments to the CNN NewsClassifier using pretrained embeddings*

```
```

### Model Evaluation and Prediction


#### Evaluating on the test dataset

#### Predicting the category of novel news headlines

*Example 5-16. Predicting with the trained model*

```
```