<a href="https://colab.research.google.com/github/hookskl/nlp_w_pytorch/blob/main/nlp_w_pytorch_ch5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding Words and Types

Part of implementing any NLP task involves dealing with different kinds of discrete types. Examples of discrete types are:

* words
* characters
* parts-of-speech tags (POS)
* named entities
* named entity types
* parse features 
* items in a product catalog

Any input feature that comes from a finite (or countably finite) set (aka a vocabulary), it is a *discrete type*.

One of the core successes to deep learning in NLP is the method of representing discrete types as dense vectors. "Representation learning" or "embedding" refer to learning a mapping from one discrete type to a point in a vector space. In the context of words, this mapping is referred to as a *word embedding*. Other embedding methods exist, such as count-based embeddings (TF-IDF). The focus here will be *learning-based* or *prediction-based* embedding methods, where the representations are learned by maximizing an objective for a specific learning task. One such example is predicting a word based on context. These learned embeddings are so quintessential to modern NLP that it can be expected the performance on any NLP task will improve by adding one.

## Why Learn Embeddings?

Learned embeddings have several advantages over more classical representations, such as count-based methods that are heuristically constructed.

First, they are more computationally efficient since their size does not scale with the size of the vocabularly. Second, count-based methods result in high-dimensional vectors that encode redundant information along many dimensions. Third, very high dimensions lead to problems in fitting machine learning models (*the curse of dimensionality*). Finally, learned representations are more suited to the task at hand, whereas count-based or low dimensional approaches (SVD and PCA) are not necessarily optimized for the relevant task.

### Efficiency of Embeddings

One of the major efficiencies of word embeddings is their size is typically much smaller than those of one-hot or count-based representations. Typical sizes range between 25 and 500 dimensions, usually dicatated by hardware limitations.


### Approaches to Learning Word Embeddings

All word embedding methods train with just words in a supervised fashion. This is accomplished by constructing *auxiliary* tasks in which the data is implicitly labeled. Some examples:

* given a sequence of words, predict the new word (also called the *langauge modeling* task)
* given a sequence of words before and after, predict the missing word
* give a word, predict words that occur within a window, indepdent of the position

Generally, it's more worthwhile to use a pretrained word embedding and fine-tune than to create one from scratch.

### The Practical Use of Pretrained Word Embeddings

#### Loading Embeddings

*Example 5-1. Using pretrained word embeddings*

#### Relationships between word embeddings

*Example 5-2. The analogy task using word embeddings*

*Example 5-3. Word embeddings encode many linguistics relationships, as illustrated using the SAT analogy task*

*5-4. An example illustrating the danger of using cooccurrences to encode meaning---sometimes they do not!*

*Example 5-5. Watch out for protected attributes such as gender encoded in word embeddings. This can introduce unwanted biases in downstream models.*

*Example 5-6. Cultural gender bias encoded in vector analogy*

## Exampled: Learning the Continuous Bag of Words Embeddings

### The Frankenstein Dataset

*Example 5-7. Constructing a dataset class for the CBOW task*

```
```

### Vocabularly, Vectorizer, DataLoader

*Exampled 5-8. A Vectorizer for the CBOW data*

```
```

### The CBOWClassifier Model

*Example 5-9. The CBOWClassifier model*

```
```

### The Training Routine

*Example 5-10. Arguments to the CBOW training script*

```
```

### Model Evaluation and Prediction

## Example: Transfer Learning Using Pretrained Embeddings for Document Classification

### The AG News Dataset

*Example 5-11. The NewsDataset.__getitem__() method*

```
```

### Vocabulary, Vectorizer, and DataLoader

*Exampled 5-12. Implementing a Vectorizer for the AG News dataset*

```
```

### The NewsClassifier Model

*Example 5-13. Selecting a subset of the word embeddings based on the vocabulary*

```
```

*Example 5-14. Implementing the NewsClassifier*

```
```


### The Training Routine

*Example 5-15. Arguments to the CNN NewsClassifier using pretrained embeddings*

```
```

### Model Evaluation and Prediction


#### Evaluating on the test dataset

#### Predicting the category of novel news headlines

*Example 5-16. Predicting with the trained model*

```
```