In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Dense representation for categorical variables

One drawback that we flagged with One Hot Encoding of a categorical variable
- Tokens that seem to be related in the input domain ("dog", "dogs")
- Become *unrelated* when One Hot Encoded

Because each value is a long vector
- With a single non-zero element
- That is *different* across the two values
- The OHE vectors are *orthogonal*

This means that there is no useful measure of the distance between two tokens

To illustrate the "lack of distance" issue, let $\text{rep}$ be a mapping from tokens to their One Hot Encodings.

Using dot product (cosine similarity) as a measure of similarity to the token "dog"

| word   | rep(word) | Similarity to "dog"|
| ---    | ---       | :---:        |
| dog   | [1,0,0,0]   | rep(word) $\cdot$ rep(dog)  = 1  |
| dogs  | [0,1,0,0]   | rep(word) $\cdot$ rep(dog)  = 0  |
| cat   | [0,0,1,0]   | rep(word) $\cdot$ rep(dog)  = 0  |
| apple | [0,0,0,1]   | rep(word) $\cdot$ rep(dog)  = 0  |

All words other than "dog" are equidistant from "dog".

Intuitively, we observe similarity between "dog" and other words
- ("dog", "dogs"): Same root, different Singular/Plural form 
- ("dog", "cat"); Same concept: pet

and complete lack of similarity between "dog" and "apple".

Yet all have the same distance measure from "dog": $0$.

We can consider an alternate encoding to OHE.

Suppose each dimension of the encoded vector
- Measured the intensity of the token against some concept
    - Singular/Plural
    - Domestic Animal
    - Edible

This type of representation is called *continuous*
- As the strength is a continuous value
- Compared to the *discrete* encoding of OHE as binary 0/1

It is also called a *dense* representation
- Multiple non-zero elements in the vector
- Compared to the single non-zero element in the OHE vector

In a continuous, dense representation
two values expressing similar concepts will be "closer" than two values that do not share concepts

- "Cats", "Dogs", "Apples"
    - Share the concept "Plural"
- "Cat", "Dog"
    - Share the concept "Domestic animal"
- "good", "bad"
    - Share the concept "Opposite"

# Doing math with words

Let's explore the implication and power of dense vector representation of words.

If each element of the vector
- Expresses a concept
- And the number of concepts is small compared to $|| \Vocab ||$
- And the concepts are fairly independent
- Then we have found an alternate basis (compared to the $|| \Vocab ||$ basis vectors of OHE) of smaller dimension
- For representing words

This concept is sometimes called *word embeddings*

Let $\v_w$ denote the dense representation of token $w$:


| $w$   | $\v_w$ |
| ---    | ---       | 
| cat   | [.7, .5, .01 ]   
| cats   | [.7, .5, .95 ]  
| dog   | [.7, .2, .01 ]   
| dogs   | [.7, .2, .95 ]
| apple   | [.1, .4, .01 ]   
| apples   | [.1, .4, .95 ]

Notice that "dogs" and "apples"
- Are similar along one dimension (the last, perhaps encoding "Is Plural")
- Are dissimilar along one dimension (the first, perhaps encoding "Is Pet")

Also notice that "dog" and "cat"
- Are similar along the first dimension (reinforcing the notion that this dimension may be "Is Pet")


Taking this a step further: we can perform element-wise math on dense vector representations:

$$
\v_\text{cats} - \v_\text{cat} \approx \v_\text{dogs} - \v_\text{dog} \approx \v_\text{apples} - \v_\text{apple}
$$

because
- "cats" and "cat" are similar in all concepts *except* "Plural".
- As are "dogs" and "dog"
- As are "apples" and "apple"

If that's the case, we can approximate the vector that expresses the "pure" concept "Is Plural"
- Without expressing any other concept

as $(\v_\text{cats} - \v_\text{cat})$

Then we can construct the Plural form of "apple"
- By adding the pure vector for Plural to the vector for "apple"

$$
\v_\text{apples} \approx \v_\text{apple} + (\v_\text{cats} - \v_\text{cat})
$$
we can create the Plural form of "apple" 

## Word analogies

The implications of doing math on words is even more powerful.

Consider solving the analogy problem
>king:man :: ?:woman

That is: what is the female analog of "king" ?

Suppose the concepts ("dimensions") of the dense representation were
- Gender (man or woman)
- Regal (Royal or commoner)

Then
$$
\v_\text{king} - \v_\text{man} + \v_\text{woman} \approx \v_\text{queen}
$$
because
$$
\begin{array}[lll]\\
\v_\text{king} & = & (\text{Man, Royal}) & \text{vector representation of "king"}\\
\v_\text{king}- (\text{Man}, 0) &  = & (0, \text{Royal}) & \text{subtract vector for "pure" concept "Man"} \\
\v_\text{king} - (\text{Man}, 0) + (\text{Woman}, 0) & = & (\text{Woman, Royal})  & \text{add vector for "pure" concept "Woman"} \\\\
& = & \text{Queen} & \text{the word having concepts "Woman" and "Royal"}\\
\end{array}
$$

We can use math on dense vectors to compute analogies!

Let's formalize the "math" of word vectors

For tokens $w, w'$ with dense vectors $\v_{w}, \v_{w'}$
- Define a metric $d(\v_{w}, \v_{w'})$ of the distance between the words
- For example: 
$$d(\v_{w}, \v_{w'}) = 1 - \text{cosine similarity}(\v_{w}, \v_{w'})$$


Define the set of tokens $N_{n',d}(w)$ in vocabulary $\Vocab$
- That are among the $n'$ "closest" to a token $w$
- According to distance metric $d$

$$
\begin{array}\\
\text{wv}_{n',d}(w) & = & \left\{ \; \v_{w'} \, | \, \text{rank}_V( d(\v_{w}, \v_{w'}) ) \le n' \; \right\} & 
\text{the dense vectors of the } n' \text{ tokens in } \Vocab \text{ closest to token } w \\
N_{n',d}(w) & = & \left\{ \; w' \, | \, \v_{w'} \in \text{wv}_{n',d}(w) \; \right\} \\
\end{array}
$$

This is the "neighborhood" of token $w$ as defined by the distance metric.

Token $w'$ is defined to be *approximately equal to* token $w$ 
- Denoted as $w \approx_{n',d} w'$
- If $w'$ is in the neighborhood of $w$

$$
w \approx_{n',d} w' \; \; \text{if } \w' \in N_{n',d}(w) 
$$



Thus, the analogy
>a:b :: c:d

implies

$$
\v_a - \v_b  \approx_{n',d}  \v_c - \v_d 
$$

So to solve the word analogy for $c$:
$$
\v_c \approx_{n',d}  \v_a - \v_b + \v_d
$$

## GloVe: Pretrained embeddings

Fortunately, you don't have to create your own word-embeddings from scratch.

There are a number of precomputed embeddings freely available.

GloVe is a family of word embeddings that have been trained on large corpora
- GloVe6b
    - Trained on 6 Billion tokens
    - 400K words
    - Corpus:  Wikipedia (2014) + GigaWord5 (version 5, news wires 1994-2010)
    - Many different dense vector lengths to choose from
        - 50, 100, 200, 300

We will illustrate the power of word embeddings using GloVe6b vectors of length $100$.

$
\begin{array}[llllll]\\
\text{king- man + woman} &  \approx_{n',d} & \text{queen } \\
\text{man - boy + girl} &  \approx_{n',d} & \text{woman } \\
\text{Paris - France + Germany} &  \approx_{n',d} & \text{Berlin } \\
\text{Einstein - science + art} &  \approx_{n',d} & \text{Picasso} \\
\end{array}
$

You can see that the dense vectors seem to encode "concepts", that we can manipulate mathematically.

You may discover some unintended bias

$
\begin{array}[llllll]\\
\text{doctor - man + woman} &  \approx_{n',d} & \text{nurse } \\
\text{mechanic  - man + woman} &  \approx_{n',d} & \text{teacher } \\
\end{array}
$

## Domain specific embeddings

Do we speak Wikipedia English in this room ?

Here are the neighborhoods of some financial terms, according to GloVe:

$
\begin{array}[lll]\\
N(\text{bull}) & =  & [ \text{cow, elephant, dog, wolf, pit, bear, rider, lion, horse}] \\
N(\text{short}) & =  & [ \text{rather, instead, making, time, though, well, longer, shorter, long}] \\
N(\text{strike}) & =  & [ \text{workers, struck, action, blow, striking, protest, stoppage, walkout, strikes}] \\
N(\text{FX}) & =  & [ \text{showtime, cnbc, ff, nickelodeon, hbo, wb, cw, vh1}] \\
\end{array}
$

It may be desirable to create word embeddings on a narrow (domain specific) corpus.

This is not difficult provided you have enough data.

# Obtaining Dense Vectors: Transfer Learning

How do we obtain Dense Vector representations that seem to have these wonderful properties ?

- Through Machine Learning !
- As a by-product of solving a specific Source task
- Once we have the embeddings, we can re-use them in many other Target tasks.


This is exactly what we called Transfer Learning
- We train a Source Task
- The layers and associated weights learned in training the Source Task
- Are re-used for a different Target Task

The layer and associated weights that implement the dense vector encoding are re-used
- Called an *Embedding Layer*

We will show code that trains an Embedding Layer shortly.

# Word prediction problems: high-level

The Source Task we will use to create word embeddings is from a class of *Word Prediction* tasks
- Given a set of tokens (the "context")
- Predict a related token

For example
- Given prefix $\w_{(1)} \ldots \w_{(\tt-1)}$ of a sequence of tokens $\w$
- Predict the next token $\w_\tp$.
>"Machine Learning is  $\langle ??? \rangle $"

Or a similar problem
- Predict token $\w_\tp$
- From surrounding tokens 
$$\w_{(\tt-o)}, \ldots, \w_{(\tt-1)} , \langle ??? \rangle, \w_{(\tt+1)} \ldots, \w_{(\tt+o)}$$
>"Machine  $\langle ??? \rangle$ is easy"



The inspiration behind using a Word Prediction task to learn embeddings
- Is that meaning of a word can be inferred by context
- "You are known by the company that you keep"

For example
- "I ate an apple"
- "I ate a  blueberry"
- "I ate a pie"

"apple", "blueberry", "pie" concept: things that you eat

The Word Prediction task is thus a form of Classification.

We need a large number of training examples as this is a Supervised Learning problem.

One reason that Word Prediction is used is that it is fairly easy to obtain training examples
- From any source of raw text
- Just reformat
- That is: the target/label for an example is just an adjacent token

Since targets can be derived from examples, this is sometimes called *Semi-Supervised Learning*




Let $\w$ be the sequence of $n_\w$ words 

A *word prediction* is a mapping 
- from input $\w$
- to a probability distribution $\hat{\y}$ over all words in vocabulary $\Vocab$
    - $\hat{\y}_j = \pr{V_j}$
    - That is: it assigns a probability to each word in the vocabulary

Here are some simple word prediction problems:

$
\begin{array}[lll]\
\text{predict next word from context}  & \pr{\w_\tp | & \w_{(\tt-\offset)} \ldots, \w_{(\tt-1)} } \\
\text{predict a surrounding word}      & \pr{\w_{(\tt')} |& \w_\tp } \\
    & & \tt' = \{ \tt - o, \ldots, \tt + o \} - \{ \tt \} \\
\text{predict center word from context} & \pr{ \w_\tp | & [ \w_{(\tt-\offset)} \ldots \w_{(\tt-1)} \w_{(\tt+1)} \ldots \w_{(\tt+\offset)} ] }  & \\
\end{array}
$

Here is the Neural Network we construct for the Source Task that will learn embeddings
- Ignoring for the moment the issue of converting variable length sequences to a fixed length


<table>
    <tr>
        <th><center>Embedding Layer</center></th>
    </tr>
    <tr>
        <td><img src="images/Embedding_Layer.png"></td>
    </tr>
</table>

Layers:
- One Hot Encoded token
- Embedding: converts sparse encoding to dense encoding
- Classifier: operating on dense encodings


The only "new" layer type is the Embedding layer
- This is nothing more than Matrix Multiplication
- The mapping can be implemented  an $(|\Vocab| \times n_e)$ matrix
$\Emb$ 
- Where $n_e$ is the length chosen for the dense vector

That is because 
- The OHE vector for the $j^{th}$ word $\Vocab_j$ in vocabulary $\Vocab$
- Is the $(|\Vocab| \times 1)$ vector of all $0$'s except at index $j$
$$
V^{(j)} =1
$$
- $\Emb^T * \text{OHE}(\Vocab_j)$
- Selects row $j$ of $\Emb$, which is the  is the $(n_e \times 1)$ *dense vector* encoding of $\Vocab_j$


Matrix $\Emb$ are *weights to be learned* by training
- Along with the weights of the Classifier layer

In other words
- We train the Neural Network
- To create an embedding
- That makes it easy for a Classifier
- To solve the Source Task



# Conclusion

Categorical variables (such as tokens/words) are easily represented as One Hot Encoded values.

This is perfectly adequate when there is no relationship between tokens.

Word embeddings/Dense representations create a representation
- Not just of a token is isolation
- But a token with multiple dimensions of meaning
- Which enable inter-token relationships

We showed how to create dense representation of words as a by-product of solving a Source Task.

The Source Task we used was Word Prediction, but other tasks may work as well.

The embeddings learned for the Source Task may be useful in other tasks
- This is Transfer Learning in the real world


In [2]:
print("Done")

Done
