In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

$$
\newcommand{\V}{\mathbf{V}}
\newcommand{\v}{\mathbf{v}}
\newcommand{\offset}{o}
\newcommand{\o}{o}
\newcommand{\E}{\mathbf{E}}
$$


# Natural Language Processing

The datasets for Machine Learning have historically been mainly numeric.

But non-numeric data such as Image and Text is an abundant and potentially rich source of insight.

We have illustrated many of the concepts in this course with Image data.

We will briefly dive into the world of text.
  
*Natural Language Processing* is the set of tools/techniques that facilitate using text as raw material.

## The world of text

- SEC filing
- analyst reports
- news articles
- tweets

We will approach text mainly from a Deep Learning perspective
- lots of data
- minimal pre-processing
- "feature engineering" by the Neural Network

That is not to discount more "classical" methods for NLP
- Part of speech
- Stemming
- Lemmatization
- n-grams

All of these are potentially useful as pre-processing steps for Deep Learning.

However, if our data sets are big enough, it may be counter-productive to preprocess.

# Issues with text

There are several big issues to tackle regarding text data
- Words are categorical variables
- Token sequences (sentences/paragraphs) are variable length
- Token sequences: order matters

We are using the term "token" rather than word
- tokens may include punctuation, special characters
- tokens may be characters rather than entire words

## Notation
- $\w$ is a sequence of $n_\w$ tokens $\w_{(1)}, \ldots, \w_{(n_\w)}$
- each token is an element of vocabulary $\V: \w_\tp \in \V, 1 \le \tt \le ||\w||$
    - token $j$ in vocabulary $\V$ is denoted $\V_j$
- We define two pseuduo-tokens to denote the start/end of the sentence
    - $\w_(0) = \text{<START>}$
    - $\w_{(n_\w+1)} =\text{<END>}$

We need a function to convert a token into a numeric vector:
$$\text{rep}: \text{token} \mapsto \mathbb{R}^{n_\V}
$$


One Hot Encoding (OHE) and word embeddings are examples of such a function.

- For OHE: $n_\V = ||\V||$
- For Word Embeddings: $n_\V$ is the dimension of the embedding vector

We will extend $\text{rep}$ to sequences $\w$:
$$\text{rep}(\w) = \left[ \text{rep}(\w_\tp) | 1 \le \tt \le ||\w||  \right]$$

# Issue 1: Words are categorical variables

We address the first issue relating to text: words are *categorical variables*.

By now, we should know to **not** treat categorical variables as ordinals.

Let's review the reason.

Treating a word as an ordinal 
$$\text{rep}(w) \in \mathbb{R}^1$$ 
would imply
- "apple" < "orange" is a sensible statement
- that this ordering is meaningful to a Machine Learning model

**Example**

Linear regression:
$$
\y = \Theta^T \text{rep}(w)
$$

Predict $\y$ given feature vector (attributes) $\text{rep}(w)$
- by learning parameters $\Theta$

Suppose that we tried to encode word $w$ with an integer:  $\text{rep}(w) = I_w$.
- $I_\text{apple} = 10 * I_\text{orange}$
    - means "apple" has 10 times the impact on prediction $\hat{\y}$ as "orange"
    - impact is $\Theta * I_w$
- Re-encoding "apple" with a value 10 times larger would make it 10 times more important
       

## Sparse Represention of words by One Hot Encoding (OHE)

So the natural way of representing a word is as a categorical variable
- indictor per word: $\text{Is}_\text{apple}$
- One Hot Encoding

OHE is a *sparse* representation
- length of $\text{rep}(w)$ is $| \V |$, yet only a single non-zero element

The problem is that there are lots of words !
- $|V|$ is large !
- $\text{rep}(\w)$ length is $|\w| |\V|$

# Issue 1 revisited: Sparse verus dense representation of categoricals

## Dense representation of words: Embeddings

Sparse encodings, such as OHE
- convert a token into a vector of features
- where the features are orthogonal: only one is active at a time

This is called a *discrete* representation.

Discrete representations have a major drawbacks
- they are long
    -  $\text{rep}(\w)$ length is $||\w|| * ||\V||$
- there is no meaningful metric of "distance" between the representation of words

To illustrate the "lack of distance" issue, let 

$$
\text{OHE}(w)
$$

denote the One Hot Encoding of word $w$.

Using dot product (cosine similarity) as a measure of similarity

| word   | OHE(word) | Similarity |
| ---    | ---       | :---:        |
| dog   | [1,0,0,0]   | OHE(word) $\cdot$ OHE(dog)  = 1  |
| dogs  | [0,1,0,0]   | OHE(word) $\cdot$ OHE(dog)  = 0  |
| cat   | [0,0,1,0]   | OHE(word) $\cdot$ OHE(dog)  = 0  |
| apple | [0,0,0,1]   | OHE(word) $\cdot$ OHE(dog)  = 0  |


Each pair of distinct words has 0 similarity
- no recognition of plural form
- no recognition of commonality (pets)

This is due to the fact that only a single "feature" of the OHE is active (non-zero).

However, it's possible that, in reality, there are many "dimensions" to a word, for example
- singular/plural
- entity type, e.g., Person
- positive/negative

- "Cats", "Dogs", "Apples"
    - related by being plural form
- "Cat", "Dog"
    - related by being animals
- "good", "bad"
    - related by being "opposites"

Thus it is not unreasonable to represent a word as a short *dense vector* of features 
- each feature (vector element) captures a concept
- numeric value of element encodes the strength of the word's relation to the concept

Ideally the features would be indepenent

This is called a *continuous* word representation.


# Doing math with words

Let's explore the implication and power of dense vector representation of words.

Let $\v_w$ be the dense vector/embedding for word $w$
- captures multiple aspects of a word
- where each element of the vector is a nearly-independent aspect
- then we can perform interesting mathematical manipulations on word vectors


| $w$   | $\v_w$ |
| ---    | ---       | 
| cat   | [.7, .5, .01 ]   
| cats   | [.7, .5, .95 ]  
| dog   | [.7, .2, .01 ]   
| dogs   | [.7, .2, .95 ]
| apple   | [.1, .4, .01 ]   
| apples   | [.1, .4, .95 ]

Does the last dimension encode "plural form" ?
$$
\v_\text{cats} - \v_\text{cat} \approx \v_\text{dogs} - \v_\text{dog} \approx \v_\text{apples} - \v_\text{apple}
$$

If so:
$$
\v_\text{apples} \approx \v_\text{apple} + (\v_\text{cats} - \v_\text{cat})
$$


## Word analogies

king:man :: ? : queen


Let
- $\v_w$ be the dense vector for word $w$
- $d(\v_{w}, \v_{w'})$ be some measure of the distance between the two vectors $\v_{w}, \v_{w'}$
    - e.g., ( $1 - \text{cosine similarity}$ )

Using the distance metric,  define the set of words in vocabulary $\V$ that are "closest" to a word $w$.

Let
- $\text{wv}_{n',d}(\v_w)$ be the dense vectors of the $n'$ words in $\V$ closest to word $w$
$$
\text{wv}_{n',d}(\v_w) = \{ \v_{w'} | \text{rank}_V( d(\v_{w}, \v_{w'}) ) \le n' \}
$$
- $N_{n',d}(w)$ be the set of $n'$ words associated with $\text{wv}_{n',d}(\v_w)$


$$
N_{n',d}(w) = \{ w' | w' \in \text{wv}_{n',d}(\v_w) \}
$$

We can define approximate equality of two words $w, w'$ if they are among the closest words 

$$
w \approx_{n',d} w' \; \; \text{if } \w' \in N_{n',d}(w) 
$$

That is: 
- word $w$ is approximately equal to word $w'$
- if $w'$ is among the $n'$ words closest to $w$ according to distance metric $d$.

Finally, we can define word analogies:

a:b :: c:d

means

$$
\v_a - \v_b  \approx_{n',d}  \v_c - \v_d 
$$

So to solve the word analogy for $c$:
$$
\v_c \approx_{n',d}  \v_a - \v_b + \v_d
$$

To be concrete:
$$
\v_\text{king} - \v_\text{man} + \v_\text{woman} \approx_{n',d} \v_\text{queen}
$$

## GloVe: Pre-trained embeddings

Fortunately, you don't have to create your own word-embeddings from scratch.

There are a number of pre-computed embeddings freely available.

GloVe is a family of word embeddings that have been trained on large corpra
- GloVe6b
    - Trained on 6 Billion tokens
    - 400K words
    - Corpus:  Wikipedia (2014) + GigaWord5 (version 5, news wires 1994-2010)
    - Many different dense vector lengths to choose from
        - 50, 100, 200, 300

We will illustrate the power of word embeddings using GloVe6b vectors of length $100$.

$
\begin{array}[llllll]\\
\text{king- man + woman} &  \approx_{n',d} & \text{queen } \\
\text{man - boy + girl} &  \approx_{n',d} & \text{woman } \\
\text{Paris - France + Germany} &  \approx_{n',d} & \text{Berlin } \\
\text{Einstein - science + art} &  \approx_{n',d} & \text{Picasso} \\
\end{array}
$

You can see that the dense vectors seem to encode "concepts", that we can manipulate mathematically.

You may discover some unintended bias

$
\begin{array}[llllll]\\
\text{doctor - man + woman} &  \approx_{n',d} & \text{nurse } \\
\text{mechanic  - man + woman} &  \approx_{n',d} & \text{teacher } \\
\end{array}
$

### Domain specific embeddings

Do we speak Wikipedia English in this room ?

Here are the neighborhoods of some financial terms, according to GloVe:

$
\begin{array}[lll]\\
N(\text{bull}) & =  & [ \text{cow, elephant, dog, wolf, pit, bear, rider, lion, horse}] \\
N(\text{short}) & =  & [ \text{rather, instead, making, time, though, well, longer, shorter, long}] \\
N(\text{strike}) & =  & [ \text{workers, struck, action, blow, striking, protest, stoppage, walkout, strikes}] \\
N(\text{FX}) & =  & [ \text{showtime, cnbc, ff, nickelodeon, hbo, wb, cw, vh1}] \\
\end{array}
$

It may be desirable to create word embeddings on a narrow (domain specific) corpus.

This is not difficult provided you have enough data.

# Obtaining Dense Vectors: Transfer Learning

How do we obtain Dense Vector representation of words ?

We learn them !

Suppose we had a task T 
that involves mapping a sequence of words to an outcome.

To be concrete: mapping a movie review to an indicator of Positive/Negative sentiment.



Ignoring for the moment the issue of converting variable length sequences to a fixed length
- inputs are OHE of words
- target is Positive/Negative label

- Logistic Regression from  sentence representation to binary target Positive/Negative

One could also ask
- can we map the OHE of a word $\w_\tp$ (length $|\V|$)
- to a shorter, dense vector $\mathbf{e}_\tp$ of length $n_e$
- and use the dense vector in the Logistic Regerssion
 
This mapping can be represented by an an $(|\V| \times n_e)$ matrix
$\E$ 

$$
\mathbf{e}_\tp = \text{OHE}(\w_\tp)^T \E 
$$

Using Machine Learning, 
- we solve for both the Logistic Regression parameters $\W$ *and* $\mathbf{E}$
- when solving the Classification Task via Logistic Regression.

The matrix $\mathbf{E}$ is called 
- an *embedding matrix* for words 
- and
$\e_\tp$ is called an *embedding*  or *word vector* for word $\w_\tp$.

*Word embeddings* have become an important component of Deep Learning for NLP.

<table>
    <tr>
        <th><center>Word prediction: Neural Net</center></th>
    </tr>
    <tr>
        <td><img src="images/w2v_word_prediction_layers.jpg" width=800></td>
    </tr>
</table>

In other words
- we have learned a dense vector representation of words $\mathbf{E}$
- that is useful for a particular classification task



Might it be possible that the dense vector representation of words for this task
- is useful for other tasks involving words ?
- this is Transfer Learning

In [2]:
print("Done")

Done
