## [NLP with Spacy](https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/detail/59562)

Relevant git repository: https://github.com/datascienceinc/jupytercon-2017

### Tokenization:

What is tokenization? Separating out a document / corpus into useful / distinct tokens.  Naive approach: separating text via whitespace. Cons with such approach:
* a proper noun such as Great Britain might qualify it's own token
* punctuation will get looped into the preceding word

#### How Spacy does it:

Previous attempts heavily leaned on regex, but has limits.  Rough psuedocode of SpaCy's tokenizer:

(taken from: https://github.com/datascienceinc/jupytercon-2017/blob/master/tutorial/2-Text-Processing-and-Tokenization.ipynb)
* Split text by whitespace.
* Iterate over space-separated substrings: 
    * The dog doesn't shop at Macy's (anymore).$\rightarrow$ [The, dog, doesn't, shop, at, Macy's, (anymore).]
* Check whether we have an explicitly defined rule for this substring. If we do, use it.
    * doesn't $\rightarrow$ [does, n't]
* Otherwise, try to consume a prefix.
* (anymore). $\rightarrow$ [(, anymore).]
* If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
* If we didn't consume a prefix, try to consume a suffix.
    * anymore). $\rightarrow$ [anymore, ), .]
* If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
* Once we can't consume any more of the string, handle it as a single token.
    * [The, dog, does, n't, shop, at, Macy's, (, anymore, ), .]
    
#### Extending:

Can extend in a variety of ways:
* adding special tokenization cases
* modifying how a tokenizer operates / handles prefixes/suffixes
* creating a whole new tokenizer

(see bottom of [notebook 2](https://github.com/datascienceinc/jupytercon-2017/blob/master/tutorial/2-Text-Processing-and-Tokenization.ipynb) for examples)

## [Word / Document Representations](https://github.com/datascienceinc/jupytercon-2017/blob/master/tutorial/3-Word-Embeddings.ipynb)

### Document Representation

`sklearn` provides both Count and TFIDF vectorization implementations. Often useful for Doc. Rep (recall 20 Newsgroups assignment in COMP330).  Issue with Count / TFIDF implementations is that they require the whole corpus, rather than being able to add tokens online.

#### Online Token Addition: [Feature Hashing for Stateless Transformations](https://arxiv.org/pdf/0902.2206.pdf)

Rather than giving each unique token its own spot in the document vector representation, hash the token, mod the hash by the vector's dimensionality and increment that index:
* v $\leftarrow$ vector of length $K$
* for each token do
 - hash $\leftarrow$ $H$(token)
 - $v_{hash} \leftarrow v_{hash} + 1$
 

### Word Representation

Simplest representation is simply a term term matrix, i.e. the dot product of the document term matrix (shape: [n sentences, n words]) with itself. Basically, it counts the co-occurances of two words / tokens in the corpus.

Often just the co-occurances aren't so useful, one typically uses Positive Pointwise Mutual Information (PPMI):  

\begin{equation}
\text{PMI}(x, y) = log_2\frac{P(x,y)}{P(x)P(y)}
\end{equation}

Recall probabilistic independence:
\begin{equation}
P(A, B) = P(A)P(B)
\end{equation}

Note that the denominator here divides over the product of the likelihood of seeing $x$ and $y$ in general, rather than together.  So the PMI is computing how likely seeing $x$ and $y$ together over seeing them in general.  (If they are independent we compute $log_2(1) = 0$, if they are 'super dependent' we compute something like $log_2\left(\frac{1.0}{(0.1) * (0.2)}\right) \approx 7.5$, for example).

#### Neural Embeddings

Some images providing some intuition:
![NN](jupytercon-2017/tutorial/assets/softmax-nplm.png)
![NN](jupytercon-2017/tutorial/assets/BengioNN.png)

## [POS Tagging](https://github.com/datascienceinc/jupytercon-2017/blob/master/tutorial/4-Part-of-Speech-Tagging.ipynb)

(Last Assignment in 182)

### Overview of what talk covered

* What POS are
* Why they are useful in natural language processing
* Why word vectors need to take into consideration POS
* What SpaCy provides, how to extend it / train your own tagger
* Full POS tagger from start to finish with SpaCy
* A bit about perceptrons / weighted perceptrons

In SpaCy's natural language processing unit, they automatically tag the parts of speech of a sentence when consumed:  

```python
sentence = nlp("I went to the store")
sentence[0].tag_
```

so that the `tag_` attribute is available to the user.  You can also train your own, which requires:

* a vocabulary object: stores lexeme data
* a statistical model for prediction: consumes features (such as POS of a word, a preceding word, a following word, brown cluster information, etc...)
* a tagger: uses the vocab and the model to tag the tokens


## [Deep Learning for Classification](https://github.com/datascienceinc/jupytercon-2017/blob/master/tutorial/6-Deep-Learning-for-Classification.ipynb)

Briefly discussed the history of language models and what they are.  In general, a language model is:  

\begin{equation}
P(word_i | context)
\end{equation}

### History (Taken from link above)

* Old approach: ngram model: 
    * empirically estimate $p(word_i | word_{i-1}, word_{i-2})$

* Bengio et al 2003: A neural probabilistic language model:
    * Too many words! High dimensionality, low generality of ngram models.
    * Not leveraging dependencies learned between individual words.
    * Ignoring words outside a very small context window.
    * Solution: Construct a feed forward network to predict context given word (or visa versa), express word probabilities in terms of the intermediate feature vector representation learned by the network.
    * Bonus: Models that assign high probabilities to word sequences "understand" them, so the intermediate representations are likely useful for other tasks  (word embeddings).
    
    ![Kiku](jupytercon-2017/tutorial/assets/BengioNN.png)
    
* Milokov 2010: 
    * Feed forward model requires a fixed length window (researcher degree of freedom, etc).
        * Long range dependencies are lost!
        * Context confined to neighborhoods defined by heuristics.
    * Use recurrent connections to pass contextual information through arbitrarily long (theoretically) sequences of words:
    
    ![RNN](jupytercon-2017/tutorial/assets/RNN.png)
    
    such that: 
    * $h_t = \text{activation}(W_h h_{t-1} + W_x x_t) $
    * $y_t = \text{softmax}(W_y h_t)$
    * $W_x$ can be used as an embedding matrix
    
    Challenges and Extensions:
    * Ignores future words $\rightarrow$ bidirectional rnns
    * numerical challenges with gradients (exploding/vanishing)
    * hard to train weights that retain long term dependencies $\rightarrow$ trainable activation units
        * gated recurrent units or LSTMs
            * train the degree to which we update our hidden state with information from the word vs last hidden state. I.E., should we keep this hidden layer highway going, or should we forget context?