### In this section:
* Thoughts on word embeddings as a means of representing text
* Integrating your embeddings into SpaCy
* Thoughts on LSTMs for classification
* Integrating your classifier into SpaCy
* Visualizing your embeddings[if time permits]
* Visualizing your model[if time permits]


### How to represent words for NLP
* motivator on why bag of words/ one hot encoding is bad
    * curse of dimensionality, sparsity, ignores context, new words, etc.
* Word vectors
    * distributional hypothesis 
        * describing the landscape of models as using different types of "context"   
        * count and predictive approachs [<a href="#note1">note.</a>]
        * larger context: semantic relatedness (e.g. “boat” – “water”)
        * smaller context: semantic similarity (e.g. “boat” – “ship”)
    * quick overview on methods
    * SVD on doc/word matrices
    * SVD on co-occurance matrices with window
    
    * some issues:
        * large matrices!
        * expensive to SVD (quadratic time)
        * Sparse
    * Glove
    * word2vec: make word vectors the parameters of a model with the objective of defining local context.
    * go over word2vec in a little more detail
        * skip gram
        * cbow
        * negative sampling
    * word embeddings in python:
        * sklearn/pydsm + numpy (vectorizers + matrix decompositions)
        * gensim (word2vec)
* Neural models 

        
        
* Inspecting results of word embeddings:
    * self organizing maps
* Validating word vectors:
    * intrinsic vs extrinsic
    
* A note on NNs:
    * transferable features in shallow parts of a network, theres an analogy their with word2vec (shallow networks).

### Exercise 1:
Train your own word2vec model using dataset, and load those vectors into spacy. Visually inspect the results of the vector as a self organizing map.

In [1]:
!pip install gensim >> gensim-log.txt
from gensim.models import Word2Vec
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups()
corpus = dataset.data







Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


In [12]:
#!pip install spacy >> spacy-install.log
#!python -m spacy download en >> spacy-download.log
import spacy
nlp = spacy.load('en')

In [13]:
doc = nlp(corpus[0])

In [None]:
from spacy.tokens import Doc
import spacy
from spacy.matcher import Matcher
from spacy.attrs import ORTH, IS_PUNCT

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)

def process_token(token):
    if token.is_punct or token.is_space:
        return False
    elif token.like_url:
        return'URL'
    elif token.like_email:
        return'EMAIL'
    elif token.like_num:
        return'NUM'
    else:
        return token.lower_.replace(" ",'')
    
    
def process_sentence(tokenized_sent):
    tokens = []
    
    doc = Doc(nlp.vocab, words = tokenized_sent)
    nlp.tagger(doc)
    matcher(doc)
    
    for token in doc:
        processed_token = process_token(token)
        if processed_token:
            tokens.append(processed_token)    
    return tokens



In [102]:
from spacy.tokens import Doc
import spacy
from spacy.matcher import Matcher
from spacy.attrs import ORTH, IS_PUNCT

class TextProcesser(object):
    def __init__(self, nlp=None):
        self.nlp = nlp or spacy.load('en')
        
    def __call__(self, corpus):
        for doc in self.nlp.pipe(corpus, parse=False):
            for ent in doc.ents:
                ent.merge()
            yield from map(self.process_token, doc)

            
    def process_token(self, token):
        if token.like_url:
            return'URL'
        elif token.like_email:
            return'EMAIL'
        elif token.like_num:
            return'NUM'
        else:
            return token.lower_

In [103]:
T = TextProcesser(nlp)

In [104]:
t = T(corpus)

In [107]:
g = list(t)

In [None]:
model = Word2Vec(sentences=processed_sents, ###tokenized senteces, list of list of strings
                 size=300,  #size of embedding vectors
                 workers=8, #how many threads?
                 min_count=5, #minimum number of token instances to be considered
                 sample=0, #weight of downsampling common words? 
                 sg = 0, #should we use skip-gram? if 0, then cbow
                 hs=0, #heirarchical softmax?
                 iter=5 #training epocs
        )

In [None]:
<p id="note1">
turns out the distinction may not be that important. (see [Levy and Goldberg (2014), Pennington et al. (2014), Österlund et al. (2015)] as referenced in https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/).
</p>

### Tensorflow model outline
#### Definitions phrase
* Decide on an architecture
* Define all variables as tensors
* Define how to generate outputs from your inputs and variables
* Define a cost function with respect to your predictions and you labels
* Define an optimizer that minimizes your cost function
#### Execution phase
* create an execution session
* Initialize your variables
* over n epochs, run the optimizer, feeding it some data in batches

In [None]:
class BatchFeeder(object):
    def __init__(self, X, y, batch_size):
        self.X = X
        self.y = y
        self.batch_size = batch_size
        self.i = 0
        
    def __call__(self):
        while True:
            yield self.__iter__(self)
    def __iter__(self):
        X = self.X[self.i:self.i + self.batch_size]
        y = self.y[self.i:self.i + self.batch_size]
        self.i += self.batch_size
        return X, y
        
    def __next__(self):
        return self.__iter__()

In [89]:

import tensorflow as tf
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import label_binarize
import numpy as np
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X = cancer.data
y = label_binarize(cancer.target, classes=[0,1,2])[:, :2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)

class MLP(object):
    def __init__(self, X, y, layer_size = 1000):
        
        """
        32-512 rows per batch
        """
        self.X = X
        self.y = y
        self.n_classes = len(np.unique(y))
        self.hidden_dim = layer_size
        self.input_dim = X_train.shape[1]
        self.model_path = 'model.chkpt'
        self.saver = None
        self.graph = tf.Graph()
        self.default_dtype = tf.float64
        
        with self.graph.as_default():
            with tf.variable_scope('mlp_model') as scope:
                self.learning_rate = tf.Variable(0.0, dtype=tf.float32, trainable=False)
                self.x_input = tf.placeholder(X.dtype, shape = (None, self.input_dim))
                self.y_output = tf.placeholder(X.dtype, shape = (None, self.n_classes))
                self.weights = {
                    'weights1':tf.get_variable('weights1', (self.input_dim,self.hidden_dim ), dtype=self.default_dtype), 
                    'bias1':tf.get_variable('bias1', (self.hidden_dim, ), dtype=self.default_dtype), 
                    'weights2':tf.get_variable('weights2', (self.hidden_dim, self.n_classes ), dtype=self.default_dtype), 
                    'bias2':tf.get_variable('bias2', (self.n_classes, ), dtype=self.default_dtype)}
                self.get_logit_op = self.feed_forward(self.x_input, self.weights)
                self.predict_proba_op = tf.sigmoid(self.get_logit_op)
                self.predict_op = tf.argmax(self.predict_proba_op, axis=1)
                self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=self.y_output, logits=self.get_logit_op))
                self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)            
    
    def feed_forward(self, x_input, weights):
        hidden = tf.matmul(x_input, weights['weights1'])
        hidden = tf.add(hidden, weights['bias1'])
        hidden = tf.nn.relu(hidden)
        output = tf.matmul(hidden, weights['weights2'])
        output = tf.add(output, weights['bias2'])
        return output

    def predict(self, X):
        
        with tf.Session(graph=self.graph) as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_op, feed_dict={self.x_input:X})
        return preds
            
    def predict_proba(self, X, session):
        with tf.Session(graph=self.graph) as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_proba_op, feed_dict={self.x_input:X})
        return preds
    
    def fit(self, epochs=100):
        
        epochs = range(epochs)
  
        with tf.Session(graph=self.graph) as sess:
            self.saver = tf.train.Saver()
            for var in self.graph.get_collection('variables'):
                sess.run(var.initializer)
                
            for epoch in epochs:
                sess.run(self.optimizer, feed_dict={self.x_input: self.X, 
                                                    self.y_output: self.y})
            
            self.saver.save(sess, self.model_path)

In [90]:
c = MLP(X, y)
c.fit()

In [93]:
from sklearn.metrics import classification_report
print(classification_report(np.argmax(y_test, axis=1), c.predict(X_test)))

INFO:tensorflow:Restoring parameters from model.chkpt
             precision    recall  f1-score   support

          0       0.95      0.91      0.93        67
          1       0.94      0.97      0.96       104

avg / total       0.95      0.95      0.95       171



In [37]:
from sklearn.neural_network import MLPClassifier
gb = MLPClassifier(hidden_layer_sizes=(1000, ))
gb.fit(X_train, np.argmax(y_train, axis=1))
print("Train: ", np.mean(gb.predict(X_train) == np.argmax(y_train, axis=1)))
print("Test: ", np.mean(gb.predict(X_test) == np.argmax(y_test, axis=1)))


Train:  0.929648241206
Test:  0.888888888889


### Text Sequence modeling
* Preprocessing
* Create data structures


In [58]:
from collections import Counter
#c = Counter()
c['apple']
l = set(range(100000))
d = {i:j for i, j in enumerate(l)}

In [59]:
%%timeit
77777 in l

63.7 ns ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [70]:
%%timeit
d[77777]

58 ns ± 0.337 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [71]:
%%timeit
77777 in d

60.3 ns ± 3.49 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [None]:
class OrderedDefaultDict()

In [81]:
defaultdict.default_factory.

<member 'default_factory' of 'collections.defaultdict' objects>

In [82]:
{}[1]

KeyError: 1

In [61]:
#!pip install gensim >> gensim-log.txt
#!pip install spacy >> spacy-log.txt
#!python -m spacy download en >> spacy-download.txt
#!pip install keras
#!pip install tensorflow
from gensim.models import Word2Vec
from sklearn.datasets import fetch_20newsgroups
from spacy.tokens import Doc
import spacy
from spacy.matcher import Matcher
from spacy.attrs import ORTH, IS_PUNCT
from collections import OrderedDict
from functools import partial

def pad(obj, max_len=0, pad_value = 'PAD'):
    n_pads = max(max_len - len(obj), 0)
    return obj[:max_len] + [pad_value] * n_pads

class TextProcesser(object):
    def __init__(self, nlp=None, max_len=200, max_vocab_size=20000):
        
        self.max_vocab_size = max_vocab_size
        self.max_len = max_len
        self.nlp = nlp or spacy.load('en')
        self.PADDING_VAL = 0
        self.MISSING_VAL = 1
        self.INDEX_OFFSET = 2
        self.vocab = OrderedDict()
        self.padder = partial(pad, max_len=max_len, pad_value=self.PADDING_VAL)
        
    def get_current_vocab_size(self):
        return len(self.vocab)
        
    def check_word(self, word):
        current_vocab_size = self.get_current_vocab_size() # 0
        if current_vocab_size <= self.max_vocab_size:
            if word not in self.vocab:
                self.vocab.update({word: current_vocab_size + self.INDEX_OFFSET}) #{'apple': 0}
        try:
            return self.vocab[word]
        except KeyError:
            return self.MISSING_VAL
        
    def __call__(self, corpus, merge_ents=True):
        docs = []
        if merge_ents:
            for doc in self.nlp.pipe(corpus, parse=False):
                for ent in doc.ents:
                    ent.merge()
                tokens = list(map(self.process_token, doc[:self.max_len]))
                docs.append(self.padder(tokens))
        else:
            for doc in self.nlp.pipe(corpus, parse=False, tag=False, entity=False):
                tokens = list(map(self.process_token, doc[:self.max_len]))
                docs.append(self.padder(tokens))
        
        return docs
            

            
    def process_token(self, token):
        if token.like_url:
            return self.check_word("URL")
        elif token.like_email:
            return self.check_word("EMAIL")
        elif token.like_num:
            return self.check_word("NUM")
        else:
            return self.check_word(token.lower_)


#nlp = spacy.load('en')

In [67]:
#dataset = fetch_20newsgroups()
#corpus = dataset.data
processor = TextProcesser(nlp=nlp, max_len=100)
processed_corpus = processor(corpus, merge_ents=False)

array([[   2,    3,    4, ...,   34,   70,    2],
       [   2,    3,    4, ...,   27,  111,  112],
       [   2,    3,    4, ...,  117,  159,    5],
       ..., 
       [   2,    3,    4, ...,    1,   45, 1673],
       [   2,    3, 5146, ..., 4819,  388, 2319],
       [   2,    3,    4, ..., 2957,   33, 1190]])

In [None]:
def feed_forward(self, words):
    hidden = self.tf.nn.embedding_lookup(embedding_tensor, words)
    output = self.lstm(hidden)
    return output

$X \rightarrow \text{embedding} \rightarrow LSTM \rightarrow Output$

In [None]:
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
state = tf.zeros([batch_size, lstm.state_size])
probabilities = []
loss = 0.0

for current_batch_of_words in words_in_dataset:
    # The value of state is updated after processing each batch of words.
    output, state = lstm(current_batch_of_words, state)

    # The LSTM output can be used to make next word predictions
    logits = tf.matmul(output, softmax_w) + softmax_b
    probabilities.append(tf.nn.softmax(logits))
    loss += loss_function(probabilities, target_words)

In [35]:
import numpy as np
embedding = tf.Variable(initial_value=np.array([[1,2,3,4], [5,6,7,8]]))
vars_to_get = tf.Variable([1])
with tf.Session() as s:
    init = tf.variables_initializer([embedding, vars_to_get])
    s.run(init)
    b = tf.nn.embedding_lookup(embedding, vars_to_get)
    a = s.run(b)

In [36]:
a

array([[5, 6, 7, 8]])

TensorShape([Dimension(2), Dimension(4)])

In [None]:
    init = tf.initialize_variables

In [44]:
tf.nn.rnn_cell.BasicRNNCell?

In [None]:
g = tf.nn.rnn_cell.LSTMCell

In [42]:
from tensorflow.contrib.rnn import BasicLSTMCell
import tensorflow as tf
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import label_binarize
import numpy as np
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X = cancer.data
y = label_binarize(cancer.target, classes=[0,1,2])[:, :2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)

class RNN(object):
    def __init__(self, X, y, max_seq_len, lstm_size=1, layer_size = 1000):
        
        """
        32-512 rows per batch
        """
        self.X = X
        self.y = y
        self.n_classes = len(np.unique(y))
        self.hidden_dim = layer_size
        self.max_seq_len = max_seq_len or () #len of X
        self.model_path = 'model.chkpt'
        self.saver = None
        self.graph = tf.Graph()
        self.default_dtype = tf.float64
        self.vocab_size
        with self.graph.as_default():
            with tf.variable_scope('mlp_model') as scope:
                self.learning_rate = tf.Variable(0.0, dtype=tf.float32, trainable=False)
                self.x_input = tf.placeholder(X.dtype, shape = (None, self.max_seq_len))
                self.y_output = tf.placeholder(X.dtype, shape = (None, self.n_classes))
                self.embeddings = tf.Variable(shape=(self.vocab_size, self.hidden_dim))
                self.lstm_state = tf.zeros([batch_size, lstm.state_size])
                self.lstm = tf.nn.rnn_cell.LSTMCell(lstm_size)
                self.get_logit_op = self.feed_forward(self.x_input, self.weights)
                self.predict_proba_op = tf.sigmoid(self.get_logit_op)
                self.predict_op = tf.argmax(self.predict_proba_op, axis=1)
                self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=self.y_output, logits=self.get_logit_op))
                self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)            
    
    def feed_forward(self, word_input):
        hidden = tf.nn.embedding_lookup(self.embeddings, word_inputs)
        output, self.lstm_state = self.lstm(hidden, self.lstm_state)
        return output

    def predict(self, X):
        
        with tf.Session(graph=self.graph) as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_op, feed_dict={self.x_input:X})
        return preds
            
    def predict_proba(self, X, session):
        with tf.Session(graph=self.graph) as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_proba_op, feed_dict={self.x_input:X})
        return preds
    
    def fit(self, epochs=100):
        
        epochs = range(epochs)
  
        with tf.Session(graph=self.graph) as sess:
            self.saver = tf.train.Saver()
            for var in self.graph.get_collection('variables'):
                sess.run(var.initializer)
                
            for epoch in epochs:
                sess.run(self.optimizer, feed_dict={self.x_input: self.X, 
                                                    self.y_output: self.y})
            
            self.saver.save(sess, self.model_path)

In [None]:
from tensorflow.contrib.rnn import BasicLSTMCell
import tensorflow as tf
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import label_binarize
import numpy as np
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X = cancer.data
y = label_binarize(cancer.target, classes=[0,1,2])[:, :2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)

class LSTM(object):
    def __init__(self, X, y, max_seq_len, lstm_size=1, layer_size = 1000):
        
        """
        32-512 rows per batch
        """
        self.X = X
        self.y = y
        self.n_classes = len(np.unique(y))
        self.hidden_dim = layer_size
        self.max_seq_len = max_seq_len or () #len of X
        self.model_path = 'model.chkpt'
        self.saver = None
        self.graph = tf.Graph()
        self.default_dtype = tf.float64
        self.vocab_size
        with self.graph.as_default():
            with tf.variable_scope('mlp_model') as scope:
                self.learning_rate = tf.Variable(0.0, dtype=tf.float32, trainable=False)
                self.x_input = tf.placeholder(X.dtype, shape = (None, self.max_seq_len))
                self.y_output = tf.placeholder(X.dtype, shape = (None, self.n_classes))
                self.embeddings = tf.Variable(shape=(self.vocab_size, self.hidden_dim))
                self.lstm_state = tf.zeros([batch_size, lstm.state_size])
                self.lstm = tf.nn.rnn_cell.LSTMCell(lstm_size)
                self.get_logit_op = self.feed_forward(self.x_input, self.weights)
                self.predict_proba_op = tf.sigmoid(self.get_logit_op)
                self.predict_op = tf.argmax(self.predict_proba_op, axis=1)
                self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=self.y_output, logits=self.get_logit_op))
                self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)            
    
    def feed_forward(self, word_input):
        hidden = tf.nn.embedding_lookup(self.embeddings, word_inputs)
        output, self.lstm_state = self.lstm(hidden, self.lstm_state)
        return output

    def predict(self, X):
        
        with tf.Session(graph=self.graph) as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_op, feed_dict={self.x_input:X})
        return preds
            
    def predict_proba(self, X, session):
        with tf.Session(graph=self.graph) as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_proba_op, feed_dict={self.x_input:X})
        return preds
    
    def fit(self, epochs=100):
        
        epochs = range(epochs)
  
        with tf.Session(graph=self.graph) as sess:
            self.saver = tf.train.Saver()
            for var in self.graph.get_collection('variables'):
                sess.run(var.initializer)
                
            for epoch in epochs:
                sess.run(self.optimizer, feed_dict={self.x_input: self.X, 
                                                    self.y_output: self.y})
            
            self.saver.save(sess, self.model_path)