### How to represent words for NLP
* motivator on why bag of words/ one hot encoding is bad
    * curse of dimensionality, sparsity, ignores context, new words, etc.
* Word vectors
    * distributional hypothesis 
        * describing the landscape of models as using different types of "context"   
        * count and predictive approachs [<a href="#note1">note.</a>]
        * larger context: semantic relatedness (e.g. “boat” – “water”)
        * smaller context: semantic similarity (e.g. “boat” – “ship”)
    * quick overview on methods
    * SVD on doc/word matrices
    * SVD on co-occurance matrices with window
    
    * some issues:
        * large matrices!
        * expensive to SVD (quadratic time)
        * Sparse
    * Glove
    * word2vec: make word vectors the parameters of a model with the objective of defining local context.
    * go over word2vec in a little more detail
        * skip gram
        * cbow
        * negative sampling
    * word embeddings in python:
        * sklearn/pydsm + numpy (vectorizers + matrix decompositions)
        * gensim (word2vec)
* Neural models 

        
        
* Inspecting results of word embeddings:
    * self organizing maps
* Validating word vectors:
    * intrinsic vs extrinsic
    
* A note on NNs:
    * transferable features in shallow parts of a network, theres an analogy their with word2vec (shallow networks).

### Exercise 1:
Train your own word2vec model using dataset, and load those vectors into spacy. Visually inspect the results of the vector as a self organizing map.

In [1]:
!pip install gensim >> gensim-log.txt
from gensim.models import Word2Vec
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups()
corpus = dataset.data







Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


In [12]:
#!pip install spacy >> spacy-install.log
#!python -m spacy download en >> spacy-download.log
import spacy
nlp = spacy.load('en')

In [13]:
doc = nlp(corpus[0])

In [None]:
from spacy.tokens import Doc
import spacy
from spacy.matcher import Matcher
from spacy.attrs import ORTH, IS_PUNCT

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)

def process_token(token):
    if token.is_punct or token.is_space:
        return False
    elif token.like_url:
        return'URL'
    elif token.like_email:
        return'EMAIL'
    elif token.like_num:
        return'NUM'
    else:
        return token.lower_.replace(" ",'')
    
    
def process_sentence(tokenized_sent):
    tokens = []
    
    doc = Doc(nlp.vocab, words = tokenized_sent)
    nlp.tagger(doc)
    matcher(doc)
    
    for token in doc:
        processed_token = process_token(token)
        if processed_token:
            tokens.append(processed_token)    
    return tokens



In [102]:
from spacy.tokens import Doc
import spacy
from spacy.matcher import Matcher
from spacy.attrs import ORTH, IS_PUNCT

class TextProcesser(object):
    def __init__(self, nlp=None):
        self.nlp = nlp or spacy.load('en')
        
    def __call__(self, corpus):
        for doc in self.nlp.pipe(corpus, parse=False):
            for ent in doc.ents:
                ent.merge()
            yield from map(self.process_token, doc)

            
    def process_token(self, token):
        if token.like_url:
            return'URL'
        elif token.like_email:
            return'EMAIL'
        elif token.like_num:
            return'NUM'
        else:
            return token.lower_

In [103]:
T = TextProcesser(nlp)

In [104]:
t = T(corpus)

In [107]:
g = list(t)

In [None]:
model = Word2Vec(sentences=processed_sents, ###tokenized senteces, list of list of strings
                 size=300,  #size of embedding vectors
                 workers=8, #how many threads?
                 min_count=5, #minimum number of token instances to be considered
                 sample=0, #weight of downsampling common words? 
                 sg = 0, #should we use skip-gram? if 0, then cbow
                 hs=0, #heirarchical softmax?
                 iter=5 #training epocs
        )

In [None]:
<p id="note1">
turns out the distinction may not be that important. (see [Levy and Goldberg (2014), Pennington et al. (2014), Österlund et al. (2015)] as referenced in https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/).
</p>

### Tensorflow model outline
#### Definitions phrase
* Decide on an architecture
* Define all variables as tensors
* Define how to generate outputs from your inputs and variables
* Define a cost function with respect to your predictions and you labels
* Define an optimizer that minimizes your cost function
#### Execution phase
* create an execution session
* Initialize your variables
* over n epochs, run the optimizer, feeding it some data in batches

In [None]:
class BatchFeeder(object):
    def __init__(self, X, y, batch_size):
        self.X = X
        self.y = y
        self.batch_size = batch_size
        self.i = 0
        
    def __call__(self):
        while True:
            yield self.__iter__(self)
    def __iter__(self):
        X = self.X[self.i:self.i + self.batch_size]
        y = self.y[self.i:self.i + self.batch_size]
        self.i += self.batch_size
        return X, y
        
    def __next__(self):
        return self.__iter__()

In [41]:
class MLP(object):
    def __init__(self, X, y, layer_size = 1000):
        self.X = X
        self.y = y
        self.n_classes = len(np.unique(y))
        self.hidden_dim = layer_size
        self.input_dim = X_train.shape[1]
        self.model_path = 'model.chkpt'
        self.saver = None
        self.graph = tf.Graph()
        self.default_dtype = tf.float64
        
        with self.graph.as_default():
            with tf.variable_scope('mlp_model') as scope:
                self.x_input = tf.placeholder(X.dtype, shape = (None, self.input_dim))
                self.y_output = tf.placeholder(X.dtype, shape = (None, self.n_classes))
                self.weights = {
                    'weights1':tf.get_variable('weights1', (self.input_dim,self.hidden_dim ), dtype=self.default_dtype), 
                    'bias1':tf.get_variable('bias1', (self.hidden_dim, ), dtype=self.default_dtype), 
                    'weights2':tf.get_variable('weights2', (self.hidden_dim, self.n_classes ), dtype=self.default_dtype), 
                    'bias2':tf.get_variable('bias2', (self.n_classes, ), dtype=self.default_dtype)}
                self.get_logit_op = self.feed_forward(self.x_input, self.weights)
                self.predict_proba_op = tf.sigmoid(self.get_logit_op)
                self.predict_op = tf.argmax(self.predict_proba_op, axis=1)
                self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=self.y_output, logits=self.get_logit_op))
                self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)            
    
    def feed_forward(self, x_input, weights):
        hidden = tf.matmul(x_input, weights['weights1'])
        hidden = tf.add(hidden, weights['bias1'])
        hidden = tf.nn.relu(hidden)
        output = tf.matmul(hidden, weights['weights2'])
        output = tf.add(output, weights['bias2'])
        return output
    
    @staticmethod
    def init_vector(shape, name=None):
        init_vals = tf.random_normal(shape, dtype=tf.float64)
        return tf.Variable(init_vals, name=name)

    def predict(self, X):
        
        with tf.Session() as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_op, feed_dict={self.x_input:X})
        return preds
            
    def predict_proba(self, X, session):
        with tf.Session() as sess:
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_proba_op, feed_dict={self.x_input:X})
        return preds
    
    def fit(self, X_train, y_train, epochs=100):
        
        epochs = range(epochs)
  
        with tf.Session(graph=self.graph) as sess:
        self.saver = tf.train.sa
            for var in self.graph.get_collection('variables'):
                sess.run(var.initializer)
                
            for epoch in epochs:
                sess.run(self.optimizer, feed_dict={self.x_input: X_train, self.y_output: y_train})
            
            self.saver.restore(sess, self.model_path)
            preds = sess.run(self.predict_proba_op, feed_dict={self.x_input: X_train, self.y_output: y_train})
        
        return preds










In [45]:
c.predict(X)

AttributeError: 'NoneType' object has no attribute 'restore'

In [43]:
c = MLP(X, y)
preds = c.fit(X, y)

In [239]:
c.graph.get_collection('variables')[0]

<tf.Variable 'weights1:0' shape=(30, 1000) dtype=float64_ref>

In [5]:
with tf.Session() as s:
    s.run(c.graph.get_collection('variables')[0].initializer)

ValueError: Fetch argument <tf.Operation 'weights1/Assign' type=Assign> cannot be interpreted as a Tensor. (Operation name: "weights1/Assign"
op: "Assign"
input: "weights1"
input: "random_normal"
attr {
  key: "T"
  value {
    type: DT_DOUBLE
  }
}
attr {
  key: "_class"
  value {
    list {
      s: "loc:@weights1"
    }
  }
}
attr {
  key: "use_locking"
  value {
    b: true
  }
}
attr {
  key: "validate_shape"
  value {
    b: true
  }
}
 is not an element of this graph.)

In [4]:
c = MLP(X, y)
g = c.fit(X_train, y_train)

ValueError: Fetch argument <tf.Operation 'weights1/Assign' type=Assign> cannot be interpreted as a Tensor. (Operation name: "weights1/Assign"
op: "Assign"
input: "weights1"
input: "random_normal"
attr {
  key: "T"
  value {
    type: DT_DOUBLE
  }
}
attr {
  key: "_class"
  value {
    list {
      s: "loc:@weights1"
    }
  }
}
attr {
  key: "use_locking"
  value {
    b: true
  }
}
attr {
  key: "validate_shape"
  value {
    b: true
  }
}
 is not an element of this graph.)

In [182]:
c = MLP(X, y)

epochs = range(100)
with tf.Session() as sess:
    sess.run(c.init)
    
    for epoch in epochs:
        sess.run(c.optimizer, feed_dict={c.x_input:X_train, c.y_output:y_train})
    
    train_preds = sess.run(c.predict_op, feed_dict={c.x_input: X_train, c.y_output: y_train})
    test_preds = sess.run(c.predict_op, feed_dict={c.x_input: X_test, c.y_output: y_test})
    train_accuracy = np.mean(np.argmax(y_train, axis=1) == train_preds)
    print(train_accuracy)
    
    test_accuracy = np.mean(np.argmax(y_test, axis=1) == test_preds)
    print(test_accuracy)

FailedPreconditionError: Attempting to use uninitialized value Variable_272
	 [[Node: Variable_272/read = Identity[T=DT_DOUBLE, _class=["loc:@Variable_272"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_272)]]

Caused by op 'Variable_272/read', defined at:
  File "/opt/conda/envs/python3/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/python3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2698, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2802, in run_ast_nodes
    if self.run_code(code, result):
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-182-1b1e9ba69bad>", line 1, in <module>
    c = MLP(X, y)
  File "<ipython-input-181-d49fd2352648>", line 12, in __init__
    'weights1':init_vector((input_dim,hidden_1_dim )),
  File "<ipython-input-160-49a3a6e0c647>", line 19, in init_vector
    return tf.Variable(init_vals)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 200, in __init__
    expected_shape=expected_shape)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 319, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1303, in identity
    result = _op_def_lib.apply_op("Identity", input=input, name=name)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/opt/conda/envs/python3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Variable_272
	 [[Node: Variable_272/read = Identity[T=DT_DOUBLE, _class=["loc:@Variable_272"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_272)]]


In [2]:
import tensorflow as tf
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import label_binarize
import numpy as np
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X = cancer.data
y = label_binarize(cancer.target, classes=[0,1,2])[:, :2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)

input_rows = X_train.shape[0]
input_dim = X_train.shape[1]
hidden_1_dim = 1000
n_classes = len(np.unique(y))
batch_size = 500

def init_vector(shape):
    init_vals = tf.random_normal(shape, dtype=tf.float64)
    return tf.Variable(init_vals)

def feed_forward(X, weights):
    
    hidden = tf.nn.relu(tf.add(tf.matmul(X, weights['weights1']), weights['bias1']))
    output = tf.add(tf.matmul(hidden, weights['weights2']), weights['bias2'])
    return output


x_input = tf.placeholder(X.dtype, shape = (None, input_dim))
y_output = tf.placeholder(X.dtype, shape = (None, n_classes))
    
weights = {
    'weights1':init_vector((input_dim,hidden_1_dim )), 
    'bias1':init_vector((hidden_1_dim, )),
    'weights2':init_vector((hidden_1_dim,n_classes )), 
    'bias2':init_vector((n_classes, ))    
}


yhat = feed_forward(x_input, weights)
predict = tf.argmax(yhat, axis=1)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_output, logits=yhat))

optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(cost)
init = tf.global_variables_initializer()


epochs = range(100)
with tf.Session() as sess:
    sess.run(init)
    
    for epoch in epochs:
        sess.run(optimizer, feed_dict={x_input:X_train, y_output:y_train})
    
    train_preds = sess.run(predict, feed_dict={x_input: X_train, y_output: y_train})
    test_preds = sess.run(predict, feed_dict={x_input: X_test, y_output: y_test})
    train_accuracy = np.mean(np.argmax(y_train, axis=1) == train_preds)
    print(train_accuracy)
    
    test_accuracy = np.mean(np.argmax(y_test, axis=1) == test_preds)
    print(test_accuracy)

0.869346733668
0.900584795322


In [180]:
tf.local_variables()

[]

In [37]:
from sklearn.neural_network import MLPClassifier
gb = MLPClassifier(hidden_layer_sizes=(1000, ))
gb.fit(X_train, np.argmax(y_train, axis=1))
print("Train: ", np.mean(gb.predict(X_train) == np.argmax(y_train, axis=1)))
print("Test: ", np.mean(gb.predict(X_test) == np.argmax(y_test, axis=1)))


Train:  0.929648241206
Test:  0.888888888889


0.0
0.0


  # Remove the CWD from sys.path while we load stuff.
  del sys.path[0]


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1,

FailedPreconditionError: Attempting to use uninitialized value Variable_196
	 [[Node: _retval_Variable_196_0_0 = _Retval[T=DT_DOUBLE, index=0, _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_196)]]

TensorShape([Dimension(None), Dimension(30)])

In [127]:
l = logistic_regression(X, y)

AttributeError: 'numpy.ndarray' object has no attribute 'get_shape'