### Deep Learning Tutorial for NLP with Tensorflow
This tutorial borrows from here and tries to show how to work with NLP tasks using Tensorflow. Borrowed material from [here]([here](https://github.com/rguthrie3/DeepLearningForNLPInPytorch/blob/master/Deep%20Learning%20for%20Natural%20Language%20Processing%20with%20Pytorch.ipynb)

---

In [1]:
import tensorflow as tf
import os
import numpy as np

tf.set_random_seed(1) # seed to obtain similar outputs
os.environ['CUDA_VISIBLE_DEVICES'] = '' # avoids using GPU for this session

### 1. Introduction of Tensors in Tensorflow
How to create and handle tensors, which are the building blocks for any deep learning architecture you wish to implement. You can find more information about tensors and why they are used for deep learning modeling here [LINK]. 

#### Creating Tensors

In [2]:
# start session
sess = tf.Session()

#initialize_op =  tf.global_variables_initializer()
#sess.run(initialize_op)

# 1D tensor (also known as a 1D vector)
V_data = [1., 2., 3.]
V = tf.constant(V_data, tf.float32)
print(V, "\n",sess.run(V))

# 2D tensor (also known as a 2D matrix)
M_data = [[1.,2.,3.],[4.,5.,6.]]
M = tf.constant(M_data, tf.float32)
print(M,"\n", sess.run(M))

# 3D tensor of size 2x2x2
T_data = [[[1.,2., 3.], [4.,5., 6.]],
          [[7.,8., 9.], [10.,11., 12]]]
T = tf.constant(T_data, tf.float32)
print(T, "\n", sess.run(T))

Tensor("Const:0", shape=(3,), dtype=float32) 
 [ 1.  2.  3.]
Tensor("Const_1:0", shape=(2, 3), dtype=float32) 
 [[ 1.  2.  3.]
 [ 4.  5.  6.]]
Tensor("Const_2:0", shape=(2, 2, 3), dtype=float32) 
 [[[  1.   2.   3.]
  [  4.   5.   6.]]

 [[  7.   8.   9.]
  [ 10.  11.  12.]]]


We can also operate on tensors the same way we would on standard numpy matrices. The following show some examples.

In [3]:
# Index into V and produce a scalar
print(V, "\n",sess.run(V[0]))

# Index into M and produce a scalar
print(M, "\n",sess.run(M[0]))

# Index into T and produce a matrix
print(T, "\n",sess.run(T[0]))

Tensor("Const:0", shape=(3,), dtype=float32) 
 1.0
Tensor("Const_1:0", shape=(2, 3), dtype=float32) 
 [ 1.  2.  3.]
Tensor("Const_2:0", shape=(2, 2, 3), dtype=float32) 
 [[ 1.  2.  3.]
 [ 4.  5.  6.]]


We can also create a tensor with random data with a specified dimension

In [4]:
x = tf.random_normal([3,4,5])
print(x, "\n", sess.run(x))

Tensor("random_normal:0", shape=(3, 4, 5), dtype=float32) 
 [[[ 2.92425132  0.1904023  -0.0999928   0.54333866  0.44673878]
  [-0.74975967 -0.54365152 -0.11328146 -0.73304367  0.84540504]
  [-0.98038179  0.22330996  0.00350116  0.63260406  0.53027982]
  [ 0.11028062 -1.06750703  0.78615493  1.87881851  0.29188421]]

 [[ 0.65026432 -0.20335354 -0.80023402 -0.09808423 -0.38593856]
  [-0.82919121  0.76829058 -0.30424473  0.76561087  0.09717353]
  [-0.03512896 -0.44248524 -0.63942212 -1.37797427  0.00825686]
  [-0.61289674 -2.01852942  0.82595825 -0.49025506  2.95250511]]

 [[-1.63209784  1.47359169  0.18774952  2.14593959 -2.32931423]
  [-1.32003772  0.49694589 -0.96136755 -0.05908597  0.28159368]
  [-3.31978583  0.13563539  1.1548022  -0.09598015  0.69415414]
  [-1.30712473 -0.62319744  0.53103876  0.84003556 -0.92114925]]]


#### Operations with Tensors

In [5]:
x = tf.constant([1.,2.,3.])
y = tf.constant([4.,5.,6.])
z = x+y
print(sess.run(z))

[ 5.  7.  9.]


#### Concatenation

In [6]:
# concatenate by rows (axis=0)
x_1 = tf.random_normal([2,5])
y_1 = tf.random_normal([3,5])
z_1 = tf.concat([x_1, y_1], 0)
print(sess.run(z_1))

print("\n")
# concatenate by columns (axis=1)
x_2 = tf.random_normal([2,3])
y_2 = tf.random_normal([2,5])
z_2 = tf.concat([x_2, y_2], 1)
print(sess.run(z_2))

[[ 0.58697659  0.64898354 -0.00182672  0.46972319 -0.41258761]
 [ 0.50580209  1.08674705 -0.98539138 -0.37500492 -0.27378666]
 [-0.10245068 -0.42229936  0.53483063 -0.90609199  0.2286306 ]
 [ 1.33724499  0.54341757 -0.11454915 -0.13848655  0.70253915]
 [ 0.18164249 -0.91935599  1.01541901 -0.56042647  0.84225392]]


[[ 1.01857746 -0.81870914 -0.97899282 -0.23221652  0.54886436 -1.43723702
   1.37461388  0.19729844]
 [-2.54092383  0.40189755 -0.17309178 -0.0217983   1.60369706  0.98966628
  -1.32113194 -0.79742384]]


More on concatenation [here](https://www.tensorflow.org/api_docs/python/tf/concat).

#### Reshaping tensors

In [7]:
x = tf.random_normal([2,3,4])
print(sess.run(x))
print("\n")
print(sess.run(tf.reshape(x,[2,12])))
print("\n")
print(sess.run(tf.reshape(x,[2,-1]))) # -1 enable automatic infer of dimension

[[[ 0.20525731 -0.89691323  1.07025433 -0.15564451]
  [-0.40598848 -0.29619998 -0.82002103 -0.37928221]
  [ 0.12490512  0.03255307  1.95663142 -1.24699843]]

 [[ 0.42128563 -0.4631182   0.95954072 -1.19090807]
  [-1.57321525  0.02937024  1.77990592 -0.65396672]
  [-1.10802829  0.33289853 -1.41283214 -0.69903141]]]


[[-1.53472745 -0.57235247 -1.88776898 -0.37644315  1.04152322  0.10515906
  -0.41811749 -1.1674329   0.40431187  0.34636515 -0.1303246   2.76180029]
 [-1.75117695 -0.15002298  0.65409285  0.44413102 -0.12226854  0.90484196
   0.21590623  1.56579196 -0.41805771 -0.28057054 -0.06562074 -0.55228049]]


[[-0.51017123  0.04450967 -0.04642046  0.04758257 -2.67784333 -2.01859713
  -2.0040257  -0.24659356  1.00061321  0.10434803 -0.38498423  0.78364938]
 [ 1.9038918  -1.16755736 -0.9858014   0.01895356  0.82711017  0.77959973
  -1.15610087  0.94547653  0.65504938 -0.99168533  0.3092854  -0.84525639]]


Noticed that everything time we do sess.run(x) the values of the tensor keeps changing. We can fix this as follows:

In [8]:
x = tf.random_normal([2,3,4])
x_result = sess.run(x) # this makes the results stable
print(x_result)
print("\n")
print(sess.run(tf.reshape(x_result,[2,-1])))

[[[ 1.58181369  0.81615424  1.2647419   2.44118857]
  [-0.64325184 -0.52695727  0.87262011 -2.33419275]
  [-0.91412055 -0.65119022 -0.2877526  -0.03355991]]

 [[ 0.03930664 -0.3554261  -1.84036934  0.64453197]
  [-0.70999211 -0.07287408  0.58443463  0.53111529]
  [-1.32720828  0.61744171 -1.12603593  1.55679595]]]


[[ 1.58181369  0.81615424  1.2647419   2.44118857 -0.64325184 -0.52695727
   0.87262011 -2.33419275 -0.91412055 -0.65119022 -0.2877526  -0.03355991]
 [ 0.03930664 -0.3554261  -1.84036934  0.64453197 -0.70999211 -0.07287408
   0.58443463  0.53111529 -1.32720828  0.61744171 -1.12603593  1.55679595]]


### 2. Computation Graphs and Automatic Differentiation

In [9]:
# slightly different from pytorch but still the same effect applies
x = tf.random_normal([2,2])
y = tf.random_normal([2,2])
z = x+y
print(sess.run(z))

[[-0.31353098 -0.52566063]
 [-2.63175964 -2.74863577]]


In [10]:
# we specify a session to run the evaluation
with sess.as_default():
    var_x = tf.Variable(x, name="X")
    var_y = tf.Variable(y, name="Y")
    var_z = var_x + var_y

    var_x.initializer.run()
    var_y.initializer.run()

    var_z.eval()
    print()




In [11]:
tf.gradients(var_z, var_x) # compute gradients of var_z with regards to var_x

[<tf.Tensor 'gradients/add_2_grad/Reshape:0' shape=(2, 2) dtype=float32>]

### 3. Deep Learning Building Blocks: Affine maps, non-linearities and objectives
The affine map is a function $f(x)$ where
$$ f(x) = Wx + b $$

where $W$ refers to a weight matrix and vectors $x,b$ are input and bias, respectively. 

In [12]:
# We do this slightly different in tensorflow using matmul operation
X = tf.Variable(tf.random_normal([128,20]), name="X")
W = tf.Variable(tf.random_normal([20,30]), name="W")
b = tf.Variable(tf.random_normal([30]), name="b")

tf.matmul(X, W) + b

<tf.Tensor 'add_3:0' shape=(128, 30) dtype=float32>

#### Non-linearities
Examples: $tanh(x)$, $\sigma(x)$, and $ReLU(x)$

In [13]:
data = tf.Variable(tf.random_normal([2,2]), name="data")
data.initializer.run(session=sess)
print(sess.run(data))
print(sess.run(tf.nn.relu(data)))

[[ 1.61521125  0.01385026]
 [ 0.04062277 -0.31989396]]
[[ 1.61521125  0.01385026]
 [ 0.04062277  0.        ]]


#### Softmax and Probabilities
We are computing for $softmax(x)$ where the $i^{th}$ component of $softmax(x)$ is:
$$ \frac{\exp(x_i)}{\sum_j \exp(x_j)} $$

In [14]:
data = tf.Variable(tf.random_normal([2]), name="data")
data.initializer.run(session=sess)
print(sess.run(data))
print(sess.run(tf.nn.softmax(data)))
print(sess.run(tf.reduce_sum(tf.nn.softmax(data))))
print(sess.run(tf.log(tf.nn.softmax(data))))

[ 1.60825002 -2.16805053]
[ 0.97760576  0.02239429]
1.0
[-0.0226488  -3.79894924]


### 4. Optimization and Training
We compute for gradients that are used to update our weights so as to minimize the loss function as the objective: 

$$ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta) $$

### 5. Creating Network Components in Tensorflow
#### Logistic regression on BOW classifier
Given a BOW vector representation $x$ of a sentence $s$. The output of our network is computed as follows:
$$log(softmax(Wx+b))$$
where 
$$\forall i\in \{|s|\}: x_i  = count(s_i)$$ and $|s|$ is the number of words in a given sentence

In [15]:
data = [ ("me gusta comer en la cafeteria".split(), "SPANISH"),
         ("Give it to me".split(), "ENGLISH"),
         ("No creo que sea una buena idea".split(), "SPANISH"),
         ("No it is not a good idea to get lost at sea".split(), "ENGLISH") ]

test_data = [ ("Yo creo que si".split(), "SPANISH"),
              ("it is lost on me".split(), "ENGLISH")]

# defining the training set labels or targets
label_to_ix = { "SPANISH": [1,0], "ENGLISH": [0,1] }

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print (word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


In [16]:
class BOWClassifier(object):
    def __init__(self, num_labels, vocab_size):
        # input
        self.X = tf.placeholder(tf.float32, [None, vocab_size])
        self.Y = tf.placeholder(tf.float32, [None, num_labels])
        
        # weights and bias
        self.W = tf.Variable(tf.random_uniform([vocab_size, num_labels]))
        self.b = tf.Variable(tf.random_uniform([num_labels]))
        
        # log probabilties of output
        self.logits = tf.matmul(self.X, self.W) + self.b
        self.log_probs = tf.log(tf.nn.softmax(self.logits))
        
        # loss
        self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.Y))

In [17]:
def make_bow_vector(sentence, word_to_ix):
    vec = np.zeros(len(word_to_ix)) # need to find what is the way to assign to tf tensors
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.reshape(1, -1)

def make_target(label, label_to_ix):
    # need to binarize the labels => [0,1] or [1,0]
    return np.array([label_to_ix[label]])

My observation is that with Pytorch it is easy to create variables and assign values to them. With tensorflow this is not so easy, therefore I preferred to use numpy to assist with the make_bow_vector function.

In [18]:
model = BOWClassifier(NUM_LABELS, VOCAB_SIZE)

In [19]:
# parameters can be obtained directly
model.W

<tf.Variable 'Variable:0' shape=(26, 2) dtype=float32_ref>

Let's just obtain the log probabilities of the output

In [20]:
with tf.Session() as sess:    
    # initialize and run all variables so that we can use their values directly
    init = tf.global_variables_initializer()
    sess.run(init)
    
     # prepare bow vector
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    
    log_probs = sess.run(model.log_probs, feed_dict = {model.X: bow_vector})

In [21]:
print(log_probs)

[[-0.12336739 -2.15363789]]


Now let's print the matrix parameters corresponding to as specific word such as "creo"

In [22]:
with tf.Session() as sess:    
    # initialize and run all variables so that we can use their values directly
    init = tf.global_variables_initializer()
    sess.run(init)
    
    for instance, label in test_data:
        bow_vector = make_bow_vector(instance, word_to_ix)
        params, log_probs = sess.run([model.W, model.log_probs], feed_dict = {model.X: bow_vector})
        print(log_probs)

[[-0.28383601 -1.39792216]]
[[-1.90972412 -0.16031106]]


In [23]:
print(params.T[:,word_to_ix["creo"]])

[ 0.69216204  0.48723626]


Here we train the model and use optimization to minimize lost. 

In [24]:
with tf.Session() as sess:    
    # training procedure
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
    train_op = optimizer.minimize(model.loss)
    
    # initialize and run all variables so that we can use their values directly
    init = tf.global_variables_initializer()
    sess.run(init)
    
    # train cycle with training data
    for epoch in range(15):
        for instance, label in data:
            bow_vector = make_bow_vector(instance, word_to_ix)
            target = make_target(label, label_to_ix)
            _, loss, params, log_probs = sess.run([train_op, model.loss, model.W, model.log_probs], 
                                         feed_dict = {model.X: bow_vector, model.Y:target})
        print(loss)
    print("Optimization finished!!!")
    
    # test with testing data
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        params, log_probs = sess.run([model.W, model.log_probs],
                                    feed_dict={model.X: bow_vec})
        print(log_probs)
    print(params.T[:,word_to_ix["creo"]])

0.918552
0.297093
0.17745
0.128839
0.101833
0.08441
0.0721463
0.0630087
0.0559212
0.0502563
0.0456219
0.0417586
0.0384884
0.0356845
0.0332538
Optimization finished!!!
[[-0.15573406 -1.93646169]]
[[-3.43429089 -0.03277975]]
[ 0.88358289  0.29581532]


We can observe that the log probability for Spanish is much higher in the first test sample, while the log probability for English is much higher in the second example, as it corresponds the testing data.

In [25]:
print(params.T[:,word_to_ix["gusta"]])

[ 0.90241164 -0.08874875]


### 6. Word Embeddings: Enconding Lexical Semantics
We aim to build dense representations of a vocabulary to obtain semantic similarity between words, which helps to support the distributional hypothesis (words appearing in similar contexts are related to each other semantically).

Each vector representation of a word contains some semantics attributes that determine similar words through some similarity measure like cosine similarity.

These latent semantic attributes (features) are determined or learned automatically by a neural network. In other words, the parameters of the model are the word embeddings, which are learned during training. The attributes learned cannot be interpreted since they are learned by the neural network during the training.

"In summary, word embeddings are a representation of the semantics of a word, efficiently encoding semantic information that might be relevant to the task at hand. You can embed other things too: part of speech tags, parse trees, anything! The idea of feature embeddings is central to the field."

### Word Embeddings in Tensorflow
First, we build a lookup table which keeps indexes. The results is a $|V|\times D$ matrix, where $|V|$ is the size of the vocabulary and $D$ is the dimensionality of the embeddings. This means that a word with index $i$ has its embedding stored in the $i$th row of the matrix.

In [26]:
sess = tf.Session()
word_to_ix = { "hello": 0, "world": 1 }
word_ids = [0,1]
embeds = tf.Variable(tf.random_uniform([2,5]), name="word_embeddings")
embedded_word_ids = tf.nn.embedding_lookup(embeds, word_ids)

#print(sess.run(embedded_word_ids))

---

### Example: N-gram Language Modeling 
In this model we aim to predict target word from word sequence; i.e., we aim to compute:
$$ P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} ) $$
, where $w_i$ is the ith word of the sequence. 

In [27]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [ ([test_sentence[i], test_sentence[i+1]], test_sentence[i+2]) for i in range(len(test_sentence) - 2) ]
print (trigrams[:3]) # print the first 3, just so you can see what they look like

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [28]:
vocab = set(test_sentence)
word_to_ix = { word: i for i, word in enumerate(vocab) }

In [85]:
class NGramLanguageModeler(object):
    def __init__(self, vocab_size, embedding_dim, context_size):
        
        # inputs and outputs
        self.X = tf.placeholder(tf.int32, [ context_size ]) # context (word sequence)
        self.Y = tf.placeholder(tf.int32, [1,1]) # target word 
        
        # Embeddings
        self.embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_dim], -1.0, 1.0))
        self.embed = tf.nn.embedding_lookup(self.embeddings, self.X)
        
        # layer
        self.W = tf.Variable(tf.random_uniform([vocab_size, embedding_dim]))
        self.b = tf.Variable(tf.random_uniform([vocab_size]))
        
        # loss        
        self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=self.W,
                     biases=self.b,
                     labels=self.Y,
                     inputs=self.embed,
                     num_sampled=64,
                     num_classes=vocab_size))        

In [89]:
# create model with class
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
losses = []

with tf.Session() as sess:    
    # training procedures
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
    train_op = optimizer.minimize(model.loss)
        
    # initialize and run all variables so that we can use their values directly
    init = tf.global_variables_initializer()
    sess.run(init)
    
    # train cycle with training data
    for epoch in range(10):
        total_loss = 0
        for context, target in trigrams: # collecting batches
            context_idxs = list(map(lambda w: word_to_ix[w], context))
            
            _, embed,loss, params = sess.run([train_op, model.embed, model.loss, model.W],
                                                 feed_dict = {model.X:context_idxs,
                                                             model.Y:np.array(word_to_ix[target]).reshape(1,1)})
            total_loss+=loss
        print(total_loss)
    

7355.27872467
3092.70562935
1670.26257944
1151.23467731
904.935578346
789.814936161
719.891431808
671.3481884
645.542729855
625.694592476


#### TODO: To improve the graph structure by adding namescopes. We can also visualize embeddings using Tensorboard by loggin summaries.

### Computing Word Embeddings: Continuous Bag-of-Words
The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typcially, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent.

The CBOW model is as follows.  Given a target word $w_i$ and an $N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$ and $w_{i+1}, \dots, w_{i+N}$, referring to all context words collectively as $C$, CBOW tries to minimize,
$$ -\log p(w_i | C) = \log \text{Softmax}(A(\sum_{w \in C} q_w) + b) $$,
where $q_w$ is the embedding for word $w$.

In [92]:
CONTEXT_SIZE = 2 # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process. Computational processes are abstract
beings that inhabit computers. As they evolve, processes manipulate other abstract
things called data. The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()
word_to_ix = { word: i for i, word in enumerate(set(raw_text)) }
data = []
for i in range(2, len(raw_text) - 2):
    context = [ raw_text[i-2], raw_text[i-1], raw_text[i+1], raw_text[i+2] ]
    target = raw_text[i]
    data.append( (context, target) )
print (data[:5])

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


#### Building Model with Tensorflow

In [97]:
class CBOW(object):
    def __init__(self, vocab_size, embedding_dim, context_size):
        
        # inputs and outputs
        self.X = tf.placeholder(tf.int32, [ context_size * 2 ]) # context (word sequence)
        self.Y = tf.placeholder(tf.int32, [1,1]) # target word 
        
        # Embeddings
        self.embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_dim], -1.0, 1.0))
        self.embed = tf.nn.embedding_lookup(self.embeddings, self.X)
        
        # layer
        self.W = tf.Variable(tf.random_uniform([vocab_size, embedding_dim]))
        self.b = tf.Variable(tf.random_uniform([vocab_size]))
        
        # loss        
        self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=self.W,
                     biases=self.b,
                     labels=self.Y,
                     inputs=self.embed,
                     num_sampled=64,
                     num_classes=vocab_size))  

In [100]:
# create model with class
model = CBOW(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
losses = []

with tf.Session() as sess:    
    # training procedures
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
    train_op = optimizer.minimize(model.loss)
        
    # initialize and run all variables so that we can use their values directly
    init = tf.global_variables_initializer()
    sess.run(init)
    
    # train cycle with training data
    for epoch in range(10):
        total_loss = 0
        for context, target in data: # collecting batches
            context_idxs = list(map(lambda w: word_to_ix[w], context))
            
            _, embed,loss, params = sess.run([train_op, model.embed, model.loss, model.W],
                                                 feed_dict = {model.X:context_idxs,
                                                             model.Y:np.array(word_to_ix[target]).reshape(1,1)})
            total_loss+=loss
        print(total_loss)
    

3483.83104134
1645.26916409
980.017921448
688.383241653
546.793506622
466.670639515
417.156669617
386.634778023
369.295153141
351.957611561


### References
- [How to structure your model in Tensorflow](http://web.stanford.edu/class/cs20si/lectures/notes_04.pdf)
- [Understanding RNNs and LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Introduction to RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
- [RNN Text Classification](https://github.com/LunaBlack/RNN-Classification)