# Deep Learning for Data Analytics Summer School 3 -- NLP Tutorial

## Part 1: Preliminaries and Task Introduction

We want to solve a task called "stance detection", which is about classifying the attitude of a sentence towards a concept. Read more about the task here: http://alt.qcri.org/semeval2016/task6/

Let's have a look at the data:

In [7]:
from readwrite.reader import *
from readwrite.writer import *

fp = "data/semeval/"
train_path = fp + "semeval2016-task6-train+dev.txt"
test_path = fp + "SemEval2016-Task6-subtaskB-testdata-gold.txt"
pred_path = fp + "SemEval2016-Task6-subtaskB-testdata-pred.txt"
tweets_train, targets_train, labels_train, ids_train = readTweetsOfficial(train_path)
tweets_test, targets_test, labels_test, ids_test = readTweetsOfficial(test_path)
print(tweets_train[0], targets_train[0], labels_train[0])
print(tweets_train[721], targets_train[721], labels_train[721])

dear lord thank u for all of ur blessings forgive my sins lord give me strength and energy for this busy day ahead #blessed #hope #SemST Atheism AGAINST
Stress on #water resources threatens lives and livelihoods #anthropoceneage #sustainability #SemST Climate Change is a Real Concern FAVOR


In [8]:
print(tweets_train[5])
print(targets_train[5])
print(labels_train[5])

target_set = set(targets_train)

print(target_set)

If we are unsure whether something is halal or haram, we should leave it - this will #safeguard our #deen #rule #SemST
Atheism
AGAINST
{'Atheism', 'Feminist Movement', 'Climate Change is a Real Concern', 'Hillary Clinton', 'Legalization of Abortion'}


As you can see, each instance consists of a tweet, a target, for which we want to predict a label ("`FAVOR, AGAINST, NONE`").

## Part 2: Scikit-learn

Our second approach uses a pre-implemented classifier and feature extractor from the scikit-learn package.

In [9]:
# let's first merge the tweets and targets for easier feature extraction
tweets_targets_train = [" | ".join([tweets_train[i], targets_train[i]]) for i in range(len(tweets_train))]
tweets_targets_test = [" | ".join([tweets_test[i], targets_test[i]]) for i in range(len(tweets_test))]
tweets_targets_train[0], labels_train[0]

('dear lord thank u for all of ur blessings forgive my sins lord give me strength and energy for this busy day ahead #blessed #hope #SemST | Atheism',
 'AGAINST')

We now transform the instances into features using sklearn's count vectoriser that assigns an ID to each word, then weighs them based on their frequency.

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv.fit(tweets_train)
cv.vocabulary_

{'dear': 2144,
 'lord': 4879,
 'thank': 8002,
 'for': 3198,
 'all': 447,
 'of': 5696,
 'ur': 8485,
 'blessings': 1082,
 'forgive': 3211,
 'my': 5413,
 'sins': 7365,
 'give': 3465,
 'me': 5110,
 'strength': 7720,
 'and': 527,
 'energy': 2714,
 'this': 8094,
 'busy': 1292,
 'day': 2126,
 'ahead': 408,
 'blessed': 1079,
 'hope': 3957,
 'semst': 7164,
 'are': 649,
 'the': 8010,
 'peacemakers': 5955,
 'they': 8080,
 'shall': 7234,
 'be': 894,
 'called': 1325,
 'children': 1538,
 'god': 3506,
 'matthew': 5089,
 'scripture': 7096,
 'peace': 5952,
 'am': 477,
 'not': 5602,
 'conformed': 1816,
 'to': 8179,
 'world': 8899,
 'transformed': 8258,
 'by': 1303,
 'renewing': 6707,
 'mind': 5222,
 'ispeaklife': 4307,
 '2014': 58,
 'salah': 6969,
 'should': 7301,
 'prayed': 6215,
 'with': 8845,
 'focus': 3172,
 'understanding': 8409,
 'allah': 448,
 'warns': 8673,
 'against': 389,
 'lazy': 4664,
 'prayers': 6220,
 'done': 2455,
 'just': 4457,
 'show': 7307,
 'surah': 7824,
 'al': 425,
 'maoon': 5029,
 

In [11]:
features_train = cv.transform(tweets_targets_train)
features_test = cv.transform(tweets_targets_test)
print(features_train[0])

  (0, 408)	1
  (0, 447)	1
  (0, 527)	1
  (0, 728)	1
  (0, 1079)	1
  (0, 1082)	1
  (0, 1292)	1
  (0, 2126)	1
  (0, 2144)	1
  (0, 2714)	1
  (0, 3198)	2
  (0, 3211)	1
  (0, 3465)	1
  (0, 3957)	1
  (0, 4879)	2
  (0, 5110)	1
  (0, 5413)	1
  (0, 5696)	1
  (0, 7164)	1
  (0, 7365)	1
  (0, 7720)	1
  (0, 8002)	1
  (0, 8094)	1
  (0, 8485)	1


We now define and train a simple logistic regression model with L2 regularisation.

In [12]:
# s(f(x), g(x)) + loss function handled by this model
model = LogisticRegression(penalty='l2')
model.fit(features_train, labels_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We can now use this model to make predictions.

In [13]:
predictions = model.predict(features_test)
predictions[5]

'NONE'

**Exercise**: inspect the predictions and check for which examples incorrect vs. correct features are made. Inspect which features are good vs. bad predictors of the test set instances.

Let's see how well we did overall and compute evaluation metrics.

In [14]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(labels_test, predictions))
print(set(labels_test))
print(set(predictions))

             precision    recall  f1-score   support

    AGAINST       0.75      0.01      0.02       299
      FAVOR       0.00      0.00      0.00       148
       NONE       0.37      1.00      0.54       260

avg / total       0.45      0.37      0.21       707

{'FAVOR', 'AGAINST', 'NONE'}
{'AGAINST', 'NONE'}


  'precision', 'predicted', average, warn_for)


Let's also look at which labels were often confused with one another.

In [15]:
print(confusion_matrix(labels_test, predictions))

[[  3   0 296]
 [  0   0 148]
 [  1   0 259]]


**Exercise**: try to understand the confusion matrix and think about what would cause the results you observe.

## Part 3: Word2vec

Our third approach is to use word embeddings, which are trained using a simple feed-forward neural network. Word embeddings are commonly used in NLP, so there are many ready-made software packages, the most common one of which is word2vec.

While scikit-learn did all the preprocessing and feature extraction for us, we now have to put in a little bit more work for this.
First, we tokenise the data.

In [16]:
from ex3_word2vec.tokenize_tweets import tokenise_tweets
#tweet_tokens = tokenise_tweets(tweets_train)
#target_tokens = tokenise_tweets(targets_train)
#tweet_tokens_test = tokenise_tweets(tweets_test)
#target_tokens_test = tokenise_tweets(targets_test)
tweets_targets_train_tokens = tokenise_tweets(tweets_targets_train)
tweets_targets_test_tokens = tokenise_tweets(tweets_targets_test)

Then, we need to convert labels to indeces

In [17]:
def label2Indeces(labels):
    labels_ret = []
    for i, lab in enumerate(labels):
        if lab == 'NONE':
            labels_ret.append(0)
        elif lab == 'FAVOR':
            labels_ret.append(2)
        elif lab == 'AGAINST':
            labels_ret.append(1)
    return labels_ret

labels_train_idx = label2Indeces(labels_train)
labels_test_idx = label2Indeces(labels_test)

Then, we need to train a word2vec model. We first turn on logging to monitor the training process and set the word2vec model hyperparameters.

In [18]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# set params
num_features = 100    # Word vector dimensionality
min_word_count = 2   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words
trainalgo = 1 # cbow: 0 / skip-gram: 1

We'll import the word2vec `gensim` package

In [19]:
from gensim.models import word2vec

2018-11-29 15:39:02,080 : INFO : 'pattern' package not found; tag filters are not available for English


If the package is not found, uncomment and run the line below to install gensim

In [None]:
# !{sys.executable} -m pip install gensim

Now we can start training the model

In [20]:
print("Training model...")
model = word2vec.Word2Vec(tweets_targets_train_tokens, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling, sg = trainalgo)

# add for memory efficiency
model.init_sims(replace=True)

# save the model
model.save("models/skip_nostop_sing_100features_5minwords_10context")

2018-11-29 15:39:04,965 : INFO : collecting all words and their counts
2018-11-29 15:39:04,968 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-29 15:39:05,059 : INFO : collected 9910 word types from a corpus of 107326 raw words and 5628 sentences
2018-11-29 15:39:05,063 : INFO : Loading a fresh vocabulary
2018-11-29 15:39:05,142 : INFO : effective_min_count=2 retains 9910 unique words (100% of original 9910, drops 0)
2018-11-29 15:39:05,145 : INFO : effective_min_count=2 leaves 107326 word corpus (100% of original 107326, drops 0)


Training model...


2018-11-29 15:39:05,334 : INFO : deleting the raw counts dictionary of 9910 items
2018-11-29 15:39:05,337 : INFO : sample=0.001 downsamples 51 most-common words
2018-11-29 15:39:05,338 : INFO : downsampling leaves estimated 81982 word corpus (76.4% of prior 107326)
2018-11-29 15:39:05,411 : INFO : estimated required memory for 9910 words and 100 dimensions: 12883000 bytes
2018-11-29 15:39:05,412 : INFO : resetting layer weights
2018-11-29 15:39:05,827 : INFO : training model with 4 workers on 9910 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=10
2018-11-29 15:39:06,393 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-11-29 15:39:06,541 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-11-29 15:39:06,559 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-11-29 15:39:06,570 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-11-29 15:39:06,574 : INFO : EPOCH - 1 : training 

Now we can example what the word2vec model has learned.

Exercise: play around with the three functions below by inputting diifferent words. What do you observe? 
Hint: you can access the model's vocabulary with "`model.wv.vocab`"

In [21]:
# if needed, load a word2vec model
# model = word2vec.Word2Vec.load(modelname)

# find most similar n words to given word
def applyWord2VecMostSimilar(model, word="#abortion", top=20):
    print("Find ", top, " terms most similar to ", word, "...")
    for res in model.wv.most_similar(word, topn=top):
        print(res)
    print("\n")
    
# determine similarity between words
def applyWord2VecSimilarityBetweenWords(model, w1="trump", w2="conservative"):
    print("Computing similarity between ", w1, " and ", w2, "...")
    print(model.wv.similarity(w1, w2), "\n")
    
# search which words/phrases the model knows which contain a searchterm
def applyWord2VecFindWord(model, searchterm="trump"):
    print("Finding terms containing ", searchterm, "...")
    for v in model.wv.vocab:
        if searchterm in v:
            print(v)
    print("\n")
    
applyWord2VecMostSimilar(model)
applyWord2VecSimilarityBetweenWords(model)
applyWord2VecFindWord(model)

Find  20  terms most similar to  #abortion ...
('rich', 0.990666389465332)
('personally', 0.9897016882896423)
('#thingsyoudontsayasapolitician', 0.9896957874298096)
('beg', 0.9891961812973022)
('quit', 0.9891592860221863)
('atheist', 0.9889217615127563)
('busy', 0.9888014197349548)
('hurt', 0.9883986711502075)
('dad', 0.9882718920707703)
('refuse', 0.988067626953125)
('taken', 0.9879710674285889)
('obviously', 0.9875509738922119)
('dick', 0.987427830696106)
('scared', 0.9870878458023071)
('boobs', 0.9863909482955933)
('troll', 0.9859837293624878)
('tiny', 0.9858169555664062)
('asking', 0.9855866432189941)
('living', 0.9850107431411743)
('married', 0.9849096536636353)


Computing similarity between  trump  and  conservative ...
0.91756403 

Finding terms containing  trump ...
@realdonaldtrump
trump
#donaldtrump
@strumpetcity
#trump
#trump2016




  if np.issubdtype(vec.dtype, np.int):


**Exercise**: there's another gensim package that automatically detects phrases, which can be a useful preprocessing step. Train such a model and see what it learns. Here is how to train one.

In [22]:
from gensim.models import Phrases
bigram = Phrases(tweets_targets_train_tokens)
# bigram.save("models/phrases.model")

2018-11-29 15:39:10,636 : INFO : collecting all words and their counts
2018-11-29 15:39:10,640 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-11-29 15:39:11,120 : INFO : collected 42844 word types from a corpus of 107326 words (unigram + bigrams) and 5628 sentences
2018-11-29 15:39:11,122 : INFO : using 42844 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>


**Exercise** (try at home): An alternative is to use word embeddings pre-trained on a larger dataset. Here's how to import word2vec embeddings. 

In [23]:
import gensim

# download pre-trained word embeddings: $ wget https://www.dropbox.com/s/bnm0trligffakd9/GoogleNews-vectors-negative300.bin.gz
# load them
#w2vmodel = word2vec.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
w2vmodel = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

2018-11-29 15:39:13,351 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin.gz
2018-11-29 15:44:17,154 : INFO : loaded (3000000, 300) matrix from GoogleNews-vectors-negative300.bin.gz


Now let's use the word embeddings as features for a stance detection model.
Because word embeddings encode words, but each of our instances consists of more than one word, we need to apply some additional function to convert this list of word vectors into something we can use as input to our stance detection model. A simple approach is to bag of word embeddings, which is to merely average all word embeddings for a sentence / instance. This can be implemented in a few lines of code using the Python numpy package.

In [24]:
def encodeSentW2V(w2vmodel, sents, dim=100):

    feats = []
    # for each tweet, get the word vectors and average them
    for i, tweet in enumerate(sents):
        numvects = 0
        vect = []
        for token in tweet:
            try:
                s = w2vmodel.wv[token]
                vect.append(s)
                numvects += 1
            except KeyError:
                s = 0.0
        if vect.__len__() > 0:
            mtrmean = np.average(vect, axis=0)
            if i == 0:
                feats = mtrmean
            else:
                feats = np.vstack((feats, mtrmean))
        else:
            feats = np.vstack((feats, np.zeros(dim)))

    return feats

**Exercises** (optional): 
- understand what each line in the above code does
- write an alternative function to the above that encodes tweets and targets separately and concatenates their representations
- write an alternative function to the above that encodes tweets and targets separately and concatenates their representations, then also concatenates the outer product between the vectors to the tweet-target representation to capture the interaction between tweets and targets

Now we'll convert each training and testing instance to features, using the function above.

In [25]:
features_train_w2v = encodeSentW2V(model, tweets_targets_train_tokens)
features_test_w2v = encodeSentW2V(model, tweets_targets_test_tokens)

Now we can train a logistic regression classifier with l2 regularisation

In [26]:
model = LogisticRegression(penalty='l2')
model.fit(features_train_w2v, labels_train_idx)
preds = model.predict(features_test_w2v)
preds_prob = model.predict_proba(features_test_w2v)
coef = model.coef_
print("Label options", model.classes_)
print("Labels", labels_train_idx)
print("Predictions", preds)
print("Predictions probabilities", preds_prob)
print("Feat length ", features_train_w2v[0].__len__())

Label options [0 1 2]
Labels [1, 1, 1, 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 2, 0, 1, 1, 2, 1, 1, 1, 0, 1, 1, 1, 2, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 0, 1, 2, 2, 1, 2, 1, 1, 0, 2, 2, 0, 2, 1, 0, 2, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 0, 2, 2, 0, 1, 1, 2, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 0, 2, 1, 0, 1, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 0, 1, 1, 0, 1, 0, 2, 1, 1, 0, 1, 2, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 2, 0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 2, 0, 1, 0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 1, 2, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 0, 2, 1, 0, 0, 1, 0, 0, 2, 2, 0, 0, 2, 2, 2, 2, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 2, 1, 1, 0, 0, 1, 1, 0, 2, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 2, 2, 1, 2, 1, 2, 0, 2, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 0

We then check the performance again

In [27]:
print(classification_report(labels_test_idx, preds))

             precision    recall  f1-score   support

          0       0.40      0.09      0.15       260
          1       0.42      0.87      0.57       299
          2       0.40      0.11      0.17       148

avg / total       0.41      0.42      0.33       707



In [28]:
print(confusion_matrix(labels_test_idx, preds))

[[ 23 228   9]
 [ 25 259  15]
 [  9 123  16]]


**Exercises:**
- as for the model in Part 2, examine the correct and incorrect predictions. How do the results compare to the ones you obtained in Part 2?
- replace the logistic regression classifier with a simple neural network, a multi-layer perceptron (Hint: see http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) and compare performances

### Replace the lgr model with a simple neural network

In [29]:
from sklearn.neural_network import MLPClassifier

nn_model = MLPClassifier(hidden_layer_sizes=(200,400,600,400,200))
nn_model.fit(features_train_w2v, labels_train_idx)

nn_preds = nn_model.predict(features_test_w2v)
#nn_preds_prob = model.predict_proba(features_test_w2v)
#coef = model.coef_

print(classification_report(labels_test_idx, nn_preds))
print(confusion_matrix(labels_test_idx, nn_preds))

             precision    recall  f1-score   support

          0       0.37      0.76      0.50       260
          1       0.43      0.14      0.21       299
          2       0.12      0.05      0.07       148

avg / total       0.34      0.35      0.29       707

[[198  35  27]
 [222  43  34]
 [117  23   8]]


## Part 4: RNNs

Up until now, we have trained models that ingore word order. We will now train RNNs, that take as input the word embeddings we have trained in Part 3 and learn to construct a sentence, then predict a stance label.

Some more intricate pre-processing than in the previous part is necessary to map words to IDs and account for unseen words at test time. For now, let's assume we have a function that takes care of this.

First, let's define some preliminaries.

In [30]:
from readwrite.reader import *
from readwrite.writer import *
import tensorflow as tf
from collections import defaultdict
from ex4_rnns.tensoriser import prepare_data
from ex4_rnns.batch import get_feed_dicts
from ex4_rnns.map import numpify

# Set initial random seed so results are more stable
np.random.seed(1337)
tf.set_random_seed(1337)

We can define various options for training our models, which have a big impact on performance. For now, let's set them to values that allow us to do rapid prototyping.

In [31]:
# Define model options / hyperparameters
options = {"main_num_layers": 3, "model_type": "tweet-only-lstm", "batch_size": 32, "emb_dim": 16, 
            "max_epochs": 50, "skip_connections": True, "learning_rate": 0.001, "dropout_rate": 0.3, 
            "rnn_cell_type": "lstm", "attention": False, "pretr_word_embs": False}

We first need to define placeholders, which define what shape the data we pass on to the optmiser has.
In our case, our data consists of instance IDs, tweets, targets and labels. For tweets and targets, we also need to provide how long the instance are, i.e. how many tokens each sentence is made up of. This is important for the RNN later on -- because an unrolled RNN consists of several time steps, one step for each token, we need to know exactly how many time steps we need for each instance.

In [32]:
def set_placeholders():
    ids = tf.placeholder(tf.int32, [None], name="ids")
    tweets = tf.placeholder(tf.int32, [None, None], name="tweets")
    tweet_lengths = tf.placeholder(tf.int32, [None], name="tweets_lengths")
    targets = tf.placeholder(tf.int32, [None, None], name="targets")
    target_lengths = tf.placeholder(tf.int32, [None], name="targets_lengths")
    labels = tf.placeholder(tf.int32, [None, None], name="labels")
    placeholders = {"ids": ids, "tweets": tweets, "tweets_lengths": tweet_lengths, "targets": targets, "targets_lengths": target_lengths, "labels": labels}
    return placeholders

placeholders = set_placeholders()

Let's load the data, turn it into indeces and then tensors.

In [33]:
from ex4_rnns.classifier_rnns import loadData
data_train, data_test, vocab, labels = loadData(train_path, test_path, placeholders, **options)
print("Data loaded and tensorised.")

Data loaded and tensorised.


Now let's start defining our first model, a bidirectional RNN. In a quite most basic form with LSTM cells, it looks like this:

In [34]:
def reader_simple(inputs, lengths, output_size, scope=None):
    """Dynamic bi-LSTM reader.

    Args:
        inputs (tensor): The inputs into the bi-LSTM
        lengths (tensor): The lengths of the sequences
        output_size (int): Size of the LSTM state of the reader
        scope (string): The TensorFlow scope for the reader.

    Returns:
        Outputs (tensor): The outputs from the bi-LSTM.
        States (tensor): The cell states from the bi-LSTM.
    """
    with tf.variable_scope(scope or "reader", reuse=tf.AUTO_REUSE) as varscope:
        cell_fw = tf.contrib.rnn.LSTMCell(output_size, initializer=tf.contrib.layers.xavier_initializer())
        cell_bw = tf.contrib.rnn.LSTMCell(output_size, initializer=tf.contrib.layers.xavier_initializer())
    
        outputs, states = tf.nn.bidirectional_dynamic_rnn(
            cell_fw,
            cell_bw,
            inputs,
            sequence_length=lengths,
            dtype=tf.float32
        )
        
    # ( (outputs_fw,outputs_bw) , (output_state_fw,output_state_bw) )
    # in case LSTMCell: output_state_fw = (c_fw,h_fw), and output_state_bw = (c_bw,h_bw)
    # each [batch_size x max_seq_length x output_size]
    return outputs, states

So we need to define what cells we want to have for the forwards backwards and backwards reading, and then define a `tf.nn.bidirectional_dynamic_rnn`. The latter takes as arguments the forwards and backwards cells, the inputs to the RNN, i.e. a sentence, and the sequence lengths, i.e. the token length of the sentence.

Let's define another function for reading a sentence with an RNN now, but with a few additional bells and whistles.

In [35]:
def reader(inputs, lengths, output_size, contexts=(None, None), scope=None, **options):
    """Dynamic bi-LSTM reader; can be conditioned with initial state of other rnn.

    Args:
        inputs (tensor): The inputs into the bi-LSTM
        lengths (tensor): The lengths of the sequences
        output_size (int): Size of the LSTM state of the reader.
        context (tensor=None, tensor=None): Tuple of initial (forward, backward) states
                                  for the LSTM
        scope (string): The TensorFlow scope for the reader.

    Returns:
        Outputs (tensor): The outputs from the bi-LSTM.
        States (tensor): The cell states from the bi-LSTM.
    """

    skip_connections = options["skip_connections"]
    attention = options["attention"]
    num_layers = options["main_num_layers"]
    drop_keep_prob = options["dropout_rate"]

    with tf.variable_scope(scope or "reader", reuse=tf.AUTO_REUSE) as varscope:
        if options["rnn_cell_type"] == "layer_norm":
            cell_fw = tf.contrib.rnn.LayerNormBasicLSTMCell(output_size)
            cell_bw = tf.contrib.rnn.LayerNormBasicLSTMCell(output_size)
        elif options["rnn_cell_type"] == "nas":
            cell_fw = tf.contrib.rnn.NASCell(output_size)
            cell_bw = tf.contrib.rnn.NASCell(output_size)
        elif options["rnn_cell_type"] == "phasedlstm":
            cell_fw = tf.contrib.rnn.PhasedLSTMCell(output_size)
            cell_bw = tf.contrib.rnn.PhasedLSTMCell(output_size)
        else: #LSTM cell
            cell_fw = tf.contrib.rnn.LSTMCell(output_size, initializer=tf.contrib.layers.xavier_initializer())
            cell_bw = tf.contrib.rnn.LSTMCell(output_size, initializer=tf.contrib.layers.xavier_initializer())
        if num_layers > 1:
            cell_fw = tf.nn.rnn_cell.MultiRNNCell([cell_fw] * num_layers)
            cell_bw = tf.nn.rnn_cell.MultiRNNCell([cell_bw] * num_layers)

        if drop_keep_prob != 1.0:
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell=cell_fw, output_keep_prob=drop_keep_prob)
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell=cell_bw, output_keep_prob=drop_keep_prob)

        if skip_connections == True:
            cell_fw = tf.contrib.rnn.ResidualWrapper(cell_fw)
            cell_bw = tf.contrib.rnn.ResidualWrapper(cell_bw)

        if attention == True:
            cell_fw = tf.contrib.rnn.AttentionCellWrapper(cell_fw, attn_length=10)
            cell_bw = tf.contrib.rnn.AttentionCellWrapper(cell_bw, attn_length=10)

        outputs, states = tf.nn.bidirectional_dynamic_rnn(
            cell_fw,
            cell_bw,
            inputs,
            sequence_length=lengths,
            initial_state_fw=contexts[0],
            initial_state_bw=contexts[1],
            dtype=tf.float32
        )

        # ( (outputs_fw,outputs_bw) , (output_state_fw,output_state_bw) )
        # in case LSTMCell: output_state_fw = (c_fw,h_fw), and output_state_bw = (c_bw,h_bw)
        # each [batch_size x max_seq_length x output_size]
        return outputs, states

As you can see above, we have now added options for different cells, for multiple layers, for dropout, skip connections and word by word attention. All those are tricks of the trade to achieve better performance. We have also expanded the arguments of the `tf.nn.bidirectional_dynamic_rnn()` function such that we can control the initialisation of the RNNs (`initial_state_fw, initial_state_fw`).

Now that we've defined an RNN, we can use that to define a first model. 

In [36]:
def bilstm_tweet_reader(placeholders, label_size, vocab_size, emb_init=None, **options):
    emb_dim = options["emb_dim"] # embedding dimensionality

    # [batch_size, max_seq1_length]
    seq1 = placeholders['tweets']

    # [batch_size, labels_size]
    labels = tf.to_float(placeholders['labels'])

    init = tf.contrib.layers.xavier_initializer(uniform=True)
    if init is None:
        emb_init = init

    # embed the words, i.e. look up the embedding for each word
    with tf.variable_scope("embeddings", reuse=tf.AUTO_REUSE):
        embeddings = tf.get_variable("word_embeddings", [vocab_size, emb_dim], dtype=tf.float32, initializer=emb_init)

    with tf.variable_scope("embedders", reuse=tf.AUTO_REUSE) as varscope:
        seq1_embedded = tf.nn.embedding_lookup(embeddings, seq1)

    # give those embeddings as an input to the RNN reader we have defined above
    with tf.variable_scope("reader_seq", reuse=tf.AUTO_REUSE) as varscope1:
        # seq1_states: (c_fw, h_fw), (c_bw, h_bw)
        outputs, states = reader(seq1_embedded, placeholders['tweets_lengths'], emb_dim,
                            scope=varscope1, **options)

    # shape output: [batch_size, 2*emb_dim]
    if options["main_num_layers"] == 1:
        # shape states: [2, 2]
        output = tf.concat([states[0][1], states[1][1]], 1)
    else:
        # shape states: [2, num_layers, 2]
        output = tf.concat([states[0][-1][1], states[1][-1][1]], 1)

    # pass the RNN encoding to an output layer to make prediction
    with tf.variable_scope("bilstm_preds", reuse=tf.AUTO_REUSE):
        # output of sequence encoders is projected into an output layer
        scores = tf.contrib.layers.fully_connected(output, label_size, weights_initializer=init, activation_fn=tf.tanh)
        loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=scores, labels=labels)
        predict = tf.nn.softmax(scores)

    return scores, loss, predict

This first model encodes only the tweets using an RNN and makes a prediction based on that encoding. The model consists of three parts: 1) word embedding learning and lookup, 2) tweet encoding with an RNN, 3) output layer: projection of the tweet RNN encoding into the space of output labels

**Exercise**: write a variant of the above model that encodes both the tweet and the tweet target with an RNN each.

**Thought exercise**: what happens with the word embeddings here and how does it relate to what we have seen in the previous part of the tutorial? Could we use the embeddings we have trained in the previous part for our model? What would be the benefits, downsides and challenges with that?

In [37]:
def bilstm_tweet_and_target_reader(placeholders, label_size, vocab_size, emb_init=None, **options):
    emb_dim = options["emb_dim"] # embedding dimensionality

    # [batch_size, max_seq1_length]
    seq1 = placeholders['tweets']

    # [batch_size, labels_size]
    labels = tf.to_float(placeholders['labels'])

    init = tf.contrib.layers.xavier_initializer(uniform=True)
    if init is None:
        emb_init = init

    # embed the words, i.e. look up the embedding for each word
    with tf.variable_scope("embeddings", reuse=tf.AUTO_REUSE):
        embeddings = tf.get_variable("word_embeddings", [vocab_size, emb_dim], dtype=tf.float32, initializer=emb_init)

    with tf.variable_scope("embedders", reuse=tf.AUTO_REUSE) as varscope:
        seq1_embedded = tf.nn.embedding_lookup(embeddings, seq1)

    # give those embeddings as an input to the RNN reader we have defined above
    with tf.variable_scope("reader_seq", reuse=tf.AUTO_REUSE) as varscope1:
        # seq1_states: (c_fw, h_fw), (c_bw, h_bw)
        outputs, states = reader(seq1_embedded, placeholders['tweets_lengths'], emb_dim,
                            scope=varscope1, **options)
        
    with tf.variable_scope("target_reader_seq", reuse=tf.AUTO_REUSE) as varscope:
        

    # shape output: [batch_size, 2*emb_dim]
    if options["main_num_layers"] == 1:
        # shape states: [2, 2]
        output = tf.concat([states[0][1], states[1][1]], 1)
    else:
        # shape states: [2, num_layers, 2]
        output = tf.concat([states[0][-1][1], states[1][-1][1]], 1)

    # pass the RNN encoding to an output layer to make prediction
    with tf.variable_scope("bilstm_preds", reuse=tf.AUTO_REUSE):
        # output of sequence encoders is projected into an output layer
        scores = tf.contrib.layers.fully_connected(output, label_size, weights_initializer=init, activation_fn=tf.tanh)
        loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=scores, labels=labels)
        predict = tf.nn.softmax(scores)

    return scores, loss, predict

IndentationError: expected an indented block (<ipython-input-37-964c7354ebfc>, line 31)

The next thing we need is a training loop. What we want to do is: for a number of epochs, draw a batch of training instances, train our model on that, adjust the parameters of the model; and repeat this for a fixed number of epochs, or until the model converges.

In [38]:
def training_loop(placeholders, train_feed_dicts, min_op, logits, loss, preds, sess, **options):

    max_epochs = options["max_epochs"]

    for i in range(1, max_epochs + 1):
        loss_all, correct_all = [], 0.0
        total, correct_dev_all = 0.0, 0.0
        for batch in train_feed_dicts:
            _, current_loss, p = sess.run([min_op, loss, preds], feed_dict=batch)
            loss_all.append(current_loss)
            correct_all, total = calculate_hits(correct_all, total, placeholders, p, batch)

        # Randomise batch IDs, so that selection of batch is random
        np.random.shuffle(train_feed_dicts)
        acc = correct_all / total

        mean_loss = np.mean(loss_all)
        print('Epoch %d :' % i, "Loss: ", mean_loss, "Acc: ", acc)

    return logits, loss, preds

Now let's define the rest of the training procedure. We have a model and a training loop, now we also need to define an optmiser and we need to initialise our graph. For the optmiser, we could use vanilla gradient descent, or something cleverer that adjusts the learning rate. Here, we use `RMSProp`, which works well in practice and is space-efficient.
For ease of use, we wrap this all inside a function.

In [39]:
#from ex4_rnns.classifier_rnns import bicond_reader

def train(placeholders, target_labels, train_feed_dicts, vocab, w2v_model=None, sess=None, **options):
    # placeholders, labels, data_train, vocab, sess=sess, **options

    init = None
    if w2v_model != None:
        init = tf.constant_initializer(w2v_model.wv.syn0)

    # Create model. The second one is the one defined above, the first one encodes both the tweet and the target
    if options["model_type"] == 'bicond':
        logits, loss, preds = bicond_reader(placeholders, len(target_labels), len(vocab), init, **options)  # those return dicts where the keys are the task names
    elif options["model_type"] == 'tweet-only-lstm':
        logits, loss, preds = bilstm_tweet_reader(placeholders, len(target_labels), len(vocab), init, **options)  # those return dicts where the keys are the task names

    # define an optimiser and initialise graph
    optim = tf.train.RMSPropOptimizer(learning_rate=options["learning_rate"])
    min_op = optim.minimize(tf.reduce_mean(loss))
    tf.global_variables_initializer().run(session=sess)

    # call the training loop function
    logits, loss, preds = training_loop(placeholders, train_feed_dicts, min_op, logits, loss, preds, sess, **options)

    return logits, loss, preds

To monitor how well we do during training, we calculate accuracy, in addition to printing the loss.

In [40]:
def calculate_hits(correct_all, total, placeholders, p, batch):
    hits = [pp for ii, pp in enumerate(p) if np.argmax(pp) == np.argmax(batch[placeholders["targets"]][ii])]
    correct_all += len(hits)
    total += len(batch[placeholders["targets"]])
    return correct_all, total

Now we have defined everything we need to start training a model. To do this, we need to start a new session, then call the training routine. We first train on the training data, monitoring performance as we go, then apply the trained model to the test data.

In [None]:
# Do not take up all the GPU memory all the time.
sess_config = tf.ConfigProto()
sess_config.gpu_options.allow_growth = True
with tf.Session(config=sess_config) as sess:
    logits, loss, preds = train(placeholders, labels, data_train, vocab, sess=sess, **options)

    print("Finished training, evaluating on test set")

    correct_test_all, total_test = 0.0, 0.0
    p_inds_test, g_inds_test = [], []
    for batch_test in data_test:
        p_test = sess.run(preds, feed_dict=batch_test)

        pred_inds_test = [np.argmax(pp_test) for pp_test in p_test]
        p_inds_test.extend(pred_inds_test)
        gold_inds_test = [np.argmax(batch_test[placeholders["targets"]][i_d]) for i_d, targ in
                              enumerate(batch_test[placeholders["targets"]])]
        g_inds_test.extend(gold_inds_test)

        correct_test_all, total_test = calculate_hits(correct_test_all, total_test, placeholders, p_test, batch_test)


    acc_test = correct_test_all / total_test

    print("Test accuracy:", acc_test)





Epoch 1 : Loss:  1.0622361 Acc:  0.39701704545454547
Epoch 2 : Loss:  1.0465156 Acc:  0.39683948863636365
Epoch 3 : Loss:  0.9854704 Acc:  0.3036221590909091
Epoch 4 : Loss:  0.91591954 Acc:  0.27095170454545453
Epoch 5 : Loss:  0.856513 Acc:  0.28373579545454547
Epoch 6 : Loss:  0.7918611 Acc:  0.31977982954545453
Epoch 7 : Loss:  0.7302194 Acc:  0.3407315340909091
Epoch 8 : Loss:  0.6628153 Acc:  0.3464133522727273
Epoch 9 : Loss:  0.6078312 Acc:  0.3370028409090909
Epoch 10 : Loss:  0.56794417 Acc:  0.34925426136363635
Epoch 11 : Loss:  0.5273678 Acc:  0.3464133522727273
Epoch 12 : Loss:  0.4880041 Acc:  0.34321732954545453
Epoch 13 : Loss:  0.4646991 Acc:  0.3487215909090909
Epoch 14 : Loss:  0.44139308 Acc:  0.34534801136363635
Epoch 15 : Loss:  0.4262697 Acc:  0.34996448863636365
Epoch 16 : Loss:  0.41587022 Acc:  0.34410511363636365
Epoch 17 : Loss:  0.40644482 Acc:  0.34801136363636365
Epoch 18 : Loss:  0.39521807 Acc:  0.3513849431818182
Epoch 19 : Loss:  0.38839865 Acc:  0.35

**Exercises**:
- We have defined a number of hyperparameters above, mainly set to allow for rapid prototyping, not to achieve a good performance. What would be better ones? Try a few different combinations and monitor loss, training accuracy and observe test accuracy.
- There are also a number of different optmisers you can use, see https://www.tensorflow.org/api_docs/python/tf/train/
- Debug tip: if you receive a weird Tensorflow message about reusing variables, select "`Kernel -> Restart & Run All`"
- Replace the accuracy printing function with the sklearn classification report printing, as introduced in the second part of the tutorial