## 1. Introduction

This notebook will walk you through the process of building the simple neural model for detection of disasterous Tweets. Keras documentation might be helful with the excercises: https://keras.io/.

First step is importing required python packages, and fixing the random seeds for reproducible experiments:

In [None]:
import os
import random
from collections import namedtuple

import numpy as np
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Embedding, Dense, SimpleRNN
from tensorflow.python.keras.optimizers import SGD
from tensorflow.python.keras.preprocessing import sequence
from tensorflow.python.keras.preprocessing.text import Tokenizer
import tensorflow as tf
from tqdm import tqdm

# Set random seed settings for repeatable experiments
random.seed(0)
np.random.seed(0)
print('done')

## 2. Dataset loading

In the next step, we will load the dataset of Disasterous and Casual Tweets. These tweets can be loaded from the provided .tsv file. Disaster Tweets are labeled with `1`, and Casual - with `0`:

In [None]:
TextDataset = namedtuple('TextDataset', ('texts', 'labels'))

def load_disaster_dataset():
    tweets = []
    labels = []
    data_path=os.path.join('data', 'disasters_socialmedia.tsv')
    with open(data_path, encoding='utf8') as data_file:
        for line in tqdm(list(data_file)[1:]): #wrap in progress bar, skip the headers
            tweet, _, label = line.strip().split('\t')
            label = int(label)
            if label == 2: # Skip Can't decide (2) category
                continue
            tweets.append(tweet)
            labels.append(label)
    return TextDataset(texts=tweets, labels=labels)

disasters_dataset = load_disaster_dataset()

disasterous = [disasters_dataset.texts[i] for i, label in enumerate(disasters_dataset.labels) 
                      if label == 1]

casual = [disasters_dataset.texts[i] for i, label in enumerate(disasters_dataset.labels) 
                      if label == 0]
 
print('Fraction of casual tweets: {:.3f}'.format(len(casual)/(len(disasterous)+len(casual))))

print('\n==================\nDISASTEROUS TWEETS\n==================')
for tweet in random.sample(disasterous, 5):
    print('------\n',tweet)


print('\n==================\nCASUAL TWEETS\n==================')

for tweet in random.sample(casual, 5):
    print('------\n',tweet)


## 3. Data encoding

After loading the textual data, we have to transform it to numerical form, understandable by the neural model. Keras require the data to be indexed (every unique word -> unique index) and padded to the same length. We will use `keras.preprocessing.text.Tokenizer` and `keras.preprocessing.sequence` tools for this purpose:

In [None]:
# Data preparation (probably should be more explicit, here in Notebook)

EncodedDataset = namedtuple('EncodedDataset', ('instances', 'labels'))

def encode_dataset(text_dataset, tokenizer):
    texts_enco = tokenizer.texts_to_sequences(text_dataset.texts)
    texts_padded = sequence.pad_sequences(texts_enco, padding='pre')
    encoded_dataset = EncodedDataset(instances=texts_padded, labels=text_dataset.labels)
    vocab_size = len(tokenizer.word_index) + 1
    sequence_len = len(texts_padded[0])
    return encoded_dataset, vocab_size, sequence_len

tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=' ',
                      char_level=False)
tokenizer.fit_on_texts(disasters_dataset.texts)
encoded_dataset, vocab_size, sequence_len = encode_dataset(disasters_dataset, tokenizer)
print('Vocab size: {}, Sequence len: {}'.format(vocab_size, sequence_len))
print('\nEncoded Tweets:')
for instance in encoded_dataset.instances[:10]:
    print(instance)




## 4. Preparing training/validation splits

Now, we will split our dataset to two parts: 
- one used for training (75%) and 
- one used for validation (assesment of the accuracy of the model, 25%)


In [None]:
def get_splits(dataset, valid_fraction=0.25):
    data_size = len(dataset.instances)
    split_id = int(data_size * valid_fraction)
    merged_data = list(zip(dataset.instances, dataset.labels))
    random.shuffle(merged_data)
    shuffled_instances, shuffled_labels = zip(*merged_data)
    valid_data = EncodedDataset(instances=shuffled_instances[0:split_id],
                                labels=shuffled_labels[0:split_id])
    train_data = EncodedDataset(instances=shuffled_instances[split_id:],
                                labels=shuffled_labels[split_id:])
    return train_data, valid_data

train_data, valid_data = get_splits(encoded_dataset)

print('Training data size: {}, Validation data size: {}'.format(len(train_data.labels), len(valid_data.labels)))

## 5. Building and training the neural model

We will train a simple RNN model and train in with SGD algorithm: 

In [None]:
print('Building model...', end=' ')

model = Sequential()
model.add(Embedding(input_length=sequence_len, input_dim=vocab_size, output_dim=50, mask_zero=True)) 
# Set mask_zero=False when using CNN
model.add(SimpleRNN(64))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=SGD(), metrics=['accuracy'])
print('done')

print('Training model..')
x_train = np.array(train_data.instances, np.float32)
y_train = np.array(train_data.labels, np.int32)
x_valid = np.array(valid_data.instances, np.float32)
y_valid = np.array(valid_data.labels, np.int32)
model.fit(x=x_train, y=y_train, validation_data=(x_valid, y_valid), batch_size=32, epochs=5, verbose=2)
print('Training comleted')


Finally, lets evaluate our trained model once again on validatation split, to confirm the final accuracy of the solution:

In [None]:
print("Final validation..")
_, acc = model.evaluate(x_valid, y_valid, verbose=2)
print("Accuracy on validation set: {:.3f}".format(acc))

## Excercise 1 (warmup): Display performance on Training set 

Similarly to validation split evaluation, please evalate your model on training data and examine the performance. The difference between training and validation accuracy will give you some insight about generalization of the model.

## Excercise 2: Improve network performance

Default network architecture should achieve ~60% accuracy on the Disasters dataset. It's better than random, but still not satisfactory. Please use the knowledge acquired during presentation (about different network architectures, optimization algorithms, etc) to improve the model. Try to reach 80% accucary, or even more!

## Excercise 3: Generate network predictions for example text data

Please write simple code that will generate predictions for the list of custom sentences, for example:
```
texts = ['Fiery 14-vehicle crash on Highway 400 kills 3',
         'I have a crush on Monica']
```
the output should look like this:

```
1.000 -> Fiery 14-vehicle crash on Highway 400 kills 3
0.000 -> I have a crush on Monica

```


Hits:
- The example texts should be processed in the similar way as we processed training data
    - Use to `keras.preprocessing.sequence` with `sequence_len` parameter filled to  pad encoded data to desired length:
```
keras.preprocessing.sequence.pad_sequences(sequences=..., sequence_len=desired_len)
``` 

- Use `model.predict()` method to generate predictions for the data 
- The `EarlyStopping` callback can be used to stop training when the validation score stops increasing:

```
from tensorflow.python.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='val_acc', mode='max')
model.fit(x=..., y=..., epochs=100, callbacks=[early_stopping])
```
  Another option is to reduce the decrease the number of epochs to have the best model trained in the last epoch.


## Excercise 4: Display misclassified examples for error analysis

Manual inspection of misclassified examples (error analysis) is a good way to get insight about the model behaviour and source of ideas for possible improvements. Using the code from previous excercise, try to display misclassified examples from validation data and see whether you see any patterns in the errors that the network is doing. Assume that `Disasterous` label is chosen when the prediction is > 0.5, otherwise the tweet is treated as `Casual`.

