## 1-Getting the data

The work in this notebook is based on the work presented [here](http://nadbordrozd.github.io/blog/2017/06/03/python-or-scala/) by [Nadbor](https://www.linkedin.com/in/nadbor-drozd-12316063/).

A pytorch version can be found [here](https://github.com/jrzaurin/RNN_character_tagging)

Our goal here will be to differentiate between text sources at character level. Let's start by downloading the data

In [1]:
%%bash

# to get the data simply run this in your working directory
mkdir -p data/austen
cd data

wget http://www.gutenberg.org/files/31100/31100.txt
mv 31100.txt austen/austen.txt

mkdir shakespeare
wget http://www.gutenberg.org/files/100/100-0.txt
mv 100-0.txt shakespeare/shakespeare.txt

git clone https://github.com/scikit-learn/scikit-learn.git
git clone https://github.com/scalaz/scalaz.git

--2018-09-15 17:48:46--  http://www.gutenberg.org/files/31100/31100.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4454075 (4.2M) [text/plain]
Saving to: ‘31100.txt’

     0K .......... .......... .......... .......... ..........  1%  206K 21s
    50K .......... .......... .......... .......... ..........  2%  377K 16s
   100K .......... .......... .......... .......... ..........  3%  431K 14s
   150K .......... .......... .......... .......... ..........  4% 2.04M 11s
   200K .......... .......... .......... .......... ..........  5%  485K 10s
   250K .......... .......... .......... .......... ..........  6% 1.32M 9s
   300K .......... .......... .......... .......... ..........  8%  382K 9s
   350K .......... .......... .......... .......... ..........  9% 4.66M 8s
   400K .......... .......... .......... .......... .......

In [2]:
%%bash

find data/ -maxdepth 1 -type d

data/
data//shakespeare
data//austen
data//scalaz
data//scikit-learn


## 2-Prepare input files

Now we prepare the files with some text pre-processing. 

In [4]:
import numpy as np
import fnmatch
import os
import argparse

from unidecode import unidecode

chars = '\n !"#$%&\'()*+,-./0123456789:;<=>?@[\\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~'
charset = set(chars)
n_chars = len(charset)
char2ind = dict((c, i) for i, c in enumerate(chars))
ind2char = dict((i, c) for i, c in enumerate(chars))

char2vec = {}
for c in charset:
    vec = np.zeros(n_chars)
    vec[char2ind[c]] = 1
    char2vec[c] = vec


def sanitize_text(text):
    return ''.join(c for c in unidecode(text.decode('utf-8', 'ignore')).replace('\t', '    ') if c in charset)


input_dirs = ['data/scikit-learn', 'data/scalaz', 'data/austen', 'data/shakespeare']
output_dirs = ['data/sklearn_clean', 'data/scalaz_clean', 'data/austen_clean', 'data/shakespeare_clean']
file_patterns = ['*.py','*.scala','austen.txt','shakespeare.txt']
for input_dir, output_dir, file_pattern in zip(input_dirs, output_dirs, file_patterns):
    try:
        os.makedirs(output_dir)
    except os.error as e:
        # errno 17 means 'file exists error' which we can ignore
        if e.errno != 17:
            raise

    for root, dirnames, filenames in os.walk(input_dir):
        for filename in fnmatch.filter(filenames, file_pattern):
            src_path = os.path.join(root, filename)
            dst_path = os.path.join(output_dir, filename)
            # read in bytes (rb), write in text ('w')
            with open(src_path, 'rb') as in_f, open(dst_path, 'w') as out_f:
                out_f.write(sanitize_text(in_f.read()))

### 2_1. Dealing with the books

For this tutorial we will be using the `python` and `scala` datasets. 

Just in case you want to use Austen's and Shakespeare's books, here is some additional processing with the aim of partitioning the data in an adequate form.

In [5]:
import re
import os

authors = ['austen', 'shakespeare']

ebook_d = {}
ebook_d['austen'] = {}
ebook_d['shakespeare'] = {}

ebook_d['austen']['dir'] = 'data/austen_clean'
ebook_d['austen']['fname'] = 'austen.txt'
ebook_d['austen']['regex'] = 'Chapter\s+.*|CHAPTER\s+.*' # regular expression to split based on
ebook_d['austen']['startidx'] = 1 # starting index for the resulting partitions
ebook_d['austen']['endex'] = 'THE END' # expression to denote the end of the document 

ebook_d['shakespeare']['dir'] = 'data/shakespeare_clean'
ebook_d['shakespeare']['fname'] = 'shakespeare.txt'
ebook_d['shakespeare']['regex'] = '\s+\d+\s+|ACT\s+.*\.|SCENE\s+.*\.'
ebook_d['shakespeare']['startidx'] = 3
ebook_d['shakespeare']['endex'] = 'FINIS'

for author in authors:
    filepath = os.path.join(ebook_d[author]['dir'],ebook_d[author]['fname'])
    with open(filepath, 'r') as f:
        ebook = f.read()
    f.close()

    endex = ebook_d[author]['endex']
    startidx = ebook_d[author]['startidx']
    the_end = [m.start() for m in re.finditer(endex, ebook)][-1]
    ebook = ebook[:the_end]
    parts = re.split(ebook_d[author]['regex'], ebook)[startidx:]

    for i,p in enumerate(parts):
        fname = 'part' + str(i).zfill(4) + '.txt'
        fpath = os.path.join(ebook_d[author]['dir'],fname)
        with open(fpath, 'w') as f:
            f.write(p)
        f.close()
    os.remove(filepath)

## 3-Train/Test split

Not much secret here...

In [6]:
import shutil

data_dirs = ['data/sklearn_clean/', 'data/scalaz_clean/', 'data/austen_clean/', 'data/shakespeare_clean/']
test_fraction = 0.25

for data_dir in data_dirs:
    files = os.listdir(data_dir)
    train_dir = os.path.join(data_dir, 'train')
    test_dir = os.path.join(data_dir, 'test')

    # randomly shuffle the files
    files = list(np.array(files)[np.random.permutation(len(files))])
    os.makedirs((train_dir))
    os.makedirs(test_dir)

    train_fraction = 1 - test_fraction
    for i, f in enumerate(files):
        file_path = os.path.join(data_dir, f)
        if len(files) * train_fraction >= i:
            shutil.move(file_path, train_dir)
        else:
            shutil.move(file_path, test_dir)

In [7]:
%%bash
find data/*clean -maxdepth 2  -type d

data/austen_clean
data/austen_clean/test
data/austen_clean/train
data/scalaz_clean
data/scalaz_clean/test
data/scalaz_clean/train
data/shakespeare_clean
data/shakespeare_clean/test
data/shakespeare_clean/train
data/sklearn_clean
data/sklearn_clean/test
data/sklearn_clean/train


## 4-Training Using Keras

Remember, our objective is given a sequence of characters, finding whether the characters correspond to python code or scala (or to Austen's books or Shakespeare). 

Here is where the interesting things begin and where [Nadbor](https://www.linkedin.com/in/nadbor-drozd-12316063/), the author of the original blog designed a very interesting way to feed the network, using an "infinite" sequence of characters. 

Let's go with the details. Let's start with a series of helpers that will be useful to generate batches:

In [1]:
from random import choice

def chars_from_files(list_of_files):
    """
    open a file from list_of_files and yield the chars
    """
    while True:
        filename = choice(list_of_files)
        with open(filename, 'r') as f:
            chars = f.read()
            for c in chars:
                yield c


def splice_texts(files_a, jump_size_a, files_b, jump_size_b):
    """
    Pick code snippets from source A/B with at least length jump_size_a/b[0]
    and at most length jump_size_a/b[1] and splice them
    
    Params:
    -------
    files_a/b: list of files
    jump_size_a/b: list with two values [min_length, max_length]
    """    
    a_chars = chars_from_files(files_a)
    b_chars = chars_from_files(files_b)
    generators = [a_chars, b_chars]

    a_range = range(jump_size_a[0], jump_size_a[1])
    b_range = range(jump_size_b[0], jump_size_b[1])
    ranges = [a_range, b_range]

    source_ind = choice([0, 1])
    while True:
        jump_size = choice(ranges[source_ind])
        gen = generators[source_ind]
        for _ in range(jump_size):
            yield (gen.__next__(), source_ind)
        source_ind = 1 - source_ind


def generate_batches(files_a, jump_size_a, files_b, jump_size_b, batch_size, sample_len, return_text=False):
    """
    Bacth generator for keras: given a batch_size, it will return a sequence of length sample_len 
    characters where characters from files_a and files_b will be spliced using splice_texts. 

    For example, we have n_chars=96, and say that we use sample_len=100. This generator will yield:
    1) X: an array of shape (1024, 100, 96) that is explained as follows:
        1024 -> batch size
        100  -> length of a sequence of characters
        96   -> one hot encoded char (96 different chars)
    2) y: an array of shape (1024, 100, 1) that is explained as follows:
        1024 -> batch size
        100  -> length of a sequence of characters
        1    -> the label of the corresponding character. In this example 0=python, 1=scala
    3) the text sequences of 100 characters (optional)
    """    

    gens = [splice_texts(files_a, jump_size_a, files_b, jump_size_b) for _ in range(batch_size)]
    while True:
        X = []
        y = []
        texts = []
        for g in gens:
            chars = []
            vecs = []
            labels = []
            for _ in range(sample_len):
                c, l = g.__next__()
                vecs.append(char2vec[c])
                labels.append([l])
                chars.append(c)
            X.append(vecs)
            y.append(labels)

            if return_text:
                texts.append(''.join(chars))

        if return_text:
            yield (np.array(X), np.array(y), texts)
        else:
            yield (np.array(X), np.array(y))

Let's go ahead and check what generate batches produces:

In [2]:
from glob import glob

dir_a = "data/sklearn_clean/"
dir_b = "data/scalaz_clean"
files_a = glob(os.path.join(dir_a, "train/*"))
files_b = glob(os.path.join(dir_b, "train/*"))

# using Nadbor's original settings
min_jump_size_a = 20
max_jump_size_a = 200
min_jump_size_b = 20
max_jump_size_b = 200
juma = [min_jump_size_a, max_jump_size_a]
jumb = [min_jump_size_b, max_jump_size_b]

batch_size = 1024
seq_len = 100

In [3]:
gen = generate_batches(files_a, juma, files_b, jumb, batch_size, seq_len, return_text=True)
X, y, texts = gen.__next__()

In [4]:
print(X.shape, y.shape)
print(texts[0])

(1024, 100, 96) (1024, 100, 1)
# Author: Brian M. Clapper, G Varoquaux
# License: BSD

import numpy as np

# XXX we should be testi


Ok, time to build the model, and here is where keras excels. We can build stateful or bidirectional stacked LSTMs in a few lines of code. For example, let's start with 3 stateful LSTM layers. Here we overcommented the code for clarity:

In [5]:
from keras.layers import Dense, Dropout, LSTM, TimeDistributed, Bidirectional
from keras.models import Sequential, load_model

train_a = glob(os.path.join(dir_a, "train/*"))
train_b = glob(os.path.join(dir_b, "train/*"))
val_a = glob(os.path.join(dir_a, "test/*"))
val_b = glob(os.path.join(dir_b, "test/*"))

juma = [min_jump_size_a, max_jump_size_a]
jumb = [min_jump_size_b, max_jump_size_b]

lstm_layers=3
rnn_size=128
epochs=5
steps_per_epoch=100
validation_steps=50
dropout_rate = 0.2
batch_shape = (batch_size, seq_len, n_chars)

model = Sequential()
for _ in range(lstm_layers):
    # Note that we make the RNN stateful just by adding the stateful=True parameter. Stateful
    # means that the last hidden state after one bacth will be passed as the 1st hidden state in 
    # the next batch. This makes perfect sense when using generate_batches(), since the sequences 
    # of characters in batch number n+1 follow those of batch number n.
    model.add(LSTM(rnn_size, return_sequences=True, batch_input_shape=batch_shape,stateful=True))
    model.add(Dropout(dropout_rate))

# A TimeDistributed is a "Sequence-wise" operation, and applies a layer across each element of the sequence. 
# For example, in our case, it will apply a 1-neuron dense layer with sigmoid activation to each of the 100
# elements of the sequence, leading to a final output of size (batch_size, seq_len, 1). The job of this layer
# is simply to classify each character of the sequence as python/scala
model.add(TimeDistributed(Dense(units=1, activation='sigmoid')))
model.compile(optimizer='adam', loss='mse', metrics=['accuracy', 'binary_crossentropy'])

train_gen = generate_batches(train_a, juma, train_b, jumb, batch_size, seq_len)
validation_gen = generate_batches(val_a, juma, val_b, jumb, batch_size, seq_len)
model.fit_generator(train_gen,
                    steps_per_epoch=steps_per_epoch,
                    validation_data=validation_gen,
                    validation_steps=validation_steps,
                    epochs=epochs)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa581d2bb38>

Not bad for 100 steps per epoch and 5 epochs. At this stage Nadbor realised that some of the misclassifications raised from the fact that the RNN *"can only interpret a character in the context of characters that came before"*. In other words, a `Bidirectional` LSTM, where sequences are fed from both ends will potentially solve the issue.

When using bidirectional LSTMs it does not make much sense to se `stateful=True`, which is easy to understand. Simply explained, we are feeding the network from both ends, so half of the state is not directly related (meaning naturally following a sequence) to the previous state.   

As you can imagine, coding a 3 `Bidirectional` LSTM layers in keras is as simple as:

In [6]:
model = Sequential()
for _ in range(lstm_layers):
    model.add(Bidirectional(LSTM(rnn_size, return_sequences=True),batch_input_shape=batch_shape))
    model.add(Dropout(dropout_rate))

model.add(TimeDistributed(Dense(units=1, activation='sigmoid')))
model.compile(optimizer='adam', loss='mse', metrics=['accuracy', 'binary_crossentropy'])

train_gen = generate_batches(train_a, juma, train_b, jumb, batch_size, seq_len)
validation_gen = generate_batches(val_a, juma, val_b, jumb, batch_size, seq_len)

# this time let's save the model
from keras.callbacks import ModelCheckpoint
checkpointer = ModelCheckpoint("models/model_keras")

model.fit_generator(train_gen,
                    steps_per_epoch=steps_per_epoch,
                    validation_data=validation_gen,
                    validation_steps=validation_steps,
                    epochs=epochs,
                    callbacks=[checkpointer])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa5749755f8>

Indeed bidirectional LSTMs perform better (in his original post Nadbor trains for 1000 steps and more epochs, so the results are nearly perfect)  

Model trained! Now what? Let's visualize the results (again, all credit to Nadbor for the following code)

if you wanted to run this from terminal: 

```
python apply_tagger_keras.py models/model_keras output/sklearn_or_scala_preds_keras data/sklearn_clean/ data/scalaz_clean
python plot_predictions.py output/sklearn_or_scala_preds_keras output/sklearn_or_scala_preds_keras_html
```

The following generator will return the "features" and "target" for n steps and then the corresponding text

In [54]:
def get_batches_and_text(files_a, jump_size_a, files_b, jump_size_b, batch_size, sample_len, n):
    """first yields n batches, then yields a list of texts + all the labels"""
    gen = generate_batches(files_a, jump_size_a, files_b, jump_size_b, batch_size, sample_len, True)
    texts = []
    labels = []
    for i in range(n):
        X, y, txt = gen.__next__()
        texts.append(txt)
        labels.append(y.reshape((batch_size, sample_len)))
        yield (X, y)
    yield ["".join(parts) for parts in zip(*texts)], np.hstack(labels)

Just in case, for some information on the parameter `max_queue_size` have a look [here](https://keunwoochoi.wordpress.com/2017/08/24/tip-fit_generator-in-keras-how-to-parallelise-correctly/)

In [96]:
from joblib import dump

output_preds = "output/sklearn_or_scala_preds_keras"

from keras.models import load_model
model_path = "models/model_keras"
model = load_model(model_path)

fa = glob(os.path.join(dir_a, "test/*"))
fb = glob(os.path.join(dir_b, "test/*"))
steps = 50

# for 50 steps it will return X,y for model to predict.
# Note that max_queue_size needs to be greather than 0. We set it to 1, 
# which means that the first batch will be "ignored" for prediction
gen = get_batches_and_text(fa, juma, fb, jumb, batch_size, seq_len, steps+1)
predictions = model.predict_generator(gen, steps=steps, max_queue_size=1)

# then it will return text and labels for plotting purposes
texts, labels = gen.__next__()

# # we drop the first sequence as is ignored for prediction
# texts = []
# for text in tmp_texts:
#     texts.append(text[seq_len:])
# labels = tmp_labels[:, seq_len:]

try:
    os.makedirs(output_preds)
except os.error:
    pass
for i in range(batch_size):
    preds = np.vstack([predictions[j::batch_size, :].ravel() for j in range(batch_size)])
    path = os.path.join(output_preds, 'part_' + str(i).zfill(5) + ".joblib")
    dump((texts[i], preds[i], labels[i]), path)

And let's plot to pretty HTML files

In [97]:
import matplotlib
import matplotlib.pyplot as plt

from joblib import load
%matplotlib inline

def prediction_to_html(text, predictions, labels, cmap='Reds'):
    cmap = matplotlib.cm.get_cmap(cmap)
    html_chars = []
    for c, p, l in zip(text, predictions, labels):
        if c == '\n':
            html_chars.append('<br>')
        else:
            r, g, b, a = cmap(p)
            r, g, b = int(256*r), int(256*g), int(256*b)
            if l:
                c = '<font face="Times New Roman" \nsize="5">%s</font>' % c
            else:
                c = '<font face="monospace" \nsize="3">%s</font>' % c
            html_chars.append('<span style="background-color:rgb(%s, %s, %s); color:black;">%s</span>' % (r, g, b, c))
    tot_html = "".join(html_chars)
    return tot_html

output_html = "output/sklearn_or_scala_preds_keras_html"

try:
    os.makedirs(output_html)
except os.error:
    pass
files = glob(os.path.join(predictions_dir, "*"))
#random range of file
for i, f in enumerate(files[100:110]): 
    text, prediction, labels = load(f)
    html = prediction_to_html(text, prediction, labels)
    out_path = os.path.join(output_html, 'part-' + str(i).zfill(5) + ".html")
    with open(out_path, "w") as out:
        out.write(html)

In [98]:
from IPython.display import display, HTML
display(HTML(filename="output/sklearn_or_scala_preds_keras_html/part-00003.html"))

Looks very good!