[View in Colaboratory](https://colab.research.google.com/github/hamelsmu/kdd-2018-hands-on-tutorials/blob/master/Feature%20Extraction%20and%20Summarization%20with%20Sequence%20to%20Sequence%20Learning.ipynb)

# Setup Notebook

Install [ktext](https://github.com/hamelsmu/ktext) and [annoy](https://github.com/spotify/annoy).

In [0]:
!pip install -q ktext
!pip install -q annoy

In [6]:
import json
from urllib.request import urlopen

from annoy import AnnoyIndex
from keras import optimizers
from keras.layers import Input, Dense, LSTM, GRU, Embedding, Lambda, BatchNormalization
from keras.models import Model
from keras import optimizers
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from ktext.preprocess import processor
import numpy as np
import pandas as pd
import random
from tqdm import tqdm

Using TensorFlow backend.


# Data sets

## [English to French](http://www.manythings.org/anki/)

In [0]:
# !wget http://www.manythings.org/anki/fra-eng.zip
# !unzip -o fra-eng.zip

In [0]:
# with open('fra.txt', 'r') as f:
#     lines = f.readlines()
# target_docs, source_docs = zip(*[line.strip().split('\t') for line in lines])
# target_docs = list(set(target_docs))

## [CoNaLa](https://conala-corpus.github.io/)

In [7]:
!wget http://www.phontron.com/download/conala-corpus-v1.1.zip
!unzip -o conala-corpus-v1.1.zip

--2018-08-08 03:23:04--  http://www.phontron.com/download/conala-corpus-v1.1.zip
Resolving www.phontron.com (www.phontron.com)... 208.113.196.149
Connecting to www.phontron.com (www.phontron.com)|208.113.196.149|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52105440 (50M) [application/zip]
Saving to: ‘conala-corpus-v1.1.zip’


2018-08-08 03:23:05 (76.1 MB/s) - ‘conala-corpus-v1.1.zip’ saved [52105440/52105440]

Archive:  conala-corpus-v1.1.zip
   creating: conala-corpus/
  inflating: conala-corpus/conala-mined.jsonl  
  inflating: conala-corpus/conala-train.json  
  inflating: conala-corpus/conala-test.json  


In [0]:
with open('conala-corpus/conala-mined.jsonl', 'r') as f:
    lines = [json.loads(line) for line in f.readlines()]
source_docs = [line['snippet'] for line in lines]
target_docs = [line['intent'] for line in lines]

In [0]:
with open('conala-corpus/conala-train.json', 'r') as f:
    lines = json.load(f)
train_source_docs = [line['snippet'] for line in lines]
train_target_docs = [line['intent'] for line in lines]
test_docs = [line['rewritten_intent'] for line in lines if line['rewritten_intent']]

In [0]:
with open('conala-corpus/conala-test.json', 'r') as f:
    lines = json.load(f)
test_source_docs = [line['snippet'] for line in lines]
test_target_docs = [line['intent'] for line in lines]

## Other Data Sources (For Later Use)

### GitHub issues data

In [0]:
# issues = pd.read_csv('https://storage.googleapis.com/kubeflow-examples/github-issue-summarization-data/github-issues.zip')
# source_docs = list(issues.body)
# target_docs = list(issues.issue_title)

### Python functions data

In [0]:
# f = urlopen('https://storage.googleapis.com/kubeflow-examples/code_search/data/train.function')
# source_docs = [line.decode('utf-8') for line in f.readlines()]
# f = urlopen('https://storage.googleapis.com/kubeflow-examples/code_search/data/train.docstring')
# target_docs = [line.decode('utf-8') for line in f.readlines()]

## Use subset of the data

We will use only of the training set in the interest of brevity.  However, we can use the full dataset in a subsequent pass if desired.

In [0]:
source_docs = source_docs[:100000]
target_docs = target_docs[:100000]

# Language Model

## Preprocessing
Tokenize, generate vocabulary, apply padding and vectorize.

Lets inspect the raw text of the target docs.  

In [12]:
target_docs[:10]

['Sort a nested list by two elements',
 'converting integer to list in python',
 'Converting byte string in unicode string',
 'List of arguments with argparse',
 'How to convert a Date string to a DateTime object?',
 'How to efficiently convert Matlab engine arrays to numpy ndarray?',
 'Converting html to text with Python',
 'regex for repeating words in a string in Python',
 'Ordering a list of dictionaries in python',
 'Two Combination Lists from One List']

In order to pre-process this text for deep learning, we need to convert this text into integer values.  In order to do this, we will use the `ktext` package.

In [13]:
proc = processor(hueristic_pct_padding=.7, keep_n=20000)
vecs = proc.fit_transform(target_docs)

 See full histogram by insepecting the `document_length_stats` attribute.


Below is an example where tokens are mapped to integers

In [14]:
print('original list: ', target_docs[0].split())
print('tokenized list: ', [proc.token2id[x] for x in target_docs[0].lower().split()])

original list:  ['Sort', 'a', 'nested', 'list', 'by', 'two', 'elements']
tokenized list:  [118, 2, 151, 10, 43, 38, 56]


We can see the most common words here, by calling the `token_count_pandas() method`. 

In [15]:
proc.token_count_pandas().head(20)

Unnamed: 0,count
a,52824
python,48171
in,47702
to,47281
how,36116
of,22831
with,15954
the,13505
list,12651
from,11626


Furthermore, the documents in our corpus have different lengths.  By setting `hueristic_pct_padding=.7`, `ktext` will truncate and pad all sequences to the 70th percentile length.  However, it can be useful to sanity check a histogram of lengths.  We inspect the `document_length_stats` property below which displays a histogram of document lengths. 

In [16]:
proc.document_length_stats

Unnamed: 0,bin,doc_count,cumsum_pct
6,0,31,0.00031
0,5,34978,0.35009
1,10,50700,0.85709
2,15,12664,0.98373
3,20,1486,0.99859
5,25,124,0.99983
4,30,17,1.0


It is useful to keep track of the maximum length and the unique number of tokens in the corpus for later purposes.

In [17]:
vocab_size = max(proc.id2token.keys()) + 1
max_length = proc.padding_maxlen

print('vocab size: ', vocab_size)
print('max length allowed for documents: ', max_length)

vocab size:  10225
max length allowed for documents:  10


## Language model

Prepare training data for language model.

In [18]:
sequences = []
for arr in tqdm(vecs):
    non_zero = (arr != 0).argmax()
    for i in range(non_zero, len(arr)):
        sequences.append(arr[:i+1])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
# y = to_categorical(y, num_classes=vocab_size)

100%|██████████| 100000/100000 [00:00<00:00, 125350.72it/s]


In [19]:
i = Input(shape=(max_length-1,))
x = Embedding(vocab_size, 256, input_length=max_length-1)(i)
x = LSTM(256, return_sequences=True)(x)
last_timestep = Lambda(lambda x: x[:, -1, :])(x)
last_timestep = Dense(vocab_size, activation='softmax')(last_timestep)
model = Model(i, last_timestep)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 9)                 0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 9, 256)            2617600   
_________________________________________________________________
lstm_1 (LSTM)                (None, 9, 256)            525312    
_________________________________________________________________
lambda_1 (Lambda)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10225)             2627825   
Total params: 5,770,737
Trainable params: 5,770,737
Non-trainable params: 0
_________________________________________________________________


## Training

Now that we have created our architecture, we can train our model.  

**This step takes approximately 20 minutes.  This is a good time to take a bathroom break!**

In [20]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X, y, epochs=10, batch_size=50, validation_split=0.1)

Train on 734247 samples, validate on 81584 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10

## Generate sequence

The goal of a language model is to predict the next word in a sequence.  To sanity check the language model, we will see what kind of sentence is generated when we start with a a seed word of 'is'.  We are looking to see if the sentence generated appears to be sampled from the distribution of the data.  

In other words does the sentence generated look like it was written by the same author(s) pertaining to the same domain as the training corpus?  

In [0]:
def generate_seq(model, proc, n_words, seed_text):
    in_text = seed_text
    for _ in range(n_words):
        vec = proc.transform([in_text])[:,1:]
        index = np.argmax(model.predict(vec, verbose=0), axis=1)[0]
        out_word = ''
        if index == 1:
            out_word = '_unk_'
        else:
            out_word = proc.id2token[index]
        in_text += ' ' + out_word
    return in_text

See what sentence is generated from language model, seeded witht he word `is`

In [32]:
generate_seq(model, proc, max_length, 'is')

'is there a way to use phantomjs in python to print'

## Generate sentence embeddings

One of the goals of training the language model is learning reprsentations of sentences in our corpus.  For example, we can extract values from intermediate layers of this language model, and use those as sentence embeddings. 

In [0]:
embedding_model = Model(inputs=model.inputs, outputs=model.layers[-3].output)

The below code extracts the hidden states from the encoder when given an input.   There is one hidden state for each word in the sentence.

In [34]:
input_sequence = test_docs[random.randint(0, len(test_docs))]
print('input sequence: ', input_sequence, '\n\nhidden states:\n')
vec = proc.transform([input_sequence])[:,1:]
embedding_model.predict(vec)


input sequence:  for a dictionary `a`, set default value for key `somekey` as list and append value `bob`  in that key 

hidden states:



array([[[ 0.        , -0.        , -0.        , ..., -0.        ,
          0.6728434 , -0.        ],
        [-0.        , -0.        , -0.68634623, ..., -0.00608053,
          0.04332689,  0.        ],
        [-0.02194916, -0.        , -0.        , ..., -0.        ,
          0.        ,  0.        ],
        ...,
        [ 0.        ,  0.        , -0.        , ...,  0.        ,
         -0.        , -0.        ],
        [-0.7565876 ,  0.27250797, -0.97665966, ..., -0.16200553,
         -0.8420539 , -0.        ],
        [-0.761593  ,  0.40664664, -0.        , ..., -0.04313043,
         -0.02480136, -0.07966985]]], dtype=float32)

Let's extract the hidden states for all the sentences in our training data.

In [0]:
vecs = proc.transform(test_docs)

In [0]:
hidden_states = embedding_model.predict(vecs[:, 1:])

To create a sentence embedding, we need to summarize the hidden states (there is one for each term ).  A simple approach is to use aggregate stastics like the mean, max, or the sum of all the hidden states.  There are other approaches that are outside the scope of this notebook, but that we will discuss.

In [0]:
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)

## Application - Nearest Neighbor Search

Now that we have a way to represent each sentence as a vector, we can use this representation on many kinds of downstream tasks.  One such task is finding a similar sentence to any given sentence. 

### Build vector indices

We will first place all the vectorized sentences in a special data structure that allows for fast nearest neighbor lookups.  We will use [annoy](https://github.com/spotify/annoy) for this purpose.

In [38]:
dimension = hidden_states.shape[-1]
index = AnnoyIndex(dimension)
for i, v in enumerate(sum_vecs):
    index.add_item(i, v)
index.build(10)

True

### Search nearest neighbors

In [40]:
input_sequence = test_docs[random.randint(0, len(test_docs))]
print('Query: ', input_sequence)

vec = proc.transform([input_sequence])[:,1:]
vec = np.sum(embedding_model.predict(vec), axis=1)
ids, _ = index.get_nns_by_vector(vec.T, 10, include_distances=True)

print('\n\nSearch Results:\n')
[test_docs[i] for i in ids]

Query:  sort a list of lists `list_to_sort` by indices 2,0,1 of the inner list


Search Results:



['sort a list of lists `list_to_sort` by indices 2,0,1 of the inner list',
 "convert a list of lists `list_of_lists` into a list of strings keeping empty sub-lists as empty string ''",
 'sort a list of dictionaries `list_of_dct` by values in an order `order`',
 "ordering a list of dictionaries `mylist` by elements 'weight' and 'factor'",
 'make a list of lists in which each list `g` are the elements from list `test` which have the same characters up to the first `_` character',
 'sort a list of lists `L` by index 2 of the inner list',
 'sort a list of lists `l` by index 2 of the inner list',
 "order a list of lists `[[1, 'mike'], [1, 'bob']]` by the first value of individual list",
 'merge a list of dictionaries in list `L` into a single dict',
 'split a list of tuples `data` into sub-lists of the same tuple field using itertools']

# Sequence to Sequence Model

Let's remind ourselves what the data looks like.  The `source` is a snippet of code and the `target` is a description of that code.  The goal would be to train a model that can generate a description given a snippet of code.  

A [sequence to sequence model](https://towardsdatascience.com/how-to-create-data-products-that-are-magical-using-sequence-to-sequence-models-703f86a231f8) can be used to tackle this problem.  This is due to the fact that the input to this model is a sequence of tokens (code), and the desired output we want to predict are also a sequence of tokens (description of code.) 


In [56]:
print('source (code input):         ', source_docs[2])
print('target (description output): ', target_docs[2])

source (code input):          c.decode('unicode_escape')
target (description output):  Converting byte string in unicode string


## Preprocessing

Similar to previous excercises, we must pre-process the raw strings into a format that can be utilized by our model.  One such format is to map each word in our corpus to a unique integer value, which we will refer to as a vocabulary.  If the source and target are from the same distribution, (which they are not in this example) the ocabulary can be shared.


Concretely, we will tokenize, generate vocabulary, apply padding and vectorize.  These steps are as follows:

**1. Tokenize:**  process of parsing strings into discrete words or tokens.

**2. Generate Vocabulary:**  assign each token to a unique integer, rare-occuring tokens may be assigned to the same integer.

**3. Padding:**  we standardize the sequence length of each example to be the same by truncating and padding each example to the same lentgh.

The `ktext` package helps us accomplish these steps.


In [41]:
source_proc = processor(hueristic_pct_padding=.7, keep_n=20000)
source_vecs = source_proc.fit_transform(source_docs)

target_proc = processor(append_indicators=True, hueristic_pct_padding=.7, keep_n=14000, padding ='post')
target_vecs = target_proc.fit_transform(target_docs)

 See full histogram by insepecting the `document_length_stats` attribute.
 See full histogram by insepecting the `document_length_stats` attribute.


We will use teacher forcing for the decoder of the sequence to sequence model, so we will offset the target sequence by one. 

In [0]:
encoder_input_data = source_vecs
encoder_seq_len = encoder_input_data.shape[1]

decoder_input_data = target_vecs[:, :-1]
decoder_target_data = target_vecs[:, 1:]

num_encoder_tokens = max(source_proc.id2token.keys()) + 1
num_decoder_tokens = max(target_proc.id2token.keys()) + 1

## Encoder model

The role of the encoder is to extract features and generate a representation of the input sequence, which in this case is a snippet of code. 

In [0]:
word_emb_dim=512
hidden_state_dim=1024
encoder_seq_len=encoder_seq_len
num_encoder_tokens=num_encoder_tokens
num_decoder_tokens=num_decoder_tokens

encoder_inputs = Input(shape=(encoder_seq_len,), name='Encoder-Input')
x = Embedding(num_encoder_tokens, word_emb_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs)
x = BatchNormalization(name='Encoder-Batchnorm-1')(x)
_, state_h = GRU(hidden_state_dim, return_state=True, name='Encoder-Last-GRU', dropout=.5)(x)
encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
seq2seq_encoder_out = encoder_model(encoder_inputs)

## Decoder model

The role of the decoder is to generate a description of the code conditioned on the features extracted by the encoder.

In [0]:
decoder_inputs = Input(shape=(None,), name='Decoder-Input')
dec_emb = Embedding(num_decoder_tokens, word_emb_dim, name='Decoder-Word-Embedding', mask_zero=False)(decoder_inputs)
dec_bn = BatchNormalization(name='Decoder-Batchnorm-1')(dec_emb)
decoder_gru = GRU(hidden_state_dim, return_state=True, return_sequences=True, name='Decoder-GRU', dropout=.5)
decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
x = BatchNormalization(name='Decoder-Batchnorm-2')(decoder_gru_output)
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='Final-Output-Dense')
decoder_outputs = decoder_dense(x)

## Sequence to sequence model

We can connect the encoder and decoder together to create the sequence to sequence model.

In [0]:
seq2seq_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

## Training

The below hyper parameters were found through some trial and error.  

**This should take approximately ~ 35 minutes to train.**

In [0]:
batch_size = 1024
epochs = 16

seq2seq_model.compile(optimizer=optimizers.Nadam(lr=0.00005), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = seq2seq_model.fit([encoder_input_data, decoder_input_data],
                            np.expand_dims(decoder_target_data, -1),
                            batch_size=batch_size,
                            epochs=epochs,
                            validation_split=0.1)

Train on 90000 samples, validate on 10000 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16

## Extract encoder and decoder models

For inference, we will not have teacher forcing for the decoder.  Therefore, we must re-assemble our model such we can feed one prediction at a time. 

In [0]:
def extract_decoder_model(model):
    latent_dim = model.get_layer('Encoder-Model').output_shape[-1]
    decoder_inputs = model.get_layer('Decoder-Input').input
    dec_emb = model.get_layer('Decoder-Word-Embedding')(decoder_inputs)
    dec_bn = model.get_layer('Decoder-Batchnorm-1')(dec_emb)
    gru_inference_state_input = Input(shape=(latent_dim,), name='hidden_state_input')
    gru_out, gru_state_out = model.get_layer('Decoder-GRU')([dec_bn, gru_inference_state_input])
    dec_bn2 = model.get_layer('Decoder-Batchnorm-2')(gru_out)
    dense_out = model.get_layer('Final-Output-Dense')(dec_bn2)
    decoder_model = Model([decoder_inputs, gru_inference_state_input], [dense_out, gru_state_out])
    return decoder_model

One side effect of training a sequence-to-sequence model in this way is that the encoder can be re-used as a general purpose feature extractor.  We extract the encoder below for this purpose in a later exercise. 

In [0]:
encoder_model = seq2seq_model.get_layer('Encoder-Model')
for layer in encoder_model.layers:
    layer.trainable = False

decoder_model = extract_decoder_model(seq2seq_model)
decoder_model.summary()

## Predict function description using trained sequence-to-sequence model

In [0]:
i = random.randint(0, len(test_source_docs))

max_len = target_proc.padding_maxlen
raw_input_text = test_source_docs[i]

raw_tokenized = source_proc.transform([raw_input_text])
encoding = encoder_model.predict(raw_tokenized)
original_encoding = encoding
state_value = np.array(target_proc.token2id['_start_']).reshape(1, 1)

decoded_sentence = []
stop_condition = False
while not stop_condition:
    preds, st = decoder_model.predict([state_value, encoding])
    pred_idx = np.argmax(preds[:, :, 2:]) + 2
    pred_word_str = target_proc.id2token[pred_idx]

    if pred_word_str == '_end_' or len(decoded_sentence) >= max_len:
        stop_condition = True
        break
    decoded_sentence.append(pred_word_str)

    # update the decoder for the next word
    encoding = st
    state_value = np.array(pred_idx).reshape(1, 1)

print(raw_input_text)
print(test_target_docs[i])
print(' '.join(decoded_sentence))

In [0]:
## Generate Embeddings

In [0]:
train_source_emb = encoder_model.predict(source_proc.transform(train_source_docs))

In [0]:
vecs = proc.transform(train_target_docs)
hidden_states = embedding_model.predict(vecs[:, 1:])
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)
train_target_emb = sum_vecs

In [0]:
print(train_source_emb.shape)
print(train_target_emb.shape)

# Construct a Joint Vector Space

Useful when you have different modalities, such as 

In [0]:
inp = Input(shape=(train_source_emb.shape[1],))
x = Dense(train_target_emb.shape[1], use_bias=False)(inp)
# x = BatchNormalization()(x)
# x = Dense(512)(x)
modal_model = Model([inp], x)
modal_model.summary()

In [0]:
modal_model.compile(optimizer=optimizers.Nadam(lr=0.002), loss='cosine_proximity', metrics=['accuracy'])

batch_size = 1024
epochs = 10
history = modal_model.fit([train_source_emb], train_target_emb,
                          batch_size=batch_size, epochs=epochs, validation_split=0.1)

## Applications
### Use test data

In [0]:
test_source_emb = encoder_model.predict(source_proc.transform(test_source_docs))

In [0]:
vecs = proc.transform(test_target_docs)
hidden_states = embedding_model.predict(vecs[:, 1:])
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)
test_target_emb = sum_vecs

In [0]:
print(test_source_emb.shape)
print(test_target_emb.shape)

### Build vector indices

In [0]:
dimension = hidden_states.shape[-1]
index = AnnoyIndex(dimension)
for i, v in enumerate(test_target_emb):
    index.add_item(i, v)
index.build(10)

### Search nearest neighbors

In [0]:
i = random.randint(0, len(test_source_docs))
input_sequence = test_source_docs[i]
print(input_sequence)

vec = np.expand_dims(test_source_emb[i], 0)
out_vec = modal_model.predict(vec)
ids, _ = index.get_nns_by_vector(out_vec.T, 10, include_distances=True)
[test_target_docs[i] for i in ids]