[View in Colaboratory](https://colab.research.google.com/github/hamelsmu/kdd-2018-hands-on-tutorials/blob/master/Feature%20Extraction%20and%20Summarization%20with%20Sequence%20to%20Sequence%20Learning.ipynb)

# Setup Notebook

Install [ktext](https://github.com/hamelsmu/ktext) and [annoy](https://github.com/spotify/annoy).

In [0]:
!pip install -q ktext
!pip install -q annoy

In [1]:
import json
from urllib.request import urlopen

from annoy import AnnoyIndex
from keras import optimizers
from keras.layers import Input, Dense, LSTM, GRU, Embedding, Lambda, BatchNormalization
from keras.models import Model
from keras import optimizers
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from ktext.preprocess import processor
import numpy as np
import pandas as pd
import random
from tqdm import tqdm

Using TensorFlow backend.


# Data sets

## [English to French](http://www.manythings.org/anki/)

In [0]:
# !wget http://www.manythings.org/anki/fra-eng.zip
# !unzip -o fra-eng.zip

In [0]:
# with open('fra.txt', 'r') as f:
#     lines = f.readlines()
# target_docs, source_docs = zip(*[line.strip().split('\t') for line in lines])
# target_docs = list(set(target_docs))

## [CoNaLa](https://conala-corpus.github.io/)

In [2]:
!wget http://www.phontron.com/download/conala-corpus-v1.1.zip
!unzip -o conala-corpus-v1.1.zip

--2018-08-07 17:41:27--  http://www.phontron.com/download/conala-corpus-v1.1.zip
Resolving www.phontron.com (www.phontron.com)... 208.113.196.149
Connecting to www.phontron.com (www.phontron.com)|208.113.196.149|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52105440 (50M) [application/zip]
Saving to: ‘conala-corpus-v1.1.zip.1’


2018-08-07 17:41:28 (45.7 MB/s) - ‘conala-corpus-v1.1.zip.1’ saved [52105440/52105440]

Archive:  conala-corpus-v1.1.zip
  inflating: conala-corpus/conala-mined.jsonl  
  inflating: conala-corpus/conala-train.json  
  inflating: conala-corpus/conala-test.json  


In [0]:
with open('conala-corpus/conala-mined.jsonl', 'r') as f:
    lines = [json.loads(line) for line in f.readlines()]
source_docs = [line['snippet'] for line in lines]
target_docs = [line['intent'] for line in lines]

In [0]:
with open('conala-corpus/conala-train.json', 'r') as f:
    lines = json.load(f)
train_source_docs = [line['snippet'] for line in lines]
train_target_docs = [line['intent'] for line in lines]
test_docs = [line['rewritten_intent'] for line in lines if line['rewritten_intent']]

In [0]:
with open('conala-corpus/conala-test.json', 'r') as f:
    lines = json.load(f)
test_source_docs = [line['snippet'] for line in lines]
test_target_docs = [line['intent'] for line in lines]

## Other Data Sources (For Later Use)

### GitHub issues data

In [0]:
# issues = pd.read_csv('https://storage.googleapis.com/kubeflow-examples/github-issue-summarization-data/github-issues.zip')
# source_docs = list(issues.body)
# target_docs = list(issues.issue_title)

### Python functions data

In [0]:
# f = urlopen('https://storage.googleapis.com/kubeflow-examples/code_search/data/train.function')
# source_docs = [line.decode('utf-8') for line in f.readlines()]
# f = urlopen('https://storage.googleapis.com/kubeflow-examples/code_search/data/train.docstring')
# target_docs = [line.decode('utf-8') for line in f.readlines()]

## Use subset of the data

We will use only of the training set in the interest of brevity.  However, we can use the full dataset in a subsequent pass if desired.

In [0]:
source_docs = source_docs[:100000]
target_docs = target_docs[:100000]

# Language Model

## Preprocessing
Tokenize, generate vocabulary, apply padding and vectorize.

Lets inspect the raw text of the target docs.  

In [8]:
target_docs[:10]

['Sort a nested list by two elements',
 'converting integer to list in python',
 'Converting byte string in unicode string',
 'List of arguments with argparse',
 'How to convert a Date string to a DateTime object?',
 'How to efficiently convert Matlab engine arrays to numpy ndarray?',
 'Converting html to text with Python',
 'regex for repeating words in a string in Python',
 'Ordering a list of dictionaries in python',
 'Two Combination Lists from One List']

In order to pre-process this text for deep learning, we need to convert this text into integer values.  In order to do this, we will use the `ktext` package.

In [9]:
proc = processor(hueristic_pct_padding=.7, keep_n=20000)
vecs = proc.fit_transform(target_docs)

 See full histogram by insepecting the `document_length_stats` attribute.


Below is an example where tokens are mapped to integers

In [10]:
print('original list: ', target_docs[0].split())
print('tokenized list: ', [proc.token2id[x] for x in target_docs[0].lower().split()])

original list:  ['Sort', 'a', 'nested', 'list', 'by', 'two', 'elements']
tokenized list:  [118, 2, 151, 10, 43, 38, 56]


We can see the most common words here, by calling the `token_count_pandas() method`. 

In [11]:
proc.token_count_pandas().head(20)

Unnamed: 0,count
a,52824
python,48171
in,47702
to,47281
how,36116
of,22831
with,15954
the,13505
list,12651
from,11626


Furthermore, the documents in our corpus have different lengths.  By setting `hueristic_pct_padding=.7`, `ktext` will truncate and pad all sequences to the 70th percentile length.  However, it can be useful to sanity check a histogram of lengths.  We inspect the `document_length_stats` property below which displays a histogram of document lengths. 

In [12]:
proc.document_length_stats

Unnamed: 0,bin,doc_count,cumsum_pct
6,0,31,0.00031
0,5,34978,0.35009
1,10,50700,0.85709
2,15,12664,0.98373
3,20,1486,0.99859
5,25,124,0.99983
4,30,17,1.0


It is useful to keep track of the maximum length and the unique number of tokens in the corpus for later purposes.

In [13]:
vocab_size = max(proc.id2token.keys()) + 1
max_length = proc.padding_maxlen

print('vocab size: ', vocab_size)
print('max length allowed for documents: ', max_length)

vocab size:  10225
max length allowed for documents:  10


## Language model

Prepare training data for language model.

In [25]:
sequences = []
for arr in tqdm(vecs):
    non_zero = (arr != 0).argmax()
    for i in range(non_zero, len(arr)):
        sequences.append(arr[:i+1])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
# y = to_categorical(y, num_classes=vocab_size)

100%|██████████| 100000/100000 [00:01<00:00, 93113.13it/s]


In [26]:
i = Input(shape=(max_length-1,))
x = Embedding(vocab_size, 256, input_length=max_length-1)(i)
x = LSTM(256, return_sequences=True)(x)
last_timestep = Lambda(lambda x: x[:, -1, :])(x)
last_timestep = Dense(vocab_size, activation='softmax')(last_timestep)
model = Model(i, last_timestep)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 9)                 0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 9, 256)            2617600   
_________________________________________________________________
lstm_4 (LSTM)                (None, 9, 256)            525312    
_________________________________________________________________
lambda_4 (Lambda)            (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 10225)             2627825   
Total params: 5,770,737
Trainable params: 5,770,737
Non-trainable params: 0
_________________________________________________________________


## Training

In [0]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X, y, epochs=10, batch_size=50, validation_split=0.1)

Train on 734247 samples, validate on 81584 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10

## Generate sequence

The goal of a language model is to predict the next word in a sequence.  To sanity check the language model, we will see what kind of sentence is generated when we start with a a seed word of 'is'.  We are looking to see if the sentence generated appears to be sampled from the distribution of the 

In [0]:
def generate_seq(model, proc, n_words, seed_text):
    in_text = seed_text
    for _ in range(n_words):
        vec = proc.transform([in_text])[:,1:]
        index = np.argmax(model.predict(vec, verbose=0), axis=1)[0]
        out_word = ''
        if index == 1:
            out_word = '_unk_'
        else:
            out_word = proc.id2token[index]
        in_text += ' ' + out_word
    return in_text

In [58]:
generate_seq(model, proc, max_length, 'is')

'there python is there a way to make the tkinter text'

## Generate embeddings

In [0]:
embedding_model = Model(inputs=model.inputs, outputs=model.layers[-3].output)

In [60]:
input_sequence = test_docs[random.randint(0, len(test_docs))]
print(input_sequence)
vec = proc.transform([input_sequence])[:,1:]
embedding_model.predict(vec)

check if list `li` is empty


array([[[ 0.        , -0.        ,  0.        , ...,  0.        ,
          0.2822097 ,  0.        ],
        [ 0.        , -0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        , -0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [ 0.55555606,  0.44329646, -0.6894753 , ..., -0.30388248,
          0.        ,  0.56213903],
        [ 0.        ,  0.7900059 , -0.22815275, ..., -0.7061561 ,
          0.        ,  0.7455072 ],
        [ 0.7805286 , -0.73729074, -0.9997308 , ..., -0.8817332 ,
         -0.3556277 , -0.11047834]]], dtype=float32)

In [0]:
vecs = proc.transform(test_docs)

In [0]:
hidden_states = embedding_model.predict(vecs[:, 1:])

In [0]:
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)

## Applications
### Build vector indices

In [64]:
dimension = hidden_states.shape[-1]
index = AnnoyIndex(dimension)
for i, v in enumerate(sum_vecs):
    index.add_item(i, v)
index.build(10)

True

### Search nearest neighbors

In [65]:
ids, _ = index.get_nns_by_item(1000, 10, include_distances=True)
[test_docs[i] for i in ids]

['check if any of the items in  `search` appear in `string`',
 "check if all of the following items in list `['a', 'b']` are in a list `['a', 'b', 'c']`",
 'check if any elements in one list `list1` are in another list `list2`',
 'check if any item from list `b` is in list `a`',
 'check if any element of list `substring_list` are in string `string`',
 'Check if a given key `key` exists in dictionary `d`',
 "Check if a given key 'key1' exists in dictionary `dict`",
 "test if either of strings `a` or `b` are members of the set of strings, `['b', 'a', 'foo', 'bar']`",
 'check if any values in a list `input_list` is a list',
 'check if the third element of all the lists in a list "items" is equal to zero.']

In [66]:
input_sequence = test_docs[random.randint(0, len(test_docs))]
print(input_sequence)

vec = proc.transform([input_sequence])[:,1:]
vec = np.sum(embedding_model.predict(vec), axis=1)
ids, _ = index.get_nns_by_vector(vec.T, 10, include_distances=True)
[test_docs[i] for i in ids]

interleave the elements of two lists `a` and `b`


['interleave the elements of two lists `a` and `b`',
 'sum the product of elements of two lists named `a` and `b`',
 'get indexes of the largest `2` values from a list `a` using itemgetter',
 'apply itertools.product to elements of a list of lists `arrays`',
 'Find all the items from a dictionary `D` if the key contains the string `Light`',
 "Concatenate elements of a list 'x' of multiple integers to a single integer",
 "get unique values from the list `['a', 'b', 'c', 'd']`",
 'get index of elements in array `A` that occur in another array `B`',
 'insert elements of list `k` into list `a` at position `n`',
 'get tuples of the corresponding elements from lists `lst` and `lst2`']

# Sequence to Sequence Model

## Preprocessing
Tokenize, generate vocabulary, apply padding and vectorize.
If source and target are from the same distribution, vocabulary can be shared.

In [0]:
source_proc = processor(hueristic_pct_padding=.7, keep_n=20000)
source_vecs = source_proc.fit_transform(source_docs)

target_proc = processor(append_indicators=True, hueristic_pct_padding=.7, keep_n=14000, padding ='post')
target_vecs = target_proc.fit_transform(target_docs)

In [0]:
encoder_input_data = source_vecs
encoder_seq_len = encoder_input_data.shape[1]

decoder_input_data = target_vecs[:, :-1]
decoder_target_data = target_vecs[:, 1:]

num_encoder_tokens = max(source_proc.id2token.keys()) + 1
num_decoder_tokens = max(target_proc.id2token.keys()) + 1

## Encoder model

In [0]:
word_emb_dim=512
hidden_state_dim=1024
encoder_seq_len=encoder_seq_len
num_encoder_tokens=num_encoder_tokens
num_decoder_tokens=num_decoder_tokens

encoder_inputs = Input(shape=(encoder_seq_len,), name='Encoder-Input')
x = Embedding(num_encoder_tokens, word_emb_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs)
x = BatchNormalization(name='Encoder-Batchnorm-1')(x)
_, state_h = GRU(hidden_state_dim, return_state=True, name='Encoder-Last-GRU', dropout=.5)(x)
encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
seq2seq_encoder_out = encoder_model(encoder_inputs)

## Decoder model

In [0]:
decoder_inputs = Input(shape=(None,), name='Decoder-Input')
dec_emb = Embedding(num_decoder_tokens, word_emb_dim, name='Decoder-Word-Embedding', mask_zero=False)(decoder_inputs)
dec_bn = BatchNormalization(name='Decoder-Batchnorm-1')(dec_emb)
decoder_gru = GRU(hidden_state_dim, return_state=True, return_sequences=True, name='Decoder-GRU', dropout=.5)
decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
x = BatchNormalization(name='Decoder-Batchnorm-2')(decoder_gru_output)
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='Final-Output-Dense')
decoder_outputs = decoder_dense(x)

## Sequence to sequence model

In [0]:
seq2seq_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

## Training

In [0]:
batch_size = 1024
epochs = 16

seq2seq_model.compile(optimizer=optimizers.Nadam(lr=0.00005), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = seq2seq_model.fit([encoder_input_data, decoder_input_data],
                            np.expand_dims(decoder_target_data, -1),
                            batch_size=batch_size,
                            epochs=epochs,
                            validation_split=0.1)

## Extract encoder and decoder models

In [0]:
def extract_decoder_model(model):
    latent_dim = model.get_layer('Encoder-Model').output_shape[-1]
    decoder_inputs = model.get_layer('Decoder-Input').input
    dec_emb = model.get_layer('Decoder-Word-Embedding')(decoder_inputs)
    dec_bn = model.get_layer('Decoder-Batchnorm-1')(dec_emb)
    gru_inference_state_input = Input(shape=(latent_dim,), name='hidden_state_input')
    gru_out, gru_state_out = model.get_layer('Decoder-GRU')([dec_bn, gru_inference_state_input])
    dec_bn2 = model.get_layer('Decoder-Batchnorm-2')(gru_out)
    dense_out = model.get_layer('Final-Output-Dense')(dec_bn2)
    decoder_model = Model([decoder_inputs, gru_inference_state_input], [dense_out, gru_state_out])
    return decoder_model

In [0]:
encoder_model = seq2seq_model.get_layer('Encoder-Model')
for layer in encoder_model.layers:
    layer.trainable = False

decoder_model = extract_decoder_model(seq2seq_model)
decoder_model.summary()

## Predict sequence

In [0]:
i = random.randint(0, len(test_source_docs))

max_len = target_proc.padding_maxlen
raw_input_text = test_source_docs[i]

raw_tokenized = source_proc.transform([raw_input_text])
encoding = encoder_model.predict(raw_tokenized)
original_encoding = encoding
state_value = np.array(target_proc.token2id['_start_']).reshape(1, 1)

decoded_sentence = []
stop_condition = False
while not stop_condition:
    preds, st = decoder_model.predict([state_value, encoding])
    pred_idx = np.argmax(preds[:, :, 2:]) + 2
    pred_word_str = target_proc.id2token[pred_idx]

    if pred_word_str == '_end_' or len(decoded_sentence) >= max_len:
        stop_condition = True
        break
    decoded_sentence.append(pred_word_str)

    # update the decoder for the next word
    encoding = st
    state_value = np.array(pred_idx).reshape(1, 1)

print(raw_input_text)
print(test_target_docs[i])
print(' '.join(decoded_sentence))

In [0]:
## Generate Embeddings

In [0]:
train_source_emb = encoder_model.predict(source_proc.transform(train_source_docs))

In [0]:
vecs = proc.transform(train_target_docs)
hidden_states = embedding_model.predict(vecs[:, 1:])
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)
train_target_emb = sum_vecs

In [0]:
print(train_source_emb.shape)
print(train_target_emb.shape)

# Modality

In [0]:
inp = Input(shape=(train_source_emb.shape[1],))
x = Dense(train_target_emb.shape[1], use_bias=False)(inp)
# x = BatchNormalization()(x)
# x = Dense(512)(x)
modal_model = Model([inp], x)
modal_model.summary()

In [0]:
modal_model.compile(optimizer=optimizers.Nadam(lr=0.002), loss='cosine_proximity', metrics=['accuracy'])

batch_size = 1024
epochs = 10
history = modal_model.fit([train_source_emb], train_target_emb,
                          batch_size=batch_size, epochs=epochs, validation_split=0.1)

## Applications
### Use test data

In [0]:
test_source_emb = encoder_model.predict(source_proc.transform(test_source_docs))

In [0]:
vecs = proc.transform(test_target_docs)
hidden_states = embedding_model.predict(vecs[:, 1:])
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)
test_target_emb = sum_vecs

In [0]:
print(test_source_emb.shape)
print(test_target_emb.shape)

### Build vector indices

In [0]:
dimension = hidden_states.shape[-1]
index = AnnoyIndex(dimension)
for i, v in enumerate(test_target_emb):
    index.add_item(i, v)
index.build(10)

### Search nearest neighbors

In [0]:
i = random.randint(0, len(test_source_docs))
input_sequence = test_source_docs[i]
print(input_sequence)

vec = np.expand_dims(test_source_emb[i], 0)
out_vec = modal_model.predict(vec)
ids, _ = index.get_nns_by_vector(out_vec.T, 10, include_distances=True)
[test_target_docs[i] for i in ids]