# KDD 2018 Hands-On Tutorial  https://kddseq2seq.com/

Feature Extraction and Summarization With Sequence-to-Sequence Learning


### Pre-requisites

The target audience of this tutorial are moderately skilled users who have some familiarity with neural networks and are comfortable writing code.  These blog posts are good background for this tutorial:

- [How To Create Data Products That Are Magical Using Sequence-to-Sequence Models](https://towardsdatascience.com/how-to-create-data-products-that-are-magical-using-sequence-to-sequence-models-703f86a231f8)

- [How To Create Natural Language Semantic Search For Arbitrary Objects With Deep Learning](https://towardsdatascience.com/semantic-code-search-3cd6d244a39c)

### Google Colab Notebooks

This tutorial can be run in Google Colab notebooks, which provides a free gpu-enabled Jupyter Notebook on the cloud.  **You can open this notebook in Colab  by following [this link](https://colab.research.google.com/github/hohsiangwu/kdd-2018-hands-on-tutorials/blob/master/Feature%20Extraction%20and%20Summarization%20with%20Sequence%20to%20Sequence%20Learning.ipynb).**

# Takeaway

1. Language Model
  * Self-supervised learning
  * Sequence generation
  * Pooling to get representations
2. Sequence to Sequence Model
  * Machine translation
  * Encoder to get representations
3. Joint Vector Space

# Motivating Example: Semantic Code Search

Yes, this is a gif of a notebook inside another notebook.

Motivation:  What if you could search code semantically instead of keyword search?  

![alt text](https://github.com/hamelsmu/code_search/raw/master/gifs/live_search.gif?sanitize=true)

A detailed, open source end to end tutorial on how to create semantic code search yourself is [here](https://towardsdatascience.com/semantic-code-search-3cd6d244a39c).

# Setup Notebook

Install [ktext](https://github.com/hamelsmu/ktext) and [annoy](https://github.com/spotify/annoy).

In [None]:
!pip install -q ktext
!pip install -q annoy

In [None]:
import json
from urllib.request import urlopen

from annoy import AnnoyIndex
from keras import optimizers
from keras.layers import Input, Dense, LSTM, GRU, Embedding, Lambda, BatchNormalization
from keras.models import Model
from keras import optimizers
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from ktext.preprocess import processor
import numpy as np
import pandas as pd
import random
from tqdm import tqdm

# Data sets

## [CoNaLa](https://conala-corpus.github.io/)

Challenge designed to test systems for generating program snippets from natural language.


### Preview of the CoNaLa Dataset

```
{
  "question_id": 36875258,
  "intent": "copying one file's contents to another in python", 
  "rewritten_intent": "copy the content of file 'file.txt' to file 'file2.txt'", 
  "snippet": "shutil.copy('file.txt', 'file2.txt')", 
}

{
  "intent": "How do I check if all elements in a list are the same?", 
  "rewritten_intent": "check if all elements in list `mylist` are the same", 
  "snippet": "len(set(mylist)) == 1", 
  "question_id": 22240602
}

{
  "intent": "Iterate through words of a file in Python", 
  "rewritten_intent": "get a list of words `words` of a file 'myfile'", 
  "snippet": "words = open('myfile').read().split()", 
  "question_id": 7745260
}
```

In [None]:
!wget http://www.phontron.com/download/conala-corpus-v1.1.zip
!unzip -o conala-corpus-v1.1.zip

In [None]:
with open('conala-corpus/conala-mined.jsonl', 'r') as f:
    lines = [json.loads(line) for line in f.readlines()]
source_docs = [line['snippet'] for line in lines]
target_docs = [line['intent'] for line in lines]

In [None]:
with open('conala-corpus/conala-train.json', 'r') as f:
    lines = json.load(f)
train_source_docs = [line['snippet'] for line in lines]
train_target_docs = [line['intent'] for line in lines]
test_docs = [line['rewritten_intent'] for line in lines if line['rewritten_intent']]

In [None]:
with open('conala-corpus/conala-test.json', 'r') as f:
    lines = json.load(f)
test_source_docs = [line['snippet'] for line in lines]
test_target_docs = [line['intent'] for line in lines]

## Other Data Sources (For Later Use)

The below datasets are alternate sources of data for this same exercise.  We will not be reviewing these data as part of this tutorial.  However, we encourage you to inspect these data for additional practice and to get more intuition regarding these techniques.  Practicing with these other datasets will  give you confidence regarding the general application of the techniques we are teaching in this tutorial.

### [English to French](http://www.manythings.org/anki/)

In [None]:
# !wget http://www.manythings.org/anki/fra-eng.zip
# !unzip -o fra-eng.zip

In [None]:
# with open('fra.txt', 'r') as f:
#     lines = f.readlines()
# target_docs, source_docs = zip(*[line.strip().split('\t') for line in lines])
# target_docs = list(set(target_docs))

### GitHub issues data

In [None]:
# issues = pd.read_csv('https://storage.googleapis.com/kubeflow-examples/github-issue-summarization-data/github-issues.zip')
# source_docs = list(issues.body)
# target_docs = list(issues.issue_title)

### Python (function, docstring) pairs

Purpose of this dataset is to see if you can generate the docstring of a python function or method by looking at the code.

In [None]:
# f = urlopen('https://storage.googleapis.com/kubeflow-examples/code_search/data/train.function')
# source_docs = [line.decode('utf-8') for line in f.readlines()]
# f = urlopen('https://storage.googleapis.com/kubeflow-examples/code_search/data/train.docstring')
# target_docs = [line.decode('utf-8') for line in f.readlines()]

## Use subset of the data

We will use only of the training set in the interest of brevity.  However, we can use the full dataset in a subsequent pass if desired.

In [None]:
source_docs = source_docs[:50000]
target_docs = target_docs[:50000]

# Language Model

What is a language model?

![alt text](https://cdn-images-1.medium.com/max/1440/1*XGfyUGtWq0yZ4RfufYfbRw.jpeg)

Source: https://medium.com/paper-club/language-modeling-survey-333077e43dd9

## Preprocessing
Tokenize, generate vocabulary, apply padding and vectorize.

#### Keras Text Pre-Processing Primer

Now that we have gathered the data, we need to prepare the data for the modeling. Before jumping into the code, let’s warm up with a toy example of two documents:

```
[“The quick brown fox jumped over the lazy dog 42 times.”, “The dog is lazy”]
```

Below is a rough outline of the steps I will take in order to pre-processes this raw text:

**1. Clean text:** in this step, we want to remove or replace specific characters and lower case all the text. This step is discretionary and depends on the size of the data and the specifics of your domain. In this toy example, I lower-case all characters and replace numbers with *number* in the text. In the real data, I handle more scenarios.

[“the quick brown fox jumped over the lazy dog *number* times”, “the dog is lazy”]


**3. Tokenize:** split each document into a list of words

```
[[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumped’, ‘over’, ‘the’, ‘lazy’, ‘dog’, ‘*number*’, ‘times’], [‘the’, ‘dog’, ‘is’, ‘lazy’]]
```

**4. Build vocabulary:** You will need to represent each distinct word in your corpus as an integer, which means you will need to build a map of token -> integers. Furthermore, I find it useful to reserve an integer for rare words that occur below a certain threshold as well as 0 for padding (see next step). After you apply a token -> integer mapping, your data might look like this:

```
[[2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 11], [2, 9, 12, 8]]
```

**5. Padding:** 5. Padding: You will have documents that have different lengths. There are many strategies on how to deal with this for deep learning, however for this tutorial I will pad and truncate documents such that they are all transformed to the same length for simplicity. You can decide to pad (with zeros) and truncate your document at the beginning or end, which I will refer to as “pre” and “post” respectively. After pre-padding our toy example, the data might look like this:

```
[[2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 11], [0, 0, 0, 0, 0, 0, 0, 2, 9, 12, 8]]
```

A reasonable way to decide your target document length is to build a histogram of document lengths and choose a sensible number. (Note that the above example has padded the data in front but we could also pad at the end. We will discuss this more in the next section).

Inspect the raw text of source and target documents:

Source docs:


In [None]:
for x in source_docs[:10]:
  print(x)

Target docs:

In [None]:
target_docs[:10]

In order to pre-process this data, we will use the [`ktext` package](https://github.com/hamelsmu/ktext).   `ktext` helps accomplish the pre-processing steps outlined in the previous section. This library is a thin wrapper around keras and spacy text processing utilities, and leverages python process-based-threading to speed things up. It also chains all of the pre-processing steps together and provides a bunch of convenience functions. Warning: this package is under development so use with caution outside this tutorial (pull requests are welcome!). To learn more about how this library works, look at this [tutorial](https://github.com/hamelsmu/ktext/blob/master/notebooks/Tutorial.ipynb) (but for now I suggest reading ahead).

In [None]:
proc = processor(hueristic_pct_padding=.7, keep_n=5000)
vecs = proc.fit_transform(target_docs)

In [None]:
assert vecs.shape[0] == len(target_docs)

The above code cleans, tokenizes, and applies pre-padding and post-truncating such that each document length is equal to the 70th percentile of document lengths, which is an arbitrary choice. I made decisions about padding length by studying histograms of document length provided by ktext. Furthermore, only the top 5,000 words in the vocabulary are retained and remaining words are set to the index 1 which correspond to rare words (this was another arbitrary choice). 

Below is an example where tokens are mapped to integers.

In [None]:
print('original list: ', target_docs[0])
print('tokenized list: ', vecs[0])

We can see the most common words here, by calling the `token_count_pandas()` method.

In [None]:
proc.token_count_pandas().head(20)

Furthermore, the documents in our corpus have different lengths. By setting `hueristic_pct_padding=.7`, `ktext` will truncate and pad all sequences to the 70th percentile length. However, it can be useful to sanity check a histogram of lengths. We inspect the `document_length_stats` property below which displays a histogram of document lengths.

In [None]:
proc.document_length_stats

It is useful to keep track of the maximum length and the unique number of tokens in the corpus for later purposes.

In [None]:
vocab_size = max(proc.id2token.keys()) + 1
max_length = proc.padding_maxlen

print('vocab size: ', vocab_size)
print('max length allowed for documents: ', max_length)

## Language model architecture

Prepare training data for language model.

In [None]:
sequences = []
for arr in tqdm(vecs):
    non_zero = (arr != 0).argmax()
    for i in range(non_zero, len(arr)):
        sequences.append(arr[:i+1])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
# y = to_categorical(y, num_classes=vocab_size)

In [None]:
i = Input(shape=(max_length-1,))
x = Embedding(vocab_size, 256, input_length=max_length-1)(i)
x = LSTM(256, return_sequences=True)(x)
last_timestep = Lambda(lambda x: x[:, -1, :])(x)
last_timestep = Dense(vocab_size, activation='softmax')(last_timestep)
model = Model(i, last_timestep)
model.summary()

## Training

Now that we have created our architecture, we can train our model.  

**This step takes approximately 25 minutes.  This is a good time to take a bathroom break!**

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X, y, epochs=10, batch_size=128, validation_split=0.1)

## Generate sequence

The goal of a language model is to predict the next word in a sequence. To sanity check the language model, we will see what kind of sentence is generated when we start with a a seed word of 'is'. We are looking to see if the sentence generated appears to be sampled from the distribution of the data.

In other words does the sentence generated look like it was written by the same author(s) pertaining to the same domain as the training corpus?

In [None]:
def generate_seq(model, proc, n_words, seed_text):
    in_text = seed_text
    for _ in range(n_words):
        vec = proc.transform([in_text])[:,1:]
        index = np.argmax(model.predict(vec, verbose=0), axis=1)[0]
        out_word = ''
        if index == 1:
            out_word = '_unk_'
        else:
            out_word = proc.id2token[index]
        in_text += ' ' + out_word
    return in_text

See what sentence is generated from language model, seeded witht he word `is`.

In [None]:
generate_seq(model, proc, max_length, 'is')

## Generate sentence embeddings

One of the goals of training the language model is learning reprsentations of sentences in our corpus. 


There are a plethora of general purpose pre-trained models that will generate high-quality embeddings of phrases (also called sentence embeddings). [This article](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a) provides a great overview of the landscape. For example, Google’s universal sentence encoder works very well for many use cases and is available on [Tensorflow Hub](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1).

Despite the convenience of these pre-trained models, it can be advantageous to train a model that captures the domain-specific vocabulary and semantics of docstrings. There are many techniques one can use to create sentence embeddings. These range from simple approaches, like averaging word vectors to more sophisticated techniques like those used in the construction of the universal sentence encoder.

For this tutorial, we will leverage a the language model we trained earlier to generate embeddings for sentences.  It is important to carefully consider the corpus you use for training when building a language model. Ideally, you want to use a corpus that is of a similar domain to your downstream problem so you can adequately capture the relevant semantics and vocabulary. For example, a great corpus for this problem would be stack overflow data, since that is a forum that contains an extremely rich discussion of code. However, in order to keep this tutorial simple, we re-use the set of docstrings as our corpus. This is sub-optimal as discussions on stack overflow often contain richer semantic information than what is in a one-line docstring. We leave it as an exercise for the reader to examine the impact on the final outcome by using an alternate corpus.

After we train the language model, our next task is to use this model to generate an embedding for each sentence. A common way of doing this is to summarize the hidden states of the language model.   A simple approach is to use aggregate stastics like the mean, max, or the sum of all the hidden states. There are other approaches that are outside the scope of this tutorial, and will discuss if time permits.

The below code extracts the hidden states from the encoder when given an input. There is one hidden state for each word in the sentence.

In [None]:
embedding_model = Model(inputs=model.inputs, outputs=model.layers[-3].output)

We can extract values from intermediate layers of this language model, and use those as sentence embeddings.  Here is how you can do that concretely with the language model we trained:

In [None]:
input_sequence = test_docs[random.randint(0, len(test_docs))]
print('input sequence: ', input_sequence, '\n\nhidden states:\n')
vec = proc.transform([input_sequence])[:,1:]
embedding_model.predict(vec)

Let's extract the hidden states for all the sentences in our training data.

In [None]:
test_vecs = proc.transform(test_docs)

In [None]:
hidden_states = embedding_model.predict(test_vecs[:, 1:])

As mentioned earlier, we can compute aggregate statistics over the hidden states.  This is how you can do that:

In [None]:
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)

## Application - Nearest Neighbor Search

Now that we have a way to represent each sentence as a vector, we can use this representation on many kinds of downstream tasks. One such task is finding a similar sentence to any given sentence.  


### Build vector indices

We will first place all the vectorized sentences in a special data structure that allows for fast nearest neighbor lookups. We will use [annoy](https://github.com/spotify/annoy) for this purpose.

In [None]:
dimension = hidden_states.shape[-1]
index = AnnoyIndex(dimension)
for i, v in enumerate(sum_vecs):
    index.add_item(i, v)
index.build(10)

### Search nearest neighbors

In [None]:
input_sequence = test_docs[random.randint(0, len(test_docs))]
print('Query: ', input_sequence)

vec = proc.transform([input_sequence])[:,1:]
vec = np.sum(embedding_model.predict(vec), axis=1)
ids, _ = index.get_nns_by_vector(vec.T, 10, include_distances=True)

print('\nSearch Results:')
[test_docs[i] for i in ids][1:]

# Sequence to Sequence Model

A [sequence to sequence model](https://towardsdatascience.com/how-to-create-data-products-that-are-magical-using-sequence-to-sequence-models-703f86a231f8) allows you to take an input sequence (source), and predict an output sequence (target).  These sequences can be anything, however we will focus on natural language for this tutorial. Sequence-to-sequence models have been used with great success in summarizing texts as well as generating translations from one language to another. For this tutorial, we will demonstrate a very creative task: given a snippet of code, we will train a model that generates a description of that code!

### Sequence to Sequence Primer

There are many variants of seq2seq models, however we will walk through one of the most simplest forms: an encoder-decoder network using RNNs.

#### Training

The decoder receives the ground truth, shifted by one time-step (is allowed to see the ground-truth of the previous time step).  This is called teacher forcing.   

![alt text](https://blog.keras.io/img/seq2seq/seq2seq-teacher-forcing.png)

#### Inference

At inference time, we will not be able to see the ground truth from the last time step.   Therefore we can use the last predicted output in place of the previous time step's ground truth.  We will generate our sequence this way using a greedy approach, stopping only when we either reach a maximum length or predict a special <stop> token.  There are more sophisticated ways of generating sequences such as using [beam search](https://en.wikipedia.org/wiki/Beam_search) that we will not cover in this tutorial.

![alt text](https://blog.keras.io/img/seq2seq/seq2seq-inference.png)

Credit: https://blog.keras.io/category/tutorials.html


**Building a neural network architecture is like stacking lego bricks.** For beginners, it can be useful to think of each layer as an API: you send the API some data and then the API returns some data. Thinking of things this way frees you from becoming overwhelmed, and you can build your understanding of things slowly. It is important to understand two concepts:

the shape of data that each layer expects, and the shape of data the layer will return. (When you stack many layers on top of each other, the input and output shapes must be compatible, like legos).
conceptually, what will the output(s) of a layer represent? What does the output of a subset of stacked layers represent?

Let's take a look at the data we want to use. The `source` is the snippet of code and the `target` is the description of that code.

In [None]:
print('source (code input): ', source_docs[2])
print('target (description output): ', target_docs[2])

## Preprocessing

Similar to previous excercises, we must pre-process the raw strings into a format that can be utilized by our model. One such format is to map each word in our corpus to a unique integer value, which we will refer to as a vocabulary. If the source and target are from the same distribution, (which they are not in this example) the vocabulary can be shared.


Concretely, we will tokenize, generate vocabulary, apply padding and vectorize. These steps are as follows:

**1. Tokenize:** Process of parsing strings into discrete words or tokens.

**2. Generate Vocabulary:** Assign each token to a unique integer, rare-occuring tokens may be assigned to the same integer.

**3. Padding:** We standardize the sequence length of each example to be the same by truncating and padding each example to the same lentgh.

The `ktext` package helps us accomplish these steps.

In [None]:
source_proc = processor(hueristic_pct_padding=.7, keep_n=20000)
source_vecs = source_proc.fit_transform(source_docs)

Note that we will pre-process the source documents in the same way as the language model.  The target documents, however will be processed in the same way with some subtle differences. 

In [None]:
target_proc = processor(append_indicators=True, hueristic_pct_padding=.7, keep_n=14000, padding ='post')
target_vecs = target_proc.fit_transform(target_docs)

 Above, we passed some additional parameters:

 - **append_indicators=True** will append the tokens ‘_start_’ and ‘_end_’ to the start and end of each document, respectively.
 
 - **padding=’post’** means that zero padding will be added to the end of the document instead of default of ‘pre’.
 
 
 The reason for processing the target documents in this way is that we want our model to know when the first letter of the docstring is supposed to occur, and also learn to predict when the end of a phrase should be. This will make more sense in the next section where model architecture is discussed.

Additionally, we will use teacher forcing for the decoder of the sequence to sequence model, so we will offset the target sequence by one.

In [None]:
encoder_input_data = source_vecs
encoder_seq_len = encoder_input_data.shape[1]

decoder_input_data = target_vecs[:, :-1]
decoder_target_data = target_vecs[:, 1:]

num_encoder_tokens = max(source_proc.id2token.keys()) + 1
num_decoder_tokens = max(target_proc.id2token.keys()) + 1

## Encoder model

The role of the encoder is to extract features and generate a representation of the input sequence, which in this case is a snippet of code. 

In [None]:
word_emb_dim=512
hidden_state_dim=1024
encoder_seq_len=encoder_seq_len
num_encoder_tokens=num_encoder_tokens
num_decoder_tokens=num_decoder_tokens

encoder_inputs = Input(shape=(encoder_seq_len,), name='Encoder-Input')
x = Embedding(num_encoder_tokens, word_emb_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs)
x = BatchNormalization(name='Encoder-Batchnorm-1')(x)
_, state_h = GRU(hidden_state_dim, return_state=True, name='Encoder-Last-GRU', dropout=.5)(x)
encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
seq2seq_encoder_out = encoder_model(encoder_inputs)

In [None]:
encoder_model.summary()

## Decoder model

The role of the decoder is to generate a description of the code conditioned on the features extracted by the encoder.

In [None]:
decoder_inputs = Input(shape=(None,), name='Decoder-Input')
dec_emb = Embedding(num_decoder_tokens, word_emb_dim, name='Decoder-Word-Embedding', mask_zero=False)(decoder_inputs)
dec_bn = BatchNormalization(name='Decoder-Batchnorm-1')(dec_emb)
decoder_gru = GRU(hidden_state_dim, return_state=True, return_sequences=True, name='Decoder-GRU', dropout=.5)
decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
x = BatchNormalization(name='Decoder-Batchnorm-2')(decoder_gru_output)
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='Final-Output-Dense')
decoder_outputs = decoder_dense(x)

## Sequence to sequence model

We can connect the encoder and decoder together to create the sequence to sequence model.

In [None]:
seq2seq_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

Summary of model architecture:

In [None]:
seq2seq_model.summary()

![alt text](https://raw.githubusercontent.com/hohsiangwu/kdd-2018-hands-on-tutorials/master/images/seq2seq_model_architecture.svg?sanitize=true)

## Training

The below hyperparameters were found through some trial and error.

**This should take approximately ~ 35 minutes to train.**

In [None]:
batch_size = 1024
epochs = 16

seq2seq_model.compile(optimizer=optimizers.Nadam(lr=0.00005), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = seq2seq_model.fit([encoder_input_data, decoder_input_data],
                            np.expand_dims(decoder_target_data, -1),
                            batch_size=batch_size,
                            epochs=epochs,
                            validation_split=0.1)

Recommendation of keeping track of different experiments:

http://wandb.com

(We will not be covering this in the tutorial)

## Extract encoder and decoder models

To prepare the model for inference (to make predictions), we have to re-assemble it (with its trained weights intact) such that the decoder uses the last prediction as input rather than being fed the right answer for the previous time step, as illustrated below:

![alt text](https://blog.keras.io/img/seq2seq/seq2seq-inference.png)

In [None]:
def extract_decoder_model(model):
    latent_dim = model.get_layer('Encoder-Model').output_shape[-1]
    decoder_inputs = model.get_layer('Decoder-Input').input
    dec_emb = model.get_layer('Decoder-Word-Embedding')(decoder_inputs)
    dec_bn = model.get_layer('Decoder-Batchnorm-1')(dec_emb)
    gru_inference_state_input = Input(shape=(latent_dim,), name='hidden_state_input')
    gru_out, gru_state_out = model.get_layer('Decoder-GRU')([dec_bn, gru_inference_state_input])
    dec_bn2 = model.get_layer('Decoder-Batchnorm-2')(gru_out)
    dense_out = model.get_layer('Final-Output-Dense')(dec_bn2)
    decoder_model = Model([decoder_inputs, gru_inference_state_input], [dense_out, gru_state_out])
    return decoder_model

One side effect of training a sequence-to-sequence model in this way is that the encoder can be re-used as a general purpose feature extractor. We extract the encoder below for this purpose in a later exercise.

In [None]:
encoder_model = seq2seq_model.get_layer('Encoder-Model')
for layer in encoder_model.layers:
    layer.trainable = False

decoder_model = extract_decoder_model(seq2seq_model)
decoder_model.summary()

## Predict code descriptions using the trained sequence-to-sequence model

You will see that the predicted descriptions are not perfect, but seem to be picking up on correlations between common code token sequences and natural language descriptions of that code.

Feel free to run the below block of code as many times as you want. A new random sample from the test set will be drawn each time.

In [None]:
i = random.randint(0, len(test_source_docs))

max_len = target_proc.padding_maxlen
raw_input_text = test_source_docs[i]

raw_tokenized = source_proc.transform([raw_input_text])
encoding = encoder_model.predict(raw_tokenized)
original_encoding = encoding
state_value = np.array(target_proc.token2id['_start_']).reshape(1, 1)

decoded_sentence = []
stop_condition = False
while not stop_condition:
    preds, st = decoder_model.predict([state_value, encoding])
    pred_idx = np.argmax(preds[:, :, 2:]) + 2
    pred_word_str = target_proc.id2token[pred_idx]

    if pred_word_str == '_end_' or len(decoded_sentence) >= max_len:
        stop_condition = True
        break
    decoded_sentence.append(pred_word_str)

    # update the decoder for the next word
    encoding = st
    state_value = np.array(pred_idx).reshape(1, 1)

print('sample code from test set:\n------------------------\n', raw_input_text)
print('\nground truth:\n------------------------\n', test_target_docs[i])
print('\npredicted description:\n------------------------')
print(' '.join(decoded_sentence))

## Generate Embeddings

We need two embeddings

1.  Embeddings for the code snippets, from the seq2seq encoder.

2. Embeddings for the docstrings, from the language model.



Embeddings for the code snippets:

In [None]:
train_source_emb = encoder_model.predict(source_proc.transform(train_source_docs))

Embeddings for the target documents, which are natural language summaries (like docstrings):

In [None]:
train_target_vecs = proc.transform(train_target_docs)
hidden_states = embedding_model.predict(train_target_vecs[:, 1:])

Summarize the hidden states from the languager model.

In [None]:
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)
train_target_emb = sum_vecs

Check the shapes of each embedding.

In [None]:
print('source embedding shape on training set: ', train_source_emb.shape)
print('target embedding shape on training set: ', train_target_emb.shape)

# Construct a Joint Vector Space (Semantic Code Search)

Right now we have a way of representing:
- a blob of code as a vector using the encoder of the sequence-to-sequence model, and 
- the code descriptions as a vector using the language model.

However, these two vector spaces are not related to eachother. It can be useful to project the vectors for code and descriptions into the same space so that we can search code with natural language. There are many ways of accomplishing this task, however we will demonstrate a technique inspired from [this paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41473.pdf), where we use regression to "pull" these vectors into the same space.  This idea is further illustrated below:

![alt text](https://cdn-images-1.medium.com/max/1280/1*zhLXNHK8ILaYV8tT-jDlOQ.png)

### Review of the high-level process:  How do we build a joint vector space?  

Surprise! You are almost there!  We have already completed steps 1 - 3 as illustrated below.   

![alt text](https://raw.githubusercontent.com/hohsiangwu/kdd-2018-hands-on-tutorials/master/images/joint_space_diagram.svg?sanitize=true)

Most of the pieces for this step come from prior steps in this tutorial. In this step, we will fine-tune the seq2seq model  to predict docstring embeddings instead of docstrings. 

In [None]:
inp = Input(shape=(train_source_emb.shape[1],))
x = Dense(train_target_emb.shape[1], use_bias=False)(inp)
# x = BatchNormalization()(x)
# x = Dense(512)(x)
modal_model = Model([inp], x)
modal_model.summary()

In [None]:
modal_model.compile(optimizer=optimizers.Nadam(lr=0.002), loss='cosine_proximity', metrics=['accuracy'])

batch_size = 1024
epochs = 10
history = modal_model.fit([train_source_emb], train_target_emb,
                          batch_size=batch_size, epochs=epochs, validation_split=0.1)

## Application - Semantic Search

### Use test data

In [None]:
test_source_emb = encoder_model.predict(source_proc.transform(test_source_docs))

In [None]:
test_target_vecs = proc.transform(test_target_docs)
hidden_states = embedding_model.predict(test_target_vecs[:, 1:])
mean_vecs = np.mean(hidden_states, axis=1)
max_vecs = np.max(hidden_states, axis=1)
sum_vecs = np.sum(hidden_states, axis=1)
test_target_emb = sum_vecs

In [None]:
print(test_source_emb.shape)
print(test_target_emb.shape)

### Build vector indices

In [None]:
dimension = hidden_states.shape[-1]
index = AnnoyIndex(dimension)
for i, v in enumerate(test_target_emb):
    index.add_item(i, v)
index.build(10)

### Search nearest neighbors

In [None]:
i = random.randint(0, len(test_source_docs))
input_sequence = test_source_docs[i]
print(input_sequence)

vec = np.expand_dims(test_source_emb[i], 0)
out_vec = modal_model.predict(vec)
ids, _ = index.get_nns_by_vector(out_vec.T, 10, include_distances=True)
[test_target_docs[i] for i in ids]