# DSSM

In this assignment you are going to create and train a DSSM neural network. If you need a recap, feel free to read the original paper [Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) by Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, Larry Heck.

The training dataset consists of query-document pairs from user sessions in [Yandex](http://yandex.com) search engine. Documents are represented by titles of those pages, where user was taken after clicking on the search results. The dataset was intentionally made small for you not to spend a lot of time on training. It was also somewhat filtered. Only english query-document pairs were left. Some adult content was removed.

As in the original paper we will use trigrams to represent input text. "Some, phrase!" gives the following trigrams: \["som", "ome", "phr", "hra", "ras", "ase"\].

Some code has already been written to help you with this assignment. Put your code in places mark by comment YOUR CODE HERE. For several basic functions tests are given. Make sure that all tests pass before you move on.

There are some prerequisits that you have to install for this notebook to run. We are going to need numpy, pandas, tensorflow and scipy. Install them via any method you find most convinient (pip, conda, etc).

Ask on the forum, if you need help.

In [1]:
# Import all the libraries.

import numpy as np
import pandas as pd
import tensorflow as tf

from scipy.spatial import cKDTree
from tensorflow import keras

import re
import string

from functools import partial

  from ._conv import register_converters as _register_converters


In [2]:
# Define necessary constants.

ALPHABET = ' ' + string.ascii_lowercase + string.digits
ALPHABET_SIZE = len(ALPHABET)

INEXISTANT_TRIGRAM = 0
TOTAL_TRIGRAMS = ALPHABET_SIZE ** 3 + 1
DOCUMENTS_PER_GROUP = 5

# We will read our dataset chunk by chunk.
CHUNK_SIZE = 10**4

# Queries and titles have arbitrary length, but to simplify things, 
# we will truncate them if they are very long, or will pad them 
# with inexistant trigrams if the are to short.
QUERY_PADDING_SIZE = 40
TITLE_PADDING_SIZE = 80

In [8]:
# Neural networks use numbers as inputs, so let us convert trigrams to numbers.
# Take a look at tests below to get the idea of this encoding.
# This function should not return 0, because 0 is reserved for INEXISTANT_TRIGRAM.
char_to_index = dict(zip(ALPHABET, range(ALPHABET_SIZE)))

def trigram_to_index(trigram):
    a, b, c = trigram
    index = 1 + char_to_index[c] + ALPHABET_SIZE * char_to_index[b] + ALPHABET_SIZE ** 2 * char_to_index[a]
    return index

In [9]:
assert trigram_to_index('   ') == 1
assert trigram_to_index('  a') == 2
assert trigram_to_index(' aa') == 39
assert trigram_to_index('aaa') == 1408
assert trigram_to_index('aa ') == 1407
assert trigram_to_index('zzz') == 36583
assert trigram_to_index('000') == 37990
assert trigram_to_index('7f ') == 46769

In [28]:
# The original input text may contain punctuation, different case, extra spaces. 
# Lets convert everything to lower case and filter out all characters except [a-z0-9 ].
import re
def filter_text(text):
    return re.sub(' {2,}', ' ', re.sub('[^a-z0-9 ]+', '', text.lower())).strip()

In [29]:
assert filter_text('AAA') == 'aaa'
assert filter_text('A   B') == 'a b'
assert filter_text('  123  asdf    ') == '123 asdf'
assert filter_text('  !@#$%^&*()-  +=`~,./?_  ') == ''

In [46]:
# With the help of the two functions defined above we are ready to implement 
# a function that converts arbitrary text to trigrams. 
# Note that it should return np.array.
# Truncate/pad the number of output trigrams to padding_size with INEXISTANT_TRIGRAM.

def text_to_trigram_vector(text, padding_size):
    vec = []
    text = ' ' + filter_text(text) + ' '
    for i in range(min(len(text) - 2, padding_size)):
        if text[i + 1] != ' ':
            vec.append(trigram_to_index(text[i: i + 3]))
    return np.asarray(vec + [INEXISTANT_TRIGRAM] * (padding_size - len(vec)))

In [48]:
assert type(text_to_trigram_vector('aaa', 3)) is type(np.array([]))
assert np.array_equal(text_to_trigram_vector('aaa', 3), np.array([39, 1408, 1407]))
assert np.array_equal(text_to_trigram_vector('aaa', 2), np.array([39, 1408]))
assert np.array_equal(text_to_trigram_vector('aaa', 4), np.array([39, 1408, 1407, 0]))
assert np.array_equal(text_to_trigram_vector('aaa aaa', 7), np.array([39, 1408, 1407, 39, 1408, 1407, 0]))

In [49]:
# df is pandas dataframe. 
# This function computes and adds to new columns to df: 
# 'query_vector' and 'title_vector' which are numerical 
# trigram representations of query and title columns.

def add_vectors(df):
    df['query_vector'] = df['query'].apply(partial(text_to_trigram_vector, padding_size=QUERY_PADDING_SIZE))
    df['title_vector'] = df['title'].apply(partial(text_to_trigram_vector, padding_size=TITLE_PADDING_SIZE))

In [50]:
# This is the data generator function. It reads input csv file in chunks and generates
# data for training. As in the original paper we take one query, duplicate it 5 times, then take
# the corresponding document title, put in the first position and put "negative" document titles in the
# 2nd - 5th positions. 
#
# There are several ways to construct negatives. The first one is to obtain them 
# from user sessions, but this way is not cheap. The second one, soft negatives, is to
# randomly choose titles of other documents. And the third, hard negatives, is to choose such
# titles of other documents that increase the error of the current model. Here we implement the
# second and third approaches in the chunk_size window.

def data_generator(filename, batch_size, chunk_size, query_model=None, document_model=None):
    df_reader = pd.read_csv(filename, iterator=True, chunksize=chunk_size)
    for df in df_reader:
        df = df.dropna()
        add_vectors(df)
        
        query_groups = []
        title_groups = []
        targets = []

        for row in df.itertuples():
            index, query, positive_title, query_vector, positive_title_vector = row
            negative_title_vectors = []
            
            good_df = df[df['query'] != query]
            if query_model is None or document_model is None:
                negative_title_vectors.extend(good_df['title_vector'].sample(DOCUMENTS_PER_GROUP - 1))
            else:
                query_emedding = predict_for_one(query_model, query_vector)
                how_close = lambda title_vector: np.inner(query_emedding, predict_for_one(document_model, title_vector))
                good_subsample = good_df['title_vector'].sample(batch_size * 2)
                closest = good_subsample.apply(how_close).nlargest(DOCUMENTS_PER_GROUP - 1)
                negative_title_vectors.extend(good_subsample[closest.index])
            query_groups.append([query_vector] * DOCUMENTS_PER_GROUP)
            title_groups.append([positive_title_vector] + negative_title_vectors)

            probabilities = np.zeros(DOCUMENTS_PER_GROUP)
            probabilities[0] = 1

            targets.append(probabilities)

            if len(query_groups) == batch_size:
                yield [np.array(query_groups), np.array(title_groups)], np.array(targets)
                query_groups = []
                title_groups = []
                targets = []

In [51]:
# Helper functions.

def cosine_distance(query_semantic_feature, document_semantic_feature):
    distance = tf.losses.cosine_distance(query_semantic_feature, document_semantic_feature, axis=-1, reduction=tf.losses.Reduction.NONE)
    distance = tf.reshape(distance, [-1, DOCUMENTS_PER_BATCH])
    return distance

def sum_over_axis(axis):
    return lambda x: tf.reduce_sum(x, axis=axis)

def mean_over_axis(axis):
    return lambda x: tf.reduce_mean(x, axis=axis)

Now we are ready to define our DSSM model. We will use [tensorflow](https://www.tensorflow.org) for that purpose and its [keras API](https://www.tensorflow.org/guide/keras). Our model will consist of three submodels that share weights: full_model, query_model, title_model. Full model is the whole DSSM model as is. Query model is its part that calculates query embedings. Title model calculates title embedings.

In [64]:
def dssm_models(embedding_size, query_layers_params, title_layers_params):
    # Construct inputs.
    query_input = keras.Input(shape=(DOCUMENTS_PER_GROUP, QUERY_PADDING_SIZE,), name='queries_input')
    title_input = keras.Input(shape=(DOCUMENTS_PER_GROUP, TITLE_PADDING_SIZE), name='titles_input')

    # Define functional layer for common embeddings.
    embedding_layer = keras.layers.Embedding(TOTAL_TRIGRAMS, embedding_size, name='common_embedding')
    
    # Define functionl dense layers for queries.
    query_layers = []
    for i, units in enumerate(query_layers_params[:-1]):
        query_layers.append(keras.layers.Dense(units, activation='tanh', name='query_projection_{}'.format(i)))
    query_layers.append(keras.layers.Dense(units, activation='tanh', name='query_semantic'))
    
    # Define functional dense layers for titles.
    title_layers = []
    for i, units in enumerate(title_layers_params[:-1]):
        query_layers.append(keras.layers.Dense(units, activation='tanh', name='title_projection_{}'.format(i)))
    title_layers.append(keras.layers.Dense(units, activation='tanh', name='title_semantic'))
    
    # Construct neural network for queries.                 
    query_word_embeddings = embedding_layer(query_input)
    query_embedding = keras.layers.Lambda(mean_over_axis(-2), name='query_word_embeddings_sum')(query_word_embeddings)
    query_inner_layer = query_embedding
    for layer in query_layers:
        query_inner_layer = layer(query_inner_layer)
    query_semantic_feature = keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=-1), name='normalized_query_semantic_feature')(query_inner_layer)

    # Construct neural network for titles.
    title_word_embeddings = embedding_layer(title_input)
    title_embedding = keras.layers.Lambda(mean_over_axis(-2), name='title_word_embeddings_sum')(title_word_embeddings)
    title_inner_layer = title_embedding
    for layer in title_layers:
        title_inner_layer = layer(title_inner_layer)
    title_semantic_feature = keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=-1), name='normalized_title_semantic_feature')(title_inner_layer)

    # Combine results to evaluate relevance.
    relevance = keras.layers.Multiply(name='relevance_mult')([query_semantic_feature, title_semantic_feature])
    relevance =  keras.layers.Lambda(sum_over_axis(-1), name='relevance_sum')(relevance)

    # Calculate posterior probabilities with softmax. Find the appropriate.
    posterior_probability = keras.layers.Softmax()(relevance)
    
    # Define seperate models with common weights.
    full_model = keras.Model(inputs=[query_input, title_input], outputs=posterior_probability)
    query_model = keras.Model(inputs=query_input, outputs=query_semantic_feature)
    title_model = keras.Model(inputs=title_input, outputs=title_semantic_feature)
    
    full_model.compile(optimizer=keras.optimizers.Adam(lr=0.001),
                       loss='categorical_crossentropy',
                       metrics=['accuracy'])
    query_model.compile(optimizer=keras.optimizers.Adam(lr=0.001),
                        loss='categorical_crossentropy',
                        metrics=['accuracy'])
    title_model.compile(optimizer=keras.optimizers.Adam(lr=0.001),
                           loss='categorical_crossentropy',
                           metrics=['accuracy'])
    
    return full_model, query_model, title_model

In [65]:
# After we trained our models with groups of 5 query-document pairs, 
# we are going to use this model to predict pobability of one pair.

def predict_for_one(model, inp):
    return model.predict(inp[np.newaxis, np.newaxis, :].repeat(DOCUMENTS_PER_GROUP, axis=1))[0][0]

In [104]:
# Let us define a class, that will use the trained models to index documents
# and rank them for a given search query.
# For every document we will calculate its title embedding and store it in KDTree,
# a data structure that supports fast search of top_k nearest neighbors of an arbitrary vector.
# During search a query is also converted to its embedding and then we find top_k documents with closest embeddings.

class Searcher:
    def __init__(self, query_model, document_model):
        self.query_model = query_model
        self.document_model = document_model
    
    def index(self, titles):
        self.titles = titles
        # Map titles to trigrams with correct padding.
        title_vectors = keras.backend.variable(np.asarray(
            [text_to_trigram_vector(t, TITLE_PADDING_SIZE) for t in titles]
        ))
        # Use document_model to predict embedings for each title vector.
        title_embeddings = keras.backend.eval(self.document_model(title_vectors))
        # Store embedings in KDTree for fast search.
        self.titles_tree = cKDTree(np.vstack(title_embeddings))
        
    def search(self, query, top=5):
        # Map query to trigrams with correct padding.
        query_vector = keras.backend.variable(np.asarray(
            [text_to_trigram_vector(query, QUERY_PADDING_SIZE)]
        ))
        # Use query_model to predict embeding for the query vector.
        query_embedding = keras.backend.eval(self.query_model(query_vector))[0]
        # Run top_k query. 
        title_indices = self.titles_tree.query(query_embedding, top)[1]
        return title_indices
    
    def search_titles(self, query, top=5):
        # Same as search, but returns titles, not thier indices in the index.
        return self.titles[self.search(query, top)]

In [115]:
# We need a function to evaluate the quality of our searcher.
from tqdm import tqdm_notebook as tqdm

def evalute_searcher_quality(searcher, df, top=5):
    hits = 0
    total = 0
    for row in df.itertuples():
        print(total)
        index, query, title = row
        if title in searcher.search_titles(query, top):
            hits += 1
        total += 1
    return hits / total

In [106]:
# We will train our model on the training data and then evaluate it on the validation dataset.

df_val = pd.read_csv('en-query-title-unique-validation.csv')

In [107]:
full_model, query_model, document_model = dssm_models(embedding_size=128, 
                                                      query_layers_params=(128, 64), 
                                                      title_layers_params=(128, 64))
print(full_model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
queries_input (InputLayer)      (None, 5, 40)        0                                            
__________________________________________________________________________________________________
common_embedding (Embedding)    multiple             6483712     queries_input[0][0]              
                                                                 titles_input[0][0]               
__________________________________________________________________________________________________
query_word_embeddings_sum (Lamb (None, 5, 128)       0           common_embedding[0][0]           
__________________________________________________________________________________________________
titles_input (InputLayer)       (None, 5, 80)        0                                            
__________

**Make sure that the previous cell outputs something like this:**
```__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
queries_input (InputLayer)      (None, 5, 40)        0                                            
__________________________________________________________________________________________________
common_embedding (Embedding)    multiple             6483712     queries_input[0][0]              
                                                                 titles_input[0][0]               
__________________________________________________________________________________________________
query_word_embeddings_sum (Lamb (None, 5, 128)       0           common_embedding[0][0]           
__________________________________________________________________________________________________
titles_input (InputLayer)       (None, 5, 80)        0                                            
__________________________________________________________________________________________________
query_projection_0 (Dense)      (None, 5, 128)       16512       query_word_embeddings_sum[0][0]  
__________________________________________________________________________________________________
query_semantic (Dense)          (None, 5, 128)       16512       query_projection_0[0][0]         
__________________________________________________________________________________________________
title_word_embeddings_sum (Lamb (None, 5, 128)       0           common_embedding[1][0]           
__________________________________________________________________________________________________
title_projection_0 (Dense)      (None, 5, 128)       16512       query_semantic[0][0]             
__________________________________________________________________________________________________
title_semantic (Dense)          (None, 5, 128)       16512       title_word_embeddings_sum[0][0]  
__________________________________________________________________________________________________
normalized_query_semantic_featu (None, 5, 128)       0           title_projection_0[0][0]         
__________________________________________________________________________________________________
normalized_title_semantic_featu (None, 5, 128)       0           title_semantic[0][0]             
__________________________________________________________________________________________________
relevance_mult (Multiply)       (None, 5, 128)       0           normalized_query_semantic_feature
                                                                 normalized_title_semantic_feature
__________________________________________________________________________________________________
relevance_sum (Lambda)          (None, 5)            0           relevance_mult[0][0]             
__________________________________________________________________________________________________
softmax (Activation)            (None, 5)            0           relevance_sum[0][0]              
==================================================================================================
Total params: 6,549,760
Trainable params: 6,549,760
Non-trainable params: 0
__________________________________________________________________________________________________
None```

In [108]:
# To speed you up we provide you with a pretrained model. 
# It was trained on a larger dataset for a couple of days. 
# First on soft negatives, then on hard negatives.
# Then weights of last layers where reset.

full_model.load_weights('full_model_trained_weights.h5')

In [109]:
# Lets try to index validation dataset with our Searcher.

searcher = Searcher(query_model, document_model)
searcher.index(df_val['title'].values)

In [110]:
# Perform sanity check. Because some weights were reset, 
# the results should be random for now, thats ok to see something inadequate like
# 'brandon colby wikipedia', 'using bibtex', 'symfony certification',
# 'tpv compound srl', 'turbulence' or anything else not connected to python.

searcher.search_titles('python')

array(['brandon colby wikipedia', 'using bibtex', 'symfony certification',
       'tpv compound srl', 'turbulence'], dtype=object)

In [117]:
df_val.shape

(5000, 2)

In [116]:
# Evaluate the pretrained model. Results are going to be very bad, mostly zeros, but we will fix it later on.

for top in (1, 5, 10):
    print('top{}: {}'.format(top, evalute_searcher_quality(searcher, df_val, top)))

0
1
2
3
4
5
6
7
8
9
10


KeyboardInterrupt: 

In [None]:
# Define a generator for soft negatives.

random_negatives_generator = data_generator('en-query-title-unique-train-short.csv', batch_size=512, chunk_size=10**4)

In [None]:
# And start the training process. It should take about 5-15 minutes.

full_model.fit_generator(random_negatives_generator, steps_per_epoch=100, epochs=1)

In [None]:
# Now repeat the indexing of the validation dataset. 

searcher = Searcher(query_model, document_model)
searcher.index(df_val['title'].values)

In [None]:
# This time sanity check should pass. Make sure that python is mentioned the titles.

searcher.search_titles('python')

In [None]:
# Your goal is to achieve over 0.70 top5 on the evaluation dataset, so make sure that validation dataset
# provides you with >0.60 top1, >0.70 top5, >0.75 top10.

for top in (1, 5, 10):
    print('top{}: {}'.format(top, evalute_searcher_quality(searcher, df_val, top)))

# Submit for evaluation

To score your achivement you have to send the results for the test dataset via the submit page on coursera.

In [None]:
# Read the test data.

df_test = pd.read_csv('en-query-title-unique-test-student.csv')

In [None]:
# Index it.

searcher = Searcher(query_model, document_model)
searcher.index(df_test['title'].values)

In [None]:
# Prepare the submission dataframe.

submission = df_test['query'].apply(lambda q: pd.Series(searcher.search(q)))

In [None]:
# Save the submission in the current directory.

submission.to_csv('submission.csv', index=False)