## DSSM and beyond

Повторяем идею из [Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data/)


<img src="https://raw.githubusercontent.com/v-liaha/v-liaha.github.io/master/assets/dssm.png" width=600>

В качестве энкодера используем **conv - maxpooling**

Скачиваем данные [Quora Question Pairs](https://www.kaggle.com/quora/question-pairs-dataset)

**Описание данных:**

* id - the id of a training set question pair
* qid1, qid2 - unique ids of each question (only available in train.csv)
* question1, question2 - the full text of each question
* is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

In [0]:
import os
import re

import tensorflow as tf
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
# tf.enable_eager_execution()

Using TensorFlow backend.


In [0]:
# choose the GPU to use
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

In [0]:
time_steps = 12
vocab_size = 7000

**Задание 1**

Написать функцию, которая приводит строку к нижнему регистру, оставляет запятые, числа, вопросительный и восклицательный знаки

In [0]:
def tokenize_string(string):
    
    re.sub(r'[^A-Za-z,?!]', '', string).lower()
    
    return string


def vectorize(data, tokenizer, time_steps=time_steps):
    data = tokenizer.texts_to_sequences(data)
    data = pad_sequences(data, maxlen=time_steps, padding='post')
    return data

## Обработка данных

Поменяем постановку задачи: теперь вместо того, чтобы предсказывать, с какой вероятностью данные примеры являются дубликатами, будем находить дубликаты среди пула примеров.

In [0]:
# nrows -- сколько строк с *.csv файла загрузить в память
data = pd.read_csv('questions.csv', nrows=1000)

# оставляем только дубликаты
data = data[data['is_duplicate'] == 1]
data = data.dropna()
data = data.rename({'question1': 'query', 'question2': 'd+'}, axis=1)

# очищаем данные от шума
data['query'] = data['query'].apply(lambda x: tokenize_string(x))
data['d+'] = data['d+'].apply(lambda x: tokenize_string(x))

# создаем K=4 не дубликатов для данного примера
data['d1-'] = np.random.permutation(data['d+'].values)
data['d2-'] = np.random.permutation(data['d+'].values)
data['d3-'] = np.random.permutation(data['d+'].values)
data['d4-'] = np.random.permutation(data['d+'].values)

# первый пример всегда является дубликатом, все остальные --- нет
y = np.zeros((data.shape[0], 5), dtype=int)
y[:,0] = 1

In [0]:
data.head()

Unnamed: 0,id,qid1,qid2,query,d+,is_duplicate,d1-,d2-,d3-,d4-
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,Which are best mobile phones to buy under 15000?,What books are worth reading in early 20s?,What jobs are available with a bachelor’s degr...,What is the way to watch Comedy Nights with Ka...
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1,Why Central Govt banned old 500 and 1000 Rs no...,How can I efficiently learn while sleeping?,How can I teach myself how to sing?,Who will disrupt Bloomberg?
11,11,23,24,How do I read and find my YouTube comments?,How can I see all my Youtube comments?,1,Does eating prunes help with constipation?,How do I potty train my two-month-old Labrador...,How can you play a Blu Ray DVD on a regular DV...,What are the best places to visit in Kerala fo...
12,12,25,26,What can make Physics easy to learn?,How can you make physics easy to learn?,1,What are some high paying jobs for a fresher w...,How can I use Twitter for business?,Why does Quora mark my questions as needing im...,How do I become a good computer science engineer?
13,13,27,28,What was your first sexual experience like?,What was your first sexual experience?,1,What are some of the products made from crude ...,Is it healthy to eat a whole avocado every day?,How do I become a good computer science engineer?,How do I potty train my two-month-old Labrador...


In [0]:
data.shape

(380, 10)

In [0]:
# фитим токенайзер

corpus = data['query'].tolist() + data['d+'].tolist()
tok = Tokenizer(num_words=vocab_size)
tok.fit_on_texts(corpus)

In [0]:
# векторизуем данные

q = vectorize(data['query'].values, tok)
d0 = vectorize(data['d+'].values, tok)
d1 = vectorize(data['d1-'].values, tok)
d2 = vectorize(data['d2-'].values, tok)
d3 = vectorize(data['d3-'].values, tok)
d4 = vectorize(data['d4-'].values, tok)

In [0]:
# делим датасет на обучение и валидацию

x = np.hstack((q, d0, d1, d2, d3, d4)).reshape((-1, 6, time_steps))
xtr, xev, ytr, yev = train_test_split(x, y, test_size=0.1, random_state=24)

In [0]:
print(x.shape)

(380, 6, 12)


In [0]:
x[0]

array([[ 230,  445,  231,   13,  445,  925,    2,   21,   36,  446,   49,
          44],
       [ 230,  231,   13, 1204,   10,  326,    2,   21,  158,  446,   49,
          44],
       [  30,   12,   15, 1297, 1298,    6,  378,  124, 1299,    0,    0,
           0],
       [   2,  137,   12, 1339, 1340,   10,  292,  676,    0,    0,    0,
           0],
       [   2,  123,   12,  765,   32,    8, 1288, 1289,   10,  608,  269,
           0],
       [   1,   63,    6,  201,  287,  652,   32,  653,   93,   19,    1,
        1324]], dtype=int32)

## input_fn

С помощью tf.data создаем итератор, который будет подавать данные в модель

In [0]:
def expand_x(x):
    return {'q': x[:,0],
            'd0': x[:,1],
            'd1': x[:,2],
            'd2': x[:,3],
            'd3': x[:,4],
            'd4': x[:,5]}

# функция, которая подает данные в модель
def input_fn(x, labels, params, is_training):
    dataset = tf.data.Dataset.from_tensor_slices((x, labels))

    if is_training:
        dataset = dataset.shuffle(buffer_size=params['train_size'])
        dataset = dataset.repeat(count=params['num_epochs'])

    dataset = dataset.batch(params['batch_size'])
    dataset = dataset.map(lambda x, y: (expand_x(x), y))
    dataset = dataset.prefetch(buffer_size=100)
    return dataset

# Model

**Задание 2**

Реализуйте функцию, котора считает косинусную близость между тензорами размера **(batch_size, dim)**

In [0]:
# hint: try to use tf.nn.l2_normalize, tf.multiply

def cosine_sim(a, b):
    """
    Подсчет косинусной близости между двумя тензорами размера (batch_size, dim)
    """
    normalize_a = tf.nn.l2_normalize(a,0)        
    normalize_b = tf.nn.l2_normalize(b,0)
    
    # tf.multiply(normalize_a,normalize_b) должен иметь shape=(batch_size, dim)=(256, 12)
    
    cos_sim = tf.reduce_sum(tf.multiply(normalize_a,normalize_b), axis=1)
    # a cos_sim с shape = (256)

    return cos_sim

**Задание 3**

Реализуйте энкодер, который переводит тензор размера **(batch_size, time_steps, emb_size)** в тензор **(batch_size, new_dim)**

<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/73d826d4c2363701b88e3e234fe3b8756c0f9671/3-Figure1-1.png" width=600>


Применить два типа свертки: **[kernel_size=3, strides=2, filters=32], [kernel_size=5, strides=3, filters=32]**

Над выходами **average-pooling, max-pooling** соответственно. Полученные тензоры сконкатенировать.

In [0]:
def build_model(features, params, is_training):


  emb_matrix = tf.get_variable('embedding_matrix',
                             shape=[params['vocab_size'], params['emb_size']],
                             dtype=tf.float32)

  def encode(sentences):
    """
    Args:
        sentences: (batch_size, time_steps) последовательности индексов
    Returns:
        out: (batch_size, new_dim) представление текста в новом пространстве
    """

    # hints: use tf.nn.embedding_lookup, tf.layers.conv1d, tf.reduce_max
    # tf.reduce_mean, tf.concat
    embs = tf.nn.embedding_lookup(emb_matrix, sentences)

    conv_1 = tf.layers.conv1d(embs,
                      filters=32,
                      kernel_size=3,
                      strides=2)
    conv_2 = tf.layers.conv1d(embs,
                      filters=32,
                      kernel_size=5,
                      strides=3)
    pool_1 = tf.reduce_mean(conv_1, axis=1)
    pool_2 = tf.reduce_max(conv_2, axis=1)
    cc = tf.concat([pool_1, pool_2], axis=1)
    out = tf.layers.dense(cc, 29, activation=tf.nn.relu)
    return out


  # энкодим все документы
  encoded_features = {}        

  with tf.variable_scope('enc'):
      encoded_features['q'] = encode(features['q'])

  for key, value in features.items():
      if key != 'q':
          with tf.variable_scope('enc', reuse=True):
              encoded_features[key] = encode(value)

  # считаем косинусные близости между q и всеми документами
  cos_sims = {}

  for key, value in encoded_features.items():
      if key != 'q':
          cos_sims[key] = cosine_sim(encoded_features['q'], encoded_features[key])

  # конкатинируем косинусные близости

  to_concatenate = [cos_sims['d0'], cos_sims['d1'], cos_sims['d2'], cos_sims['d3'], cos_sims['d4']]
  concatenated = tf.stack(to_concatenate, axis=1)
  
  return concatenated, encoded_features

Функция потерь:

$$J(\theta) = - \sum_i y_i \ln(\hat{y_i})$$

Мы хотим, чтобы $cosine\_similarity(q, d_0) = 1$, а $cosine\_similarity(q, d_j) = 0$, где $j \in \{1,2,3,4\}$, тогда лосс будет стремиться к нулю.


**Задание 4**

Реализовать метрики:

* Accuracy
* MSE

In [0]:
def model_fn(features, labels, mode, params):
    
    is_training = (mode == tf.estimator.ModeKeys.TRAIN)
    
    with tf.variable_scope('model'):
        logits, _ = build_model(features, params, is_training)
        
    preds = tf.argmax(logits, axis=1)
    
    
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {'preds': preds, 'logits': logits}
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions)
    
    # hints: tf.equal, tf.square, tf.substract, tf.cast, tf.reduce_mean
    labels = tf.cast(labels, tf.float32)
    accuracy = tf.reduce_mean(tf.cast(tf.abs(labels - logits)<0.5, tf.float32))
    mse = tf.reduce_mean(tf.cast(tf.square(labels - logits), tf.float32))
    
    loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
    
    if mode == tf.estimator.ModeKeys.EVAL:
        with tf.variable_scope('metrics'):
            eval_metrics = {'accuracy': tf.metrics.mean(accuracy),
                           'mse': tf.metrics.mean(mse)}
        
        return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=eval_metrics)
    
    tf.summary.scalar('accuracy', accuracy)
    tf.summary.scalar('mse', mse)
    tf.summary.scalar('loss', loss)
    
    optimizer = tf.train.AdamOptimizer()
    
    global_step = tf.train.get_global_step()
    train_op = optimizer.minimize(loss, global_step=global_step)
    
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

In [0]:
model_params = {
    'vocab_size': vocab_size,
    'emb_size': 300
}

config = tf.estimator.RunConfig(tf_random_seed=123,
                                model_dir='masha',
                                save_summary_steps=5)

estimator = tf.estimator.Estimator(model_fn,
                                   params=model_params,
                                   config=config)

INFO:tensorflow:Using config: {'_model_dir': 'masha', '_tf_random_seed': 123, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa51d45da90>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [0]:
params = {
    'batch_size': 256,
    'num_epochs': 5,
    'train_size': int(len(xtr) * 0.9)
}

In [0]:
estimator.train(lambda: input_fn(xtr, ytr, params=params, is_training=True))

INFO:tensorflow:Calling model_fn.
emb_matrix.shape = (7000, 300)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 

<tensorflow.python.estimator.estimator.Estimator at 0x7fa51d5d9128>

In [0]:
eval_results = estimator.evaluate(lambda: input_fn(xev, yev, params=params, is_training=False))

for key, value in eval_results.items():
    print(f'{key}: {value}')

INFO:tensorflow:Calling model_fn.
emb_matrix.shape = (7000, 300)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 

In [0]:
preds = estimator.predict(lambda: input_fn(xev, yev, params=params, is_training=False))

In [0]:
logits = []

for el in preds:
    logits.append(el['logits'])
    
logits = np.array(logits, dtype=float)

INFO:tensorflow:Calling model_fn.
emb_matrix.shape = (7000, 300)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 32)
cc.shape = (?, 64)
out.shape = (?, 29)
embs.shape = (?, 12, 300)
conv_1.shape = (?, 5, 32)
conv_2.shape = (?, 3, 32)
pool_1.shape = (?, 32)
pool_2.shape = (?, 