# DRMM (Deep Relevance Matching Model)


But de ce notebook: Construire une architecture DRMM fonctionnelle avec Pytorch.

Pour cela, 2 étapes:

- construire la chaîne de pré traitements:
    - générer des paires document-requête non pertinentes et pertinentes pour l'apprentissage
    - générer des histogrammes d'interaction locales au niveau document-requête
- construire l'architecture DRMM

Les interractions sont pour le moment des interactions locales sur des word embeddings et sont mesurées comme une similarité cosinus entre les vecteurs des mots de la requête et ceux du document.

In [1]:
from gensim.models import KeyedVectors
import numpy as np
import matplotlib.pyplot as plt
from os import sep

%matplotlib inline

embeddings_path = "embeddings_wiki2017"
dataset_path = "data"

## Pré traitements: 

### Récupérer des word embeddings 

Ce word embedding a les caractéristiques suivantes:

- Gensim Continuous Skipgram
- taille de vecteur ${300}$
- window ${5}$
- entrainé sur wikipédia février 2017 en langue anglaise
- pas de lemmatisation
- ${302866}$ mots

http://vectors.nlpl.eu/repository/

In [2]:
model = KeyedVectors.load_word2vec_format(embeddings_path + sep + "model.bin", binary=True)

In [9]:
vocabulary = [w for w in model.vocab]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', vocabulary=vocabulary, binary=True)

### On Récupère les paires de pertinence/non pertinence pour chaque requête 

On génère un dictionnaire qui contient pour chaque requête en clé, un dictionnaire contenant 2 listes:
- "relevant" contenant des id de document pertinents pour la requête.
- "irrelevant" contenant des id de document non pertinents pour la requête.

In [3]:
paires = {}

with open(dataset_path + sep + "qrels.robust2004.txt", "r") as f:
    for line in f:
        lol = line.split()
        paires.setdefault(lol[0], {})
        paires[lol[0]].setdefault('relevant', []) 
        paires[lol[0]].setdefault('irrelevant', [])
        if lol[-1] == '1':
            paires[lol[0]]["relevant"].append(lol[2])
        else:
            paires[lol[0]]["irrelevant"].append(lol[2])

### On récupère les requêtes:

Elles se trouvent sous forme de tuple ([mots clés], [texte de la requête]). On ne garde que les mots clés.

In [4]:
import ast
with open(dataset_path + sep + "robust2004.txt", "r") as f:
    queries = ast.literal_eval(f.read())
queries = {d:queries[d][0] for d in queries}

In [6]:
print(queries["301"])
print(queries["308"])

international organized crime
implant dentistry


In [7]:
def query_term_maxlen(queries):
    return np.max([len(queries[q].split()) for q in queries])

### On peut maintenant construire l'histogramme des interactions entre les embeddings de la requête et ceux du document.

On prend comme exemple 4 bins: ${[-1, -0.5]}$ ${[-0.5, 0]}$ ${[0, 0.5]}$ ${[0.5, 1]}$.

**Plusieurs manières de construire un histogramme**: compter le nombre de valeurs, compter puis normaliser...

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

class DRMMinputGenerator:
    """
    
    """
    
    def __init__(self, query_term_maxlen, intervals, normalize=False):
        self.q_max_len = query_term_maxlen
        self.intervals = intervals
        self.normalize = normalize
        self.intvlsArray = np.linspace(-1, 1, self.intervals)
        
    def set_params(self, model_wv, vectorizer):
        self.model_wv = model_wv
        self.vectorizer = vectorizer
        
    def hist(self, query, document):
        """
        """
        X = []
        for i in query.nonzero()[1]:
            histo = []
            for j in document.nonzero()[1]:
                histo.append(cosine_similarity([self.model_wv.vectors[i]], [self.model_wv.vectors[j]])[0][0])
            histo, bin_edges = np.histogram(histo, bins=self.intvlsArray)
            if self.normalize:
                histo = histo / histo.sum()
            X.append(histo)
        #retourner histogramme interactions + embeddings termes query
        return np.array(X), np.array([self.model_wv.vectors[i] for i in query.nonzero()[1]])
        

## Architecture du modèle

D'après <a href="https://dl.acm.org/citation.cfm?id=2983769">l'article</a>

Code d'après <a href="https://github.com/sebastian-hofstaetter/neural-ranking-drmm/blob/master/neural-ranking/keras_model.py">ce génie</a>

In [86]:
import keras
from keras.models import Sequential, Model
from keras.layers import Input, Embedding, Dense, Activation, Lambda, Permute, merge
from keras.layers import Reshape, Dot
from keras.activations import softmax

np.random.seed(1)

def build_keras_model(params):
    """
    """
    
    initializer_interactions = keras.initializers.RandomUniform(minval=-0.1, maxval=0.1, seed=11)
    initializer_gating = keras.initializers.RandomUniform(minval=-0.01, maxval=0.01, seed=11)
    
    #input interactions
    interactions = Input(name='interactions', shape=(params['query_term_maxlen'], params['hist_size']))
    
    #input des term vectors de la query
    query = Input(name='term_vector', shape=(params['query_term_maxlen'],1))

    #partie feed forward
    z = interactions
    for i in range(len(params['hidden_sizes'])):
        z = Dense(params['hidden_sizes'][i], kernel_initializer=initializer_interactions, name="dense_layer_{}".format(i))(z)
        z = Activation('tanh', name="activation_of_layer_{}".format(i))(z)

    z = Permute((2, 1))(z)
    z = Reshape((params['query_term_maxlen'],))(z)

    #la partie term gating
    q_w = Dense(1, kernel_initializer=initializer_gating, use_bias=False, name="gating_W")(query)
    q_w = Lambda(lambda x: softmax(x, axis=1), output_shape=(params['query_term_maxlen'],))(q_w)
    q_w = Reshape((params["query_term_maxlen"],))(q_w)

    # combination of softmax(query term idf) and feed forward result per query term
    out_ = Dot(axes=[1, 1], name='s')([z, q_w])

    model = Model(inputs=[query, interactions], outputs=[out_])

    return model

# from https://github.com/faneshion/MatchZoo/blob/master/matchzoo/losses/rank_losses.py
from keras.backend import tf
from keras.losses import *
from keras.layers import Lambda
# y_true is IGNORED (!), you don't have to set a label to train (?)
# y_pred contains the complete batch (!)
#  -> the slicing splits the tensors in even and odd (pos and negative from the input)
#  -> VERY IMPORTANT: The input data must not be shuffled !! shuffle = False

def rank_hinge_loss(y_true, y_pred):

    y_pos = Lambda(lambda a: a[::2, :], output_shape= (1,))(y_pred)
    y_neg = Lambda(lambda a: a[1::2, :], output_shape= (1,))(y_pred)
    
    loss = K.maximum(0., 1. + y_neg - y_pos)
    return K.mean(loss)


In [85]:
params = {
    "hidden_sizes": [5, 32, 1],
    "hist_size": 5,
    "query_term_maxlen": 4,
    "embedding_size": 300
}

model = build_keras_model(params)
model.summary()
#model.compile(loss=rank_hinge_loss, optimizer='adam')

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
interactions (InputLayer)       (None, 4, 5)         0                                            
__________________________________________________________________________________________________
dense_layer_0 (Dense)           (None, 4, 5)         30          interactions[0][0]               
__________________________________________________________________________________________________
activation_of_layer_0 (Activati (None, 4, 5)         0           dense_layer_0[0][0]              
__________________________________________________________________________________________________
dense_layer_1 (Dense)           (None, 4, 32)        192         activation_of_layer_0[0][0]      
__________________________________________________________________________________________________
activation