# DRMM (Deep Relevance Matching Model)


But de ce notebook: Construire une architecture DRMM fonctionnelle avec Pytorch.

Pour cela, 2 étapes:

- construire la chaîne de pré traitements:
    - générer des paires document-requête non pertinentes et pertinentes pour l'apprentissage
    - générer des histogrammes d'interaction locales au niveau document-requête
- construire l'architecture DRMM

Les interractions sont pour le moment des interactions locales sur des word embeddings et sont mesurées comme une similarité cosinus entre les vecteurs des mots de la requête et ceux du document.

In [1]:
from gensim.models import KeyedVectors
import numpy as np
import matplotlib.pyplot as plt
from os import sep

%matplotlib inline

embeddings_path = "embeddings_wiki2017"
dataset_path = "data"

## Pré traitements: 

### Récupérer des word embeddings 

Ce word embedding a les caractéristiques suivantes:

- Gensim Continuous Skipgram
- taille de vecteur ${300}$
- window ${5}$
- entrainé sur wikipédia février 2017 en langue anglaise
- pas de lemmatisation
- ${302866}$ mots

http://vectors.nlpl.eu/repository/

In [2]:
model = KeyedVectors.load_word2vec_format(embeddings_path + sep + "model.bin", binary=True)

In [9]:
vocabulary = [w for w in model.vocab]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', vocabulary=vocabulary, binary=True)

### On Récupère les paires de pertinence/non pertinence pour chaque requête 

On génère un dictionnaire qui contient pour chaque requête en clé, un dictionnaire contenant 2 listes:
- "relevant" contenant des id de document pertinents pour la requête.
- "irrelevant" contenant des id de document non pertinents pour la requête.

In [3]:
paires = {}

with open(dataset_path + sep + "qrels.robust2004.txt", "r") as f:
    for line in f:
        lol = line.split()
        paires.setdefault(lol[0], {})
        paires[lol[0]].setdefault('relevant', []) 
        paires[lol[0]].setdefault('irrelevant', [])
        if lol[-1] == '1':
            paires[lol[0]]["relevant"].append(lol[2])
        else:
            paires[lol[0]]["irrelevant"].append(lol[2])

### On récupère les requêtes:

Elles se trouvent sous forme de tuple ([mots clés], [texte de la requête]). On ne garde que les mots clés.

In [4]:
import ast
with open(dataset_path + sep + "robust2004.txt", "r") as f:
    queries = ast.literal_eval(f.read())
queries = {d:queries[d][0] for d in queries}

In [6]:
print(queries["301"])
print(queries["308"])

international organized crime
implant dentistry


In [7]:
def query_term_maxlen(queries):
    return np.max([len(queries[q].split()) for q in queries])

### On peut maintenant construire l'histogramme des interactions entre les embeddings de la requête et ceux du document.

On prend comme exemple 4 bins: ${[-1, -0.5]}$ ${[-0.5, 0]}$ ${[0, 0.5]}$ ${[0.5, 1]}$.

**Plusieurs manières de construire un histogramme**: compter le nombre de valeurs, compter puis normaliser...

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

class DRMMinputGenerator:
    """
    
    """
    
    def __init__(self, query_term_maxlen, intervals, normalize=False):
        self.q_max_len = query_term_maxlen
        self.intervals = intervals
        self.normalize = normalize
        self.intvlsArray = np.linspace(-1, 1, self.intervals)
        
    def set_params(self, model_wv, vectorizer):
        self.model_wv = model_wv
        self.vectorizer = vectorizer
        
    def hist(self, query, document):
        """
        """
        X = []
        for i in query.nonzero()[1]:
            histo = []
            for j in document.nonzero()[1]:
                histo.append(cosine_similarity([self.model_wv.vectors[i]], [self.model_wv.vectors[j]])[0][0])
            histo, bin_edges = np.histogram(histo, bins=self.intvlsArray)
            if self.normalize:
                histo = histo / histo.sum()
            X.append(histo)
        #retourner histogramme interactions + embeddings termes query
        return np.array(X), np.array([self.model_wv.vectors[i] for i in query.nonzero()[1]])
        

## Architecture du modèle

D'après <a href="https://dl.acm.org/citation.cfm?id=2983769">l'article</a>

Code d'après <a href="https://github.com/X-Wei/ETH-thesis-TREC/blob/master/2-DRMM/DRMM.py">ce génie</a>

In [35]:
N_HISTBINS = 30
query_term_maxlen = 4
embedding_size = 300

#from settings import * 
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, InputLayer, Flatten, Input, merge, Reshape
import keras.backend as K
import tensorflow as tf
import pydot
from IPython.display import SVG
from keras.utils.visualize_util import model_to_dot

np.random.seed(1)


# helper function: model visualization 
def viz_model(model):
    return SVG(model_to_dot(model).create(prog='dot', format='svg'))

#### component 1: feed-forward network 
feed_forward = Sequential(
    [Dense(input_dim = N_HISTBINS, output_dim=5, activation='tanh'),
     Dense(output_dim=1, activation='tanh'),
     ], # 30-5-1 as described in the paper
    name='feed_forward_nw')


#### component 2: gating network
from keras.engine.topology import Layer
class ScaledLayer(Layer): # a scaled layer
    def __init__(self, **kwargs):
        super(ScaledLayer
    , self).__init__(**kwargs)
    def build(self, input_shape):
        self.output_dim = input_shape[1] 
        self.W = self.add_weight(shape=(1,), # Create a trainable weight variable for this layer.
                                 initializer='one', trainable=True)
        super(ScaledLayer
    , self).build(input_shape)  # Be sure to call this somewhere!
    def call(self, x, mask=None):
        return tf.mul(x, self.W)
    def get_output_shape_for(self, input_shape):
        return (input_shape[0], self.output_dim)

input_idf = Input(shape=(None,), name='input_idf')
scaled = ScaledLayer()(input_idf)
gs = Activation('softmax', name='softmax')(scaled)
gating = Model(input=input_idf, output=gs, name='gating')

#### some helper functions to build the DRMM 
from keras.layers.core import Lambda
# take the ith slice of the input x
def slicei(x, i): return x[:,i,:]
def slicei_output_shape(input_shape): return (input_shape[0], input_shape[2])

# concatenate inputs into one tensor 
def concat(x): return K.concatenate(x) 
def concat_output_shape(input_shape): return (input_shape[0][0], input_shape[0][1])

# innerproduct of zs and gs
def innerprod(x): return K.sum( tf.mul(x[0],x[1]), axis=1)
def innerprod_output_shape(input_shape): return (input_shape[0][0],1)

# get the difference of input 
def diff(x): return tf.sub(x[0], x[1]) 
def diff_output_shape(input_shape): return input_shape[0]

# custom loss function: hinge of (score_pos - score_neg)
def pairwise_hinge(y_true, y_pred): # y_pred = score_pos - score_neg, **y_true doesn't matter here**
    return K.mean( K.maximum(0.1 - y_pred, y_true*0.0) )  

# self-defined metrics
def ranking_acc(y_true, y_pred):
    y_pred = y_pred > 0 
    return K.mean(y_pred)

def gen_DRMM_model(QLEN, feed_forward = feed_forward): 
    '''
    generates a DRMM model for query length = `QLEN`
    returns 2 items: `scoring_model, ranking_model`
    * `scoring_model` is for predicting;
    * `ranking_model` is for training, and ranking_model has another field: `initial_weights`, which is used in `KFold`
    '''
    # input 1: idf 
    input_idf = Input(shape=(QLEN,), name='input_idf')

    # `gs` = gating output 
    # scaled = ScaledLayer(input_idf)
    # gs = Activation('softmax', name='softmax')(scaled)
    # gating = Model(input=input_idf, output=gs, name='gating')
    gs = gating(input_idf)

    # input 2: hist vectors, shape = QLEN x  N_HISTBINS
    input_hists = Input(shape=(QLEN, N_HISTBINS), name='input_hists')
    
    # `zs` = feed_forward network output  
    zs = [ feed_forward( Lambda(lambda x:slicei(x,i), slicei_output_shape, name='slice%d'%i)(input_hists) )\
            for i in xrange(QLEN) ] 
    zs = Lambda(concat, concat_output_shape, name='concat_zs')(zs)

    # score = inner product of zs and gs 
    scores = Lambda(innerprod, innerprod_output_shape, name='innerprod_zs_gs')([zs, gs])

    # scoring model 
    scoring_model = Model(input=[input_idf, input_hists], output=[scores], name='scoring_model')

    # input3 -- the negative hists vector 
    input_hists_neg = Input(shape=(QLEN, N_HISTBINS), name='input_hists_neg')
    zs_neg = [ feed_forward( Lambda(lambda x:slicei(x,i), slicei_output_shape, name='slice%d_neg'%i)(input_hists_neg) )\
          for i in xrange(QLEN) ]
    zs_neg = Lambda(concat, concat_output_shape, name='concat_zs_neg')(zs_neg)
    scores_neg = Lambda(innerprod, innerprod_output_shape, name='innerprod_zs_gs_neg')([zs_neg, gs])

    ''' **NOTE** 
    for negative document's score, can't just do (don't know why...): 
        `scores_neg = scoring_model([input_idf, input_hists_neg])`
    if we do like above, both outputs from `two_score_model` will be 0.0 
    '''

    two_score_model = Model(input=[input_idf, input_hists, input_hists_neg], 
                            output=[scores, scores_neg], name='two_score_model')
    # positive score - negative score 
    posneg_score_diff = Lambda(diff, diff_output_shape, name='posneg_score_diff')([scores, scores_neg])
    ranking_model = Model(input=[input_idf, input_hists, input_hists_neg], 
                            output=[posneg_score_diff], name='ranking_model')
    
    ranking_model.compile(optimizer='adagrad', loss=pairwise_hinge, metrics=[ranking_acc])
    ranking_model.initial_weights = ranking_model.get_weights()
    return scoring_model, ranking_model


ImportError: cannot import name 'Merge' from 'keras.layers' (/home/ismael/anaconda3/lib/python3.7/site-packages/keras/layers/__init__.py)