<h1><font color='blue'><i>Learning to Rank</i></font></h1>
<h2><b>Trabalho Final - Estudo de caso em <i>learning to rank<i></b></h2>


---

# **Orientações**

## Objetivos

Aplicar os conhecimentos adquiridos em sala sobre máquinas de busca, sistemas de recuperação de informação e *learning to rank*.

## Atividades (escolher uma das seguintes alternativas)

1. **Alternativa 1**: Desenvolver uma máquina de busca, como exemplificado em sala, coletando dados da web, indexando e executando consultas. 

2. **Alternativa 2**: Pesquisar na web, *cases* (estudos de casos, exemplos) de soluções de máquinas de busca ou *learning to rank* e preparar um notebook próprio para apresentar o estudo de caso.

2. **Alternativa 3**: Executar pelo menos 3 (três) abordagens de *learning to rank*, uma ***pointwise***, uma ***pairwise*** e uma ***listwise*** sobre a coleção MQ2008, disponível na LETOR, e apresentar os resulados de MAP em todos os 5 folds. Coleção disponível para download em [MQ2008](https://drive.google.com/uc?export=download&id=1s8QZ7qio9Y9qUB7tp6Xb-NuAUUMI-ybZ).


## <font color='red'>Importante</font>

* O notebook a ser entregue deve ser funcional, isto é, **deve funcionar para que eu consiga executar o estudo de caso apresentado**.

* Se você utilizar algum arquivo de dados, preferencialmente, **ele deve estar disponível na web como um link do Google Drive** para que eu possa executar o seu notebook.

* **Comente o seu código** explicando os passos da solução que será apresentada.

# **Estudo de caso** - Alternativa 3

## Carregando as bibliotecas

In [37]:
!pip install lightgbm



In [42]:
from urllib.request import urlretrieve
import zipfile
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_svmlight_file
from collections import Counter
from sklearn import svm, linear_model
import lightgbm as lgb
import xgboost as xgb

## Download da base de Dados MQ2008

In [18]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [19]:
# link de download
#urlretrieve('https://drive.google.com/file/d/1RM-S5P8mZYCeZHOpgZCStlEJPT58l1aX/view?usp=sharing', 'MQ2008.zip')
# leitura e extração do arquivo zip
zip_ref = zipfile.ZipFile('/content/gdrive/MyDrive/Datasets/MQ2008.zip', 'r')
zip_ref.extractall()

## Funções de Avaliação

In [20]:
def precision_at_k(r, k):
    assert k >= 1
    r = np.asarray(r)[:k] != 0
    if r.size != k:
        raise ValueError('k é muito grande!')
    return np.mean(r)

def average_precision(r):
    r = np.asarray(r) != 0
    out = [precision_at_k(r, k+1) for k in range(r.size) if r[k]]
    if not out:
        return 0.
    return np.mean(out)

def map(rs):
    return np.mean([average_precision(r) for r in rs])

def predict_rank(model, X, y, q, probability=False):
    df = pd.DataFrame( {'feats' : list(X), 'qid' : q, 'rel' : y})
    rels = []
    for qid in df['qid'].unique():
        query = '(qid==' + str(qid) + ')'
        X = np.asarray([ x for x in df.query(query)['feats'] ])
        y = df.query(query)['rel']

        if probability:
            y_probs = model.predict_proba(X)
            y_scores = []
            for x in y_probs:
                y_scores.append(x.dot([0,1,2]))
        else:
            y_scores = model.predict(X)

        aux = pd.DataFrame( {'rel' : y, 'score' : y_scores })
        rel = list(aux.sort_values(by=['score'], ascending=False)['rel'].values)
        rels.append( rel )
    return map(rels)

## FOLD 1

In [30]:
feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold1/train.txt', query_id=True)
X_train = feats.toarray()
y_train = np.array([ int(r) for r in rels ])
q_train = np.array([q for q in qids])
count_train = list(Counter(q_train).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold1/vali.txt', query_id=True)
X_vali = feats.toarray()
y_vali = np.array([ int(r) for r in rels ])
q_vali = np.array([q for q in qids])
count_vali = list(Counter(q_vali).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold1/test.txt', query_id=True)
X_test = feats.toarray()
y_test = np.array([ int(r) for r in rels ])
q_test = np.array([q for q in qids])
count_test = list(Counter(q_test).values())

#### PoinWise - Regressão

In [32]:
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

print("MAP FOLD 1: ", predict_rank(reg, X_test, y_test, q_test))

MAP FOLD 1:  0.35008534383880985


#### PairWise - LightGBM

In [41]:
gbm = lgb.LGBMRanker()
gbm.fit(X_train, y_train, group=count_train,
        eval_set=[(X_vali, y_vali)], eval_group=[count_vali],
        eval_at=[5, 10], early_stopping_rounds=50, verbose=False)

print("MAP FOLD 1: ", predict_rank(gbm, X_test, y_test, q_test))

MAP FOLD 1:  0.4446464050669653


#### ListWise - XGBoost

In [44]:
xgbr = xgb.XGBRanker(
    bosster='gbtree',
    objective='rank:map',
    random_state=42,
    learning_rate=0.1,
    colsample_bytree=0.9,
    eta=0.05,
    max_depth=6,
    n_estimators=100,
    subsamples=0.75
    )

xgbr.fit(X_train, y_train, group=count_train, verbose=True)
print("MAP FOLD 1: ", predict_rank(xgbr, X_test, y_test, q_test))

MAP FOLD 1:  0.46348733136435966


## FOLD 2

In [45]:
feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold2/train.txt', query_id=True)
X_train = feats.toarray()
y_train = np.array([ int(r) for r in rels ])
q_train = np.array([q for q in qids])
count_train = list(Counter(q_train).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold2/vali.txt', query_id=True)
X_vali = feats.toarray()
y_vali = np.array([ int(r) for r in rels ])
q_vali = np.array([q for q in qids])
count_vali = list(Counter(q_vali).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold2/test.txt', query_id=True)
X_test = feats.toarray()
y_test = np.array([ int(r) for r in rels ])
q_test = np.array([q for q in qids])
count_test = list(Counter(q_test).values())


#### PoinWise - Regressão


In [46]:
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

print("MAP FOLD 2: ", predict_rank(reg, X_test, y_test, q_test))

MAP FOLD 2:  0.3439708364528137


#### PairWise - LightGBM

In [47]:
gbm = lgb.LGBMRanker()
gbm.fit(X_train, y_train, group=count_train,
        eval_set=[(X_vali, y_vali)], eval_group=[count_vali],
        eval_at=[5, 10], early_stopping_rounds=50, verbose=False)

print("MAP FOLD 2: ", predict_rank(gbm, X_test, y_test, q_test))

MAP FOLD 2:  0.4210342316442376


#### ListWise - XGBoost

In [48]:
xgbr = xgb.XGBRanker(
    bosster='gbtree',
    objective='rank:map',
    random_state=42,
    learning_rate=0.1,
    colsample_bytree=0.9,
    eta=0.05,
    max_depth=6,
    n_estimators=100,
    subsamples=0.75
    )

xgbr.fit(X_train, y_train, group=count_train, verbose=True)
print("MAP FOLD 2: ", predict_rank(xgbr, X_test, y_test, q_test))

MAP FOLD 2:  0.43113740803256656


## FOLD 3

In [49]:
feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold3/train.txt', query_id=True)
X_train = feats.toarray()
y_train = np.array([ int(r) for r in rels ])
q_train = np.array([q for q in qids])
count_train = list(Counter(q_train).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold3/vali.txt', query_id=True)
X_vali = feats.toarray()
y_vali = np.array([ int(r) for r in rels ])
q_vali = np.array([q for q in qids])
count_vali = list(Counter(q_vali).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold3/test.txt', query_id=True)
X_test = feats.toarray()
y_test = np.array([ int(r) for r in rels ])
q_test = np.array([q for q in qids])
count_test = list(Counter(q_test).values())

#### PoinWise - Regressão

In [50]:
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

print("MAP FOLD 3: ", predict_rank(reg, X_test, y_test, q_test))

MAP FOLD 3:  0.3288143671121634


#### PairWise - LightGBM

In [51]:
gbm = lgb.LGBMRanker()
gbm.fit(X_train, y_train, group=count_train,
        eval_set=[(X_vali, y_vali)], eval_group=[count_vali],
        eval_at=[5, 10], early_stopping_rounds=50, verbose=False)

print("MAP FOLD 3: ", predict_rank(gbm, X_test, y_test, q_test))

MAP FOLD 3:  0.44402166612003063


#### ListWise - XGBoost

In [52]:
xgbr = xgb.XGBRanker(
    bosster='gbtree',
    objective='rank:map',
    random_state=42,
    learning_rate=0.1,
    colsample_bytree=0.9,
    eta=0.05,
    max_depth=6,
    n_estimators=100,
    subsamples=0.75
    )

xgbr.fit(X_train, y_train, group=count_train, verbose=True)
print("MAP FOLD 3: ", predict_rank(xgbr, X_test, y_test, q_test))

MAP FOLD 3:  0.4349613785324614


## Fold 4

In [53]:
feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold4/train.txt', query_id=True)
X_train = feats.toarray()
y_train = np.array([ int(r) for r in rels ])
q_train = np.array([q for q in qids])
count_train = list(Counter(q_train).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold4/vali.txt', query_id=True)
X_vali = feats.toarray()
y_vali = np.array([ int(r) for r in rels ])
q_vali = np.array([q for q in qids])
count_vali = list(Counter(q_vali).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold4/test.txt', query_id=True)
X_test = feats.toarray()
y_test = np.array([ int(r) for r in rels ])
q_test = np.array([q for q in qids])
count_test = list(Counter(q_test).values())

#### PoinWise - Regressão

In [55]:
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

print("MAP FOLD 4: ", predict_rank(reg, X_test, y_test, q_test))

MAP FOLD 4:  0.4031693308510521


#### PairWise - LightGBM

In [56]:
gbm = lgb.LGBMRanker()
gbm.fit(X_train, y_train, group=count_train,
        eval_set=[(X_vali, y_vali)], eval_group=[count_vali],
        eval_at=[5, 10], early_stopping_rounds=50, verbose=False)

print("MAP FOLD 4: ", predict_rank(gbm, X_test, y_test, q_test))

MAP FOLD 4:  0.515457856703494


#### ListWise - XGBoost

In [57]:
xgbr = xgb.XGBRanker(
    bosster='gbtree',
    objective='rank:map',
    random_state=42,
    learning_rate=0.1,
    colsample_bytree=0.9,
    eta=0.05,
    max_depth=6,
    n_estimators=100,
    subsamples=0.75
    )

xgbr.fit(X_train, y_train, group=count_train, verbose=True)
print("MAP FOLD 4: ", predict_rank(xgbr, X_test, y_test, q_test))

MAP FOLD 4:  0.5258585353673043


## Fold 5

In [58]:
feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold5/train.txt', query_id=True)
X_train = feats.toarray()
y_train = np.array([ int(r) for r in rels ])
q_train = np.array([q for q in qids])
count_train = list(Counter(q_train).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold5/vali.txt', query_id=True)
X_vali = feats.toarray()
y_vali = np.array([ int(r) for r in rels ])
q_vali = np.array([q for q in qids])
count_vali = list(Counter(q_vali).values())

feats, rels, qids = load_svmlight_file('/content/MQ2008/Fold5/test.txt', query_id=True)
X_test = feats.toarray()
y_test = np.array([ int(r) for r in rels ])
q_test = np.array([q for q in qids])
count_test = list(Counter(q_test).values())

#### PoinWise - Regressão

In [59]:
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

print("MAP FOLD 5: ", predict_rank(reg, X_test, y_test, q_test))

MAP FOLD 5:  0.3557017302150307


#### PairWise - LightGBM

In [60]:
gbm = lgb.LGBMRanker()
gbm.fit(X_train, y_train, group=count_train,
        eval_set=[(X_vali, y_vali)], eval_group=[count_vali],
        eval_at=[5, 10], early_stopping_rounds=50, verbose=False)

print("MAP FOLD 5: ", predict_rank(gbm, X_test, y_test, q_test))

MAP FOLD 5:  0.5076872466418735


#### ListWise - XGBoost

In [61]:
xgbr = xgb.XGBRanker(
    bosster='gbtree',
    objective='rank:map',
    random_state=42,
    learning_rate=0.1,
    colsample_bytree=0.9,
    eta=0.05,
    max_depth=6,
    n_estimators=100,
    subsamples=0.75
    )

xgbr.fit(X_train, y_train, group=count_train, verbose=True)
print("MAP FOLD 5: ", predict_rank(xgbr, X_test, y_test, q_test))

MAP FOLD 5:  0.49701138219517205


## Interpretação
Minha estratégia foi selecionar os métodos mais eficientes dos métodos disponibilizados em aula, com isso ao comparar os valores de MAP, como já era esperado o pointwise ficou com o menor valor de MAP, contudo me surpreende o pairwise ter ficado ligeiramente com MAP maior do que o listwise, esperava o listwise com valores maiores. Acredito que esse comportamente se deva ao tamanho e o tipo de dados do dataset.
Talvés o trade-off nesse caso foi complexidade do método versus complexidade do Dataset, levando em conta que o método listwise é bem mais complexo do que o pairwise, e com os valores de MAP ligeiramente melhor no pairwise, mostra que nesse caso o pairwise atenderia perfeitamente não tornando necessário aplicar o listwise que é mais complexo.