# XGBoost

Ranklib is a relatively old library and doesn't have the wide spread use that XGBoost does. Ranklib is still under active development, but the fork of the project OSC created reflects an older version.

The ES-LTR plugin is designed to work with XGBoost model format. This notebook starts with the `classic` training data generated in `hello-ltr.py` and shows how you could use XGBoost instead of Ranklib to create a model and use it with the plugin.

### Input Data

Gather the data generated for our `classic` model in `hello-ltr.ipynb`. If this file doesn't exist yet, rerun that notebook!

In [None]:
import ltr.judgments as judge
df = [j for j in judge.judgments_from_file(open('data/classic-training.txt'))]
df = judge.judgments_to_dataframe(df)
df

### Libraries for xgboost-ing

Just the dependencies we need to train and visualize out model trained with XG-Boost instead of Ranklib.

In [None]:
import pandas as pd
import xgboost as xgb
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 50,150

### Set up our training Matrix

XGBoost has it's data specficiations so we need to get out features into that format to use it.


In [None]:
df = df[['grade', 'features0']]
features = df[['features0']]
labels = df[['grade']]

dmx = xgb.DMatrix(features, labels)

### Train the first XGBoost model

Using the demo parameters for our model, we will train a standard regression tree

In [None]:
param = {'max_depth':2, 'eta':1, 'silent':1}
num_round = 2

model = xgb.train(param, dmx, num_round)

### Inspect as dataframe

Looking at the model as a dataframe can tell you which splits helped the most

In [None]:
model.trees_to_dataframe()

In [None]:
xgb.plot_tree(model)

### Adjust the objective for LTR

Really we don't want the regression as our objective function. In LTR we take advantage of a new pairwise loss function to find the optimal splits for a regression tree. 

This doesn't make a massive difference for the model that is generated because it is still a regression tree at the end of the day, but we are not longer using residual sqared error.

In [None]:
param2 = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'rank:pairwise'}

ranking_model = xgb.train(param2, dmx, num_round)

In [None]:
ranking_model.trees_to_dataframe()

In [None]:
xgb.plot_tree(ranking_model)

### Uploading an XGBoost model to the plugin

Since the model can be represented with JSON, the plugin can parse it. But we need to make sure the plugin gets the proper feature value names in order for it to parse properly.

These are supplied via a mapping `txt` file, `fmap.txt`.

The first step is to dump the model with the feature mapping to the features already stored in the plugin.

In [None]:
model_dump = ranking_model.get_dump(fmap='fmap.txt', dump_format='json')

### Massage the JSON

Manipulate the XGBoost output format to clean it up for posting to the plugin.

In [None]:
import json
clean_model = []
for line in model_dump:
    clean_model.append(json.loads(line))

### Post it to the plugin

Still referencing the index and feature set the model will be associated with.

In [None]:
import ltr.client as client
client = client.ElasticClient()

client.submit_xgboost_model('release', 'tmdb', 'xgb', clean_model)

### Confirm it works

In [None]:
from ltr.release_date_plot import search
search(client, 'batman', 'xgb')

### Compare it to the classic Ranklib model

In [None]:
from ltr.release_date_plot import plot
plot(client, "batman", models = ['classic', 'xgb'])