# Hello LTR!

Fire up an elastic server with the LTR plugin installed and run thru the cells below to get started with Learning-to-Rank. These notebooks we'll use in this training have something of an ltr client library, and a starting point for demonstrating several important learning to rank capabilities.

This notebook will document many of the important pieces so you can reuse them in future training sessions

### Download some requirements

Several requirements/datasets are stored in online, these include various training data sets, the data sets, and tools. You'll only need to do this once. But if you lose the data, you can repeat this command if needed.

In [1]:
from ltr import download
corpus='http://es-learn-to-rank.labs.o19s.com/tmdb.json'

download([corpus], dest='data/');

data/tmdb.json already exists


### Use the Elastic client

Two LTR clients exist in this code, an ElasticClient and a SolrClient. The workflow for doing Learning to Rank is the same in both search engines

In [2]:
from ltr.client import ElasticClient
client = ElasticClient()

### Index Movies

In these demos, we'll use [TheMovieDB](http://themoviedb.org) alongside some supporting assets from places like movielens.

When we reindex, we'll use `rebuild_tmdb` which deletes and recreates the index, with a few hooks to help us enrich the underlying data or modify the search engine configuration for feature engineering.

In [3]:
from ltr.index import rebuild
from ltr.helpers.movies import indexable_movies

movies=indexable_movies(movies='data/tmdb.json')
rebuild(client, index='tmdb', doc_src=movies)

Index tmdb already exists. Use `force = True` to delete and recreate


### Configure Learning to Rank

We'll discuss the feature sets a bit more. You can think of them as a series of queries that will be stored and executed before we need to train a model. 

`setup` is our function for preparing learning to rank to optimize search using a set of features. In this stock demo, we just have one feature, the year of the movie's release.

In [4]:
client.reset_ltr(index='tmdb')

config = {
    "featureset": {
        "features": [
            {
                "name": "release_year",
                "params": [],
                "template": {
                    "function_score": {
                        "field_value_factor": {
                            "field": "release_year",
                            "missing": 2000
                        },
                        "query": { "match_all": {} }
                    }
                }
            }
        ]
    }
}

client.create_featureset(index='tmdb', name='release', ftr_config=config)

Removed Default LTR feature store [Status: 200]
Initialize Default LTR feature store [Status: 200]
Create release feature set [Status: 201]


## Is this thing on?

Before we dive into all the pieces, with a real training set, we'll try out two examples of models. One that always prefers newer movies. And another that always prefers older movies. If you're curious you can opet `classic-training.txt` and `latest-training.txt` after running this to see what the training set looks like. 

In [7]:
from ltr import years_as_ratings
from ltr.judgments import judgments_from_file

years_as_ratings.synthesize(client, 
                            featureSet='release',
                            classicTrainingSetOut='data/classic-training.txt',
                            latestTrainingSetOut='data/latest-training.txt')

# Load into training set 
classic_training_set = [j for j in judgments_from_file(open('data/classic-training.txt'))]
latest_training_set = [j for j in judgments_from_file(open('data/latest-training.txt'))]

classic_training_set

Generating ratings for classic and latest model
Generating Classic judgments:
Generating Recent judgments:
Recognizing 1 queries...
Recognizing 1 queries...


[Judgment(grade=0,qid=1,keywords=,docId=374430,features=[2014.0],weight=1,
 Judgment(grade=1,qid=1,keywords=,docId=19404,features=[1995.0],weight=1,
 Judgment(grade=1,qid=1,keywords=,docId=278,features=[1994.0],weight=1,
 Judgment(grade=0,qid=1,keywords=,docId=372058,features=[2016.0],weight=1,
 Judgment(grade=2,qid=1,keywords=,docId=238,features=[1972.0],weight=1,
 Judgment(grade=0,qid=1,keywords=,docId=360814,features=[2016.0],weight=1,
 Judgment(grade=0,qid=1,keywords=,docId=244786,features=[2014.0],weight=1,
 Judgment(grade=1,qid=1,keywords=,docId=424,features=[1993.0],weight=1,
 Judgment(grade=1,qid=1,keywords=,docId=129,features=[2001.0],weight=1,
 Judgment(grade=2,qid=1,keywords=,docId=240,features=[1974.0],weight=1,
 Judgment(grade=0,qid=1,keywords=,docId=432517,features=[2017.0],weight=1,
 Judgment(grade=3,qid=1,keywords=,docId=18148,features=[1953.0],weight=1,
 Judgment(grade=1,qid=1,keywords=,docId=550,features=[1999.0],weight=1,
 Judgment(grade=1,qid=1,keywords=,docId=680,f

### Train and Submit

We'll train a lot of models in this class! Our ltr library has a `train` method that wraps a tool called `Ranklib` (more on Ranklib later), allows you to pass the most common commands to Ranklib, stores a model in the search engine, and then returns diagnostic output that's worth inspecting. 

For now we'll just train using the generated training set, and store two models `latest` and `classic`.


In [8]:
from ltr.ranklib import train
train(client, training_set=latest_training_set, 
      index='tmdb', featureSet='release', modelName='latest')
train(client, training_set=classic_training_set, 
      index='tmdb', featureSet='release', modelName='classic')

/var/folders/7_/cvjz84n54vx7zv_pw3gmdqr00000gn/T/RankyMcRankFace.jar already exists
Running java -jar /var/folders/7_/cvjz84n54vx7zv_pw3gmdqr00000gn/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t DCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /var/folders/7_/cvjz84n54vx7zv_pw3gmdqr00000gn/T/training.txt -save data/latest_model.txt 
Delete model latest: 200
Created Model latest [Status: 201]
Model saved
/var/folders/7_/cvjz84n54vx7zv_pw3gmdqr00000gn/T/RankyMcRankFace.jar already exists
Running java -jar /var/folders/7_/cvjz84n54vx7zv_pw3gmdqr00000gn/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t DCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /var/folders/7_/cvjz84n54vx7zv_pw3gmdqr00000gn/T/training.txt -save data/classic_model.txt 
Delete model classic: 200
Created Model classic [Status: 201]
Model saved


<ltr.helpers.ranklib_result.RanklibResult at 0x11f751fd0>

### Ben Affleck vs Adam West
If we search for `batman`, how do the results compare?  Since the `classic` model prefered old movies it has old movies in the top position, and the opposite is true for the `latest` model.  To continue learning LTR, brainstorm more features and generate some real judgments for real queries.

In [9]:
import ltr.release_date_plot as rdp
rdp.plot(client, 'batman')

In [10]:
import pandas as pd
classic_results = rdp.search(client, 'batman', 'classic')
print('top results from classic model:')
pd.json_normalize(classic_results)[['id', 'title', 'release_year', 'score']].head(12)

top results from classic model:


Unnamed: 0,id,title,release_year,score
0,93560,Batman and Robin,1949,1.717044
1,125249,Batman,1943,1.571955
2,268,Batman,1989,-3.974195
3,2661,Batman,1966,-4.0437
4,364,Batman Returns,1992,-4.086597
5,142061,"Batman: The Dark Knight Returns, Part 2",2013,-4.198322
6,16234,Batman Beyond: Return of the Joker,2000,-4.198322
7,123025,"Batman: The Dark Knight Returns, Part 1",2012,-4.198322
8,40662,Batman: Under the Red Hood,2010,-4.198322
9,272,Batman Begins,2005,-4.198322


In [11]:
latest_results = rdp.search(client, 'batman', 'latest')
print('top results from latest model:')
pd.json_normalize(latest_results)[['id', 'title', 'release_year', 'score']].head(12)

top results from latest model:


Unnamed: 0,id,title,release_year,score
0,242643,Batman: Assault on Arkham,2014,6.397139
1,251519,Son of Batman,2014,6.397139
2,382322,Batman: The Killing Joke,2016,1.37108
3,209112,Batman v Superman: Dawn of Justice,2016,1.37108
4,366924,Batman: Bad Blood,2016,1.37108
5,324849,The Lego Batman Movie,2017,1.371054
6,321528,Batman vs. Robin,2015,1.371003
7,142061,"Batman: The Dark Knight Returns, Part 2",2013,0.959237
8,123025,"Batman: The Dark Knight Returns, Part 1",2012,0.959237
9,69735,Batman: Year One,2011,0.959237
