# Hello LTR!

Fire up an elastic server with the LTR plugin installed and run thru the cells below to get started with Learning-to-Rank. These notebooks we'll use in this training have something of an ltr client library, and a starting point for demonstrating several important learning to rank capabilities.

This notebook will document many of the important pieces so you can reuse them in future training sessions

### Download some requirements

Several requirements/datasets are stored in online, these include various training data sets, the data sets, and tools. You'll only need to do this once. But if you lose the data, you can repeat this command if needed.

In [1]:
from ltr import download
download()

GET http://es-learn-to-rank.labs.o19s.com/tmdb.json
GET http://es-learn-to-rank.labs.o19s.com/blog.jsonl
GET http://es-learn-to-rank.labs.o19s.com/osc_judgments.txt
GET http://es-learn-to-rank.labs.o19s.com/RankyMcRankFace.jar
GET http://es-learn-to-rank.labs.o19s.com/title_judgments.txt
GET http://es-learn-to-rank.labs.o19s.com/genome_judgments.txt
GET http://es-learn-to-rank.labs.o19s.com/sample_judgments_train.txt
Done.


### Use the Elastic client

Two LTR clients exist in this code, an ElasticClient and a SolrClient. The workflow for doing Learning to Rank is the same in both search engines

In [2]:
from ltr.client import ElasticClient
client = ElasticClient()

### Index Movies

In these demos, we'll use [TheMovieDB](http://themoviedb.org) alongside some supporting assets from places like movielens.

When we reindex, we'll use `rebuild_tmdb` which deletes and recreates the index, with a few hooks to help us enrich the underlying data or modify the search engine configuration for feature engineering.

In [3]:
from ltr.index import rebuild_tmdb
rebuild_tmdb(client)

Reconfig from disk...
Deleted index tmdb [Status: 404]
{
  "error": {
    "root_cause": [
      {
        "type": "index_not_found_exception",
        "reason": "no such index",
        "resource.type": "index_or_alias",
        "resource.id": "tmdb",
        "index_uuid": "_na_",
        "index": "tmdb"
      }
    ],
    "type": "index_not_found_exception",
    "reason": "no such index",
    "resource.type": "index_or_alias",
    "resource.id": "tmdb",
    "index_uuid": "_na_",
    "index": "tmdb"
  },
  "status": 404
}
Created index tmdb [Status: 200]
Reindexing...
Indexed 0 movies (last Black Mirror: White Christmas)
Indexed 100 movies (last Apocalypse Now)
Indexed 200 movies (last Crooks in Clover)
Indexed 300 movies (last For a Few Dollars More)
Indexed 400 movies (last Downfall)
Indexed 500 movies (last Finding Nemo)
Indexed 600 movies (last Platoon)
Indexed 700 movies (last Night of the Living Dead)
Indexed 800 movies (last Evangelion: 1.0: You Are (Not) Alone)
Indexed 900 movi

Indexed 17300 movies (last Blade of the Ripper)
Indexed 17400 movies (last Kiler)
Indexed 17500 movies (last Kaïrat)
Indexed 17600 movies (last Body Bags)
Indexed 17700 movies (last Dave Attell: Captain Miserable)
Indexed 17800 movies (last Wodehouse In Exile)
Indexed 17900 movies (last Duel in the Sun)
Indexed 18000 movies (last The Message)
Indexed 18100 movies (last Shock)
Indexed 18200 movies (last Harvey)
Indexed 18300 movies (last The Worthless)
Indexed 18400 movies (last Queen of the Mountains)
Indexed 18500 movies (last Urgh! A Music War)
Indexed 18600 movies (last Wuthering Heights)
Indexed 18700 movies (last Gabriel Over the White House)
Indexed 18800 movies (last Friendship!)
Indexed 18900 movies (last Mía)
Indexed 19000 movies (last Danger! 50,000 Zombies)
Indexed 19100 movies (last Top Dog)
Indexed 19200 movies (last Reaching for the Moon)
Indexed 19300 movies (last A Child's Christmas in Wales)
Indexed 19400 movies (last The Dog Who Stopped the War)
Indexed 19500 movies (

### Configure Learning to Rank

We'll discuss the feature sets a bit more. You can think of them as a series of queries that will be stored and executed before we need to train a model. 

`setup` is our function for preparing learning to rank to optimize search using a set of features. In this stock demo, we just have one feature, the year of the movie's release.

In [None]:
config = {
    "featureset": {
        "features": [
            {
                "name": "release_year",
                "params": [],
                "template": {
                    "function_score": {
                        "field_value_factor": {
                            "field": "release_year",
                            "missing": 2000
                        },
                        "query": { "match_all": {} }
                    }
                }
            }
        ]
    }
}


from ltr import setup
setup(client, config=config, index='tmdb', featureset='release')

## Is this thing on?

Before we dive into all the pieces, with a real training set, we'll try out two examples of models. One that always prefers newer movies. And another that always prefers older movies. If you're curious you can opet `classic-training.txt` and `latest-training.txt` after running this to see what the training set looks like. 

In [None]:
from ltr import years_as_ratings
years_as_ratings.synthesize(client, 
                            featureSet='release',
                            classicTrainingSetOut='data/classic-training.txt',
                            latestTrainingSetOut='data/latest-training.txt')

### Train and Submit

We'll train a lot of models in this class! Our ltr library has a `train` method that wraps a tool called `Ranklib` (more on Ranklib later), allows you to pass the most common commands to Ranklib, stores a model in the search engine, and then returns diagnostic output that's worth inspecting. 

For now we'll just train using the generated training set, and store two models `latest` and `classic`.


In [None]:
from ltr import train
train(client, trainingInFile='data/latest-training.txt', 
      index='tmdb', featureSet='release', modelName='latest')
train(client, trainingInFile='data/classic-training.txt', 
      index='tmdb', featureSet='release', modelName='classic')

### Ben Affleck vs Adam West
If we search for `batman`, how do the results compare?  Since the `classic` model prefered old movies it has old movies in the top position, and the opposite is true for the `latest` model.  To continue learning LTR, brainstorm more features and generate some real judgments for real queries.

In [None]:
from ltr.release_date_plot import plot
plot(client)