# Hello LTR!

Fire up an elastic server with the LTR plugin installed and run thru the cells below to get started with Learning-to-Rank. These notebooks we'll use in this training have something of an ltr client library, and a starting point for demonstrating several important learning to rank capabilities.

This notebook will document many of the important pieces so you can reuse them in future training sessions

### The library: ltr 

This is a Python library, located at the top level of the repository in `hello-ltr/ltr/`. It contains helper functions used through out the notebooks.

If you want to edit the source code, make sure you are running the Jupyter Notebook server locally and not from a Docker container.

In [1]:
import ltr
import ltr.client as client
import ltr.index as index
import ltr.helpers.movies as helpers

### Download some requirements

Several requirements/datasets are stored in online, these include various training data sets, the data sets, and tools. You'll only need to do this once. But if you lose the data, you can repeat this command if needed.

In [2]:
corpus = 'http://es-learn-to-rank.labs.o19s.com/tmdb.json'

ltr.download([corpus], dest='data/')

data/tmdb.json already exists


### Use the OpenSearch client

Three LTR clients exist in this code, an ElasticClient, a SolrClient, and an OpenSearch client. The workflow for doing Learning to Rank is the same in all three search engines

In [3]:
client = client.OpenSearchClient()

http://localhost:9201/_ltr; <OpenSearch([{'host': 'localhost', 'port': 9201}])>


### Index Movies

In these demos, we'll use [TheMovieDB](http://themoviedb.org) alongside some supporting assets from places like movielens.

When we reindex, we'll use `ltr.index.rebuild` which deletes and recreates the index, with a few hooks to help us enrich the underlying data or modify the search engine configuration for feature engineering.

In [4]:
movies = helpers.indexable_movies(movies='data/tmdb.json')

index.rebuild(client, index='tmdb', doc_src=movies)

Index tmdb already exists. Use `force = True` to delete and recreate


### Configure Learning to Rank

We'll discuss the feature sets a bit more. You can think of them as a series of queries that will be stored and executed before we need to train a model. 

`setup` is our function for preparing learning to rank to optimize search using a set of features. In this stock demo, we just have one feature, the year of the movie's release.

In [5]:
# wipes out any existing LTR models/feature sets in the tmdb index
client.reset_ltr(index='tmdb')

Removed Default LTR feature store [Status: 200]
Initialize Default LTR feature store [Status: 200]


In [6]:
# A feature set as a tuple, which looks a lot like JSON
feature_set = {
    "featureset": {
        "features": [
            {
                "name": "release_year",
                "params": [],
                "template": {
                    "function_score": {
                        "field_value_factor": {
                            "field": "release_year",
                            "missing": 2000
                        },
                        "query": { "match_all": {} }
                    }
                }
            }
        ]
    }
}

feature_set

{'featureset': {'features': [{'name': 'release_year',
    'params': [],
    'template': {'function_score': {'field_value_factor': {'field': 'release_year',
       'missing': 2000},
      'query': {'match_all': {}}}}}]}}

In [7]:
# pushes the feature set to the tmdb index's LTR store (a hidden index)
client.create_featureset(index='tmdb', name='release', ftr_config=feature_set)

Create release feature set [Status: 201]


## Is this thing on?

Before we dive into all the pieces, with a real training set, we'll try out two examples of models. One that always prefers newer movies. And another that always prefers older movies. If you're curious you can open `classic-training.txt` and `latest-training.txt` after running this to see what the training set looks like. 

### Generate some judgement data

This will write out judgment data to a file path.

Look at the source code in `ltr/years_as_ratings.py` to see what assumptions are being made in this synthetic judgment. What assumptions do you make in your judgment process?

In [8]:
from ltr.years_as_ratings import synthesize

synthesize(
    client, 
    featureSet='release', # must match the name set in client.create_featureset(...)
    classicTrainingSetOut='data/classic-training.txt',
    latestTrainingSetOut='data/latest-training.txt'
)

Generating 'classic' biased judgments:
Generating 'recent' biased judgments:


In [9]:
with open("data/classic-training.txt") as classic:
    for line in classic:
        print(line)

# qid:1: *1



0	qid:1	1:2014.0 # 374430	

1	qid:1	1:1995.0 # 19404	

1	qid:1	1:1994.0 # 278	

0	qid:1	1:2016.0 # 372058	

2	qid:1	1:1972.0 # 238	

0	qid:1	1:2016.0 # 360814	

0	qid:1	1:2014.0 # 244786	

1	qid:1	1:1993.0 # 424	

1	qid:1	1:2001.0 # 129	

2	qid:1	1:1974.0 # 240	

0	qid:1	1:2017.0 # 432517	

3	qid:1	1:1953.0 # 18148	

1	qid:1	1:1999.0 # 550	

1	qid:1	1:1994.0 # 680	

1	qid:1	1:1997.0 # 637	

1	qid:1	1:2008.0 # 155	

3	qid:1	1:1964.0 # 16672	

2	qid:1	1:1990.0 # 769	

2	qid:1	1:1975.0 # 510	

1	qid:1	1:1994.0 # 13	

3	qid:1	1:1954.0 # 346	

0	qid:1	1:2016.0 # 399106	

1	qid:1	1:1999.0 # 497	

3	qid:1	1:1957.0 # 389	

2	qid:1	1:1980.0 # 1891	

1	qid:1	1:1997.0 # 128	

0	qid:1	1:2011.0 # 77338	

3	qid:1	1:1960.0 # 539	

4	qid:1	1:1950.0 # 599	

4	qid:1	1:1948.0 # 19542	

0	qid:1	1:2013.0 # 313106	

1	qid:1	1:1994.0 # 101	

1	qid:1	1:2002.0 # 598	

1	qid:1	1:1998.0 # 73	

4	qid:1	1:1931.0 # 901	

2	qid:1	1:1988.0 # 12477	

1	qid:1	1:2003.0 # 325553	

0	qid:1	1:2014.0 # 265177

In [10]:
with open("data/latest-training.txt") as classic:
    for line in classic:
        print(line)

# qid:1: *1



4	qid:1	1:2014.0 # 374430	

3	qid:1	1:1995.0 # 19404	

3	qid:1	1:1994.0 # 278	

4	qid:1	1:2016.0 # 372058	

2	qid:1	1:1972.0 # 238	

4	qid:1	1:2016.0 # 360814	

4	qid:1	1:2014.0 # 244786	

3	qid:1	1:1993.0 # 424	

3	qid:1	1:2001.0 # 129	

2	qid:1	1:1974.0 # 240	

4	qid:1	1:2017.0 # 432517	

1	qid:1	1:1953.0 # 18148	

3	qid:1	1:1999.0 # 550	

3	qid:1	1:1994.0 # 680	

3	qid:1	1:1997.0 # 637	

3	qid:1	1:2008.0 # 155	

1	qid:1	1:1964.0 # 16672	

2	qid:1	1:1990.0 # 769	

2	qid:1	1:1975.0 # 510	

3	qid:1	1:1994.0 # 13	

1	qid:1	1:1954.0 # 346	

4	qid:1	1:2016.0 # 399106	

3	qid:1	1:1999.0 # 497	

1	qid:1	1:1957.0 # 389	

2	qid:1	1:1980.0 # 1891	

3	qid:1	1:1997.0 # 128	

4	qid:1	1:2011.0 # 77338	

1	qid:1	1:1960.0 # 539	

0	qid:1	1:1950.0 # 599	

0	qid:1	1:1948.0 # 19542	

4	qid:1	1:2013.0 # 313106	

3	qid:1	1:1994.0 # 101	

3	qid:1	1:2002.0 # 598	

3	qid:1	1:1998.0 # 73	

0	qid:1	1:1931.0 # 901	

2	qid:1	1:1988.0 # 12477	

3	qid:1	1:2003.0 # 325553	

4	qid:1	1:2014.0 # 265177

### Format the training data as two arrays of Judgement objects

This step is in preparation for passing the traning data into Ranklib.

In [11]:
import ltr.judgments as judge

classic_training_set = [j for j in judge.judgments_from_file(open('data/classic-training.txt'))]
latest_training_set = [j for j in judge.judgments_from_file(open('data/latest-training.txt'))]

Recognizing 1 queries
Recognizing 1 queries


### Train and Submit

We'll train a lot of models in this class! Our ltr library has a `train` method that wraps a tool called `Ranklib` (more on Ranklib later), allows you to pass the most common commands to Ranklib, stores a model in the search engine, and then returns diagnostic output that's worth inspecting. 

For now we'll just train using the generated training set, and store two models `latest` and `classic`.


In [12]:
from ltr.ranklib import train

train(client, training_set=latest_training_set, 
      index='tmdb', featureSet='release', modelName='latest')

/var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar already exists
Running java -jar /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t DCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/training.txt -save data/latest_model.txt 
Delete model latest: 404
Created Model latest [Status: 201]
Model saved


<ltr.helpers.ranklib_result.RanklibResult at 0x1087ac910>

Now train another model based on the 'classsic' movie judgments.

In [13]:
train(client, training_set=classic_training_set, 
      index='tmdb', featureSet='release', modelName='classic')

/var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar already exists
Running java -jar /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t DCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/training.txt -save data/classic_model.txt 
Delete model classic: 404
Created Model classic [Status: 201]
Model saved


<ltr.helpers.ranklib_result.RanklibResult at 0x108405ba0>

### Ben Affleck vs Adam West
If we search for `batman`, how do the results compare?  Since the `classic` model prefered old movies it has old movies in the top position, and the opposite is true for the `latest` model.  To continue learning LTR, brainstorm more features and generate some real judgments for real queries.

In [14]:
import ltr.release_date_plot as rdp
rdp.plot(client, 'batman')

### See top 12 results for both models

Looking at the `classic` model first.

In [15]:
import pandas as pd
classic_results = rdp.search(client, 'batman', 'classic')
pd.json_normalize(classic_results)[['id', 'title', 'release_year', 'score']].head(12)

Unnamed: 0,id,title,release_year,score
0,93560,Batman and Robin,1949,1.717154
1,125249,Batman,1943,1.717093
2,268,Batman,1989,-3.982395
3,2661,Batman,1966,-4.039894
4,364,Batman Returns,1992,-4.095062
5,142061,"Batman: The Dark Knight Returns, Part 2",2013,-4.183458
6,16234,Batman Beyond: Return of the Joker,2000,-4.183458
7,123025,"Batman: The Dark Knight Returns, Part 1",2012,-4.183458
8,40662,Batman: Under the Red Hood,2010,-4.183458
9,272,Batman Begins,2005,-4.183458


And then the `latest` model.

In [16]:
latest_results = rdp.search(client, 'batman', 'latest')
pd.json_normalize(latest_results)[['id', 'title', 'release_year', 'score']].head(12)

Unnamed: 0,id,title,release_year,score
0,242643,Batman: Assault on Arkham,2014,6.397678
1,251519,Son of Batman,2014,6.397678
2,382322,Batman: The Killing Joke,2016,1.371252
3,209112,Batman v Superman: Dawn of Justice,2016,1.371252
4,366924,Batman: Bad Blood,2016,1.371252
5,324849,The Lego Batman Movie,2017,1.371228
6,321528,Batman vs. Robin,2015,1.371181
7,142061,"Batman: The Dark Knight Returns, Part 2",2013,0.948931
8,123025,"Batman: The Dark Knight Returns, Part 1",2012,0.948931
9,69735,Batman: Year One,2011,0.948931
