# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [1]:
from ltr.client.solr_client import SolrClient
client = SolrClient()

from ltr import download, index
download(); index.rebuild_tmdb(client)

GET http://es-learn-to-rank.labs.o19s.com/tmdb.json
GET http://es-learn-to-rank.labs.o19s.com/RankyMcRankFace.jar
GET http://es-learn-to-rank.labs.o19s.com/title_judgments.txt
GET http://es-learn-to-rank.labs.o19s.com/genome_judgments.txt
GET http://es-learn-to-rank.labs.o19s.com/sample_judgments_train.txt
Done.
Deleted index tmdb [Status: 200]
Created index tmdb [Status: 200]
Reindexing...
Indexed 0 movies (last Black Mirror: White Christmas)
Indexed 100 movies (last Apocalypse Now)
Indexed 200 movies (last Crooks in Clover)
Indexed 300 movies (last For a Few Dollars More)
Indexed 400 movies (last Downfall)
Flushing 500 movies
Done [Status: 200]
Indexed 500 movies (last Finding Nemo)
Indexed 600 movies (last Platoon)
Indexed 700 movies (last Night of the Living Dead)
Indexed 800 movies (last Evangelion: 1.0: You Are (Not) Alone)
Indexed 900 movies (last Batman: Assault on Arkham)
Flushing 500 movies
Done [Status: 200]
Indexed 1000 movies (last Riley's First Date?)
Indexed 1100 movies 

Done [Status: 200]
Indexed 15500 movies (last Viva Cuba)
Indexed 15600 movies (last Big Pun: The Legacy)
Indexed 15700 movies (last Hurt)
Indexed 15800 movies (last The Mudge Boy)
Indexed 15900 movies (last The Hollywood Complex)
Flushing 500 movies
Done [Status: 200]
Indexed 16000 movies (last The Great Northfield Minnesota Raid)
Indexed 16100 movies (last Lotta Leaves Home)
Indexed 16200 movies (last Just One of the Girls)
Indexed 16300 movies (last Which Way Is The Front Line From Here? The Life and Time of Tim Hetherington)
Indexed 16400 movies (last The Ladies Man)
Flushing 500 movies
Done [Status: 200]
Indexed 16500 movies (last Assassin of the Tsar)
Indexed 16600 movies (last The Adventures of Tarzan)
Indexed 16700 movies (last Vendetta)
Indexed 16800 movies (last Trucker)
Indexed 16900 movies (last Branded)
Flushing 500 movies
Done [Status: 200]
Indexed 17000 movies (last Mariage à Mendoza)
Indexed 17100 movies (last Love Bites)
Indexed 17200 movies (last The Ballad of Ramblin'

### Use Solr Client

In [1]:
from ltr.client.solr_client import SolrClient
client = SolrClient()

# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [2]:
from ltr.date_genre_judgments import synthesize
judgments = synthesize(client, judgmentsOutFile='data/genre_by_date_judgments.txt')

Generating judgments for scifi & drama movies
Query {'q': '*:*... [Status: 200]
Done


### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [3]:
config = [
            {
                "store": "genre", # Note: This overrides the _DEFAULT_ feature store location
                "name" : "release_year",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "{!func}def(release_year,2000)"
                }
            },
            {
                "store": "genre",
                "name" : "is_sci_fi",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Science Fiction\"^=10.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_drama",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Drama\"^=4.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_genre_match",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"${keywords}\"^=100.0"
                }
            }
]


from ltr.setup import setup
setup(client, config=config, featureset='genre')

Deleted classic model [Status: 200]
Deleted genre model [Status: 200]
Deleted latest model [Status: 200]
Deleted title model [Status: 200]
Deleted title_fuzzy model [Status: 200]
Deleted _DEFAULT Featurestore [Status: 200]
Deleted genre Featurestore [Status: 200]
Deleted release Featurestore [Status: 200]
Deleted title Featurestore [Status: 200]
Deleted title_fuzzy Featurestore [Status: 200]
Created genre feature store under tmdb: [Status: 200]


### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [4]:
from ltr.log import judgments_to_training_set
trainingSet = judgments_to_training_set(client,
                                        judgmentInFile='data/genre_by_date_judgments.txt', 
                                        trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                        featureSet='genre')

Recognizing 2 queries...
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Science Fiction (0/2)
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Drama (1/2)
Discarded 0 Keep 2725


### Training - Guaraneed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [5]:
from ltr.train import train
trainLog = train(client, 
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 featureSet='genre',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 10 -leaf 50 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt 
DONE
Submit Model genre Ftr Set genre [Status: 200]
Feature Set genre... [Status: 200]
Deleted Model genre [Status: 200]
Created Model genre [Status: 200]

Impact of each feature on the model
1 - 43825558.67883423
2 - 11087385.597156812
3 - 4818.4748707032195
4 - 0.0
Perfect NDCG! 1.0


### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [6]:
from ltr.search import search
search(client, keywords="drama", modelName="genre")

Query {'fl': '*,... [Status: 200]
['Lazer Team'] 
1.4061091 
2016 
['Action', 'Comedy', 'Science Fiction'] 
["In the late 1970's, the SETI project received a one time signal from outer space. It looked exactly as theorists thought a communication from an alien civilization would -- unfortunately it has never been decoded. Or so we were told. Unbeknownst to the general public the signal was translated and told us two things:  1) We are not alone.  2) The galaxy is a dangerous place."] 
---------------------------------------
['The Void'] 
1.4061091 
2016 
['Mystery', 'Horror', 'Science Fiction'] 
["In the middle of a routine patrol, officer Daniel Carter happens upon a blood-soaked figure limping down a deserted stretch of road. He rushes the young man to a nearby rural hospital staffed by a skeleton crew, only to discover that patients and personnel are transforming into something inhuman. As the horror intensifies, Carter leads the other survivors on a hellish voyage into the subterra

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [7]:
from ltr import date_genre_judgments
from ltr.log import judgments_to_training_set

date_genre_judgments.synthesize(client,
                                judgmentsOutFile='data/genre_by_date_judgments.txt',
                                autoNegate=True)

judgments_to_training_set(client,
                          judgmentInFile='data/genre_by_date_judgments.txt', 
                          trainingOutFile='data/genre_by_date_judgments_train.txt', 
                          featureSet='genre')

from ltr.train import train
trainLog = train(client,
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 featureSet='genre',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Generating judgments for scifi & drama movies
Query {'q': '*:*... [Status: 200]
Done
Recognizing 2 queries...
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Science Fiction (0/2)
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Drama (1/2)
Discarded 0 Keep 5450
Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 10 -leaf 50 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt 
DONE
Submit Model genre Ftr Set genre [Status: 200]
Feature Set genre... [Status: 200]
Deleted Model genre [Status: 200]
Created Model genre [Status: 200]

Impact of each feature on the model
4 - 2502122962.046671
3 - 268634889.38288647
1 - 116717710.59445384
2 - 4

### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [8]:
from ltr.search import search
search(client, keywords="science fiction", modelName="genre")

Query {'fl': '*,... [Status: 200]
['Lazer Team'] 
1.4070512 
2016 
['Action', 'Comedy', 'Science Fiction'] 
["In the late 1970's, the SETI project received a one time signal from outer space. It looked exactly as theorists thought a communication from an alien civilization would -- unfortunately it has never been decoded. Or so we were told. Unbeknownst to the general public the signal was translated and told us two things:  1) We are not alone.  2) The galaxy is a dangerous place."] 
---------------------------------------
['The Void'] 
1.4070512 
2016 
['Mystery', 'Horror', 'Science Fiction'] 
["In the middle of a routine patrol, officer Daniel Carter happens upon a blood-soaked figure limping down a deserted stretch of road. He rushes the young man to a nearby rural hospital staffed by a skeleton crew, only to discover that patients and personnel are transforming into something inhuman. As the horror intensifies, Carter leads the other survivors on a hellish voyage into the subterra

In [9]:
from ltr.search import search
search(client, keywords="drama", modelName="genre")

Query {'fl': '*,... [Status: 200]
['A Man There Was'] 
1.3831193 
1917 
['Drama'] 
["Terje Vigen, a sailor, suffers the loss of his family through the cruelty of another man. Years later, when his enemy's family finds itself dependent on Terje's benevolence, Terje must decide whether to avenge himself."] 
---------------------------------------
['The Immigrant'] 
1.3831193 
1917 
['Comedy', 'Drama'] 
['Charlie is an immigrant who endures a challenging voyage and gets into trouble as soon as he arrives in America.'] 
---------------------------------------
['The Last of the Mohicans'] 
1.3828125 
1920 
['Adventure', 'Drama', 'Action', 'History'] 
['As Alice and Cora Munro attempt to find their father, a British officer in the French and Indian War, they are set upon by French soldiers and their cohorts, Huron tribesmen led by the evil Magua. Fighting to rescue the women are Chingachgook and his son Uncas, the last of the Mohican tribe, and their white ally, the frontiersman Natty Bumppo

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases