# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [1]:
from ltr.client.solr_client import SolrClient
client = SolrClient()

from ltr import download, index
download(); index.rebuild_tmdb(client)

data/tmdb.json already exists
data/blog.jsonl already exists
data/osc_judgments.txt already exists
data/RankyMcRankFace.jar already exists
data/title_judgments.txt already exists
data/genome_judgments.txt already exists
data/sample_judgments_train.txt already exists
Done.
Reconfig from disk...
Deleted index tmdb [Status: 200]
Created index tmdb [Status: 200]
Reindexing...
Indexed 0 movies (last Black Mirror: White Christmas)
Indexed 100 movies (last Apocalypse Now)
Indexed 200 movies (last Crooks in Clover)
Indexed 300 movies (last For a Few Dollars More)
Indexed 400 movies (last Downfall)
Indexed 500 movies (last Finding Nemo)
Indexed 600 movies (last Platoon)
Indexed 700 movies (last Night of the Living Dead)
Indexed 800 movies (last Evangelion: 1.0: You Are (Not) Alone)
Indexed 900 movies (last Batman: Assault on Arkham)
Indexed 1000 movies (last Riley's First Date?)
Indexed 1100 movies (last The Raid)
Indexed 1200 movies (last Falling Down)
Indexed 1300 movies (last Kal Ho Naa Ho)


Done [Status: 200]
Indexed 20000 movies (last Left Behind III: World at War)
Indexed 20100 movies (last Dragon Ball Z: Lord Slug)
Indexed 20200 movies (last The Adventures of Sherlock Holmes)
Indexed 20300 movies (last Billy's Hollywood Screen Kiss)
Indexed 20400 movies (last Short Night of Glass Dolls)
Indexed 20500 movies (last Kawa)
Indexed 20600 movies (last Bears)
Indexed 20700 movies (last Pyrates)
Indexed 20800 movies (last Bastard Out of Carolina)
Indexed 20900 movies (last The Mole People)
Indexed 21000 movies (last Till Human Voices Wake Us)
Indexed 21100 movies (last It's a Wonderful Afterlife)
Indexed 21200 movies (last The Bingo Long Traveling All-Stars & Motor Kings)
Indexed 21300 movies (last Ciao! Manhattan)
Indexed 21400 movies (last The Night They Raided Minsky's)
Indexed 21500 movies (last The Girl Can't Help It)
Indexed 21600 movies (last Sam Peckinpah's West: Legacy of a Hollywood Renegade)
Indexed 21700 movies (last A Guy Named Joe)
Indexed 21800 movies (last Odd 

### Use Solr Client

In [1]:
from ltr.client.solr_client import SolrClient
client = SolrClient()

# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [2]:
from ltr.date_genre_judgments import synthesize
judgments = synthesize(client, judgmentsOutFile='data/genre_by_date_judgments.txt')

Generating judgments for scifi & drama movies
Done


### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [3]:
config = [
            {
                "store": "genre", # Note: This overrides the _DEFAULT_ feature store location
                "name" : "release_year",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "{!func}def(release_year,2000)"
                }
            },
            {
                "store": "genre",
                "name" : "is_sci_fi",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Science Fiction\"^=10.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_drama",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Drama\"^=4.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_genre_match",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"${keywords}\"^=100.0"
                }
            }
]


from ltr.setup import setup
setup(client, config=config, index='tmdb', featureset='genre')

Deleted genre Featurestore [Status: 200]
Created genre feature store under tmdb: [Status: 200]


### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [4]:
from ltr.log import judgments_to_training_set
trainingSet = judgments_to_training_set(client,
                                        judgmentInFile='data/genre_by_date_judgments.txt', 
                                        trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                        featureSet='genre')

Recognizing 2 queries...
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Science Fiction (0/2)
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Drama (1/2)
Discarded 0 Keep 2725


### Training - Guaraneed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [5]:
from ltr.train import train
trainLog = train(client, 
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 featureSet='genre',
                 index='tmdb',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Running java -jar data/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t NDCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt 
DONE
Submit Model genre Ftr Set genre [Status: 200]
Feature Set genre... [Status: 200]
{"store": "genre", "name": "genre", "class": "org.apache.solr.ltr.model.MultipleAdditiveTreesModel", "features": [{"name": "release_year"}, {"name": "is_sci_fi"}, {"name": "is_drama"}, {"name": "is_genre_match"}], "params": {"trees": [{"weight": "0.1", "root": {"feature": "release_year", "threshold": "1940.0", "left": {"feature": "release_year", "threshold": "1939.0", "left": {"feature": "release_year", "threshold": "1932.0", "left": {"value": "2.0"}, "right": {"feature": "release_year", "threshold": "1936.0", "left": {"feature": "release_year", "threshold": "1935.0", "left": {"feature": "release_year", "threshold": "1933.0", "left": {"value": "1.9016839265823364"}, "right": {"value": "2.0"}}, "ri

### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [7]:
from ltr.search import search
search(client, keywords="drama", modelName="genre")

['The Girl from the Marsh Croft'] 
5.4329715 
1917 
N/A 
["A 1917 Swedish drama film directed by Victor Sjöström, based on a 1913 novel by Selma Lagerlöf. It was the first in a series of successful Lagerlöf adaptions by Sjöström, made possible by a deal between Lagerlöf and A-B Svenska Biografteatern (later AB Svensk Filmindustri) to adapt at least one Lagerlöf novel each year. Lagerlöf had for many years denied any proposal to let her novels be adapted for film, but after seeing Sjöström's Terje Vigen she finally decided to give her allowance."] 
---------------------------------------
['Straight Shooting'] 
5.4329715 
1917 
['Western'] 
["Cattleman Flint cuts off farmer Sims' water supply. When Sims' son Ted goes for water, one of Flint's men kills him. Cheyenne is sent to finish off Sims, but finding the family at the newly dug grave, he changes sides."] 
---------------------------------------
['The Dying Swan'] 
5.4329715 
1917 
N/A 
['When Viktor meets Gizella one day beside the 

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [8]:
from ltr import date_genre_judgments
from ltr.log import judgments_to_training_set

date_genre_judgments.synthesize(client,
                                judgmentsOutFile='data/genre_by_date_judgments.txt',
                                autoNegate=True)

judgments_to_training_set(client,
                          judgmentInFile='data/genre_by_date_judgments.txt', 
                          trainingOutFile='data/genre_by_date_judgments_train.txt', 
                          featureSet='genre')

from ltr.train import train
trainLog = train(client,
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 featureSet='genre',
                 index='tmdb',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Generating judgments for scifi & drama movies
Done
Recognizing 2 queries...
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Science Fiction (0/2)
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Drama (1/2)
Discarded 0 Keep 5450
Running java -jar data/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t NDCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt 
DONE
Submit Model genre Ftr Set genre [Status: 200]
Feature Set genre... [Status: 200]
{"store": "genre", "name": "genre", "class": "org.apache.solr.ltr.model.MultipleAdditiveTreesModel", "features": [{"name": "release_year"}, {"name": "is_sci_fi"}, {"nam

### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [9]:
from ltr.search import search
search(client, keywords="science fiction", modelName="genre")

['Dr. Jekyll and Mr. Hyde'] 
2.9053094 
1920 
['Drama', 'Horror', 'Science Fiction'] 
["Dr. Jekyll and Mr. Hyde is a 1920 horror silent film based upon Robert Louis Stevenson's novella The Strange Case of Dr Jekyll and Mr Hyde and starring actor John Barrymore."] 
---------------------------------------
['Guardians of the Galaxy Vol. 2'] 
2.4134026 
2017 
['Action', 'Adventure', 'Comedy', 'Science Fiction'] 
["The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage."] 
---------------------------------------
['Rogue One: A Star Wars Story'] 
2.4134026 
2016 
['Adventure', 'Science Fiction', 'Action'] 
['A rogue band of resistance fighters unite for a mission to steal the Death Star plans and bring a new hope to the galaxy.'] 
---------------------------------------
['Wonder Woman'] 
2.4134026 
2017 
['Action', 'Adventure', 'Fantasy', 'Science Fiction'] 
['An Amazon princess comes to the world of Man to become the gre

In [10]:
from ltr.search import search
search(client, keywords="drama", modelName="genre")

['A Man There Was'] 
3.5908287 
1917 
['Drama'] 
["Terje Vigen, a sailor, suffers the loss of his family through the cruelty of another man. Years later, when his enemy's family finds itself dependent on Terje's benevolence, Terje must decide whether to avenge himself."] 
---------------------------------------
['The Immigrant'] 
3.5908287 
1917 
['Comedy', 'Drama'] 
['Charlie is an immigrant who endures a challenging voyage and gets into trouble as soon as he arrives in America.'] 
---------------------------------------
['The Birth of a Nation'] 
3.5895665 
1915 
['War', 'Drama', 'History', 'Romance'] 
['The Birth of A Nation is a silent film from 1915 and the highest grossing silent film in film history. The film tells a romance story during the American civil war. D.W. Griffith invested heavily in its high production values, pioneering many new camera effects. The Birth of a Nation was strongly protested for its negative portrayal of newly freed slaves (mostly white actors in black

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases