# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [1]:
from ltr import download, index
download.run(); index.run()

GET https://dl.bintray.com/o19s/RankyMcRankFace/com/o19s/RankyMcRankFace/0.1.1/RankyMcRankFace-0.1.1.jar
GET http://es-learn-to-rank.labs.o19s.com/tmdb.json
Done.


# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [3]:
from ltr import date_genre_judgments
judgments = date_genre_judgments.buildJudgments(judgmentsFile='data/genre_by_date_judgments.txt')

Generating judgments for scifi & drama movies
Done


In [2]:
# Uncomment this line to see the judgments
# 
# for judgment in judgments:
#    print(judgment.toRanklibFormat())

### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [3]:
config = {
    "featureset": {
            "features": [
            {
                "name": "release_year",
                "params": [],
                "template": {
                    "function_score": {
                        "field_value_factor": {
                        "field": "release_year",
                        "missing": 2000
                    },
                    "query": { "match_all": {} }
                }
            }
            },
             {
                "name": "is_sci_fi",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Science Fiction"}
                        },
                        "boost": 10.0                    }
            }
            },
             {
                "name": "is_drama",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Drama"}
                        },
                        "boost": 4.0                    }
                }
            },
             {
                "name": "is_genre_match",
                "params": ["keywords"],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "{{keywords}}"}
                        },
                        "boost": 100.0
                    }
                }
            }
    ]
    },
    "validation": {
       "params": {
           "keywords": "Science Fiction"
       },
       "index": "tmdb"
    }
}

from ltr import setup_ltr
setup_ltr.run(config=config, featureset='genre')

Removed LTR feature store: 200
Initialize LTR: 200
Created genre feature set: 201


### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [1]:
from ltr import collectFeatures
trainingSet = collectFeatures.trainingSetFromJudgments(judgmentInFile='data/genre_by_date_judgments.txt', 
                                                       trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                                       featureSet='genre')

Recognizing 2 queries...
REBUILDING TRAINING DATA for Science Fiction (0/2)
REBUILDING TRAINING DATA for Drama (1/2)


### Training - Guaraneed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [2]:
from ltr import train
trainLog = train.run(trainingInFile='data/genre_by_date_judgments_train.txt',
                     metric2t='NDCG@10',
                     featureSet='genre',
                     modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 100 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt
DONE
Delete model genre: 404
Created model genre: 201

Impact of each feature on the model
1 - 30989934.22204985
2 - 2891683.161237409
3 - 24.47444925009026
4 - 0.0
Perfect NDCG! 1.0


### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [3]:
from ltr import search
search.run(keywords="Science Fiction", modelName="genre")

{"size": 5, "query": {"sltr": {"params": {"keywords": "Science Fiction"}, "model": "genre"}}}
Why Him? 
10.492266 
2016 
['Comedy'] 
Ned, an overprotective dad, visits his daughter at Stanford where he meets his biggest nightmare: her well-meaning but socially awkward Silicon Valley billionaire boyfriend, Laird. A rivalry develops and Ned's panic level goes through the roof when he finds himself lost in this glamorous high-tech world and learns Laird is about to pop the question. 
---------------------------------------
X-Men: Apocalypse 
10.492266 
2016 
['Science Fiction'] 
After the re-emergence of the world's first mutant, world-destroyer Apocalypse, the X-Men must unite to defeat his extinction level plan. 
---------------------------------------
I Am Not a Serial Killer 
10.492266 
2016 
['Horror', 'Thriller'] 
Fifteen-year old John Cleaver is dangerous, and he knows it. He’s obsessed with serial killers, but really doesn’t want to become one. Terrible impulses constantly tempt h

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [4]:
from ltr import date_genre_judgments, collectFeatures
date_genre_judgments.buildJudgments(judgmentsFile='data/genre_by_date_judgments.txt',
                                    autoNegate=True)

collectFeatures.trainingSetFromJudgments(judgmentInFile='data/genre_by_date_judgments.txt', 
                                         trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                         featureSet='genre')

from ltr import train
trainLog = train.run(trainingInFile='data/genre_by_date_judgments_train.txt',
                     metric2t='NDCG@10',
                     featureSet='genre',
                     modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Generating judgments for scifi & drama movies
Done
Recognizing 2 queries...
REBUILDING TRAINING DATA for Science Fiction (0/2)
REBUILDING TRAINING DATA for Drama (1/2)
Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 100 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt
DONE
Delete model genre: 200
Created model genre: 201

Impact of each feature on the model
4 - 802616758.725657
1 - 89257217.35776828
3 - 1573733.5830246205
2 - 1.953984578809708e-20
Perfect NDCG! 1.0


### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [6]:
from ltr import search
search.run(keywords="Drama", modelName="genre")

{"size": 5, "query": {"sltr": {"params": {"keywords": "Drama"}, "model": "genre"}}}
The Shamrock Handicap 
9.445964 
1926 
['Romance', 'Drama'] 
The first film having an Irish motif that John Ford directed, a six reel delight set in Eire's County Kildare and in the United States, with a steeplechase background, mixing charged elements of comedy and sentimental drama, benefiting from a sterling cast including Leslie Fenton, Janet Gaynor, and Ford favourite J. Farrell MacDonald. 
---------------------------------------
Sparrows 
9.445964 
1926 
['Drama'] 
Evil Mr.Grimes keeps a rag-tag bunch orphans on his farm deep in a swamp in the US South. He forces them to work in his garden and treats them like slaves. They are watched over by the eldest, Molly. A gang in league with Mr. Grimes kidnaps Doris, the beautiful little daughter of a rich man, and hides her out on Grimes' farm, awaiting ransom. When the police close in, and Mr. Grimes threatens to throw Doris into the bottomless mire, Mol

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases