# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [1]:
from ltr.client.opensearch_client import OpenSearchClient
client = OpenSearchClient()

from ltr import download, index
from ltr.index import rebuild
from ltr.helpers.movies import indexable_movies
from ltr import download

corpus='http://es-learn-to-rank.labs.o19s.com/tmdb.json'
download([corpus], dest='data/');

movies=indexable_movies(movies='data/tmdb.json')
rebuild(client, index='tmdb', doc_src=movies)

http://localhost:9201/_ltr; <OpenSearch([{'host': 'localhost', 'port': 9201}])>
data/tmdb.json already exists
Index tmdb already exists. Use `force = True` to delete and recreate


## Create Elastic Client

In [2]:
from ltr.client.opensearch_client import OpenSearchClient
client = OpenSearchClient()

http://localhost:9201/_ltr; <OpenSearch([{'host': 'localhost', 'port': 9201}])>


# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [3]:
from ltr.date_genre_judgments import synthesize
judgments = synthesize(client, judgmentsOutFile='data/genre_by_date_judgments.txt')

Generating judgments for scifi & drama movies
{'query': {'match_all': {}}, 'size': 10000, 'sort': [{'_id': 'asc'}]}


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 458233.62it/s]


### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [4]:
client.reset_ltr(index='tmdb')

config = {
    "featureset": {
            "features": [
            {
                "name": "release_year",
                "params": [],
                "template": {
                    "function_score": {
                        "field_value_factor": {
                        "field": "release_year",
                        "missing": 2000
                    },
                    "query": { "match_all": {} }
                }
            }
            },
             {
                "name": "is_sci_fi",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Science Fiction"}
                        },
                        "boost": 1.0                    }
            }
            },
             {
                "name": "is_drama",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Drama"}
                        },
                        "boost": 1.0                    }
                }
            },
             {
                "name": "is_genre_match",
                "params": ["keywords"],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "{{keywords}}"}
                        },
                        "boost": 1.0
                    }
                }
            }
    ]
    },
    "validation": {
       "params": {
           "keywords": "Science Fiction"
       },
       "index": "tmdb"
    }
}

client.create_featureset(index='tmdb', name='genre', ftr_config=config)

Removed Default LTR feature store [Status: 200]
Initialize Default LTR feature store [Status: 200]
Create genre feature set [Status: 201]


### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [5]:
from ltr.judgments import judgments_open
from ltr.log import FeatureLogger
from itertools import groupby

from ltr.log import FeatureLogger
from ltr.judgments import judgments_open
from itertools import groupby

ftr_logger=FeatureLogger(client, index='tmdb', feature_set='genre')
with judgments_open('data/genre_by_date_judgments.txt') as judgment_list:
    for qid, query_judgments in groupby(judgment_list, key=lambda j: j.qid):
        ftr_logger.log_for_qid(judgments=query_judgments, 
                               qid=qid,
                               keywords=judgment_list.keywords(qid))


Recognizing 2 queries


### Training - Guaranteed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [6]:
from ltr.ranklib import train
trainResponse = train(client,
                 training_set=ftr_logger.logged,
                 metric2t='NDCG@10',
                 index='tmdb',
                 featureSet='genre',
                 modelName='genre')

trainLog = trainResponse.trainingLogs[0]

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(client.get_feature_name(config, ftrId), impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

/var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar already exists
Running java -jar /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t NDCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/training.txt -save data/genre_model.txt 
Delete model genre: 404
Created Model genre [Status: 201]
Model saved

Impact of each feature on the model
release_year - 84672139.80027786
is_sci_fi - 57048.53836173947
is_drama - 15085.984020625705
is_genre_match - 0.0
Perfect NDCG! 1.0


### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [7]:
from ltr.search import search
search(client, keywords="drama", modelName="genre")

{"size": 5, "query": {"sltr": {"params": {"keywords": "drama", "keywordsList": ["drama"]}, "model": "genre"}}}
{'size': 5, 'query': {'sltr': {'params': {'keywords': 'drama', 'keywordsList': ['drama']}, 'model': 'genre'}}}
The Girl from the Marsh Croft 
5.4329715 
1917 
[] 
A 1917 Swedish drama film directed by Victor Sjöström, based on a 1913 novel by Selma Lagerlöf. It was the first in a series of successful Lagerlöf adaptions by Sjöström, made possible by a deal between Lagerlöf and A-B Svenska Biografteatern (later AB Svensk Filmindustri) to adapt at least one Lagerlöf novel each year. Lagerlöf had for many years denied any proposal to let her novels be adapted for film, but after seeing Sjöström's Terje Vigen she finally decided to give her allowance. 
---------------------------------------
Straight Shooting 
5.4329715 
1917 
['Western'] 
Cattleman Flint cuts off farmer Sims' water supply. When Sims' son Ted goes for water, one of Flint's men kills him. Cheyenne is sent to finish 

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [8]:
from ltr import date_genre_judgments
date_genre_judgments.synthesize(client,
                                judgmentsOutFile='data/genre_by_date_judgments.txt',
                                autoNegate=True)

from ltr.log import FeatureLogger
from ltr.judgments import judgments_open
from itertools import groupby

ftr_logger=FeatureLogger(client, index='tmdb', feature_set='genre')
with judgments_open('data/genre_by_date_judgments.txt') as judgment_list:
    for qid, query_judgments in groupby(judgment_list, key=lambda j: j.qid):
        ftr_logger.log_for_qid(judgments=query_judgments, 
                               qid=qid,
                               keywords=judgment_list.keywords(qid))
        
        
from ltr.ranklib import train
trainResponse = train(client,
                 training_set=ftr_logger.logged,
                 metric2t='NDCG@10',
                 index='tmdb',
                 featureSet='genre',
                 modelName='genre')

trainLog = trainResponse.trainingLogs[0]

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(client.get_feature_name(config, ftrId), impact))
    
print("NDCG {}".format(trainLog.rounds[-1]))

Generating judgments for scifi & drama movies
{'query': {'match_all': {}}, 'size': 10000, 'sort': [{'_id': 'asc'}]}


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 487494.36it/s]


Recognizing 2 queries
/var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar already exists
Running java -jar /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t NDCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/training.txt -save data/genre_model.txt 
Delete model genre: 200
Created Model genre [Status: 201]
Model saved

Impact of each feature on the model
is_genre_match - 178579056.081049
release_year - 107974596.42232133
is_drama - 14215772.365316376
is_sci_fi - 7654167.392121586
NDCG 1.0


### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [9]:
from ltr.search import search
search(client, keywords="drama", modelName="genre")

{"size": 5, "query": {"sltr": {"params": {"keywords": "drama", "keywordsList": ["drama"]}, "model": "genre"}}}
{'size': 5, 'query': {'sltr': {'params': {'keywords': 'drama', 'keywordsList': ['drama']}, 'model': 'genre'}}}
A Man There Was 
3.5908291 
1917 
['Drama'] 
Terje Vigen, a sailor, suffers the loss of his family through the cruelty of another man. Years later, when his enemy's family finds itself dependent on Terje's benevolence, Terje must decide whether to avenge himself. 
---------------------------------------
The Immigrant 
3.5908291 
1917 
['Comedy', 'Drama'] 
Charlie is an immigrant who endures a challenging voyage and gets into trouble as soon as he arrives in America. 
---------------------------------------
Tillie's Punctured Romance 
3.5895667 
1914 
['Comedy', 'Drama', 'Romance'] 
Chaplin plays a womanizing city man who meets Tillie (Dressler) in the country after a fight with his girlfriend (Normand). When he sees that Tillie's father has a very large bankroll for h

In [10]:
from ltr.search import search
search(client, keywords="science fiction", modelName="genre")

{"size": 5, "query": {"sltr": {"params": {"keywords": "science fiction", "keywordsList": ["science fiction"]}, "model": "genre"}}}
{'size': 5, 'query': {'sltr': {'params': {'keywords': 'science fiction', 'keywordsList': ['science fiction']}, 'model': 'genre'}}}
Dr. Jekyll and Mr. Hyde 
2.9053097 
1920 
['Drama', 'Horror', 'Science Fiction'] 
Dr. Jekyll and Mr. Hyde is a 1920 horror silent film based upon Robert Louis Stevenson's novella The Strange Case of Dr Jekyll and Mr Hyde and starring actor John Barrymore. 
---------------------------------------
Guardians of the Galaxy Vol. 2 
2.4134026 
2017 
['Action', 'Adventure', 'Comedy', 'Science Fiction'] 
The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage. 
---------------------------------------
Monster Trucks 
2.4134026 
2016 
['Action', 'Comedy', 'Science Fiction'] 
Looking for any way to get away from the life and town he was born into, Tripp, a high school se

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases