# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [None]:
from ltr.client.elastic_client import ElasticClient
client = ElasticClient()

from ltr import download, index
download(); index.rebuild_tmdb(client)

## Create Elastic Client

In [None]:
from ltr.client.elastic_client import ElasticClient
client = ElasticClient()

# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [None]:
from ltr.date_genre_judgments import synthesize
judgments = synthesize(client, judgmentsOutFile='data/genre_by_date_judgments.txt')

### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [None]:
config = {
    "featureset": {
            "features": [
            {
                "name": "release_year",
                "params": [],
                "template": {
                    "function_score": {
                        "field_value_factor": {
                        "field": "release_year",
                        "missing": 2000
                    },
                    "query": { "match_all": {} }
                }
            }
            },
             {
                "name": "is_sci_fi",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Science Fiction"}
                        },
                        "boost": 1.0                    }
            }
            },
             {
                "name": "is_drama",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Drama"}
                        },
                        "boost": 1.0                    }
                }
            },
             {
                "name": "is_genre_match",
                "params": ["keywords"],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "{{keywords}}"}
                        },
                        "boost": 1.0
                    }
                }
            }
    ]
    },
    "validation": {
       "params": {
           "keywords": "Science Fiction"
       },
       "index": "tmdb"
    }
}

from ltr.setup import setup
setup(client, index='tmdb', config=config, featureset='genre')

### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [None]:
from ltr.log import judgments_to_training_set
trainingSet = judgments_to_training_set(client,
                                        judgmentInFile='data/genre_by_date_judgments.txt', 
                                        trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                        featureSet='genre')

### Training - Guaraneed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [None]:
from ltr.train import train
trainLog = train(client,
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 index='tmdb',
                 featureSet='genre',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [None]:
from ltr.search import search
search(client, keywords="science fiction", modelName="genre")

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [None]:
from ltr import date_genre_judgments
from ltr.log import judgments_to_training_set
date_genre_judgments.synthesize(client,
                                judgmentsOutFile='data/genre_by_date_judgments.txt',
                                autoNegate=True)

judgments_to_training_set(client,
                          judgmentInFile='data/genre_by_date_judgments.txt', 
                          trainingOutFile='data/genre_by_date_judgments_train.txt', 
                          featureSet='genre')

from ltr.train import train
trainLog = train(client,
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 index='tmdb',
                 featureSet='genre',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("NDCG {}".format(trainLog.rounds[-1]))

### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [None]:
from ltr.search import search
search(client, keywords="drama", modelName="genre")

In [None]:
from ltr.search import search
search(client, keywords="science fiction", modelName="genre")

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases