# TermStatQuery
Introduced in ES-LTR v1.5.2, the TermStatQuery provides for access to deep level statistics available in Lucene expression and Painless scripting contexts.

This allows feature engineers to easily experiment with features derived directly from the index without having to write any Java code.

Review the documentation [here](https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/advanced-functionality.html#termstat-query) and use the notebook below to experiment with the functionality that the TermStatQuery provides.

## Setup Client

In [None]:
from ltr.client import ElasticClient
client = ElasticClient()

## Step 1 - Create a Feature Set

In [None]:
'''
  TASK:
  Experiment with the TermStatQuery
  - Create a feature that utilizes a lucene expression.
  - Create a feature that utilizes painless scripting
'''

client.reset_ltr(index='tmdb')

config = {
   "featureset": {
        "features": [
            {
                "name": "tsq_expr_title_tfidf",
                "params": ["keywords"],
                "template": {
                    "term_stat": {
                        "expr": "tf * idf",             # The lucene expression evaluated for each term
                        "aggr": "max",                  # How are the calcuated expressions for each term aggregated?
                        "terms": ["{{keywords}}"],      # The list of terms to run the expr on
                        "fields": ["title"]             # Which fields to lookup terms in
                    }
                }
            },
            {
                "name": "tsq_script_title_unique_terms",
                "params": ["keywords"],
                "template_language": "script_feature",
                "template": {
                    "lang": "painless",
                    "source": "params.uniqueTerms",
                    "params": {
                        "term_stat": {
                            "analyzer": "!standard",
                            "terms": "keywordsList",
                            "fields": ["title"]
                        }
                    }
                    
                }
            }
        ]
    }
}

client.create_featureset(index='tmdb', name='sandbox', ftr_config=config)

## Step 2 - Log Features for Training

In [None]:
from ltr.log import FeatureLogger
from ltr.judgments import judgments_open, to_dataframe
from itertools import groupby

ftr_logger=FeatureLogger(client, index='tmdb', feature_set='sandbox')
with judgments_open('data/title_judgments.txt') as judgment_list:
    for qid, query_judgments in groupby(judgment_list, key=lambda j: j.qid):
        ftr_logger.log_for_qid(judgments=query_judgments, 
                               qid=qid,
                               keywords=judgment_list.keywords(qid))

df = to_dataframe(ftr_logger.logged)
df

## Step 3 - Train a Model

In [None]:
'''
  TASK:
  Experiment with the leafs and trees variables, how do they affect NGCG?
  Does a high leaf value increase your NDCG?  What could be the potential downfalls?
'''
from ltr.ranklib import train
trainResponse  = train(client,
                  index='tmdb',
                  training_set=ftr_logger.logged,
                  metric2t='NDCG@10',
                  leafs=20,
                  trees=20,
                  featureSet='sandbox',
                  modelName='sandbox')

trainLog = trainResponse.trainingLogs[0]
print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(client.get_feature_name(config, ftrId), impact))
    
for roundDcg in trainLog.rounds:
    print(roundDcg)
    
print("Train NDCG@10 %s" % trainLog.rounds[-1])

## Search

In [None]:
from ltr import search
search(client, "rambo", modelName='sandbox')