Download corpus, judgments, etc

In [1]:
from ltr import download
corpus='http://es-learn-to-rank.labs.o19s.com/blog.jsonl'
judgments='http://es-learn-to-rank.labs.o19s.com/osc_judgments.txt'

download([corpus, judgments], dest='data/');

data/blog.jsonl already exists
data/osc_judgments.txt already exists


Parse out OSC's blog into `articles`

In [2]:
import json

articles = []

with open('data/blog.jsonl') as f:
    for line in f:
        blog = json.loads(line)
        articles.append(blog)

articles[-7]

{'title': "Lets Stop Saying 'Cognitive Search'",
 'url': 'https://opensourceconnections.com/blog/2019/05/28/lets-stop-saying-cognitive-search/',
 'author': 'doug-turnbull',
 'content': ' I consume a lot of search materials: blogs, webinars, papers, and marketing collateral. There’s a consistent theme that crops up over the years: buzzwords! I understand why this happens, and it’s not all negative. We want to invite others outside our community into what we do. Heck I write my own marketing collateral, so I get the urge to jump on a buzzword bandwagon from time-to-time. \n\n That being said, I want to tell you a dirty little secret. Nobody really knows what ‘cognitive search’ means in any concrete sense. Sit two people down, and ask them what problem ‘cognitive search’ solves, and you’ll get two different answers. Most likely they imagine some kind of silver-bullet solution to a unique, painful search relevance problem they’re experiencing. Problems that require careful, deep thought wh

Instantiate an OpenSearch client.

In [3]:
from ltr.client import OpenSearchClient
client=OpenSearchClient()

http://localhost:9201/_ltr; <OpenSearch([{'host': 'localhost', 'port': 9201}])>


Reindex from the corpus into the `blog` index. The JSON file at `docker/elasticsearch/<index_name>_settings.json` is loaded to configure the index.

In [4]:
from ltr.index import rebuild
rebuild(client, index='blog', doc_src=articles)

Index blog already exists. Use `force = True` to delete and recreate


A set of features that we've come up with that seems to work well for OSC's blog. Note here, these are Elasticsearch specific

In [5]:
client.reset_ltr(index='blog')

config = {
    "featureset": {
        "features": [
            {
                "name": "title_term_match",
                "params": ["keywords"],
                "template": {
                    "constant_score": {
                       "filter": {
                            "match": {
                                "title": "{{keywords}}"
                            }
                       },
                       "boost": 1.0
                    }
                }
            },
           {
                "name": "content_bm25",
                "params": ["keywords"],
                "template": {
                    "match": {
                       "content": {
                          "query": "{{keywords}}"
                        }
                    }
                }
            },
            {
                "name": "title_phrase_bm25",
                "params": ["keywords"],
                "template": {
                    "match_phrase": {
                       "title": "{{ keywords }}"
                    }
                }
            },
            {
                "name": "title_phrase_match",
                "params": ["keywords"],
                "template": {
                    "constant_score": {
                       "filter": {
                            "match_phrase": {
                                "title": "{{keywords}}"
                            }
                       },
                       "boost": 1.0
                    }
                }
            },
            
            {
                "name": "stepwise_post_date",
                "params": ["keywords"],
                "template": {
                  "function_score": {
                     "query": {
                        "match_all": {
                        }
                     },
                     "boost_mode": "replace",
                     "score_mode": "sum",
                     "functions": [
                        {
                            "filter": {
                                "range": {
                                    "post_date": {
                                        "gte": "now-180d"
                                    }
                                }
                            },
                            "weight": "100"               
                        },
                        {
                            "filter": {
                                "range": {
                                    "post_date": {
                                        "gte": "now-360d"
                                    }
                                }
                            },
                            "weight": "100"               
                        },
                          {
                            "filter": {
                                "range": {
                                    "post_date": {
                                        "gte": "now-90d"
                                    }
                                }
                            },
                            "weight": "100"               
                        }

                     ]
                  }
                }
            },
            {
                "name": "category_phrase_bm25",
                "params": ["keywords"],
                "template": {
                    "match_phrase": {
                       "categories": "{{ keywords }}"
                    }
                }
            },
            {
                "name": "excerpt_bm25",
                "params": ["keywords"],
                "template": {
                    "match": {
                       "excerpt": "{{ keywords }}"
                    }
                }
            },
            {
                "name": "excerpt_phrase_bm25",
                "params": ["keywords"],
                "template": {
                    "match_phrase": {
                       "excerpt": "{{ keywords }}"
                    }
                }
            },
        ]
    },
    "validation": {
      "index": "blog",
      "params": {
          "keywords": "rambo"
      }

   }
}

client.create_featureset(index='blog', name='test', ftr_config=config)

Removed Default LTR feature store [Status: 200]
Initialize Default LTR feature store [Status: 200]
Create test feature set [Status: 201]


With features loaded, transform the judgment list (`query,doc,label`) into a full training set with `query,doc,label,ftr1,ftr2,...` to prepare for training

In [6]:
from ltr.judgments import judgments_open
from ltr.log import FeatureLogger
from itertools import groupby

ftr_logger=FeatureLogger(client, index='blog', feature_set='test')
with judgments_open('data/osc_judgments.txt') as judgment_list:
    for qid, query_judgments in groupby(judgment_list, key=lambda j: j.qid):
        ftr_logger.log_for_qid(judgments=query_judgments, 
                               qid=qid,
                               keywords=judgment_list.keywords(qid))

Recognizing 47 queries


# Judgement List Definition

from https://opensourceconnections.com/blog/2019/08/13/hello-ltr-sandbox-for-learning-to-rank/
The judgment list is expressed as a ‘stub’ RankSVM file format. This file format, common to learning to rank tasks tracks the grade in the first column. In our example, we use the standard of a 0 meaning most irrelevant and a 4 meaning perfectly relevant for the query. The second column is a unique identifier for the query, prefixed with qid. A comment with the document identifier follows.

In [7]:
with judgments_open('data/osc_judgments.txt') as print_judgement_list:
    for line in print_judgement_list:
        print(line)
                

Recognizing 47 queries
grade:4 qid:1 (solr) docid:4036602523
grade:4 qid:1 (solr) docid:2349740828
grade:4 qid:1 (solr) docid:2307460466
grade:4 qid:1 (solr) docid:4021588268
grade:3 qid:1 (solr) docid:3873568549
grade:3 qid:1 (solr) docid:2721405712
grade:3 qid:1 (solr) docid:3780360030
grade:3 qid:1 (solr) docid:3941005933
grade:3 qid:1 (solr) docid:1686720823
grade:3 qid:1 (solr) docid:3394242689
grade:3 qid:1 (solr) docid:2916170833
grade:2 qid:1 (solr) docid:3245818064
grade:2 qid:1 (solr) docid:3913088476
grade:2 qid:1 (solr) docid:3501843710
grade:2 qid:1 (solr) docid:4124823494
grade:2 qid:1 (solr) docid:3225155378
grade:2 qid:1 (solr) docid:1502936313
grade:2 qid:1 (solr) docid:3252179363
grade:2 qid:1 (solr) docid:2848549888
grade:2 qid:1 (solr) docid:417217470
grade:2 qid:1 (solr) docid:3986527021
grade:3 qid:1 (solr) docid:3369505975
grade:2 qid:1 (solr) docid:3252179363
grade:2 qid:1 (solr) docid:2502109327
grade:2 qid:1 (solr) docid:3440818580
grade:1 qid:1 (solr) docid:7

Train using RankyMcRankFace with the training set, optimizing search for a specific metric (here `NDCG@10`). Note `ltr.train` has additional capabilities for performing k-fold cross validaiton to ensure the model isn't overfit to training data.

The model is stored in the search engine named `test` which can be referred to later for searching.

In [8]:
from ltr.ranklib import train
trainLog = train(client,
                 training_set=ftr_logger.logged,
                 metric2t='NDCG@10',
                 featureSet='test',
                 index='blog',
                 modelName='test')



/var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar already exists
Running java -jar /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t NDCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/training.txt -save data/test_model.txt 
Delete model test: 404
Created Model test [Status: 201]
Model saved


In [9]:
!java -jar /var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar -ranker 6 -shrinkage 0.1 -metric2t NDCG@10 -tree 50 -bag 1 -leaf 10 -frate 1.0 -srate 1.0 -train 

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 17 out of bounds for length 17
	at ciir.umass.edu.eval.Evaluator.main(Evaluator.java:223)


In [10]:
/var/folders/33/jx0mw87156q2hmtrr_r82s7r0000gs/T/RankyMcRankFace.jar

NameError: name 'var' is not defined

Search! Pass some configuration in (`blog_fields`) for display purposes.

In [13]:
blog_fields = {
    'title': 'title',
    'display_fields': ['url', 'author', 'categories', 'post_date']
}

from ltr import search
search(client, "haystack ml", modelName='test', 
       index='blog', fields=blog_fields)

{"size": 5, "query": {"sltr": {"params": {"keywords": "haystack ml", "keywordsList": ["haystack ml"]}, "model": "test"}}}
{'size': 5, 'query': {'sltr': {'params': {'keywords': 'haystack ml', 'keywordsList': ['haystack ml']}, 'model': 'test'}}}
Scott Stults to Speak on Solr Security at the Solr and Text Analytics Conference 
2.0238843 
https://opensourceconnections.com/blog/2011/03/21/scott-stults-to-speak-on-solr-security-at-the-solr-and-text-analytics-conference/ 
jason-hull 
['blog', 'Conference', 'Government', 'News', 'Speaking'] 
2011-03-21T00:00:00-0400 
---------------------------------------
Recap of Activate 2018 
0.88978434 
https://opensourceconnections.com/blog/2018/11/09/recap-of-activate-2018/ 
elizabeth-haubert 
['blog', 'Solr', 'Machine-learning'] 
2018-11-09T00:00:00-0500 
---------------------------------------
Haystack - The Search Relevance Conference. 
0.78783745 
https://opensourceconnections.com/blog/2019/04/24/haystack/ 
eric-pugh 
['events'] 
2019-04-24T00:00:00