# Self-Guessing Hybrid Search (aka Part 3/3)

Assume we have a working recipe for hybrid search in the form `(query, [keyword]) => [(snippet, siilarity)]`.

Now the goal of this notebook is, _can we self-guess the keywords from the query and have the caller supply just the query?_

In [1]:
import os
from functools import partial
from dotenv import load_dotenv

_ = load_dotenv('.env')

In [2]:
import cassio

In [3]:
cassio.init(
    token=os.environ['ASTRA_DB_APPLICATION_TOKEN'],
    database_id=os.environ['ASTRA_DB_ID'],
    keyspace=os.environ.get('ASTRA_DB_KEYSPACE'),
)
session = cassio.config.resolve_session()
keyspace = cassio.config.resolve_keyspace()

In [4]:
import openai

embedding_model_name = "text-embedding-ada-002"

def get_embeddings(texts):
    result = openai.Embedding.create(
        input=texts,
        engine=embedding_model_name,
    )
    return [res.embedding for res in result.data]

### Recap

Let's use the latest "hybrid search" function from the previous investigation. One important point is that we'll proceed with the keywords in OR (i.e. a single keyword match suffices for a hit). This enables a meaningful contribution to the "score" from the keyword side; but most important, with the keywords being self-guessed, it protects somewhat from too-demanding "guesses" from the query.

**We have packaged the latest machinery from Part 2 into a Python module to reduce clutter, nothing new**

_(We just renamed the final hybrid-search function for convenience and made sure the DB parameters pass through the calls)_

In [5]:
from kw_hybrid_tools import hybrid_search_with_kw, show, keyword_similarity, sum_score_merger

That has the following signature:
```
def hybrid_search_with_kw(session, keyspace, get_embeddings, query, keywords=[],
                          top_k=3, kw_similarity_function=keyword_similarity,
                          score_merger_function=sum_score_merger, prefetch_factor=5):
    ...
```

Now, let's define a handy shortcut and run a sanity check (to be compared with the "QUERY7/KW7" run in the previous notebook):

In [6]:
hybrid_kw = partial(hybrid_search_with_kw, session=session, keyspace=keyspace, get_embeddings=get_embeddings)

KW0 = ['support', 'chat']
QUERY0 = "How come I cannot chat?"

print(f"[with safe prefetch] QUERY: '{QUERY0}', KEYWORDS: \'{', '.join(KW0)}\'")
show(hybrid_kw(query=QUERY0, keywords=KW0))

[with safe prefetch] QUERY: 'How come I cannot chat?', KEYWORDS: 'support, chat'
    [1] 0.96761 "I cannot open the support chat."
    [2] 0.96156 "I see no messages in the support chat."
    [3] 0.95498 "The support chat on the website is lagging."


## Guessing the keywords from the query

Let us consider first a simple and disappointing keyword-guesser function (there'll be several of them) later:

In [7]:
PUNKT = set('!,.?;\'"-+=/[]{}()\n')

def guess_kws_simple(query):
    _qry = ''.join([c for c in query if c not in PUNKT]).lower()
    return {
        w
        for w in _qry.split(' ')
        if w
        if len(w) > 4
    }

Rather crude, isn't it?

In [8]:
print(guess_kws_simple("The report due today is on Mia's desk, Benjamin!"))

{'today', 'report', 'benjamin'}


Let's start from this and repackage a keyword-guessing hybrid search function (again, we take advantage of the partialed shortcut to focus on the important bits):

In [9]:
def hybrid_guess(query, kw_guesser, top_k=3, kw_similarity_function=keyword_similarity,
                 score_merger_function=sum_score_merger, prefetch_factor=5):
    keywords = kw_guesser(query)
    return hybrid_kw(
        query=query,
        keywords=keywords,
        top_k=top_k,
        kw_similarity_function=kw_similarity_function,
        score_merger_function=score_merger_function,
        prefetch_factor=prefetch_factor,
    )

A little test with the crude guesser (and, let's not bother with the other settings now):

In [10]:
QUERY1 = "How come I cannot chat?"
print(f"QUERY: '{QUERY1}' [keywords={guess_kws_simple(QUERY1)}]")
show(hybrid_guess(QUERY1, kw_guesser=guess_kws_simple))

QUERY: 'How come I cannot chat?' [keywords={'cannot'}]
    [1] 0.96762 "I cannot open the support chat."
    [2] 0.95840 "I cannot speak with the support operator!"


But clearly this is not the best solution in general:

In [11]:
QUERY2 = "Do you currently have any offers?"
print(f"QUERY: '{QUERY2}' [keywords={guess_kws_simple(QUERY2)}]")
show(hybrid_guess(QUERY2, kw_guesser=guess_kws_simple))

QUERY3 = "Why does the site experience these lags?"
print(f"QUERY: '{QUERY3}' [keywords={guess_kws_simple(QUERY3)}]")
show(hybrid_guess(QUERY3, kw_guesser=guess_kws_simple))

QUERY: 'Do you currently have any offers?' [keywords={'offers', 'currently'}]
QUERY: 'Why does the site experience these lags?' [keywords={'experience', 'these'}]


_Note: with just "Why does the site lag?" you **would** get some results ... of course! No keywords found, and it falls back to ANN-only._

### Known keyword set, quick approaches

Suppose your knowledge of the problem domain lets you make an explicit list of the keywords you want to potentially use:

> You may achieve this "by hand", or by passing a random sample subset of the snippets to an LLM, ... or a combination of both.

In [12]:
available_keywords = set("buy gift discounts support operator chat message offer lag product payment process shop cart".split(" "))

Now very "cheap" but in some cases effective (and fast!) keyword extractors can be constructed.

_Note: sometimes these might yield false positives, such as "lag" being a substring of "flagged". The semantic side of the search would mostly take care of these, with the proper tuning in the score-merging phase._

In [13]:
def guess_kws_substr_from_set(query, kws=available_keywords):
    _qry = query.lower().strip()
    return {kw for kw in available_keywords if kw in _qry}

def guess_kws_tokens_from_set(query, kws=available_keywords):
    _qry = ''.join([c for c in query if c not in PUNKT]).lower()
    toks = {tk for tk in _qry.split(' ') if tk}
    return toks & kws

In [14]:
queries = [QUERY1, QUERY2, QUERY3]
guessers = [('SUBSTR, from set', guess_kws_substr_from_set), ('TOKENS, from set', guess_kws_tokens_from_set)]

for qry in queries:
    print(f"\nQUERY='{qry}'")
    for kw_g_name, kw_g in guessers:
        print(f"  KwGuesser=<{kw_g_name}> [keywords={kw_g(qry)}]")
        show(hybrid_guess(qry, kw_guesser=kw_g))


QUERY='How come I cannot chat?'
  KwGuesser=<SUBSTR, from set> [keywords={'chat'}]
    [1] 0.96762 "I cannot open the support chat."
    [2] 0.96608 "A message disappeared from the chat?"
    [3] 0.96157 "I see no messages in the support chat."
  KwGuesser=<TOKENS, from set> [keywords={'chat'}]
    [1] 0.96762 "I cannot open the support chat."
    [2] 0.96608 "A message disappeared from the chat?"
    [3] 0.96157 "I see no messages in the support chat."

QUERY='Do you currently have any offers?'
  KwGuesser=<SUBSTR, from set> [keywords={'offer'}]
    [1] 0.96833 "Is there any special offer today?"
    [2] 0.47047 "Are special offers available?"
  KwGuesser=<TOKENS, from set> [keywords=set()]
    [1] 0.97047 "Are special offers available?"
    [2] 0.96833 "Is there any special offer today?"
    [3] 0.94816 "I would like to buy gift cards. Where can I get discounts?"

QUERY='Why does the site experience these lags?'
  KwGuesser=<SUBSTR, from set> [keywords={'lag'}]
    [1] 0.46398 "The 

The substring mode seems to fare better: we get `"lag"` (which was too short in the crude approach) and we don't get confused by `"cannot"` or similar irrelevant things. _Note: the many results from the token mode for the second and third query signal that no keywords were found, don't let that fool you._

As remarked earlier, however we may get false positives (especially if there aren't better vectors in the store, to climb to the top results and displace the intruders):

In [15]:
QUERY4 = "I have been flagged by an admin... what do I do?"
print(f"\nKwGuesser=<SUBSTR, from set>, QUERY='{QUERY4}' [keywords={guess_kws_substr_from_set(QUERY4)}]")
show(hybrid_guess(QUERY4, kw_guesser=guess_kws_substr_from_set))


KwGuesser=<SUBSTR, from set>, QUERY='I have been flagged by an admin... what do I do?' [keywords={'lag'}]
    [1] 0.43556 "The support chat on the website is lagging."


### Known keyword set, using AI

Of course, the next thing we try is to employ AI to nail the kewords for us. This, however, comes at a performance cost. Depending on the use cases, an additional delay of one second or more might be acceptable or not. Let's see what we can do, and defer timing the performance to a later section.

#### HuggingFace zero-shot classifier

There's a new parameter here, a threshold to accept the keyword from the classifier.

> Install pytorch. This is not covered in the `requirements.txt` since it's ... complicated. Please follow [this](https://pytorch.org/get-started/locally/).

In [16]:
from transformers import pipeline

hf_zs_classifier = pipeline("zero-shot-classification")

def guess_kws_hf_zs_from_set(query, kws=available_keywords, keyword_threshold=0.5):
    _kws = list(kws)
    result = hf_zs_classifier([query], _kws, multi_label=True)[0]
    return {
        keyword
        for keyword, kw_score in zip(result["labels"], result["scores"])
        if kw_score >= keyword_threshold
    }

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Just the keyword extraction in action:

In [17]:
queries = [QUERY1, QUERY2, QUERY3, QUERY4]

for qry in queries:
    print(f"QUERY='{qry}' [keywords={guess_kws_hf_zs_from_set(qry)}]")

QUERY='How come I cannot chat?' [keywords={'lag'}]
QUERY='Do you currently have any offers?' [keywords={'discounts', 'operator', 'message', 'offer'}]
QUERY='Why does the site experience these lags?' [keywords={'lag', 'operator'}]
QUERY='I have been flagged by an admin... what do I do?' [keywords={'process', 'chat', 'operator', 'message'}]


Remember you can set the extraction to be more or less generous by playing with the threshold:

In [18]:
print(f"QUERY='{QUERY4}', keywords by threshold for guess_kws_hf_zs:")
for kw_t in [0.3, 0.4, 0.5, 0.7, 0.8, 0.9]:
    print(f"    threshold={kw_t:0.2f} ==> [keywords={guess_kws_hf_zs_from_set(QUERY4, keyword_threshold=kw_t)}]")

QUERY='I have been flagged by an admin... what do I do?', keywords by threshold for guess_kws_hf_zs:
    threshold=0.30 ==> [keywords={'chat', 'process', 'lag', 'operator', 'message', 'cart', 'offer'}]
    threshold=0.40 ==> [keywords={'process', 'chat', 'operator', 'message'}]
    threshold=0.50 ==> [keywords={'process', 'chat', 'operator', 'message'}]
    threshold=0.70 ==> [keywords={'process', 'operator', 'message'}]
    threshold=0.80 ==> [keywords={'process', 'message'}]
    threshold=0.90 ==> [keywords=set()]


Not very fast (comparable to calling OpenAI or something), but probably a good choice if one wants to run everything locally (and/or save on OpenAI cost, I guess)

#### _(Abandoned)_ HuggingFace LLM to get the keywords from a set

This does not seem to lead to any quickly usable result. Abandoned for now (besides, it's slower than calling LLM services).

Keeping them for record, one attempt per cell

In [19]:
_ = '''

# FROM: https://huggingface.co/docs/transformers/v4.15.0/en/task_summary#text-generation


hf_tg_llm = pipeline("text-generation")

# just did a few tests with prompts, no luck
KW_EXTRACTION_PROMPT_TEMPLATE = """The relevant keywords extracted from "{query}" are: """
prompt = KW_EXTRACTION_PROMPT_TEMPLATE.format(query=QUERY3)

result = hf_tg_llm(prompt, max_length=50 + len(prompt), do_sample=False)
print(result[0]['generated_text'])
'''

In [20]:
_ = '''
# FROM: https://huggingface.co/docs/transformers/v4.15.0/en/task_summary#text-generation

from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""

prompt = f"The top search keyword extracted from \"{QUERY4}\" are ..."

inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

print(generated)
'''

#### Using a greater LLM for keywords from a set

What if we switch to a more modern, powerful LLM such as gpt3 from OpenAI, and have it extract our keywords from a set?

In [21]:
KW_EXTRACTION_FROM_SET_PROMPT_TEMPLATE = """
You are to extract keywords from a phrase for use in a keyword-based search engine.
The keywords need not appear in the input phrase, meaning you can relate synonyms.
Please output them in a comma-separated list.
Be very careful that keywords MUST be given in stemmed form: nouns are singular, verbs are in the infinite, and so on.
Do not exceed {max_num_keywords} keywords. Keywords must not include whitespaces, i.e. they must be a single word.
Include proper nouns if relevant. Discard stop words, pronouns and generic verbs such as be, do, get and so on.
Important: limit your selection to a provided pool of available keywords.

EXAMPLE INPUT PHRASE: Does the site currently offer discounts? It featured them on the portal yesterday.
EXAMPLE AVAILABLE KEYWORDS: sale, discount, portal, reboot, offer, website, malfunction, discount, purchase, technical
EXAMPLE KEYWORDS: website, offer, discount, feature, portal

INPUT PHRASE: {query}
AVAILABLE KEYWORDS: {available_keywords}
KEYWORDS:"""

completion_model_name = "gpt-3.5-turbo"

In [22]:
def guess_kws_gpt3_from_set(query, available_keywords=available_keywords, max_num_keywords=12):
    llm_prompt = KW_EXTRACTION_FROM_SET_PROMPT_TEMPLATE.format(
        query=query,
        available_keywords=', '.join(sorted(available_keywords)),
        max_num_keywords=max_num_keywords,
    )
    response = openai.ChatCompletion.create(
        model=completion_model_name,
        messages=[{"role": "user", "content": llm_prompt}],
        temperature=0.0,
        max_tokens=40,
    )
    kws_string = response.choices[0].message.content
    kws = {
        kw_tok.strip()
        for kw_tok in kws_string.split(',')
        if kw_tok.strip()
    }
    # Constraining the LLM to stay in the provided set is not easy at all. We fix it here (not ideal)
    return kws & set(available_keywords)

Let's check how this works. We also add a funny query to put the LLM to test

In [23]:
QUERYZ = "Is Santa a user of your website? MY FRIEND'S MESSAGE ASSURES ME SO. Santa gives me a nice present each year."
queries = [QUERY1, QUERY2, QUERY3, QUERY4, QUERYZ]

print(f"Available keywords::\n    {available_keywords}")
for qry in queries:
    print(f"QUERY='{qry}'\n    keywords => {guess_kws_gpt3_from_set(qry)}")

Available keywords::
    {'discounts', 'chat', 'process', 'lag', 'operator', 'payment', 'gift', 'support', 'shop', 'message', 'cart', 'product', 'buy', 'offer'}
QUERY='How come I cannot chat?'
    keywords => {'chat'}
QUERY='Do you currently have any offers?'
    keywords => {'offer'}
QUERY='Why does the site experience these lags?'
    keywords => {'lag'}
QUERY='I have been flagged by an admin... what do I do?'
    keywords => set()
QUERY='Is Santa a user of your website? MY FRIEND'S MESSAGE ASSURES ME SO. Santa gives me a nice present each year.'
    keywords => {'message'}


Hmm, not bad, but not very good either (notice the absence of "gift" in the last result?).

The open-set version of this approach will probably do better.

Note that the quality of the output (the choice of keywords, but also getting the proper stemming/lowercase) heavily depend on engineering the right prompt.

### Open keyword set, using AI

Now we will have no closed keyword set anymore. These keyword guessers will produce whatever keyword set they want.

Pros:
- no need to prepare the set beforehand
- ready to work with unexpected, new or unpredicted queries and stored snippets

Cons:
- the "wrong synonym" may be found, lowering effectiveness
- weird "keywords" may find their way, hacking the result or their score/ordering

#### HuggingFace NER

A typical "Named Entity Recognition" (NER) vanilla pipeline from HuggingFace is designed to identify tokens representing people, locations, organizations or "miscellaneous".

This, by itself, would not help much for finding generic "keywords" (such as `"website"`), but in some cases it would complement a well-functioning vector-based search by helping dealing with proper nouns such as brand names and so on.

In [27]:
ner_classifier = pipeline("ner")  # we could add aggregation_strategy="simple" and avoid collating the "##segment" tokens...

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [25]:
def collate_ner_entities(e_raw):
    # a crude approach (refine before real usage)
    e_string = {
        e_tok.strip()
        for e_tok in ' '.join(e_raw).replace(' ##', '').lower().split(' ')
        if e_tok.strip()
    }
    return e_string

def guess_kws_hf_ner_open(query):
    entities_raw = [e['word'] for e in ner_classifier(query)]
    return collate_ner_entities(entities_raw)

Let's test it. We also add another funny query with some entities to recognize:

In [26]:
QUERYX = "Santa Claus was at Baden Baden yesterday, perhaps, visiting the headquarters of AwesomeProduct, inc."

queries = [QUERY1, QUERY2, QUERY3, QUERY4, QUERYZ, QUERYX]

for qry in queries:
    print(f"QUERY='{qry}'\n    keywords => {guess_kws_hf_ner_open(qry)}")

QUERY='How come I cannot chat?'
    keywords => set()
QUERY='Do you currently have any offers?'
    keywords => set()
QUERY='Why does the site experience these lags?'
    keywords => set()
QUERY='I have been flagged by an admin... what do I do?'
    keywords => set()
QUERY='Is Santa a user of your website? MY FRIEND'S MESSAGE ASSURES ME SO. Santa gives me a nice present each year.'
    keywords => {'santa'}
QUERY='Santa Claus was at Baden Baden yesterday, perhaps, visiting the headquarters of AwesomeProduct, inc.'
    keywords => {'awesomeproduct', 'claus', 'santa', 'baden'}


#### _(Redundant)_ A likely identical pipeline

This is probably a more manual way to achieve the above (modulo model choice changes perhaps), as you'll see. We will not pursue it further.

In [27]:
# As seen on: https://huggingface.co/docs/transformers/v4.15.0/en/task_summary#named-entity-recognition
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
def guess_kws_hf_ner_open2(query):
    inputs = tokenizer(query, return_tensors="pt")
    tokens = inputs.tokens()
    outputs = model(**inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    entities_raw = [
        token
        for token, prediction in zip(tokens, predictions[0].numpy())
        if model.config.id2label[prediction] != 'O'
    ]
    return collate_ner_entities(entities_raw)

In [29]:
queries = [QUERY4, QUERYZ, QUERYX]

for qry in queries:
    print(f"QUERY='{qry}'\n    keywords => {guess_kws_hf_ner_open2(qry)}")

QUERY='I have been flagged by an admin... what do I do?'
    keywords => set()
QUERY='Is Santa a user of your website? MY FRIEND'S MESSAGE ASSURES ME SO. Santa gives me a nice present each year.'
    keywords => {'santa'}
QUERY='Santa Claus was at Baden Baden yesterday, perhaps, visiting the headquarters of AwesomeProduct, inc.'
    keywords => {'awesomeproduct', 'claus', 'santa', 'baden'}


#### Using GPT3.5 with an open set

As anticipated, this might be the best solution (if one can afford the added latency, that is).

We simply repeat the above OpenAI case, minus constraining the keywords in a known set.

In [30]:
KW_EXTRACTION_OPEN_PROMPT_TEMPLATE = """
You are to extract keywords from a phrase for use in a keyword-based search engine.
The keywords need not appear in the input phrase, meaning you can relate synonyms.
Please output them in a comma-separated list.
Be very careful that keywords MUST be given in stemmed form: nouns are singular, verbs are in the infinite, and so on.
Do not exceed {max_num_keywords} keywords. Keywords must not include whitespaces, i.e. they must be a single word.
Include proper nouns if relevant. Discard stop words, pronouns and generic verbs such as be, do, get and so on.

EXAMPLE INPUT PHRASE: Does the site currently offer discounts? It featured them on the portal yesterday.
EXAMPLE KEYWORDS: website, offer, discount, feature, portal

INPUT PHRASE: {query}
KEYWORDS:"""

completion_model_name = "gpt-3.5-turbo"

In [31]:
def guess_kws_gpt3_open(query, max_num_keywords=12):
    llm_prompt = KW_EXTRACTION_OPEN_PROMPT_TEMPLATE.format(
        query=query,
        max_num_keywords=max_num_keywords,
    )
    response = openai.ChatCompletion.create(
        model=completion_model_name,
        messages=[{"role": "user", "content": llm_prompt}],
        temperature=0.0,
        max_tokens=40,
    )
    kws_string = response.choices[0].message.content
    kws = {
        kw_tok.strip()
        for kw_tok in kws_string.replace(' ', ',').split(',')   # <-- sometimes keywords such as "baden baden" come out, we fix
        if kw_tok.strip()
    }
    return kws

In [32]:
queries = [QUERY1, QUERY2, QUERY3, QUERY4, QUERYZ, QUERYX]

for qry in queries:
    print(f"QUERY='{qry}'\n    keywords => {guess_kws_gpt3_open(qry)}")

QUERY='How come I cannot chat?'
    keywords => {'chat', 'come'}
QUERY='Do you currently have any offers?'
    keywords => {'currently', 'offer'}
QUERY='Why does the site experience these lags?'
    keywords => {'experience', 'website', 'lag'}
QUERY='I have been flagged by an admin... what do I do?'
    keywords => {'admin', 'flag'}
QUERY='Is Santa a user of your website? MY FRIEND'S MESSAGE ASSURES ME SO. Santa gives me a nice present each year.'
    keywords => {'website', 'assure', 'user', 'friend', 'message', 'present', 'year', 'santa'}
QUERY='Santa Claus was at Baden Baden yesterday, perhaps, visiting the headquarters of AwesomeProduct, inc.'
    keywords => {'claus', 'inc', 'headquarters', 'visit', 'yesterday', 'awesomeproduct', 'baden', 'santa'}


## Two possible end-to-end solutions

We choose to adopt (and compare) two choices in the following, in a sense opposite to each other:
- the "substring" simple choice with closed keyword set (arguably the best among the **fast** solutions)
- the last option, OpenAI and an open keyword set (**slow**, but probably the most effective outcome)

In [34]:
queries = [QUERY1, QUERY2, QUERY3, QUERY4, QUERYZ, QUERYX]

guessers = [('substr/closed', guess_kws_substr_from_set), ('gpt3.5/open', guess_kws_gpt3_open)]

for qry in queries:
    print(f"\nQUERY='{qry}'")
    for kw_g_name, kw_g in guessers:
        print(f"  KwGuesser=<{kw_g_name}> [keywords={kw_g(qry)}]")
        show(hybrid_guess(qry, kw_guesser=kw_g))


QUERY='How come I cannot chat?'
  KwGuesser=<substr/closed> [keywords={'chat'}]
    [1] 0.96762 "I cannot open the support chat."
    [2] 0.96609 "A message disappeared from the chat?"
    [3] 0.96159 "I see no messages in the support chat."
  KwGuesser=<gpt3.5/open> [keywords={'chat', 'come'}]
    [1] 0.71762 "I cannot open the support chat."
    [2] 0.71608 "A message disappeared from the chat?"
    [3] 0.71157 "I see no messages in the support chat."

QUERY='Do you currently have any offers?'
  KwGuesser=<substr/closed> [keywords={'offer'}]
    [1] 0.96833 "Is there any special offer today?"
    [2] 0.47047 "Are special offers available?"
  KwGuesser=<gpt3.5/open> [keywords={'currently', 'offer'}]
    [1] 0.71833 "Is there any special offer today?"
    [2] 0.47047 "Are special offers available?"

QUERY='Why does the site experience these lags?'
  KwGuesser=<substr/closed> [keywords={'lag'}]
    [1] 0.46385 "The support chat on the website is lagging."
  KwGuesser=<gpt3.5/open> [key

On this limited dataset, we see limited differences between the two. A more realistic test will be performed.

An important observation is that the numeric similarities heavily depend on the keywords that have been guessed, so either
- one retains the vector similarity only (i.e. sets `rho=0` in the passed score merger)
- one does not attempt to find a "threshold" valid for all queries.

### Timing

Before wrapping up, let us compare the time for these two solutions to perform, by both measuring the keyword-extraction only and the complete search operation.

_This is not a rigorous performance measurement, we just want to get a rough idea._

In [37]:
from time import perf_counter

In [38]:
ini = perf_counter()
for qry in queries:
    _ = guess_kws_substr_from_set(qry)
guess_time_substr = (perf_counter() - ini) / len(queries)

ini = perf_counter()
for qry in queries:
    _ = guess_kws_gpt3_open(qry)
guess_time_gpt3 = (perf_counter() - ini) / len(queries)

print(f"Keyword guessing takes {guess_time_substr:.6f} s for substring")
print(f"Keyword guessing takes {guess_time_gpt3:.6f} s for GPT3.5")

Keyword guessing takes 0.000106 s for substring
Keyword guessing takes 0.750707 s for GPT3.5


Does this difference count also in the larger process that is the whole search?

In [39]:
ini = perf_counter()
for qry in queries:
    _ = hybrid_guess(qry, kw_guesser=guess_kws_substr_from_set)
hsearch_time_substr = (perf_counter() - ini) / len(queries)

ini = perf_counter()
for qry in queries:
    _ = hybrid_guess(qry, kw_guesser=guess_kws_gpt3_open)
hsearch_time_gpt3 = (perf_counter() - ini) / len(queries)

print(f"Full hybrid search takes {hsearch_time_substr:.6f} s for substring")
print(f"Full hybrid search takes {hsearch_time_gpt3:.6f} s for GPT3.5")

Full hybrid search takes 1.086411 s for substring
Full hybrid search takes 1.793387 s for GPT3.5


Not as dramatic as the guessing time comparison, but **far from negligible** when one has to take latencies into account.

## To be continued ...

Yes, there'll be a "Part 4 of 3" (?) with miscellaneous extra experiments and tests.