# Sugges_

One of the strategies to improve user experience is to provide user with hints, or, otherwise, to autocomplete his queries. Let's consider 'suggest'.

Today we will practice generating suggestions using [Trie](https://en.wikipedia.org/wiki/Trie) data structure (prefix tree), see the example below.

The **Trie** data structure is a tree-like data structure used for storing a dynamic set of strings. It is commonly used for efficient retrieval and storage of keys in a large dataset. The structure supports operations such as insertion, search, and deletion of keys, making it a valuable tool in fields like computer science and information retrieval.

A Trie consists of nodes connected by edges. Each node represents a character or a part of a string. The root node, the starting point of the Trie, represents an empty string. Each edge emanating from a node signifies a specific character. The path from the root to a node represents the prefix of a string stored in the Trie.

![image](https://www.ritambhara.in/wp-content/uploads/2017/05/Screen-Shot-2017-05-01-at-4.01.38-PM.png)

Plan of this lesson:

1. Build Trie based on real search query data, provided by AOL company;
2. Generate suggestion based on a trie;
3. Measure suggestion speed;

## 0. Install Trie data structure support

You are free to use any library implementation of Trie, as well as the one we suggest (read the docs before asking any questions!): https://github.com/google/pygtrie

In [None]:
!pip install pygtrie

## 1. Build a trie upon a dataset

### 1.1. Read the dataset

Download the [dataset](https://github.com/IUCVLab/information-retrieval/tree/main/datasets/aol) (we provide only the first part of the original data for simplicity (~3.5 mln queries)).

Explore the data, see readme file. Load the dataset. Pass the assert.

In [None]:
import pandas as pd

aol_data = None

#TODO: Read the dataset, e.g. as pandas dataframe

### 1.1.1. Tests

In [None]:
assert aol_data.shape[0] == 3558411, "Dataset size does not match"

### 1.2. Build a Trie

We want a suggest function to be **non-sensitive to stop words** because we don't want to upset the users if they confuses/omits prepositions. Consider *"public events in Innopolis"* vs *"public events at Innopolis"* or *"public events Innopolis"* - they all mean the same.

Build a Trie based on the dataset, **storing query statistics such as query _frequency_, urls and ranks in the nodes**. Some queries may have no associated urls, others may have multiple ranked urls. Think of the way to store this information.

Pass the asserts.

In [3]:
stops = set('a on at of to is from for and with using the in &'.split())

In [None]:
aol_trie = pygtrie.CharTrie()


#TODO: build a trie based on the dataset

### 1.2.1. Tests

In [None]:
# test trie
bag = []
for key, val in aol_trie.iteritems("sample q"):
    print(key, '~', val)
    
    #NB: here we assume you store urls in a property of list type. But you can do something different. 
    bag += val.urls
    
    assert "sample question" in key, "All examples have `sample question` substring"
    assert key[:len("sample question")] == "sample question", "All examples have `sample question` starting string"

for url in ["http://www.surveyconnect.com", "http://www.custominsight.com", 
            "http://jobsearchtech.about.com", "http://www.troy.k12.ny.us",
            "http://www.flinders.edu.au", "http://uscis.gov"]:
    assert url in bag, "This url should be in a try"

## 2. Non-sensitive to stop words

### 2.1. Write a suggest function which is non-sensitive to stop 

Suggest options for user query based on Trie you just built.
Output results sorted by frequency, print query count for each suggestion. If there is an url available, print the url too. If multiple url-s are available, print the one with the highest rank (the less the better).

Pass the asserts.
Question for analysis: What is the empirical threshold for minimal prefix for suggest?

In [None]:
def complete_user_query(query: str, trie, top_k=5) -> list[str]:
    #TODO: suggest top_k options for a user query
    # sort results by frequency (!), 
    # suggest the QUERIES for first k ranked urls if available

    #NB we assume you return suggested query string only
    
    pass

In [None]:
inp = "trie"
print("Query:", inp)
print("Results:")
res = complete_user_query(inp, aol_trie)
print(res)

### 2.1.1. Tests

In [None]:
assert res[0] == "tried and true tattoo"
assert res[1] == "triest" or res[1] == "triethanalomine"

assert "boys and girls club of conyers georgia" \
            in complete_user_query("boys girls club conyers", aol_trie, 10), "Should be here"

## 3. Measure suggest speed ##

### 3.1. Full Trie test

Check how fast your search is working. Consider changing your code if it takes too long on average.

Sucess criterion:
- there is an average and a standard deviation for **multiple runs** of the given bucket.
- there is an average and a standard deviation for **multiple runs** of naive search in the unindexed dataset.

In [None]:
import time
import numpy as np

inp_queries = ["inf", "the best ", "information retrieval", "sherlock hol", "carnegie mell", 
               "babies r", "new york", "googol", "inter", "USA sta", "Barbara "]

#TODO: measure average execution time and standard deviation (in milliseconds) per query and print it out
# Repeat this for index and for no index.

## 4. Spellchecking

### 4.1. Add spellchecking to your suggest

Try to make your search results as close as possible. Compare top-5 results of each query with top-5 results for corrected.

You can use use [pyspellchecker](https://pypi.org/project/pyspellchecker/) `candidates()` call, or use any other spellchecker implementation.

In [None]:
from spellchecker import SpellChecker

def complete_user_query_with_spellchecker(query, trie, top_k=5) -> list[str]:
    #TODO: suggest top_k options for a user query
    # sort results by frequency (!!), 
    # suggest the QUERIES for first k ranked urls if available
    pass

### 4.1.1. Tests

In [None]:
inp_queries = ["inormation retrieval", "shelrock hol", "carnagie mell", "babis r", "Barrbara "]
inp_queries_corrected = ["information retrieval", "sherlock hol", "carnegie mell", "babies r", "Barbara "]

for q, qc in zip(inp_queries, inp_queries_corrected):
    assert  complete_user_query(qc, trie, 5) == \
            complete_user_query_with_spellchecker(q, trie, 5), "Assert {} and {} give different results".format(q, qc)

## 5. Assess how dataset size affect search time

Study the speed of the trie data structure in $\frac{1}{10}, \frac{1}{4}, \frac{1}{2}$, and full dataset. 
- Sample the data at random.
- Plot the graph which shows how search time changes with dataset size.
- Compare aganist bruteforce.

In [None]:
import matplotlib.pyplot as plt

### YOUR CODE HERE

## 6. What if the query is in the middle?

Modify your code to suggest string even if the query is found **in the middle** of the text. Think about techniques you can borrow from our previous classes, e.g. about wildcard search.

E.g. `Semantic Parsing` in 

```
3DCNN-DQN-RNN: A Deep Reinforcement Learning Framework for Semantic Parsing of Large-scale 3D Point Clouds
                                                           ~~~~~~~~~~~~~~~~
```

**NB**: Please extend you trie-based approach. Even if using `in` and regexp can give you same result, this is not a scalable approach, which we will not accept.

Pass the asserts.

In [None]:
newtrie = None

## YOUR CODE HERE

def complete_user_query_with_spellchecker_and_middle(query, trie, top_k=5) -> list[str]:
    #TODO: suggest top_k options for a user query
    # sort results by frequency (!), 
    # suggest the QUERIES for first k ranked urls if available
    pass

### 6.1.1. Tests

In [None]:
assert "ricky martin beach" in complete_user_query_with_spellchecker_and_middle(
            "martin beach", newtrie, 20)
assert "free adult movies" in  complete_user_query_with_spellchecker_and_middle(
            "adult movie", newtrie, 20)

## 7. Enrich your suggest with search results

Your users will be happy if at typing the query they see not only suggested queries, but also snippets of the answers to these queries!

Imagine you type "continental air", and the search engine suggests you "continental airlines" together with the URL and snippet kind of `"Continental Airlines was a major American airline founded in 1934 and eventually headquartered in Houston, Texas..."`, which you borrow from the search engine snippet. How can you add existing search enginge to your code? [One](https://yandex.com/dev/xml/doc/dg/task/quickstart.html), [two](https://docs.microsoft.com/en-us/bing/search-apis/bing-web-search/search-the-web), [three](https://searx.roughs.ru/), [four](https://serpapi.com/) ...

Improve your suggest. It should return a tuple of 3 instead of just a string. Your result is now `(query, text, url)`. Write your own tests which for the query `continental air` return among the results:
1. `query` = `continental airlines`.
2. 
`Continental Airlines was a major American airline founded in 1934` in `text`.
3. `url` = `https://en.wikipedia.org/wiki/Continental_Airlines`.

In [None]:
def complete_user_query_with_spellchecker_and_middle_with_snippets(query, trie, top_k=5) -> list[tuple]:
    #TODO
    pass

### 7.1.1. Tests

In [None]:
results = complete_user_query_with_spellchecker_and_middle_with_snippets("continental air", newtrie, 5)

assert any("continental airlines" in result[0] for result in results), "Query 'continental airlines' should be in the results"
assert any("Continental Airlines was a major American airline founded in 1934" in result[1] for result in results), "Snippet should contain 'Continental Airlines was a major American airline founded in 1934'"