In [10]:
## Add this directory to the path and load our functions
import sys
sys.path.append("../src/")

import paware

import polars as pl

# Indexing the Data

Now that we have vector representation of all of our data, we can load it into a vector database for streamlined access through indexing.

We are using [LanceDB](https://lancedb.github.io/lancedb/) for our database needs, and we are taking advantage of its built in Almost Nearest Neighbors (ANN) index capabilities, provided as a composite inverted file index with product quantization ((IVF_PQ)[https://lancedb.github.io/lancedb/concepts/index_ivfpq/]).

The purpose of indexing is to improve retrieval times, while trading off accuracy. We found a set of parameters that worked well for our purposes here, but its possible that we could achieve even faster retrieval with minimal loss in performance by adjusting them.

For this demonstration, we chose the following parameters:

* `EMBEDDING_CONFIG_NAME = demo`: This specifies that we are going to index the "demo" configuration, that we have already processed the embeddings for and to which we have attached our engineered metadata.
* `EMBEDDING_DIR =  "../paw_demo/embedded_subs"`: This tells our indexer where to find the embedded data.
* `INDEX_CONFIG_NAME = "_demo"`: This attaches a label for this specific embedding configuration. In testing, we can index the same data using diffferent indexing parameters, and this allows us to differentiate between them.
* `DB_SAVE_DIR = "../paw_demo/indexed_data"`: This specifies where to save the database table once indexing is complete.
* `METRIC = "cosine"`: Here we chose the metric to be used by the index, our default is `cosine`
* `NUM_PARTITIONS  = 128`: Here we specify how many partitions should be created by the IVF. Our default is 1024 on the full dataset, but for this demo we'll set it much smaller.
* `NUM_SUB_VECTORS = 24`: Here we decide the level of PQ reduction. Our default on the full dataset was 96, and again we chose a smaller number for this demo.
* `ACCELERATOR = 'mps'`: LanceDB supports `'mps'` and `'cuda'` options here. The defaul is `None`.



In [3]:
## Set up the indexing tool
indexing_tool = paware.PawIndex(
    EMBEDDING_CONFIG_NAME = "demo",
    EMBEDDING_DIR =  "../paw_demo/embedded_subs/",
    INDEX_CONFIG_NAME = "_demo",
    DB_SAVE_DIR = "../paw_demo/indexed_db",
    METRIC = "cosine",
    NUM_PARTITIONS  = 128,
    NUM_SUB_VECTORS = 96,
    ACCELERATOR = 'mps',
    )

## Index the data
indexing_tool.index_data()

 52%|█████▏    | 26/50 [00:02<00:01, 12.38it/s]


0it [00:00, ?it/s]

  tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))


# Querying the Index

Now the the data is indexed, we can start to query it. At this stage that we can chose to implement several of our strategies for improving the relevance of results. The following parameters define the basic behavior of query and how it will use the index.

* `CONFIG_NAME`: Specifies which indexing configuration we will be querying
* `DB_DIR`: Specifies the directory containing the database
* `QUERY_SAVE_DIR`: Specifies where to save query results* 
* `QUERY_NAME`: Used to keep track of this query configuration for later evaluation.*
* `LIMIT`: This specifies how many results to return,
* `NPROBES`: This specified how many nearby partitions (created by the IVF) should be visited while looking for results,
* `REFINE_FACTOR`: This is a multiplier that tells the index to retrieve `LIMIT*REFINE_FACTOR` results, then re-rank using the actual (non-quantized) distances before returning the top `LIMIT` as the final set.
* `METRIC`: This is the metric used for retreival. The default we use is `cosine`.

The parameters `QUERY_SAVE_DIR` and `QUERY_NAME` are only used when we ask the standard set of queries we use to score our results.

## Query Parameters (Pre-query)

Prior to retrieval, we can narrow down our corpus of text through pre-filtering. Under the hood, this is handled through the database using SQL queries. Our two built in filtering options are:

* `FILTER_SUBMISSIONS` : This filters out any rows that have `'submission'` as their `aware_post_type`. This information was provided with the raw data.
* `FILTER_SHORT_QUESTIONS` : This filters out any rows where the `reddit_text` is shorter than 100 characters, and end in a `?`. We added this information during preprocessing.

## Query Parmeters (Post-Query)

After retrieving our results, we can then re-rank them before returning them to the user. We have three re-ranking strategies we could apply, each of which depends on the engineered metadata that we've generated.

* `RERANK_SENTIMENT`: This implements our re-ranking by `summed_sentiment_of_replies`
* `RERANK_AGREE_DISTANCE`: This implements our re-ranking by `avg_reply_agree_distance`
* `RERANK_DISAGREE_DISTANCE`: This implements our re-ranking by `avg_reply_disagree_distance`


Below, we choose the parameters of our top performing configuration on the whole dataset.


In [4]:
query_tool = paware.PawQuery(
    CONFIG_NAME = "demo_demo",
    DB_DIR = "../paw_demo/indexed_db/",
    QUERY_SAVE_DIR = "../paw_demo/demo_query_results/",
    QUERY_NAME = "top_config",
    LIMIT = 20, ## Default is 50, smaller for demo
    NPROBES =5, ## Default is 20, smaller for demo
    REFINE_FACTOR = 5, ## Default is 10, smaller for demo
    FILTER_SUBMISSIONS = False,
    FILTER_SHORT_QUESTIONS = True,
    RERANK_SENTIMENT = True,
    RERANK_AGREE_DISTANCE= True,
    RERANK_DISAGREE_DISTANCE = False,
    METRIC = "cosine"
    )

Now, we can query the data.

In [16]:
results = query_tool.ask_a_query("What is the best way to cook a steak?")

In [17]:
with pl.Config(tbl_rows=26, tbl_width_chars=180, fmt_str_lengths=300):
    print(results[["reddit_subreddit","reddit_text", "_distance"]].head(5))

shape: (5, 3)
┌──────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│ reddit_subreddit ┆ reddit_text                                                                                                                                       ┆ _distance │
│ ---              ┆ ---                                                                                                                                               ┆ ---       │
│ str              ┆ str                                                                                                                                               ┆ f32       │
╞══════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════╡
│ wholefoods       ┆ Steak tartare is great.                                     