In [1]:
## Add this directory to the path and load our functions
import sys
sys.path.append("../src/")

import paware

# Embedding the data

First we apply an embedding to our data. For demonstration purposes, we are only embedding the the data from the GM subreddit. At this stage that we apply some preprocessing, and set the embedding parameters.

To process the whole dataset, we first split it into separate files, one for each subreddit. This kept file sizes manageable while preserving all of the structure of replies.

## Preprocessing

The steps in preprocessing are:

* Drop deleted or removed comments and submission
* Drop likely bots and memes
* Handle blank comments and submissions

Here, we also add a column `is_short_question` that indicates whether the `reddit_text` is fewer than 100 characters and ends with a `"?"`.

## Embedding

For embedding, the main parameters we vary are:

* `CHUNK_WITH_METADATA` - This determines whether we append some information about the subreddit to the start of each text chunk before embedding
* `CHUNK_SIZE` - This determines the maximum token length we will send to the embedding model at a time
* `CHUNK_OVERLAP_PCT` - This determines the minimum overlap between adjacent text chunks

Under the hood, we are using a BERT based embedding model, gte-base. See more at: [Hugging Face](https://huggingface.co/thenlper/gte-base)

In [2]:
paware.PawEmbedding(
    CONFIG_NAME="demo", ## This is the name of the embedding configuration
    RAW_DATA_PATH="../temp_data/raw_data_subset_gm.parquet", ## This is the path to the raw data
    EMBEDDED_SAVE_DIR="../temp_vector_data/", ## All of the embedded data will be saved a subdirectory of this directory: "config_[CONFIG_NAME]/"
    BATCH_SIZE=10000, ## This is the batch size for the embedding
    CHUNK_WITH_METADATA=False, ## This is a flag to indicate if we are including metadata in the embedding
    CHUNK_SIZE=512, ## This is the size of the chunks to embed
    CHUNK_OVERLAP_PCT=0.2 ## This is the overlap percentage for the chunks
    ).embed_data() 

Dropping 1099 rows with reddit_text=='[deleted]'
Dropping 119 rows with reddit_text=='[removed]'
Dropping 67 rows that are likely bots or memes
Dropping 0 rows with 'reddit_text'=='' and 'aware_post_type'=='comment'
Dropping 0 rows with 'reddit_text'==' ' and 'aware_post_type'=='comment'
Replacing 'reddit_text' with 'reddit_title' in 235 rows with 'reddit_text'=='' or 'reddit_text'==' '
Working on batch 0, rows 0 through 10000...
Done with batch 0.
Working on batch 1, rows 10000 through 20000...
Done with batch 1.
Working on batch 2, rows 20000 through 30000...
Done with batch 2.
Working on batch 3, rows 30000 through 40000...
Done with batch 3.
Working on batch 4, rows 40000 through 41089...
Done with batch 4.


## Building a Vector Database and Indexing the Data

The following loads our data into a [LanceDB](https://lancedb.github.io/lancedb/) vector database, and creates an [IVF-PQ](https://lancedb.github.io/lancedb/concepts/index_ivfpq/#ivf-pq) ANN index. Behind the scenes, we can also vary the IVF-PQ hyperparameters to try and improve query speeds while trading off some accuracy.

In [3]:
paware.PawIndex(
    EMBEDDING_CONFIG_NAME="demo", ## This is the name of the embedding configuration to load for indexing
    EMBEDDING_DIR="../temp_vector_data/", ## This is the path to the embedded data
    INDEX_CONFIG_NAME="01", ## This is the name of the index configuration
    DB_SAVE_DIR="../temp_db_data/", ## The database and associated table will be in a subdirectory of this directory: "db_[EMBEDDING_CONFIG_NAME][INDEX_CONFIG_NAME]/"
    METRIC="cosine", ## This is the metric to use for the index
    ACCELERATOR="mps"   ## This is the accelerator to use for creating the index (replace with `None` if you aren't on a mac)
).index_data()

 38%|███▊      | 19/50 [00:05<00:08,  3.79it/s]


0it [00:00, ?it/s]

  tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))


# Querying the Data

Once we have the data loaded and index in a database we can perform queries. In its most basic form, we only need to set the following:

* `METRIC` - The metric we are using to determine distance (we always chose cosine similarity in this application)
* `LIMIT` - The number of results we want to retrieve

We also can vary parameters associated with our index:

* `NPROBES` - The number of nearby Voronoi cells to check for results
* `REFINE_FACTOR` - This multiplied by `LIMIT` will be the number of results retrieved behind the scenes, that will then be re-ranked base on actual distances (rather than just the distances to the quantized vectors).

In [8]:
query_tool = paware.PawQuery(
    CONFIG_NAME="demo01", ## This is the name of the [EMBEDDING_CONFIG_NAME][INDEX_CONFIG_NAME] to load for querying
    DB_DIR="../temp_db_data/", ## This is the path to the database
    QUERY_SAVE_DIR="../temp_query_data/", ## The query results will be saved in this directory as: "queries_[EMBEDDING_CONFIG_NAME][INDEX_CONFIG_NAME][QUERY_NAME].parquet"
    QUERY_NAME="query_demo", ## This is the name of the query
    METRIC="cosine", ## This is the metric to use for the query
    LIMIT=50, ## This is the number of results to return
    NPROBES=20, ## This is the number of probes to use for the query
    REFINE_FACTOR=10 ## This is the refine factor to use for the query
    )

## Querying the Data

Here, we demonstrate how to retrieve the results for a single query using our tools.

In [9]:
query_tool.ask_a_query("What kind of results does this query return?")

aware_post_type,aware_created_ts,reddit_id,reddit_name,reddit_created_utc,reddit_author,reddit_text,reddit_permalink,reddit_title,reddit_url,reddit_subreddit,reddit_link_id,reddit_parent_id,reddit_submission,text_chunk,vector,_distance
str,str,str,str,i64,str,str,str,str,str,str,str,str,str,str,"array[f32, 768]",f32
"""comment""","""2024-01-30T08:…","""kk959ov""","""t1_kk959ov""",1706620648,"""ArtisticOpposi…","""What does this…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_1aenui0""","""t3_1aenui0""","""1aenui0""","""What does this…","[-0.012349, -0.014472, … 0.026717]",0.180064
"""comment""","""2023-03-12T17:…","""jbzbstn""","""t1_jbzbstn""",1678658342,"""Revolutionary_…","""yes .many retu…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_11ns57e""","""t1_jby2cbm""","""11ns57e""","""yes .many retu…","[0.001542, -0.041068, … -0.011364]",0.183029
"""submission""","""2023-10-10T03:…","""174fa8a""","""t3_174fa8a""",1696923055,"""HighVoltageZ06…","""Check your ema…","""/r/GeneralMoto…","""Workplace of C…","""https://www.re…","""GeneralMotors""",,,,"""Check your ema…","[-0.010385, -0.014085, … -0.00345]",0.184921
"""comment""","""2023-10-24T00:…","""k67ftlm""","""t1_k67ftlm""",1698120035,"""GrandpaJoeSlot…","""Which function…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_17f0ikm""","""t3_17f0ikm""","""17f0ikm""","""Which function…","[-0.010807, 0.005565, … 0.029601]",0.185719
"""comment""","""2023-03-15T16:…","""jccflhz""","""t1_jccflhz""",1678913598,"""Mysterious-One…","""Which function…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_11rvzz9""","""t1_jcc7zh2""","""11rvzz9""","""Which function…","[-0.010807, 0.005565, … 0.029601]",0.185719
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""comment""","""2023-01-05T12:…","""j32l0gu""","""t1_j32l0gu""",1672938165,"""None""","""Oh good questi…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_103zna6""","""t1_j32ixuz""","""103zna6""","""Oh good questi…","[0.020299, -0.006411, … -0.019769]",0.215532
"""comment""","""2023-03-13T09:…","""jc1y38x""","""t1_jc1y38x""",1678714302,"""RevelacaoVerda…","""Anything of no…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_11q93mp""","""t3_11q93mp""","""11q93mp""","""Anything of no…","[-0.008088, -0.003823, … -0.013716]",0.215636
"""comment""","""2023-09-07T19:…","""jzlho94""","""t1_jzlho94""",1694128399,"""Extreme-Intern…","""Ah got it. Tha…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_16adez5""","""t1_jzjrmir""","""16adez5""","""Ah got it. Tha…","[-0.007146, -0.015612, … -0.044951]",0.215964
"""comment""","""2023-01-14T10:…","""j4bifa9""","""t1_j4bifa9""",1673709176,"""continue_impro…","""All of the abo…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_10b0jb5""","""t1_j4axkoy""","""10b0jb5""","""All of the abo…","[-0.006142, 0.003509, … -0.032284]",0.216029


# Evaluating Results

We used a standard set of queries to evaluate how well we were able to retrieve relevant results.

In [10]:
query_tool.ask_standard_queries()

aware_post_type,aware_created_ts,reddit_id,reddit_name,reddit_created_utc,reddit_author,reddit_text,reddit_permalink,reddit_title,reddit_url,reddit_subreddit,reddit_link_id,reddit_parent_id,reddit_submission,text_chunk,vector,_distance,query_text
str,str,str,str,i64,str,str,str,str,str,str,str,str,str,str,"array[f32, 768]",f32,str
"""comment""","""2024-03-01T03:…","""kstqpl1""","""t1_kstqpl1""",1709281654,"""Loose_Warthog5…","""I was just as …","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_1b34wy3""","""t1_ksqte2l""","""1b34wy3""","""I was just as …","[-0.01999, 0.007888, … 0.019346]",0.070577,"""How do General…"
"""comment""","""2023-12-23T23:…","""kepcb17""","""t1_kepcb17""",1703392492,"""noliesheretoda…","""Im not in IT, …","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_18plz0u""","""t1_kepata3""","""18plz0u""","""caused by an a…","[0.003495, -0.006993, … 0.029323]",0.070585,"""How do General…"
"""comment""","""2024-01-11T13:…","""khe7ply""","""t1_khe7ply""",1704996325,"""TagProNoah""","""It depends on …","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_1947fo4""","""t3_1947fo4""","""1947fo4""","""It depends on …","[0.02223, 0.011951, … 0.0319]",0.072132,"""How do General…"
"""comment""","""2024-01-06T14:…","""kgme97k""","""t1_kgme97k""",1704568719,"""TheRealActaeus…","""GM is pushing …","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_18zorlu""","""t3_18zorlu""","""18zorlu""","""GM is pushing …","[0.015377, 0.016594, … 0.041821]",0.074729,"""How do General…"
"""comment""","""2024-01-03T10:…","""kg4uurn""","""t1_kg4uurn""",1704294873,"""AccurateBarnac…","""Again I’m not …","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_18x7myx""","""t3_18x7myx""","""18x7myx""","""Again I’m not …","[0.015673, 0.014669, … 0.022316]",0.076631,"""How do General…"
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""comment""","""2022-12-10T20:…","""izqdauc""","""t1_izqdauc""",1670722928,"""HoldTacICU""","""CVS Caremark i…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_z983eq""","""t3_z983eq""","""z983eq""","""CVS Caremark i…","[0.030247, 0.025746, … -0.000005]",0.185341,"""What do CVS wo…"
"""comment""","""2024-02-16T08:…","""kqombr1""","""t1_kqombr1""",1708090427,"""beautiflywings…","""Unfortunately,…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_1as8df8""","""t1_kqolkqw""","""1as8df8""","""Unfortunately,…","[0.006657, 0.022748, … 0.002534]",0.185456,"""What do CVS wo…"
"""comment""","""2024-02-16T19:…","""kqrmo8a""","""t1_kqrmo8a""",1708129173,"""bilog-ang-mund…","""Case by case b…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_1asm1wq""","""t3_1asm1wq""","""1asm1wq""","""Case by case b…","[-0.010265, -0.02142, … -0.019939]",0.185828,"""What do CVS wo…"
"""comment""","""2023-03-14T23:…","""jc92mc6""","""t1_jc92mc6""",1678852768,"""EllieSouthwort…","""His username c…","""/r/GeneralMoto…",,,"""GeneralMotors""","""t3_11ris46""","""t1_jc8zjn0""","""11ris46""","""His username c…","[0.03201, 0.009154, … -0.014161]",0.185842,"""What do CVS wo…"


Then, you can load the results into a scoring tool, to compute the scores for those results.

In [3]:
## Initialize the scoring tool
score_tool = paware.PawScores(
    RESULTS_FILE_PATH="../temp_query_data/queries_demo01query_demo.parquet"
)

## Compute the scores
score_tool.compute_mext_rr_scores()
score_tool.compute_rr_scores()
score_tool.compute_dcg_scores()


In [4]:
score_tool.get_mext_rr_scores()

{'How do FedEx employees feel about route cuts?': 0.0,
 'Do Kraken employees see themselves staying at the company for the long term?': 0.0,
 'What causes bank employees the most stress at work?': 0.0,
 'When should you apply for a promotion at GM?': 0.4236425339366516,
 'What benefits do Chase employees value most?': 0.0,
 'What does a typical day look like when working for GameStop?': 0.0,
 'How do UPS employees feel about route cuts?': 0.0,
 'How much does a driver make with UPS?': 0.0,
 'How do Whole Foods workers feel about store managers?': 0.0,
 'How long is a typical UPS shift? OR Should I work a double shift at UPS?': 0.0,
 'How often do you get a raise at Lowes?': 0.0,
 'What do Kraken employees find frustrating in their day to day work?': 0,
 'What kind of benefits does GM offer?': 0.7651515151515151,
 'Does your schedule get changed often at Lowes?': 0.0,
 'What are some reasons that bank employees quit their jobs?': 0.0,
 'Is it better to work at fedex express or fedex gro

In [5]:
score_tool.get_rr_scores()

{'How do FedEx employees feel about route cuts?': 0,
 'Do Kraken employees see themselves staying at the company for the long term?': 0,
 'What causes bank employees the most stress at work?': 0,
 'When should you apply for a promotion at GM?': 1.0,
 'What benefits do Chase employees value most?': 0,
 'What does a typical day look like when working for GameStop?': 0,
 'How do UPS employees feel about route cuts?': 0,
 'How much does a driver make with UPS?': 0,
 'How do Whole Foods workers feel about store managers?': 0,
 'How long is a typical UPS shift? OR Should I work a double shift at UPS?': 0,
 'How often do you get a raise at Lowes?': 0,
 'What do Kraken employees find frustrating in their day to day work?': 0,
 'What kind of benefits does GM offer?': 1.0,
 'Does your schedule get changed often at Lowes?': 0,
 'What are some reasons that bank employees quit their jobs?': 0,
 'Is it better to work at fedex express or fedex ground?': 0,
 'What do CVS workers do if they notice thef

In [6]:
score_tool.get_dcg_scores()

{}