# Background

This is simple jupyter notebook, that part of a large project aimed at processing the Common Lisp Chat Logs to retrieve lots that are related to user queries. Previous work involved processing the multi-year data and storing a parquet files. Please note that this notebook is for illustration processes as the full dataset is quite large. Consequently, a smaller version of the dataset is used.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path 

In [3]:
# Define the top directories
CASENAME='2024_11_01'
#CASENAME='2024_12_31'
CL_DATA_TOPDIR=Path('~/data/datasets/machine-learning/cl-chat-log-data').expanduser()
ANALYSIS_TOPDIR= CL_DATA_TOPDIR / 'analyses' / CASENAME.lower()

DATADIR_TOPDIR=Path('~/data/datasets/machine-learning/cl-chat-log-data/logs').expanduser()
DATADIR=Path('~/data/datasets/machine-learning/cl-chat-log-data/raw_logs').expanduser()

CL_DATA_FILENAME = CL_DATA_TOPDIR / 'raw_logs' / 'commonlisp' / f'commonlisp_log_data_full_v{CASENAME}.gzip'
processed_data_filename = CL_DATA_TOPDIR / 'processed' / 'commonlisp' / f'commonlisp_processed_log_data_full_v{CASENAME}.gzip'
processed_data_filename_small = CL_DATA_TOPDIR / 'processed' / 'commonlisp' / f'commonlisp_processed_log_data_full_v{CASENAME}_small.gzip'

## Read the base dataframe

In [None]:
#base_df = pd.read_parquet(processed_data_filename)    

In [40]:
from datasets import load_dataset

ds = load_dataset("parquet", data_files=str(processed_data_filename_small))
#dataset = load_dataset("parquet", data_files=str(processed_data_filename))

In [41]:
ds

DatasetDict({
    train: Dataset({
        features: ['date', 'id', 'username', 'message', 'room', 'message_length', 'num_sentences', 'day_of_year', 'year', 'month', 'month_name', 'hour', 'weekday', 'week', 'year_week', 'year_month_name'],
        num_rows: 1000
    })
})

In [37]:
#remove_columns = ['date', 'id', 'username', 'message', 'room', 'message_length', 'num_sentences', 'day_of_year', 'year', 'month', 'month_name', 'h]

In [None]:
# Keep only the message column
#columns_to_keep = ["date", "message"] # Replace with your desired columns
#new_dataset = dataset.map(lambda example: {col: example[col] for col in columns_to_keep})

In [22]:
df = dataset['train'].to_pandas()

In [24]:
df['message_length'].describe()

count    1.355569e+06
mean     7.037532e+01
std      6.029763e+01
min      1.000000e+00
25%      2.900000e+01
50%      5.500000e+01
75%      9.400000e+01
max      9.110000e+02
Name: message_length, dtype: float64

In [42]:
ds

DatasetDict({
    train: Dataset({
        features: ['date', 'id', 'username', 'message', 'room', 'message_length', 'num_sentences', 'day_of_year', 'year', 'month', 'month_name', 'hour', 'weekday', 'week', 'year_week', 'year_month_name'],
        num_rows: 1000
    })
})

In [30]:
processed_data_filename_small = CL_DATA_TOPDIR / 'processed' / 'commonlisp' / f'commonlisp_processed_log_data_full_v{CASENAME}_small.gzip'

In [31]:
#df2.to_parquet(processed_data_filename_small, index=False, compression='gzip')

### Create embeddings

In [39]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[static_embedding])

In [45]:
def create_embeddings(batch):
    embeddings = model.encode(batch["message"], convert_to_numpy=True)
    batch["embeddings"] = embeddings.tolist()
    return batch

In [46]:
ds = ds.map(create_embeddings, batched=True)

Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 7684.69 examples/s]


In [47]:
ds

DatasetDict({
    train: Dataset({
        features: ['date', 'id', 'username', 'message', 'room', 'message_length', 'num_sentences', 'day_of_year', 'year', 'month', 'month_name', 'hour', 'weekday', 'week', 'year_week', 'year_month_name', 'embeddings'],
        num_rows: 1000
    })
})

### Push dataset (with embedding columns) to HF

In [50]:
ds.push_to_hub("jeosol/chat-log-embeddings-small")

Uploading the dataset shards:   0%|                                                                                                                                                                                                                                                                | 0/1 [00:00<?, ? shards/s]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.53ba/s][A
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.76s/ shards]


CommitInfo(commit_url='https://huggingface.co/datasets/jeosol/chat-log-embeddings-small/commit/827a18587ecb36970fc4ecd3c5c93c3a3b11437f', commit_message='Upload dataset', commit_description='', oid='827a18587ecb36970fc4ecd3c5c93c3a3b11437f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/jeosol/chat-log-embeddings-small', endpoint='https://huggingface.co', repo_type='dataset', repo_id='jeosol/chat-log-embeddings-small'), pr_revision=None, pr_num=None)

### Vector Search Without Index

In [68]:
import duckdb
from typing import List

#'hf://datasets/{dataset_name}/**/*.parquet'
def similarity_search_without_duckdb_index(
    query: str,
    k: int = 5,
    dataset_name: str = "jeosol/chat-log-embeddings-small", 
    embedding_column: str = "embeddings",
):
    # Use same model as used for indexing
    query_vector = model.encode(query)
    embedding_dim = model.get_sentence_embedding_dimension()

    sql = f"""
        SELECT 
            *,
            array_cosine_distance(
                {embedding_column}::float[{embedding_dim}], 
                {query_vector.tolist()}::float[{embedding_dim}]
            ) as distance
        FROM 'hf://datasets/{dataset_name}/**/*.parquet'
        ORDER BY distance
        LIMIT {k}
    """
    return duckdb.sql(sql).to_df()[['message']]

def print_messages(df):
    for item, row in df.items():
        print(f"{item} {row}")
        
#vs_nwi = similarity_search_without_duckdb_index("What is the future of AI?")
#vs_nwi

In [59]:
print_messages(vs_nwi)

0 I would imagine.
1 Zhivago: yup....there's an impedance mismatch between c/python thinking and functional thinking
2 lil could use a sequence interface.
3 What does "atomicity" mean?
4 see you next year, maybe


In [60]:
print_messages(similarity_search_without_duckdb_index("What is a sequence interface"))

0 lil could use a sequence interface.
1 How do you add a new sequence type? :)
2 Perhaps using SPLIT-SEQUENCE:SPLIT-SEQUENCE.
3 ykm: Also, compare an interface with an implementation
4 axion: with split-sequence you just split the string by newline and etc.


In [69]:
print_messages(similarity_search_without_duckdb_index("What is package?"))

message 0    Zhivago: *nods* I see what you mean. Stuff lik...
1                   see also asdf-package-system style
2    Xach: what's the recommended way to remove old...
3    I think this probably works best if FOOBAR is ...
4    I am thinking of writing a macro that changes,...
Name: message, dtype: object
