Purpose of this notebook is to explore the semantic search use case with browsing history in mind
- Important caveat is to explore the support for multiple languges

Reference link -> https://data.firefox.com/dashboard/usage-behavior

  Worldwide, English (US) remains the most common, at about 40% of the population, with German (11%) and French (8.1%) coming 2nd and 3rd. Simplified Chinese is the 4th most common language (6.7%), and Spanish (Spain) is the 5th most common language (5%).

In [1]:
import pandas as pd
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
import requests
import os
import sys

In [2]:
# Add the project root directory to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)

In [3]:
from src.constants import EMBEDDING_MODELS_DICT
from src.feature_extractor import FeatureExtractor

In [4]:
# !cp /tmp/output_file.txt /Users/cgopal/Downloads/places_output_file_v1.txt

#### Lets try reading browsing history

Download browsing history:

1) cp "/Users/<username>/Library/Application Support/Firefox/Profiles/<profilename>/places.sqlite" /tmp/places.sqlite
2) sqlite3 /tmp/places.sqlite
3) within sqlite run below commands one by one
```
.schema moz_places
.mode list
.separator "~|"
.output output_file_v2.txt
SELECT url, title, description, preview_image_url, frecency, last_visit_date FROM moz_places
WHERE description  NOTNULL
ORDER BY last_visit_date DESC limit 600
```
4) copy the file output_file_v2 to ~/Downloads/places_output_file_v2.txt


In [5]:
history = pd.read_csv("/Users/cgopal/Downloads/places_output_file_v2.txt",
                      sep="~\\|", engine="python", header=None, encoding="utf-8", on_bad_lines="skip", index_col=False,
                      names=['url', 'title', 'description', 'preview_image_url', 'frecency', 'last_visit_date'])

print(len(history))
history.head().T

671


Unnamed: 0,0,1,2,3,4
url,https://huggingface.co/Xenova/LaBSE/tree/main,https://huggingface.co/Xenova/LaBSE/blob/main/...,https://huggingface.co/Xenova/LaBSE/tree/main/...,https://huggingface.co/Xenova/LaBSE,https://huggingface.co/models?other=base_model...
title,Xenova/LaBSE at main,onnx/model_quantized.onnx · Xenova/LaBSE at main,Xenova/LaBSE at main,Xenova/LaBSE · Hugging Face,Models - Hugging Face
description,We’re on a journey to advance and democratize ...,We’re on a journey to advance and democratize ...,We’re on a journey to advance and democratize ...,We’re on a journey to advance and democratize ...,We’re on a journey to advance and democratize ...
preview_image_url,https://cdn-thumbnails.huggingface.co/social-t...,https://cdn-thumbnails.huggingface.co/social-t...,https://cdn-thumbnails.huggingface.co/social-t...,https://cdn-thumbnails.huggingface.co/social-t...,https://cdn-thumbnails.huggingface.co/social-t...
frecency,200.0,100.0,100.0,100.0,100.0
last_visit_date,1733944720780624.0,1733944695474407.0,1733944640389418.0,1733944636441166.0,1733944631579349.0


In [6]:
# history['last_visit_date'].fillna(0)

In [7]:
history['last_visit_date'] = pd.to_datetime(history['last_visit_date'], unit='us')

# fill empty last_visit_date with default value "1970-01-01"
history['last_visit_date'] = history['last_visit_date'].fillna(pd.to_datetime("1970-01-01"))
history['combined_text'] = history['title'].fillna('') + " " + history['description'].fillna('')
history = history.loc[history['combined_text'] != ''].reset_index(drop=True)

print(len(history))

671


In [8]:
history

Unnamed: 0,url,title,description,preview_image_url,frecency,last_visit_date,combined_text
0,https://huggingface.co/Xenova/LaBSE/tree/main,Xenova/LaBSE at main,We’re on a journey to advance and democratize ...,https://cdn-thumbnails.huggingface.co/social-t...,200.0,2024-12-11 19:18:40.780624,Xenova/LaBSE at main We’re on a journey to adv...
1,https://huggingface.co/Xenova/LaBSE/blob/main/...,onnx/model_quantized.onnx · Xenova/LaBSE at main,We’re on a journey to advance and democratize ...,https://cdn-thumbnails.huggingface.co/social-t...,100.0,2024-12-11 19:18:15.474407,onnx/model_quantized.onnx · Xenova/LaBSE at ma...
2,https://huggingface.co/Xenova/LaBSE/tree/main/...,Xenova/LaBSE at main,We’re on a journey to advance and democratize ...,https://cdn-thumbnails.huggingface.co/social-t...,100.0,2024-12-11 19:17:20.389418,Xenova/LaBSE at main We’re on a journey to adv...
3,https://huggingface.co/Xenova/LaBSE,Xenova/LaBSE · Hugging Face,We’re on a journey to advance and democratize ...,https://cdn-thumbnails.huggingface.co/social-t...,100.0,2024-12-11 19:17:16.441166,Xenova/LaBSE · Hugging Face We’re on a journey...
4,https://huggingface.co/models?other=base_model...,Models - Hugging Face,We’re on a journey to advance and democratize ...,https://cdn-thumbnails.huggingface.co/social-t...,100.0,2024-12-11 19:17:11.579349,Models - Hugging Face We’re on a journey to ad...
...,...,...,...,...,...,...,...
666,https://source.coop/repositories/fused/fsq-os-...,Source Cooperative,"Source Cooperative is a neutral, non-profit da...",https://source.coop/repositories/fused/fsq-os-...,258.0,2024-11-27 16:55:08.667714,Source Cooperative Source Cooperative is a neu...
667,https://huggingface.co/docs/transformers/en/mo...,Summary of the tokenizers,We’re on a journey to advance and democratize ...,https://huggingface.co/front/thumbnails/docs/t...,172.0,2024-11-27 14:34:40.388465,Summary of the tokenizers We’re on a journey t...
668,https://pernos.co/,Pernosco,"Fast, fun, omniscient debugging. Record failur...",,172.0,2024-11-27 13:17:17.232790,"Pernosco Fast, fun, omniscient debugging. Reco..."
669,https://mozilla.zoom.us/j/96967081587?pwd=Plcw...,Launch Meeting - Zoom,Zoom is the leader in modern enterprise video ...,,88.0,2024-11-27 13:05:19.391346,Launch Meeting - Zoom Zoom is the leader in mo...


#### find appropriate max token length

In [9]:
!python -V

Python 3.12.8


In [10]:
# !python -m pip install tiktoken
# !python -m pip freeze| grep tiktoken

In [11]:
# print(tiktoken.list_encoding_names())

In [12]:
# # import pandas as pd
# import tiktoken
# # import numpy as np

# # Sample data
# # history

# # Initialize the tokenizer
# # Replace 'gpt-3.5-turbo' with the model/tokenizer you want to use
# tokenizer = tiktoken.get_encoding("gpt2")

# # Tokenize each text and count tokens
# history['token_count'] = history['combined_text'].apply(lambda x: len(tokenizer.encode(x)))

# # Compute statistics
# max_length = history['token_count'].max()
# percentile_95 = np.percentile(history['token_count'], 95)
# percentile_99 = np.percentile(history['token_count'], 99)

# print(f"Maximum token count: {max_length}")
# print(f"95th percentile token count: {percentile_95}")
# print(f"99th percentile token count: {percentile_99}")

# # Decide on an appropriate max_length based on these statistics


Maximum token count: 110
95th percentile token count: 78.0
99th percentile token count: 100.29999999999995


In [13]:
EMBEDDING_MODELS_DICT

{'Xenova/all-MiniLM-L6-v2': 'https://huggingface.co/Xenova/all-MiniLM-L6-v2/resolve/main/onnx/model_quantized.onnx',
 'nomic-ai/nomic-embed-text-v1.5': 'https://huggingface.co/nomic-ai/nomic-embed-text-v1.5/resolve/main/onnx/model_quantized.onnx',
 'Xenova/all-mpnet-base-v2': 'https://huggingface.co/Xenova/all-mpnet-base-v2/resolve/main/onnx/model_quantized.onnx',
 'Xenova/paraphrase-mpnet-base-v2': 'https://huggingface.co/Xenova/paraphrase-mpnet-base-v2/resolve/main/onnx/model_quantized.onnx',
 'Xenova/all-MiniLM-L12-v2': 'https://huggingface.co/Xenova/all-MiniLM-L12-v2/resolve/main/onnx/model_quantized.onnx',
 'nomic-ai/modernbert-embed-base': 'https://huggingface.co/nomic-ai/modernbert-embed-base/resolve/main/onnx/model_quantized.onnx'}

In [14]:
texts = history['combined_text'].values.tolist()
embeddings_dict = {}
embeddings_sizes = {}

for model in EMBEDDING_MODELS_DICT.keys():
    fe = FeatureExtractor(EMBEDDING_MODELS_DICT, model_name=model)
    embeddings_dict[model] = fe.get_embeddings(texts)
    print(model, embeddings_dict[model].shape)
    embeddings_sizes[model] = embeddings_dict[model].shape[1]


selected model is Xenova/all-MiniLM-L6-v2
Xenova/all-MiniLM-L6-v2 (671, 384)
selected model is nomic-ai/nomic-embed-text-v1.5
nomic-ai/nomic-embed-text-v1.5 (671, 768)
selected model is Xenova/all-mpnet-base-v2
Xenova/all-mpnet-base-v2 (671, 768)
selected model is Xenova/paraphrase-mpnet-base-v2
Xenova/paraphrase-mpnet-base-v2 (671, 768)
selected model is Xenova/all-MiniLM-L12-v2
Xenova/all-MiniLM-L12-v2 (671, 384)
selected model is nomic-ai/modernbert-embed-base
nomic-ai/modernbert-embed-base (671, 768)


In [15]:
embeddings_sizes

{'Xenova/all-MiniLM-L6-v2': 384,
 'nomic-ai/nomic-embed-text-v1.5': 768,
 'Xenova/all-mpnet-base-v2': 768,
 'Xenova/paraphrase-mpnet-base-v2': 768,
 'Xenova/all-MiniLM-L12-v2': 384,
 'nomic-ai/modernbert-embed-base': 768}

In [16]:
embeddings_dict.keys()

dict_keys(['Xenova/all-MiniLM-L6-v2', 'nomic-ai/nomic-embed-text-v1.5', 'Xenova/all-mpnet-base-v2', 'Xenova/paraphrase-mpnet-base-v2', 'Xenova/all-MiniLM-L12-v2', 'nomic-ai/modernbert-embed-base'])

In [17]:
embeddings_dict['nomic-ai/modernbert-embed-base'].shape

(671, 768)

In [18]:
# embeddings_dict['answerdotai/ModernBERT-base'][0]

In [19]:
!mkdir -p ../data

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [20]:
import pickle

with open("../data/embeddings_dict.pkl", "wb") as f:
    pickle.dump(embeddings_dict, f)

with open("../data/embeddings_sizes.pkl", "wb") as f:
    pickle.dump(embeddings_sizes, f)

history.to_csv("../data/history.csv", index=False)

#### Explore sqlite vector DB

In [106]:
import numpy as np
import sqlite3
import sqlite_vec

from typing import List
import struct

In [152]:

def serialize_f32(vector: List[float]) -> bytes:
    """serializes a list of floats into a compact "raw bytes" format"""
    return struct.pack("%sf" % len(vector), *vector)

In [153]:
db = sqlite3.connect(":memory:")
db.enable_load_extension(True)
sqlite_vec.load(db)
db.enable_load_extension(False)

sqlite_version, vec_version = db.execute(
    "select sqlite_version(), vec_version()"
).fetchone()
print(f"sqlite_version={sqlite_version}, vec_version={vec_version}")

sqlite_version=3.47.2, vec_version=v0.1.6


In [154]:
path = "../data/embeddings_dict.pkl"

with open(path, "rb") as f:
    embeddings_dict = pickle.load(f)

In [155]:
embeddings_dict.keys()

dict_keys(['Xenova/all-MiniLM-L6-v2', 'nomic-ai/nomic-embed-text-v1.5', 'Xenova/all-mpnet-base-v2', 'Xenova/paraphrase-mpnet-base-v2', 'Xenova/all-MiniLM-L12-v2', 'nomic-ai/modernbert-embed-base'])

In [156]:
# model_name = "Xenova/paraphrase-multilingual-MiniLM-L12-v2"
# model_name = "Xenova/distiluse-base-multilingual-cased-v1"
# model_name = "Xenova/all-MiniLM-L6-v2"
# model_name = "nomic-ai/nomic-embed-text-v1.5"
model_name = "nomic-ai/modernbert-embed-base"
EMBEDDING_SIZE = embeddings_sizes[model_name]

In [157]:
items = []
for idx, vec in enumerate(embeddings_dict[model_name]):
    items.append((idx, list(vec)))

In [158]:
model_name_normalized = model_name.replace("/","_").replace("-","_").replace(".","_")

In [159]:
db.execute(f"CREATE VIRTUAL TABLE vec_items_{model_name_normalized} USING vec0(embedding float[{EMBEDDING_SIZE}])")

with db:
    for item in items:
        db.execute(
            f"INSERT INTO vec_items_{model_name_normalized}(rowid, embedding) VALUES (?, ?)",
            [item[0], serialize_f32(item[1])],
        )



In [160]:
history = pd.read_csv("../data/history.csv")

In [171]:
query = "quantization"

fe = FeatureExtractor(EMBEDDING_MODELS_DICT, model_name=model_name)
query_embedding = fe.get_embeddings([query])[0]


selected model is nomic-ai/modernbert-embed-base


In [172]:
query_embedding.shape

(768,)

In [173]:
# using cosine distance
rows = db.execute(
    f"""
      SELECT
        rowid,
        vec_distance_cosine(embedding, ?) AS cosine_distance
      FROM vec_items_{model_name_normalized}
      ORDER BY cosine_distance
      LIMIT 3
    """,
    [serialize_f32(query_embedding)],
).fetchall()

print(rows)

[(231, 0.4183337986469269), (526, 0.45645061135292053), (16, 0.4878166913986206)]


In [174]:
pd.set_option('display.max_colwidth', 200)

In [175]:
print(f"query = {query}")
# history.iloc[[row for row, score in rows]]
row_indices = [row for row, score in rows]
distance = [score for row, score in rows]

selected_rows = history.iloc[row_indices].copy()
selected_rows["distance"] = distance
selected_rows

query = quantization


Unnamed: 0,url,title,description,preview_image_url,frecency,last_visit_date,combined_text,token_count,distance
231,https://alexgarcia.xyz/sqlite-vec/guides/binary-quant.html,Binary Quantization | sqlite-vec,A vector search SQLite extension that runs anywhere!,,98.0,2024-12-09 21:29:40.745769,Binary Quantization | sqlite-vec A vector search SQLite extension that runs anywhere!,19,0.418334
526,https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html,Quantize ONNX models | onnxruntime,"ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator",https://onnxruntime.ai/images/logos/onnxruntime/ORT_icon_for_light_bg.png,94.0,2024-12-04 16:42:44.791372,"Quantize ONNX models | onnxruntime ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator",29,0.456451
16,https://github.com/YingfanWang/PaCMAP,YingfanWang/PaCMAP: PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure,PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure - YingfanWang/PaCMAP,https://opengraph.githubassets.com/cf111310da0fdafa277a43ffcfa4a4b2d4fb16e1e9132d7963fdba8b3442f507/YingfanWang/PaCMAP,2060.0,2024-12-11 18:23:49.302624,YingfanWang/PaCMAP: PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local ...,53,0.487817
