**NOTES:**
1. We can also try the quality of the result with nomic embeddings instead of JinaAI embeddings.
2. Unit tests are not provided intentionally, since they're an overkill for our audience.

In [None]:
!pip install torch
!pip install transformers
!pip install scikit-learn

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-

In [None]:
# Install required libraries
!pip install duckdb
!pip install transformers
!pip install torch

import sqlite3
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel
import duckdb
import numpy as np
import os

# Step 1: Connect to Google Drive and SQLite Database, retrieve data
from google.colab import drive
drive.mount('/content/drive')

# Specify the path to your SQLite database on Google Drive
sqlite_db_path = '/content/drive/MyDrive/db_data.db'
conn = sqlite3.connect(sqlite_db_path)
df = pd.read_sql_query("SELECT URL, Cleaned_Content FROM crawled_data", conn)
conn.close()

# Step 2: Initialize tokenizer and model from Hugging Face
# Updated to use jinaai/jina-embeddings-v2-base-en
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en")
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-base-en")

# Function to generate embeddings
def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Depending on the model architecture, you may need to adjust this
    embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embedding

# Step 3: Generate embeddings and prepare data for DuckDB
df['Embedding'] = df['Cleaned_Content'].apply(get_embedding)

# Convert embeddings to numpy arrays
embeddings = np.array(df['Embedding'].tolist())

# Step 4: Store the URLs and Embeddings in DuckDB
# Specify the path to your DuckDB database on Google Drive
duckdb_db_path = '/content/drive/MyDrive/VectorJinaDuckDB.db'

# Check if DuckDB database exists, if not, create it
if not os.path.exists(duckdb_db_path):
    print("DuckDB database does not exist, creating a new one...")
    duckdb_conn = duckdb.connect(duckdb_db_path)
else:
    print("DuckDB database already exists, connecting to it...")
    duckdb_conn = duckdb.connect(duckdb_db_path)

# Create a table and insert data
duckdb_conn.execute("""
CREATE TABLE IF NOT EXISTS url_embeddings (
    url VARCHAR,
    embedding FLOAT[]
)
""")

# Convert embeddings to numpy arrays and combine with URLs into a list of tuples
data_to_insert = list(zip(df['URL'], df['Embedding'].apply(lambda x: x.tolist())))

duckdb_conn.executemany("""
INSERT INTO url_embeddings (url, embedding) VALUES (?, ?)
""", data_to_insert)

print("Data inserted into DuckDB successfully.")

# Optional: Query the table to verify the data
result_df = duckdb_conn.execute("SELECT * FROM url_embeddings LIMIT 5").df()
print(result_df)

# Close the DuckDB connection
duckdb_conn.close()


Mounted at /content/drive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/275M [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at jinaai/jina-embeddings-v2-base-en and are newly initialized: ['embeddings.position_embeddings.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.intermedi

DuckDB database does not exist, creating a new one...
Data inserted into DuckDB successfully.
                                                 url  \
0              https://kalicube.com/learning-spaces/   
1      https://kalicube.com/learning-spaces/page/44/   
2         https://kalicube.com/learning-spaces/feed/   
3  https://kalicube.com/learning-spaces/knowledge...   
4  https://kalicube.com/learning-spaces/faq/seo-g...   

                                           embedding  
0  [-0.41959742, 0.8399466, 0.17952608, 0.0146907...  
1  [-0.4035261, 0.53907293, -0.046222396, -0.6105...  
2  [-0.7346743, 0.780547, 1.6596516, -0.21444003,...  
3  [-0.32937163, -0.34818545, 0.42623773, -0.6159...  
4  [-0.8044908, -0.14284484, 1.1074668, -0.948705...  
