# This Notebook for Running the ADB Project Phase 2

**This notebook is divided into two main parts, each focusing on different database sizes:**
- **Part 1: Database Size 10K**
  - Initiate a new database and insert vectors into it.
  - Retrieve vectors from the database.
  - Ensure that the insertion time for this database does not exceed 5 minutes.
  - Allow flexible RAM usage during insertion but ensure it stays within Google Colab limits.
  - Evaluate retrieval time and accuracy.
  - Ensure that the peak RAM usage for retrieval does not exceed 5 MB.

- **Part 2: Database Sizes 100K and More**
  - Generate database vectors using a random seed (refer to the provided code).
  - You have generate the database and its index before the submission.
  - Implement a VecDB class that loads the pre-generated database, including the index, and retrieves vectors, to load the generated database.
  - Evaluate retrieval time and accuracy for different database sizes.
  - The Peak RAM usage for the retrieval should not exceed
    - For 100 K --> 10 MB
    - For 1 M --> 25 MB
    - For 5 M --> 75 MB
    - For 10 M --> 150 MB
    - For 15 M --> 225 MB
    - For 20 M --> 300 MB

**This notebook is structured into two parts:**

- **Part 1 - Modifiable Cells:**
This section contains cells that teams are allowed to modify. The modification are only variables and to be submitted during the project's final phase. They are
  - GitHub repository link (including PAT token).
  - Database (DB) variables, providing the path to the directory or file for loading existing databases and indexes (refer to provided code to see how).

- **Part 2 - Non-Modifiable Cells:** This section must not be modified by any team. It includes essential setup and evaluation code. Ensure that the notebook runs smoothly by providing the required inputs in Part 1.




  

## Part 1 - Modifiable Cells

Of course each team will provide different github repo link
Should include PAT token to enable me to download

In [1]:
!git clone https://mtheggi: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@github.com/mtheggi/sematic_search_DB.git

Cloning into 'sematic_search_DB'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 56 (delta 25), reused 37 (delta 11), pack-reused 0[K
Receiving objects: 100% (56/56), 30.41 KiB | 5.07 MiB/s, done.
Resolving deltas: 100% (25/25), done.


Teams are required to provide unique paths for the generated databases of sizes 1M, 5M, 10M, 15M, and 20M. Follow these steps to submit the databases:

- Once you have the database and index ready, zip the necessary folders/files.
- Upload the zip file to Google Drive.
- Ensure the file is shareable with "anyone with the link."
- Obtain the zip file link (e.g., https://drive.google.com/file/d/1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah/view?usp=drive_link).
- Extract the zip file ID (e.g., 1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah).
- Place the ID in the designated variable (to be submitted during the project final phase).
- The code will automatically download the zip file and unzip it inside this directory.
- Provide the local PATH for each database to be passed to the initializer for automatic loading of the database and index (to be submitted during the project final phase).

In [2]:
TEAM_NUMBER = 1
GDRIVE_ID_DB_100K = "1GFncvb9ghfQyjTxEnt2of2qC6l0xaXKr" # my 100k*
GDRIVE_ID_DB_1M = "1TXaOgR14U5wAywHO0lqMYkTP0R0XDpdr" # my 1m *
GDRIVE_ID_DB_5M = "1MLY5q2He6ddNpV4zTiBjct5erK3FZCl-" # my 5m *
GDRIVE_ID_DB_10M = "1yy4WJUN_Fb6bUq5uBFEFIodCaaGKWqhL"# my 10m *
GDRIVE_ID_DB_15M = "1U3uTVURkZBICS29bLuvOFs6fZ0U9SVU_" #my 15m *
GDRIVE_ID_DB_20M = "1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah"
PATH_DB_100K = "test_100k_d/output.csv"
PATH_DB_1M = "test_1m_d/output.csv"
PATH_DB_5M = "test_5m_d/output.csv"
PATH_DB_10M = "test_10m_d/output.csv"
PATH_DB_15M = "test_15m_d/output.csv"
PATH_DB_20M = "test_15m_d/output.csv"

These two varaible I'll change while running in on the discussion

In [3]:
QUERY_SEED_NUMBER = 10
DB_SEED_NUMBER = 20

This means that the project submission will include these
- TEAM_NUMBER
- Github clone link
- GDRIVE_ID_DB_100K
- GDRIVE_ID_DB_1M
- GDRIVE_ID_DB_5M
- GDRIVE_ID_DB_10M
- GDRIVE_ID_DB_15M
- GDRIVE_ID_DB_20M
- PATH_DB_100K
- PATH_DB_1M
- PATH_DB_5M
- PATH_DB_10M
- PATH_DB_15M
- PATH_DB_20M <br>
- And for sure the project document that describes what you did

In [4]:
!pwd

/content


## Part 2: No edits from here
#### You can't edit this part, and neither me.
#### Note: Maybe I can edit if there is a major bug

In [5]:
%cd sematic_search_DB

/content/sematic_search_DB


This cell to run any additional requirement that your code need <br>


In [6]:
!pip install memory-profiler >> log.txt
!pip install -r requirements.txt

Collecting csvsort==1.6.1 (from -r requirements.txt (line 2))
  Downloading csvsort-1.6.1.tar.gz (3.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: csvsort
  Building wheel for csvsort (setup.py) ... [?25l[?25hdone
  Created wheel for csvsort: filename=csvsort-1.6.1-py3-none-any.whl size=4091 sha256=c79edfbe63aef7f1e017244ecfa716b1d6a084f3d25b3238ea146525eac01038
  Stored in directory: /root/.cache/pip/wheels/b9/ef/be/26ac1963436fdf9a7e477ffa63182d8010e7391b3001ee1aa7
Successfully built csvsort
Installing collected packages: csvsort
Successfully installed csvsort-1.6.1


This cell to download the zip files and unzip them here.

In [7]:
!gdown $GDRIVE_ID_DB_100K -O saved_db_100k.zip
!gdown $GDRIVE_ID_DB_1M -O saved_db_1m.zip
!gdown $GDRIVE_ID_DB_5M -O saved_db_5m.zip
!gdown $GDRIVE_ID_DB_10M -O saved_db_10m.zip
!gdown $GDRIVE_ID_DB_15M -O saved_db_15m.zip
!gdown $GDRIVE_ID_DB_20M -O saved_db_20m.zip
!unzip saved_db_100k.zip
!unzip saved_db_1m.zip
!unzip saved_db_5m.zip
!unzip saved_db_10m.zip
!unzip saved_db_15m.zip
!unzip saved_db_20m.zip

Downloading...
From: https://drive.google.com/uc?id=1GFncvb9ghfQyjTxEnt2of2qC6l0xaXKr
To: /content/sematic_search_DB/saved_db_100k.zip
100% 32.3M/32.3M [00:00<00:00, 66.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1TXaOgR14U5wAywHO0lqMYkTP0R0XDpdr
To: /content/sematic_search_DB/saved_db_1m.zip
100% 323M/323M [00:05<00:00, 64.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1MLY5q2He6ddNpV4zTiBjct5erK3FZCl-
To: /content/sematic_search_DB/saved_db_5m.zip
100% 1.62G/1.62G [00:14<00:00, 108MB/s]
Downloading...
From: https://drive.google.com/uc?id=1yy4WJUN_Fb6bUq5uBFEFIodCaaGKWqhL
To: /content/sematic_search_DB/saved_db_10m.zip
100% 3.24G/3.24G [00:32<00:00, 99.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1U3uTVURkZBICS29bLuvOFs6fZ0U9SVU_
To: /content/sematic_search_DB/saved_db_15m.zip
100% 4.87G/4.87G [00:35<00:00, 138MB/s]
Downloading...
From: https://drive.google.com/uc?id=1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah
To: /content/sematic_search_DB/saved_db_20m.

These are the functions for running and reporting

In [8]:
import numpy as np
from vec_db import VecDB
import time
from dataclasses import dataclass
from typing import List
from memory_profiler import memory_usage
import gc

@dataclass
class Result:
    run_time: float
    top_k: int
    db_ids: List[int]
    actual_ids: List[int]

results = []
to_print_arr = []

def run_queries(db, query, top_k, actual_ids, num_runs):
    global results
    results = []
    for _ in range(num_runs):
        tic = time.time()
        db_ids = db.retrive(query, top_k)
        toc = time.time()
        run_time = toc - tic
        results.append(Result(run_time, top_k, db_ids, actual_ids))
    return results

def memory_usage_run_queries(args):
    global results
    # This part is added to calcauate the RAM usage
    mem_before = max(memory_usage())
    mem = memory_usage(proc=(run_queries, args, {}), interval = 1e-3)
    return results, max(mem) - mem_before

def evaluate_result(results: List[Result]):
    # scores are negative. So getting 0 is the best score.
    scores = []
    run_time = []
    for res in results:
        run_time.append(res.run_time)
        # case for retireving number not equal to top_k, socre will be the lowest
        if len(set(res.db_ids)) != res.top_k or len(res.db_ids) != res.top_k:
            scores.append( -1 * len(res.actual_ids) * res.top_k)
            continue
        score = 0
        for id in res.db_ids:
            try:
                ind = res.actual_ids.index(id)
                if ind > res.top_k * 3:
                    score -= ind
            except:
                score -= len(res.actual_ids)
        scores.append(score)

    return sum(scores) / len(scores), sum(run_time) / len(run_time)

def get_actual_ids_first_k(actual_sorted_ids, k):
    return [id for id in actual_sorted_ids if id < k]

This to generate 10K database and the query using the seed numbers that will be changed at submissions day

In [9]:
rng = np.random.default_rng(DB_SEED_NUMBER)
vectors = rng.random((10**4, 70), dtype=np.float32)

rng = np.random.default_rng(QUERY_SEED_NUMBER)
query = rng.random((1, 70), dtype=np.float32)

actual_sorted_ids_10k = np.argsort(vectors.dot(query.T).T / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query)), axis= 1).squeeze().tolist()[::-1]

Open new DB add 10K then retrieve and evaluate. Then add another 90K (total 100K) then retrieve and evaluate.

In [10]:
db = VecDB()

records_dict = [{"id": i, "embed": list(row)} for i, row in enumerate(vectors)]
db.insert_records(records_dict)
res = run_queries(db, query, 5, actual_sorted_ids_10k, 1) # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries((db, query, 5, actual_sorted_ids_10k, 5)) # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"10K\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


records size =  10000
10000
0
node array size ,  0
records inserted and serilized successfully
10K	score	0.0	time	1.14	RAM	14.93 MB


Remove exsiting varaibles to empty some RAM

In [11]:
del vectors
del query
del actual_sorted_ids_10k
del records_dict
del db
gc.collect()

0

This code to generate 20M database. The seed (50) will not be changed. Create the same DB and prepare it's files indexes and every related file. <br>
Note at the submission I'll not run the insert records. <br>
The query istelf will be changed at submissions day but not the DB

In [12]:
rng = np.random.default_rng(50)
vectors = rng.random((10**7*2, 70), dtype=np.float32)

rng = np.random.default_rng(QUERY_SEED_NUMBER)
query = rng.random((1, 70), dtype=np.float32)

actual_sorted_ids_20m = np.argsort(vectors.dot(query.T).T / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query)), axis= 1).squeeze().tolist()[::-1]

In [13]:
print("Team Number", TEAM_NUMBER)
print("\n".join(to_print_arr))

db = VecDB(file_path = PATH_DB_100K, new_db = False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**5)
res = run_queries(db, query, 5, actual_ids, 1)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries((db, query, 5, actual_ids, 3)) # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"100K\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path = PATH_DB_1M, new_db = False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6)
res = run_queries(db, query, 5, actual_ids, 1)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries((db, query, 5, actual_ids, 3)) # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"1M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path = PATH_DB_5M, new_db = False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6*5)
res = run_queries(db, query, 5, actual_ids, 1)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries((db, query, 5, actual_ids, 3)) # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"5M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path = PATH_DB_10M, new_db = False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6*10)
res = run_queries(db, query, 5, actual_ids, 1)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries((db, query, 5, actual_ids, 3)) # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"10M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path = PATH_DB_15M, new_db = False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6*15)
res = run_queries(db, query, 5, actual_ids, 1)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries((db, query, 5, actual_ids, 3)) # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"15M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path = PATH_DB_20M, new_db = False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6*20)
res = run_queries(db, query, 5, actual_ids, 1)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries((db, query, 5, actual_ids, 3)) # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"20M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


Team Number 1
10K	score	0.0	time	1.14	RAM	14.93 MB
100K	score	0.0	time	10.66	RAM	14.13 MB
1M	score	0.0	time	49.63	RAM	10.12 MB
5M	score	-40.0	time	89.58	RAM	1.14 MB
10M	score	-46.0	time	115.05	RAM	0.00 MB
15M	score	-123.0	time	103.72	RAM	-73.60 MB
20M	score	-163.0	time	106.17	RAM	-107.78 MB


In [14]:
print("Team Number", TEAM_NUMBER)
print("\n".join(to_print_arr))

Team Number 1
10K	score	0.0	time	1.14	RAM	14.93 MB
100K	score	0.0	time	10.66	RAM	14.13 MB
1M	score	0.0	time	49.63	RAM	10.12 MB
5M	score	-40.0	time	89.58	RAM	1.14 MB
10M	score	-46.0	time	115.05	RAM	0.00 MB
15M	score	-123.0	time	103.72	RAM	-73.60 MB
20M	score	-163.0	time	106.17	RAM	-107.78 MB


Mounted at /content/drive
