# ADB Phase 2 Project Evaluation Notebook


**Purpose**: This notebook evaluates the performance of a semantic search project by analyzing databases of various sizes.

### Evaluation Focus:
- **Database Sizes**:
  - 1 Million Records
  - 10 Million Records
  - 15 Million Records
  - 20 Million Records

For each database size, this notebook will:
- Generate random vectors for the database.
- Use the `VecDB` class (implemented by students) to retrieve queries
- Evaluate and report retrieval time, accuracy, and RAM usage.

### Project Constraints:
Refer to the project document for details on RAM, Disk, Time, and Score constraints.

### Notebook Structure:
1. **Part 1 - Modifiable Cells**:
   - Includes cells that teams are allowed to modify, specifically for these variables only:
     - GitHub repository link (including PAT token).
     - Google Drive IDs for indexes files.
     - Paths for loading existing indexes.

2. **Part 2 - Non-Modifiable Cells**:
   - Contains essential setup and evaluation code that must not be modified.
   - Students should only modify inputs in Part 1 to ensure smooth execution of the notebook.

## Part 1 - Modifiable Cells

Each team must provide a unique GitHub repository link that includes a PAT token. This link will allow the notebook to download the necessary code for evaluation.

In [1]:
!git clone https://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@github.com/mou-code/vec_db.git

Cloning into 'vec_db'...
remote: Enumerating objects: 389, done.[K
remote: Counting objects: 100% (197/197), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 389 (delta 123), reused 153 (delta 95), pack-reused 192 (from 1)[K
Receiving objects: 100% (389/389), 4.34 MiB | 14.53 MiB/s, done.
Resolving deltas: 100% (223/223), done.


# Database Path Instructions


Teams need to specify paths for each database (1M, 10M, 15M, 20M records) as follows:

1. Zip each database directory/file after generation.
2. Upload the zip file to Google Drive.
3. Share the file with "Anyone with the link."
4. Extract the file ID from the link (e.g., for `https://drive.google.com/file/d/1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah/view`, the ID is `1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah`).
5. Assign each ID to the appropriate variable in Part 1.
6. Provide the local PATH for each database to be passed to the initializer for automatic loading of the database and index (to be submitted during the project final phase).

**Note**: The code will download and unzip these files automatically. Once extracted, the local path for each database should be specified to enable the notebook to load databases and indexes.

In [2]:
TEAM_NUMBER = 14
GDRIVE_ID_DB_1M = "1bRyKa-Wv8rfu62opI28iFMnMkn02QYdA"
GDRIVE_ID_DB_10M = "1gPQpPzw0LKFH4ysnbtOpA-59PZu9Sjq0"
GDRIVE_ID_DB_15M = "1vO8vCeAE6wcK0Lyig9JdI1H97Kxl2GDN"
GDRIVE_ID_DB_20M = "1-d9bvnG4Z3ddtX0omURwleeRreewk9I9"
PATH_DB_1M = "saved_db_1m.csv"
PATH_DB_10M = "saved_db_10m.csv"
PATH_DB_15M = "saved_db_15m.csv"
PATH_DB_20M = "saved_db_20m.csv"

**Query Seed Number**:
This number will be adjusted during discussions by the instructor.


In [3]:
QUERY_SEED_NUMBER = 10

**Final Submission Checklist**:
Ensure the following items are included in your final submission:
- `TEAM_NUMBER`
- GitHub clone link (with PAT token)
- Google Drive IDs for each database:
  - `GDRIVE_ID_DB_1M`, `GDRIVE_ID_DB_10M`, `GDRIVE_ID_DB_15M`, `GDRIVE_ID_DB_20M`
- Paths for each database:
  - `PATH_DB_1M`, `PATH_DB_10M`, `PATH_DB_15M`, `PATH_DB_20M`
- Project document detailing the work and findings.

## Part 2: Do Not Modify Beyond This Point
### Note:
This section contains setup and evaluation code that should not be edited by students. Only the instructor may modify this section in case of a major bug.


In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
%cd vec_db

/kaggle/working/vec_db


This cell to run any additional requirement that your code need <br>


In [6]:
!conda install -y gdown  &> log.txt

In [7]:
!pip install memory-profiler &> log.txt
!pip install -r requirements.txt



This cell to download the zip files and unzip them here.

In [8]:
!gdown $GDRIVE_ID_DB_1M -O saved_db_1m.zip
!gdown $GDRIVE_ID_DB_10M -O saved_db_10m.zip
!gdown $GDRIVE_ID_DB_15M -O saved_db_15m.zip
!gdown $GDRIVE_ID_DB_20M -O saved_db_20m.zip
!unzip saved_db_1m.zip
!unzip saved_db_10m.zip
!unzip saved_db_15m.zip
!unzip saved_db_20m.zip

Downloading...
From: https://drive.google.com/uc?id=1bRyKa-Wv8rfu62opI28iFMnMkn02QYdA
To: /kaggle/working/vec_db/saved_db_1m.zip
100%|██████████████████████████████████████| 7.32M/7.32M [00:00<00:00, 17.6MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1gPQpPzw0LKFH4ysnbtOpA-59PZu9Sjq0
From (redirected): https://drive.google.com/uc?id=1gPQpPzw0LKFH4ysnbtOpA-59PZu9Sjq0&confirm=t&uuid=beaea3cb-91c7-4acb-8950-d14461175311
To: /kaggle/working/vec_db/saved_db_10m.zip
100%|███████████████████████████████████████| 60.3M/60.3M [00:00<00:00, 176MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1vO8vCeAE6wcK0Lyig9JdI1H97Kxl2GDN
From (redirected): https://drive.google.com/uc?id=1vO8vCeAE6wcK0Lyig9JdI1H97Kxl2GDN&confirm=t&uuid=9a9be21b-3e42-484e-a359-43b700c9ff27
To: /kaggle/working/vec_db/saved_db_15m.zip
100%|█████████████████████████████████████████| 105M/105M [00:00<00:00, 207MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1-d9bvnG4Z3

These are the functions for running and reporting

In [9]:
import numpy as np
DB_SEED_NUMBER = 42
ELEMENT_SIZE = np.dtype(np.float32).itemsize
DIMENSION = 70

In [10]:
import numpy as np
import os
from vec_db import VecDB
import time
from dataclasses import dataclass
from typing import List
from memory_profiler import memory_usage
import gc

@dataclass
class Result:
    run_time: float
    top_k: int
    db_ids: List[int]
    actual_ids: List[int]

def run_queries(db, queries, top_k, actual_ids, num_runs):
    """
    Run queries on the database and record results for each query.

    Parameters:
    - db: Database instance to run queries on.
    - queries: List of query vectors.
    - top_k: Number of top results to retrieve.
    - actual_ids: List of actual results to evaluate accuracy.
    - num_runs: Number of query executions to perform for testing.

    Returns:
    - List of Result
    """
    global results
    results = []
    for i in range(num_runs):
        tic = time.time()
        db_ids = db.retrieve(queries[i], top_k)
        toc = time.time()
        run_time = toc - tic
        results.append(Result(run_time, top_k, db_ids, actual_ids[i]))
    return results

def memory_usage_run_queries(args):
    """
    Run queries and measure memory usage during the execution.

    Parameters:
    - args: Arguments to be passed to the run_queries function.

    Returns:
    - results: The results of the run_queries.
    - memory_diff: The difference in memory usage before and after running the queries.
    """
    global results
    mem_before = max(memory_usage())
    mem = memory_usage(proc=(run_queries, args, {}), interval = 1e-3)
    return results, max(mem) - mem_before

def evaluate_result(results: List[Result]):
    """
    Evaluate the results based on accuracy and runtime.
    Scores are negative. So getting 0 is the best score.

    Parameters:
    - results: A list of Result objects

    Returns:
    - avg_score: The average score across all queries.
    - avg_runtime: The average runtime for all queries.
    """
    scores = []
    run_time = []
    for res in results:
        run_time.append(res.run_time)
        # case for retireving number not equal to top_k, socre will be the lowest
        if len(set(res.db_ids)) != res.top_k or len(res.db_ids) != res.top_k:
            scores.append( -1 * len(res.actual_ids) * res.top_k)
            continue
        score = 0
        for id in res.db_ids:
            try:
                ind = res.actual_ids.index(id)
                if ind > res.top_k * 3:
                    score -= ind
            except:
                score -= len(res.actual_ids)
        scores.append(score)

    return sum(scores) / len(scores), sum(run_time) / len(run_time)

def get_actual_ids_first_k(actual_sorted_ids, k):
    """
    Retrieve the IDs from the sorted list of actual IDs.
    actual IDs has the top_k for the 20 M database but for other databases we have to remove the numbers higher than the max size of the DB.

    Parameters:
    - actual_sorted_ids: A list of lists containing the sorted actual IDs for each query.
    - k: The DB size.

    Returns:
    - List of lists containing the actual IDs for each query for this DB.
    """
    return [[id for id in actual_sorted_ids_one_q if id < k] for actual_sorted_ids_one_q in actual_sorted_ids]

This code to generate all the files for databases.

In [11]:
def _write_vectors_to_file(vectors: np.ndarray, db_path) -> None:
    mmap_vectors = np.memmap(db_path, dtype=np.float32, mode='w+', shape=vectors.shape)
    mmap_vectors[:] = vectors[:]
    mmap_vectors.flush()

def generate_database(size: int) -> None:
    rng = np.random.default_rng(DB_SEED_NUMBER)
    vectors = rng.random((size, DIMENSION), dtype=np.float32)
    return vectors

vectors = generate_database(20*10**6)

db_filename_size_20M = 'saved_db_20M.dat'
if not os.path.exists(db_filename_size_20M): _write_vectors_to_file(vectors, db_filename_size_20M)
db_filename_size_15M = 'saved_db_15M.dat'
if not os.path.exists(db_filename_size_15M): _write_vectors_to_file(vectors[:15*10**6], db_filename_size_15M)
db_filename_size_10M = 'saved_db_10M.dat'
if not os.path.exists(db_filename_size_10M): _write_vectors_to_file(vectors[:10*10**6], db_filename_size_10M)
db_filename_size_1M = 'saved_db_1M.dat'
if not os.path.exists(db_filename_size_1M): _write_vectors_to_file(vectors[:1*10**6], db_filename_size_1M)


Code to generate the queries that will be used to evaluate the questions.

Note: QUERY_SEED_NUMBER will be changed at submission day

In [12]:
needed_top_k = 10000
rng = np.random.default_rng(QUERY_SEED_NUMBER)
query1 = rng.random((1, 70), dtype=np.float32)
query2 = rng.random((1, 70), dtype=np.float32)
query3 = rng.random((1, 70), dtype=np.float32)
query_dummy = rng.random((1, 70), dtype=np.float32)

actual_sorted_ids_20m_q1 = np.argsort(vectors.dot(query1.T).T / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query1)), axis= 1).squeeze().tolist()[::-1][:needed_top_k]
gc.collect()
actual_sorted_ids_20m_q2 = np.argsort(vectors.dot(query2.T).T / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query2)), axis= 1).squeeze().tolist()[::-1][:needed_top_k]
gc.collect()
actual_sorted_ids_20m_q3 = np.argsort(vectors.dot(query3.T).T / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query3)), axis= 1).squeeze().tolist()[::-1][:needed_top_k]
gc.collect()

queries = [query1, query2, query3]
actual_sorted_ids_20m = [actual_sorted_ids_20m_q1, actual_sorted_ids_20m_q2, actual_sorted_ids_20m_q3]

In [13]:
# No more need to the actual vectors so delete it
del vectors
gc.collect()

0

This code to actually run the class you have been implemented. The `VecDB` class should take the database path, and index path that you provided.<br>
Note at the submission I'll not run the insert records. <br>
The query istelf will be changed at submissions day but not the DB

In [14]:
results = []
to_print_arr = []

In [15]:
print("Team Number", TEAM_NUMBER)
database_info = {
    "1M": {
        "database_file_path": db_filename_size_1M,
        "index_file_path": PATH_DB_1M,
        "size": 10**6
    },
    "10M": {
        "database_file_path": db_filename_size_10M,
        "index_file_path": PATH_DB_10M,
        "size": 10 * 10**6
    },
    "15M": {
        "database_file_path": db_filename_size_15M,
        "index_file_path": PATH_DB_15M,
        "size": 15 * 10**6
    },
    "20M": {
        "database_file_path": db_filename_size_20M,
        "index_file_path": PATH_DB_20M,
        "size": 20 * 10**6
    }
}

for db_name, info in database_info.items():
    db = VecDB(database_file_path = info["database_file_path"], index_file_path = info["index_file_path"], new_db = False)
    actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, info["size"])
    # Make a dummy run query to make everything fresh and loaded (wrap up)
    res = run_queries(db, query_dummy, 5, actual_ids, 1)
    # actual runs to evaluate
    res, mem = memory_usage_run_queries((db, queries, 5, actual_ids, 3))
    eval = evaluate_result(res)
    to_print = f"{db_name}\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
    print(to_print)
    to_print_arr.append(to_print)
    del db
    del actual_ids
    del res
    del mem
    del eval
    gc.collect()

Team Number 14
1M	score	-15.0	time	0.71	RAM	6.44 MB
10M	score	-24.0	time	3.80	RAM	8.31 MB
15M	score	-30.333333333333332	time	5.09	RAM	7.59 MB
20M	score	-38.666666666666664	time	5.80	RAM	8.42 MB


In [16]:
print("Team Number", TEAM_NUMBER)
print("\n".join(to_print_arr))

Team Number 14
1M	score	-15.0	time	0.71	RAM	6.44 MB
10M	score	-24.0	time	3.80	RAM	8.31 MB
15M	score	-30.333333333333332	time	5.09	RAM	7.59 MB
20M	score	-38.666666666666664	time	5.80	RAM	8.42 MB
