# 2 - Embedder

This notebook generates embeddings for each chunk extracted from the website and stores them in a `Chroma` vector store.

Make sure to follow these steps before executing:

- Upload the provided `.env` file **in the same level** as the `2-Embedder.ipynb` notebook and take note of its path.
- Create a `data` folder in this same location.
- Upload the file `kworld_chunked_dataset.pkl` into the `data` folder.

**NOTE**: It is strongly adviced to run the process using GPU (preferrably A100 or V100).

## Set-up

In [1]:
!pip install langchain
!pip install python-dotenv
!pip install sentence-transformers
!pip install pip install chromadb
!pip install wandb

Collecting langchain
  Downloading langchain-0.0.319-py3-none-any.whl (1.9 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.9 MB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m32.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.43 (from langchain)
  Downloading langsmith-0.0.47-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/41.0 kB[0m [31m6.1 

In [2]:
import os
import pickle
from typing import List
from dotenv import load_dotenv, find_dotenv
from collections import defaultdict

from langchain.schema.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
import numpy as np

import wandb
wandb.init()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [3]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Oct 19 20:34:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    46W / 350W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
# NOTE: Update with your own GDrive path
# and make sure the provided .env file has been uploaded in there
env_path = "/content/drive/MyDrive/Colab Notebooks/KPMG/.env"
load_dotenv(env_path)

True

## Load and Transform data

In [5]:
class Dataset:
    """
    Dataset contains the mapping between source (i.e., the website) and
    its corresponding chunks of text, extracted through the Chunker pipeline.
    """
    def __new__(cls, *args, **kwargs):
        return super().__new__(cls)

    def __init__(self):
        self.data = defaultdict(list)

    def __len__(self):
        return len(self.data)

    def __getstate__(self):
        return self.__dict__

    def __setstate__(self, data):
        self.__dict__ = data

    def add_data(self, source: str, chunks: List[Document]):
        if not isinstance(source, str) and isinstance(chunks, list):
            raise TypeError("Make sure 'source' and 'chunks' are in the right format")
        self.data[source].extend(chunks)

    def get_chunks(self, source: str):
        return self.data.get(source, None)

In [6]:
# NOTE: Update with your own GDrive path
# and make sure the kworld_chunked_dataset.pkl file has been uploaded in there
data_path = "/content/drive/MyDrive/Colab Notebooks/KPMG/data/kworld_chunked_dataset.pkl"

with open(data_path, "rb") as pickle_file:
  dataset = pickle.load(pickle_file)

In [7]:
# Prepare dataset for embedding process

flattened_dataset = [chunk for _, chunks in dataset.data.items() for chunk in chunks]
len(flattened_dataset)

2727267

## Generate Embeddings

**NOTES:**
- The user should provide a valid GDrive path in the `persist_directory` argument.
- Once the embeddings have been computed and stored in the specified folder, the subsequent code blocks will generate a `.zip` file compressing all the necessary `Chroma` files. This file should be unzipped and its contents pasted inside the `data/<vector_db>` folder specified for this purpose.
  - **IMPORTANT:** The name of the `<vector_db>` folder has been set in `.env` by default as `vector_db`, but the user is free to modify it as needed.

In [8]:
embedding_model_name = "thenlper/gte-base"
embedder = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"device": "cuda", "batch_size": 128}
)

vectordb = Chroma.from_documents(
  flattened_dataset,
  embedding=embedder,
  persist_directory='/content/drive/MyDrive/Colab Notebooks/KPMG/vector_db'
)

vectordb.persist()

Downloading (…)a8668/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)10cbba8668/README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

Downloading (…)cbba8668/config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

Downloading (…)668/onnx/config.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)/onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)a8668/onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/219M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)a8668/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)10cbba8668/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bba8668/modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

In [10]:
len(vectordb.get()["documents"])

2727267

In [16]:
!zip -r "/content/drive/MyDrive/Colab Notebooks/KPMG/out/vectordb.zip" "/content/drive/MyDrive/Colab Notebooks/KPMG/vector_db"

  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/ (stored 0%)
  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/chroma.sqlite3 (deflated 45%)
  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/b02981fe-7174-43a1-995d-c5fc31293159/ (stored 0%)
  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/b02981fe-7174-43a1-995d-c5fc31293159/header.bin (deflated 52%)
  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/b02981fe-7174-43a1-995d-c5fc31293159/data_level0.bin (deflated 23%)
  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/b02981fe-7174-43a1-995d-c5fc31293159/length.bin (deflated 63%)
  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/b02981fe-7174-43a1-995d-c5fc31293159/link_lists.bin (deflated 76%)
  adding: content/drive/MyDrive/Colab Notebooks/KPMG/vector_db/b02981fe-7174-43a1-995d-c5fc31293159/index_metadata.pickle (deflated 81%)


In [17]:
from google.colab import files
files.download("/content/drive/MyDrive/Colab Notebooks/KPMG/out/vectordb.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>