# Lets build a rag system

Steps:

1. Load the necessary data into datasets
2. Tokenize the data
3. Store it in a vector database (faiss)

**We are going to use the llama 3.2 1B tokenizer**

## Dataset

The dataset used as augmentation is a subset of pub med data with 500k documents, where they contains:

1. `title`: The title of the document
2. `content`: The content of the document
3. `contents`: The title + content of the document
4. `PMID`: The PubMed ID of the document
5. `id`: The unique identifier for the document

In [18]:
import json
from typing import List

from torch.utils.data import Dataset
from transformers import AutoTokenizer
import faiss
import numpy as np

In [2]:
class FromJsonDataset(Dataset):
    def __init__(self, json_file):
        self.raw_content = ""
        with open(json_file, "r") as f:
            self.raw_content = f.read()

        self.data = json.loads(self.raw_content)

    def __len__(self):
        return len(self.data["id"])

    def __getitem__(self, idx: int):
        return  {
            "title": self.data["title"][idx],
            "content": self.data["content"][idx],
            "contents": self.data["contents"][idx],
            "PMID": self.data["PMID"][idx],
            "id": self.data["id"][idx],
        }

In [3]:
dataset = FromJsonDataset(json_file="./data/pubmed_500K.json")

In [4]:
dataset[0]["title"], dataset[0]["content"]

("[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)].",
 '(--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.')

# Tokenizer and tokenized dataset

We are going to tokenize all data, so we can store it in a vector database later.

We are not going to add special tokens, so we can use the tokenizer as it is.

Also we have a `max_length` of 800 tokens, which is enough for most of the documents in the dataset.

In [21]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer.pad_token = tokenizer.eos_token  # Set pad token to eos token for Llama 3.2

MAX_TOKEN_LENGTH = 800  # Maximum token length for our dataset

In [22]:
class TokenizedDataset(Dataset):
    def __init__(self, tokenizer, dataset: Dataset, max_length: int = 800):
        self.tokenizer = tokenizer
        self.dataset = dataset
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx: int):
        item = self.dataset[idx]
        tokenized_item = self.tokenizer(
            item["contents"],
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
            add_special_tokens=False,
        )
        tokenized_item["id"] = item["id"]
        tokenized_item["PMID"] = item["PMID"]
        return tokenized_item

In [23]:
tokenized_dataset = TokenizedDataset(tokenizer, dataset, max_length=MAX_TOKEN_LENGTH)

In [24]:
tokenized_dataset[0]

{'input_ids': tensor([[ 33722,    822,  32056,   7978,    389,   6730,    316,    458,   6956,
             14,  23440,     13,    763,  55004,   7978,    922,    279,   3276,
           3527,  27330,   5820,    315,  58719,   7435,   7288,   1481,    285,
          53904,    337,    320,   3170,    596,  12215,  27261,  58719,   7435,
           7288,   7826,    285,  53904,    337,    706,    264,   6156,   3276,
           3527,  27330,   1957,  11911,    389,  47040,     11,    902,    374,
            539,   9057,    555,    459,  73681,    315,    279,  37143,  19625,
             13,    578,   5541,   5849,  29150,   5820,    315,    281,   7270,
            258,    374,  11293,    555,    220,   1135,   3346,   1555,   5369,
            315,  15184,  53904,    337,    304,    279,  11595,    315,    220,
             16,     14,     15,     13,     20,     13,    578,   3276,   3527,
          27330,   1957,    315,  15184,  53904,    337,   1193,  13980,    304,
           116

# Faiss storage

We are going to use Faiss as vector database, in order to do so we need to create a storage object that will handle the indexing and searching of the data.

In [28]:
class FaissStorage:
    """
    FaissStorage is a concrete implementation of the Storage abstract base class.
    It uses the Faiss library to store and query vectors efficiently.
    """

    def __init__(
            self,
            dimension: int,
            index = None,
    ):
        """
        Initializes the FaissStorage with a specified vector dimension.

        Args:
            dimension: The dimensionality of the vectors to be stored.
        """
        self.dimension = dimension
        if index is None:
            index = faiss.IndexFlatL2(dimension)
        self.index = index

    def store(self, key: str, data):
        """
        Stores a vector associated with a given key.

        Args:
            key: A unique identifier for the vector (e.g., a string).
            data: The vector to be stored. Must be a list of floats with length equal to 'dimension'.
        """
        if len(data[0]) != self.dimension:
            raise ValueError(f"Data must have {self.dimension} dimensions.")

        # Convert the data to a numpy array and ensure it's in the correct format
        data = np.array(data, dtype='float32')

        self.index.add(data)  # Add the vector to the index

    def query(self, key: str, k: int) -> (List, List):
        """
        Retrieves the vector associated with a given key.

        Args:
            key: The identifier of the vector to retrieve.

        Returns:
            A tuple containing the distances and indices of the nearest vectors.
        """
        # In this simple implementation, we don't actually use the key for retrieval.
        # In a real-world scenario, you would need to maintain a mapping from keys to indices.

        # For demonstration purposes, we'll return the first vector in the index
        if self.index.ntotal == 0:
            return None

        distances, indices = self.index.search(np.array([[0] * self.dimension], dtype='float32'), k)

        return distances[0].tolist(), indices[0].tolist()

    def export(self, file_path: str):
        """
        Exports the Faiss index to a file.

        Args:
            file_path: The path where the index will be saved.
        """
        faiss.write_index(self.index, file_path)

    def load(self, file_path: str):
        """
        Loads a Faiss index from a file.

        Args:
            file_path: The path from where the index will be loaded.
        """
        self.index = faiss.read_index(file_path)

def index(
        data: Dataset, # The dataset to index, it should be tokenized
        storage: FaissStorage,
        data_transform: callable = None,
):
    """
    Indexes the dataset into the storage system.

    Args:
        data (Dataset): The dataset to index, it should be tokenized.
        storage (Storage): The storage system to use for indexing.
    """
    buffer = []

    for i in range(len(data)):
        item = data[i]
        key = item["id"]
        if data_transform:
            item = data_transform(item)
        buffer.append(item)
        if i % 10000 == 0:
            print(f"Indexed {i} items")
            storage.store(key, buffer)
            buffer = []


    return storage

def item_transform(item):
    """
    Transforms the item to be stored in the storage system.
    """
    # convert to numpy array
    return item["input_ids"][0].numpy().astype("float32")

In [29]:
storage = FaissStorage(
    dimension=MAX_TOKEN_LENGTH,
)

In [30]:
storage = index(
    data=tokenized_dataset,
    storage=storage,
    data_transform=item_transform,
)

Indexed 0 items
Indexed 10000 items
Indexed 20000 items


KeyboardInterrupt: 

In [31]:
query = "Biochemical studies on camomile components/"

# Lets query the index

In [34]:
distances, indexes = storage.query(key=query, k=5)

In [35]:
for i in range(len(indexes)):
    print(f"Index: {indexes[i]}, Distance: {distances[i]}")
    print(tokenizer.decode(tokenized_dataset[indexes[i]]["input_ids"][0], skip_special_tokens=True))
    print("\n")

Index: 1410, Distance: 236034326528.0
Synthesis of angiotensin II antagonists containing N- and O-methylated and other amino acid residues. [1-N-Methylisoasparagine,8-isoleucine]- (I), [1-sarcosine,4-N-methyltyrosine,8-isoleucine]- (II), [1-sarcosine,5-N-methylisoleucine,8-isoleucine]- (III), [1-sarcosine,8-N-methylisoleucine]- (IV), [1-sarcosine8k-N-methylisoleucine,8-N-methylisoleucine]- (V), [1-sarcosine,8-O-methylthreonine]- (VI), [1-sarcosine,8-methionine]- (VII), and [1-sarcosine,8-serine]angiotensin II (VIII), synthesized by Merrifield's solid-phase procedure, possess respectively 0.8, 0.3, 0.5, 1.0, 0.0, 0.5, 3.7, and 0.7% pressor activity of angiotensin II (vagotomized, ganglion-blocked rats). They caused an initial rise in blood pressure (30 min of infusion, 250 ng/kg/min in vagotomized, ganglion-blocked rats) of 16.57, 9.80, 22.80, 32.00, 7.00, 15.06, 32.50, and 11.42 mmHg and showed secretory activity (isolated cat adrenal medulla) of 1.0, 0.1, 0.01, 0.1, less than 0.01, 0.

In [39]:
STORAGE_FILE_PATH = "./outputs/pubmed_500K.index"

In [None]:
# storage.export(STORAGE_FILE_PATH) # uncomment to save the index

# Dataset: pre tokenized dataset

We added a faiss storage already tokenized and ready to use

In [40]:
storage.load(STORAGE_FILE_PATH)

In [41]:
distances, indexes = storage.query(key=query, k=5)
for i in range(len(indexes)):
    print(f"Index: {indexes[i]}, Distance: {distances[i]}")
    print(tokenizer.decode(tokenized_dataset[indexes[i]]["input_ids"][0], skip_special_tokens=True))
    print("\n")

Index: 281918, Distance: 171539365888.0
3,3'-Diiodothyronine production, a major pathway of peripheral iodothyronine metabolism in man. 3,3'-Diiodothyronine (3,3'-T(2)) has been detected in human serum and in thyroglobulin. However, no quantitative assessment of its clearance rate (CR), production rate (PR), or of the importance of extrathyroidal sources of 3,3'-T(2) relative to direct thyroidal secretion is yet available. This study examines these parameters in seven euthyroid subjects, and in eight athyreotic subjects (H) eumetabolic due to thyroxine therapy (HT(4)) (n = 5) or triiodothyronine replacement (HT(3)) (n = 3). A highly specific radioimmunoassay for the measurement of 3,3'-T(2) in whole serum was developed. Serum 3,3'-T(2) concentrations were (mean +/- SD) 6.0+/-1.0 ng/100 ml in 13 normal subjects, 9.0+/-4.6 ng/100 ml in 25 hyperthyroid patients, and 2.7+/-1.1 ng/100 ml in 17 hypothyroid patients. The values in each of the latter two groups were significantly different fro