# Indexing: load -> tokenize -> store

We are going to load our dataset, tokenize it and store it in a faiss index, all in one go.

At the end we are going to save the index to a file.

In [1]:
from models_.building.llama_tokenizer import load_tokenizer
from rag.indexing import index
from storage.faiss_ import FaissStorage
from data.pubmed.from_json import FromJsonDataset
from data.pubmed.tokenized import TokenizedDataset

In [2]:
dataset = FromJsonDataset(json_file="../../data/pubmed_500K.json")
dataset[0]

{'title': "[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)].",
 'content': '(--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.',
 'contents': "[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)]. (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only o

In [3]:
tokenizer = load_tokenizer()
tokenizer.pad_token = tokenizer.eos_token
tokenizer

PreTrainedTokenizerFast(name_or_path='meta-llama/Llama-3.2-1B', vocab_size=128000, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>', 'pad_token': '<|end_of_text|>'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128004: AddedToken("<|finetune_right_pad_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128005: AddedToken("<|re

In [4]:
MAX_TOKEN_LENGTH = 800 # check train_dataset.ipynb for more details

In [5]:
tokenized_dataset = TokenizedDataset(
    tokenizer=tokenizer,
    dataset=dataset,
    max_length=MAX_TOKEN_LENGTH,
)

In [6]:
tokenized_dataset[0]

[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)]. (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.


{'input_ids': tensor([[ 33722,    822,  32056,   7978,    389,   6730,    316,    458,   6956,
             14,  23440,     13,    763,  55004,   7978,    922,    279,   3276,
           3527,  27330,   5820,    315,  58719,   7435,   7288,   1481,    285,
          53904,    337,    320,   3170,    596,  12215,  27261,  58719,   7435,
           7288,   7826,    285,  53904,    337,    706,    264,   6156,   3276,
           3527,  27330,   1957,  11911,    389,  47040,     11,    902,    374,
            539,   9057,    555,    459,  73681,    315,    279,  37143,  19625,
             13,    578,   5541,   5849,  29150,   5820,    315,    281,   7270,
            258,    374,  11293,    555,    220,   1135,   3346,   1555,   5369,
            315,  15184,  53904,    337,    304,    279,  11595,    315,    220,
             16,     14,     15,     13,     20,     13,    578,   3276,   3527,
          27330,   1957,    315,  15184,  53904,    337,   1193,  13980,    304,
           116

In [7]:
# check the decoded content
tokenizer.decode(tokenized_dataset[0]["input_ids"][0], skip_special_tokens=True)

[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)]. (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.


"[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)]. (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost."

In [8]:
storage = FaissStorage(
    dimension=MAX_TOKEN_LENGTH,
)

In [9]:
def item_transform(item):
    """
    Transforms the item to be stored in the storage system.
    """
    # convert to numpy array
    return item["input_ids"][0].numpy().astype("float32")

In [None]:
storage = index(
    data=tokenized_dataset,
    storage=storage,
    data_transform=item_transform,
)

[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)]. (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.
Indexed 0 items
[Demonstration of tumor inhibiting properties of a strongly immunostimulating low-molecular weight substance. Comparative studies with ifosfamide on the immuno-labile DS carcinosarcoma. Stimulation of the autoimmune activity for approx. 20 days by BA 1, a N-(2-cyanoethylene)-urea. Novel prophylactic possibilities]. A report is given on the recent discovery of outstanding immunological properties in BA 1 [N-(2-cyanoethylene)-urea] having a (low)

In [12]:
import faiss
# save the index to a file
faiss.write_index(storage.index, "../../outputs/store/pubmed_500Kv2.index")

In [1]:
vectors = faiss.vector_to_array(storage.index).reshape(index.ntotal, index.d)
print("Stored vectors:\n", vectors)

NameError: name 'faiss' is not defined