<a href="https://colab.research.google.com/github/imusicmash/stanford_llm_python/blob/main/llamaindex_load_index_store_chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading data

I explored how we can chunk RAG data while using a vector DB

Before an LLM can act on your data, you first need to load it. I use a full text of "Alice's Adventures in Wonderland" by Lewis Carroll located in the `data/` directory.

Uses books from Gutenberg.org/ebooks/11

In [1]:
!mkdir -p 'data/'

In [3]:
# from site gutenberg.org/ebooks/11
# https://www.gutenberg.org/cache/epub/11/pg11.txt
!wget 'https://www.gutenberg.org/cache/epub/11/pg11.txt' -O 'data/pg11.txt'

--2024-03-06 23:45:19--  https://www.gutenberg.org/cache/epub/11/pg11.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174385 (170K) [text/plain]
Saving to: ‘data/pg11.txt’


2024-03-06 23:45:19 (2.41 MB/s) - ‘data/pg11.txt’ saved [174385/174385]



In [8]:
from openai import OpenAI
from google.colab import userdata

open_ai_key = userdata.get('openai')
# client = OpenAI(api_key=open_ai_key)

In [9]:
import os
os.environ["OPENAI_API_KEY"] = open_ai_key

In [5]:
!pip install llama-index --upgrade

Collecting llama-index
  Downloading llama_index-0.10.16-py3-none-any.whl (5.6 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.1.5-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.7-py3-none-any.whl (25 kB)
Collecting llama-index-core<0.11.0,>=0.10.16 (from llama-index)
  Downloading llama_index_core-0.10.17-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.6-py3-none-any.whl (6.0 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.3-py3-none-any.whl (6.6 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading l

In [6]:
from llama_index.core import SimpleDirectoryReader

# Load the data from the data directory
documents = SimpleDirectoryReader("./data").load_data()

# Transformations & Indexing

After the data is loaded, you then need to process and transform your data before putting it into a storage system. These transformations include chunking, extracting metadata, and embedding each chunk.

In [10]:
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

# Create a text splitter that splits the text into chunks of 512 characters with an overlap of 10 characters.
# SentenceSplitter tries to keep sentences and paragraphs together.
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

# Set the text splitter in the settings, globally.
Settings.text_splitter = text_splitter
# Alternative approach to setting the text splitter globally:
# Settings.chunk_size = 512
# Settings.chunk_overlap = 50

# A VectorStoreIndex is by far the most frequent type of Index.
# The Vector Store Index takes your Documents and splits them up into Nodes.
# It then creates vector embeddings of the text of every node, ready to be queried.
index = VectorStoreIndex.from_documents(
    documents, transformations=[text_splitter]
)

In [12]:
# Let's do some introspection on the index!
vector_store_dict = index.vector_store.to_dict()

In [15]:
# As you can see vector store is a storage where you can find the vectors of the documents and original documents.
vector_store_dict.keys()

dict_keys(['embedding_dict', 'text_id_to_ref_doc_id', 'metadata_dict'])

In [16]:
# The mapping from text_id to ref_doc_id is also stored in the vector store.
# I use the very first text_id below as an example.
# this is an example of a dictionary with a sub dictionary!!!
vector_store_dict["text_id_to_ref_doc_id"]

{'fa2af1ae-8dde-446b-9a07-beccff4967d4': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 '2a97e8a6-daf5-4e2e-b3b5-3751d9da187b': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 '7ff4554c-a8c5-45c6-aa06-fb70257266ff': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'e34882b5-fb32-4bb5-87e8-46f0bb2bb778': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'bcc7c419-39e7-4b78-9f95-05c6e5e1ee05': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 '87582b65-9560-499b-b05a-c9a678f510f0': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 '86756d86-4691-4273-a78a-7d6dd2977317': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'b22144a7-9c1d-49c5-8e07-98a2e8091603': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'b7b39fa5-bbf9-45a6-92bf-8c220ea72d0f': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 '86010ce1-e432-4042-a458-b834241c501d': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 '6bbb6cb1-0b9b-4ae5-9027-161e2f7db186': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 '58c80688-c679-4492-8df4-283f94ccf56c': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'db124ab8-5873-4c6b-afae-9c

In [None]:
vector_store_dict["embedding_dict"]

In [None]:
# The embedding of the very first text_id is stored in the vector store!
vector_store_dict["embedding_dict"]["fa2af1ae-8dde-446b-9a07-beccff4967d4"]


In [23]:
# Metadata is also stored in the vector store.
vector_store_dict["metadata_dict"]["fa2af1ae-8dde-446b-9a07-beccff4967d4"]


{'file_path': '/content/data/pg11.txt',
 'file_name': '/content/data/pg11.txt',
 'file_type': 'text/plain',
 'file_size': 174385,
 'creation_date': '2024-03-06',
 'last_modified_date': '2024-03-01',
 '_node_type': 'TextNode',
 'document_id': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'doc_id': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'ref_doc_id': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e'}

In [24]:
# Metadata is also stored in the vector store.
vector_store_dict["metadata_dict"]["940bf7d9-1810-4cfd-93f3-f79b253dce63"]


{'file_path': '/content/data/pg11.txt',
 'file_name': '/content/data/pg11.txt',
 'file_type': 'text/plain',
 'file_size': 174385,
 'creation_date': '2024-03-06',
 'last_modified_date': '2024-03-01',
 '_node_type': 'TextNode',
 'document_id': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'doc_id': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e',
 'ref_doc_id': '5c900e1d-62bf-4f4a-b4e1-f8497f84805e'}

In [27]:
# The chunk content is also stored in the vector store.
index.storage_context.docstore.get_document("b545e8a8-eab0-4041-bb84-62b6b268ba21").text

'I suppose you’ll be telling me next that you never\r\ntasted an egg!”\r\n\r\n“I _have_ tasted eggs, certainly,” said Alice, who was a very truthful\r\nchild; “but little girls eat eggs quite as much as serpents do, you\r\nknow.”\r\n\r\n“I don’t believe it,” said the Pigeon; “but if they do, why then\r\nthey’re a kind of serpent, that’s all I can say.”\r\n\r\nThis was such a new idea to Alice, that she was quite silent for a\r\nminute or two, which gave the Pigeon the opportunity of adding, “You’re\r\nlooking for eggs, I know _that_ well enough; and what does it matter to\r\nme whether you’re a little girl or a serpent?”\r\n\r\n“It matters a good deal to _me_,” said Alice hastily; “but I’m not\r\nlooking for eggs, as it happens; and if I was, I shouldn’t want\r\n_yours_: I don’t like them raw.”\r\n\r\n“Well, be off, then!” said the Pigeon in a sulky tone, as it settled\r\ndown again into its nest. Alice crouched down among the trees as well\r\nas she could, for her neck kept getting en

# Storing

The API calls to create the embeddings in a VectorStoreIndex can be expensive in terms of time and money, so you will want to store them to avoid having to constantly re-index things.

In [28]:
import chromadb
from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore

chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("alice")


vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# I think the main confusion here is that the VectorStoreIndex is not the same as the VectorStore we used above!
# It takes "storage_context" as an argument. If we go deeper into the VectorStoreIndex base class,
# we can see that it takes the storage_context:
# https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/storage_context.py#L50
# The storage context container is a utility container for storing nodes, indices, and vectors.
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

In [29]:
index.as_query_engine().query("How did Alice come to meet the Queen of Hearts?").response

'Alice came to meet the Queen of Hearts when a procession passed by her in the garden. The Queen noticed Alice and inquired about her identity, to which Alice politely responded.'