<a href="https://colab.research.google.com/github/imusicmash/stanford_llm_python/blob/main/llamaindex_load_index_store_chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading data

Before an LLM can act on your data, you first need to load it. I use a full text of "Alice's Adventures in Wonderland" by Lewis Carroll located in the `data/` directory.

Uses books from Gutenberg.org/ebooks/11

In [None]:
from llama_index.core import SimpleDirectoryReader

# Load the data from the data directory
documents = SimpleDirectoryReader("./data").load_data()

# Transformations & Indexing

After the data is loaded, you then need to process and transform your data before putting it into a storage system. These transformations include chunking, extracting metadata, and embedding each chunk.

In [None]:
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

# Create a text splitter that splits the text into chunks of 512 characters with an overlap of 10 characters.
# SentenceSplitter tries to keep sentences and paragraphs together.
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

# Set the text splitter in the settings, globally.
Settings.text_splitter = text_splitter
# Alternative approach to setting the text splitter globally:
# Settings.chunk_size = 512
# Settings.chunk_overlap = 50

# A VectorStoreIndex is by far the most frequent type of Index.
# The Vector Store Index takes your Documents and splits them up into Nodes.
# It then creates vector embeddings of the text of every node, ready to be queried.
index = VectorStoreIndex.from_documents(
    documents, transformations=[text_splitter]
)

In [None]:
# Let's do some introspection on the index!
vector_store_dict = index.vector_store.to_dict()

In [None]:
# As you can see vector store is a storage where you can find the vectors of the documents and original documents.
vector_store_dict.keys()

dict_keys(['embedding_dict', 'text_id_to_ref_doc_id', 'metadata_dict'])

In [None]:
# The mapping from text_id to ref_doc_id is also stored in the vector store.
# I use the very first text_id below as an example.
vector_store_dict["text_id_to_ref_doc_id"]

{'ec85792b-0bb4-4b18-8fc9-f47820cdf4fb': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 '632176a2-655a-46cc-b8c2-7b268cb0253b': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 '98be59d7-2179-493c-bcaf-29ef0627b5fb': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 '48911502-daac-4404-a843-61e1819b340e': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'a4d80479-3197-4863-a5a9-b95b9ae1a7fe': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'b37460db-46ed-406f-a3ef-8d0cd417f21a': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'f2ad7c67-3164-469b-9daf-70d0675da657': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 '065b7c5f-3fcf-4ac8-884e-130816c5ba4d': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'b38f72b2-cd8b-4b1a-8676-7c987aaa776f': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 '35e8842a-5078-4464-ac77-40c006920ab7': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'a4d417a9-14f6-4f99-a4e8-d276de29edf8': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'c0422674-eccf-4ab0-b9fc-603ec9a8d24d': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 '8b881f97-673c-42bc-ae21-dc

In [None]:
# The embedding of the very first text_id is stored in the vector store!
vector_store_dict["embedding_dict"]["ec85792b-0bb4-4b18-8fc9-f47820cdf4fb"]


[0.001972062746062875,
 0.007032161578536034,
 -0.030411550775170326,
 -0.022489327937364578,
 0.004609972704201937,
 0.032178085297346115,
 -0.003862593090161681,
 -0.03269445523619652,
 -0.025845741853117943,
 -0.003210334572941065,
 0.010143977589905262,
 0.023508481681346893,
 -0.0058465455658733845,
 -0.01269186194986105,
 0.01456710509955883,
 0.001254408503882587,
 0.019037794321775436,
 -0.008920992724597454,
 -0.00830270629376173,
 -0.02769380621612072,
 -0.010143977589905262,
 0.02038307674229145,
 -0.001268846564926207,
 0.004959881771355867,
 -0.022557271644473076,
 -0.01032742578536272,
 0.03087356686592102,
 -0.036010101437568665,
 0.02331824041903019,
 -0.022013721987605095,
 -0.002559774788096547,
 -0.00013312697410583496,
 -0.021932190284132957,
 -0.008635629899799824,
 -0.018140938133001328,
 0.0010607693111523986,
 0.015368839725852013,
 -8.615035039838403e-05,
 0.0126578900963068,
 -0.012637507170438766,
 0.02278827875852585,
 -0.0011796705657616258,
 -0.00134868023

In [None]:
# Metadata is also stored in the vector store.
vector_store_dict["metadata_dict"]["ec85792b-0bb4-4b18-8fc9-f47820cdf4fb"]

{'file_path': '/Users/dimatimofeev/Projects/allin-search/data/Alice_Adventures_in_Wonderland.txt',
 'file_name': '/Users/dimatimofeev/Projects/allin-search/data/Alice_Adventures_in_Wonderland.txt',
 'file_type': 'text/plain',
 'file_size': 174385,
 'creation_date': '2024-03-04',
 'last_modified_date': '2024-03-04',
 '_node_type': 'TextNode',
 'document_id': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'doc_id': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69',
 'ref_doc_id': 'c4b4b37d-5ddb-4ecc-ad01-a6b745630a69'}

In [None]:
# The chunk content is also stored in the vector store.
index.storage_context.docstore.get_document("ec85792b-0bb4-4b18-8fc9-f47820cdf4fb").text

"\ufeffThe Project Gutenberg eBook of Alice's Adventures in Wonderland\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.\r\n\r\nTitle: Alice's Adventures in Wonderland\r\n\r\n\r\nAuthor: Lewis Carroll\r\n\r\nRelease date: June 27, 2008 [eBook #11]\r\n                Most recently updated: February 4, 2024\r\n\r\nLanguage: English\r\n\r\nCredits: Arthur DiBianca and David Widger\r\n\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***\r\n[Illustration]\r\n\r\n\r\n\r\n\r\nAlice’s Adventures in Wonderland\r\n\r\nby Lewis Carroll\r\n\r\nTHE MIL

# Storing

The API calls to create the embeddings in a VectorStoreIndex can be expensive in terms of time and money, so you will want to store them to avoid having to constantly re-index things.

In [None]:
import chromadb
from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore

chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("alice")


vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# I think the main confusion here is that the VectorStoreIndex is not the same as the VectorStore we used above!
# It takes "storage_context" as an argument. If we go deeper into the VectorStoreIndex base class,
# we can see that it takes the storage_context:
# https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/storage_context.py#L50
# The storage context container is a utility container for storing nodes, indices, and vectors.
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

In [None]:
index.as_query_engine().query("How did Alice come to meet the Queen of Hearts?").response

'Alice came to meet the Queen of Hearts when a procession passed by her. The Queen noticed Alice and inquired about her identity.'