<a href="https://colab.research.google.com/github/mindfulcoder49/NorthShoreAI/blob/main/ReadTheDocs_2024_03_18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Read The Docs Notebook for March 18 2024

Hi! At the last meeting I mentioned that I might start creating Jupyter notebooks to learn llama-index because that's what worked for me the last time I tried to create a real project.

I created this notebook, and I realized there are a few specific techniques I use when I'm doing this that are very different from just reading the documentation. In fact, it's more like generating the documentation you need to answer the questions you have.

So I have left this notebook open ended so I can get it out today, on Wednesday, and leave the rest of the exploration up to you. I would love if people brough their own versions of this notebook and if they found the learning techniques here useful.

Best Regards,
Alex

In [1]:
!pip install -q llama-index

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.0/108.0 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

##Install llama-index and import API Key

One thing I do know is we need to install llama-index, so we did that first. The I guess it's time to start looking under the hood. First, we should make sure we have our API key set as an environment variable the way llama-index expects.

In [6]:
import os
from google.colab import userdata

#this method assumes you have the key set in your google colab secrets
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

## Loading Documents: What do we get?

The first thing llama_index suggests is to load your documents. That sounds simple enough, but I would like to know exactly what we are getting when we do that, so let's look:

In [7]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("sample_data").load_data()

In [8]:
#print information about the documents variable
length = len(documents)
print(f"Number of documents: {length}")
#type of each list element
print(f"Document type: {type(documents[0])}")
#properties in each document
print(f"Document properties: document.metadata {documents[0].metadata}")
#get a list of all document properties
print(f"Document properties: document.metadata.keys {documents[0].metadata.keys()}")
#document data
print(f"Document data: document.text {documents[0].text}")

Number of documents: 1
Document type: <class 'llama_index.core.schema.Document'>
Document properties: document.metadata {'file_path': '/content/sample_data/README.md', 'file_name': 'README.md', 'file_type': 'text/markdown', 'file_size': 930, 'creation_date': '2024-03-11', 'last_modified_date': '2000-01-01'}
Document properties: document.metadata.keys dict_keys(['file_path', 'file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date'])
Document data: document.text This directory includes a few sample datasets to get you started.

*   `california_housing_data*.csv` is California housing data from the 1990 US
    Census; more information is available at:
    https://developers.google.com/machine-learning/crash-course/california-housing-data-description

*   `mnist_*.csv` is a small sample of the
    MNIST database, which is
    described at: http://yann.lecun.com/exdb/mnist/

*   `anscombe.json` contains a copy of
    Anscombe's quartet; it
    was originally described in

## Creating an Index using OpenAI

llama-index is already solving problems for us. The SimpleDirectoryReader.loadData function gave us a list of Document objects, and each of the Document objects has a text property and a metadata property. The metadata property is a dict with the following keys:

'file_path', 'file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date'

Now let's make the index, and look at that. This will query OpenAI to generate the embeddings.

In [9]:
index = VectorStoreIndex.from_documents(documents)

In [10]:
print(index)

<llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x797490378c70>


## Exploring the Index

We can iterate over the list returned by the dir() function to see all the properties and methods of the VectorStoreIndex object used in LLamaIndex.

In [11]:
for item in dir(index):
  print(item)

__abstractmethods__
__annotations__
__class__
__class_getitem__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__lt__
__module__
__ne__
__new__
__orig_bases__
__parameters__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__slots__
__str__
__subclasshook__
__weakref__
_abc_impl
_add_nodes_to_index
_aget_node_with_embedding
_async_add_nodes_to_index
_build_index_from_nodes
_callback_manager
_delete_node
_docstore
_embed_model
_get_node_with_embedding
_graph_store
_index_struct
_insert
_insert_batch_size
_is_protocol
_object_map
_service_context
_show_progress
_storage_context
_store_nodes_override
_transformations
_use_async
_vector_store
as_chat_engine
as_query_engine
as_retriever
build_index_from_nodes
delete
delete_nodes
delete_ref_doc
docstore
from_documents
from_vector_store
index_id
index_struct
index_struct_cls
insert
insert_nodes
ref_doc_info
refresh
refresh_ref_docs
service_context
set_in

## Looking at retrievers, query engines, and chat engines

The things that catches my eye are the as_ehat_engine, as_query_engine, and as_retriever methods that will give me these three different things that are all used in LLamaIndex. This could be a good opportunity to learn the differences.

In [13]:
#generate a retriever, query engine, and chat engine from my index and save them to different variables
retriever = index.as_retriever()
query_engine = index.as_query_engine()
chat_engine = index.as_chat_engine()

First we can take a look at what objects we get for each of these:

In [20]:
#print all objects
print(retriever)
print(query_engine)
print(chat_engine)

<llama_index.core.indices.vector_store.retrievers.retriever.VectorIndexRetriever object at 0x7974486a8430>
<llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine object at 0x7974486a9660>
<llama_index.agent.openai.base.OpenAIAgent object at 0x7974470154b0>


The first two are sort of intuitive, but the third is interesting. The chat_engine we get is an OpenAIAgent. That's the default, but I bet they tried to make it easy to use others.

Let's take a look at the methods and properties for each one

In [14]:
for item in dir(retriever):
  print(item)

__abstractmethods__
__class__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__lt__
__module__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__slots__
__str__
__subclasshook__
__weakref__
_abc_impl
_aget_nodes_with_embeddings
_ahandle_recursive_retrieval
_alpha
_aretrieve
_aretrieve_from_object
_as_query_component
_build_node_list_from_query_result
_build_vector_store_query
_check_callback_manager
_doc_ids
_docstore
_embed_model
_filters
_get_nodes_with_embeddings
_get_prompt_modules
_get_prompts
_handle_recursive_retrieval
_index
_kwargs
_node_ids
_retrieve
_retrieve_from_object
_similarity_top_k
_sparse_top_k
_update_prompts
_validate_prompts
_vector_store
_vector_store_query_mode
_verbose
aretrieve
as_query_component
callback_manager
get_prompts
get_service_context
object_map
retrieve
similarity_top_k
update_prompts


In [15]:
#for the query engine
for item in dir(query_engine):
  print(item)

__abstractmethods__
__class__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__lt__
__module__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__slots__
__str__
__subclasshook__
__weakref__
_abc_impl
_apply_node_postprocessors
_aquery
_as_query_component
_get_prompt_modules
_get_prompts
_node_postprocessors
_query
_response_synthesizer
_retriever
_update_prompts
_validate_prompts
aquery
aretrieve
as_query_component
asynthesize
callback_manager
from_args
get_prompts
query
retrieve
retriever
synthesize
update_prompts
with_retriever


In [16]:
#for the chat engine
for item in dir(chat_engine):
  print(item)

__abstractmethods__
__class__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__lt__
__module__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__slots__
__str__
__subclasshook__
__weakref__
_abc_impl
_achat
_aquery
_arun_step
_as_query_component
_chat
_get_prompt_modules
_get_prompts
_query
_run_step
_update_prompts
_validate_prompts
achat
agent_worker
aquery
arun_step
as_query_component
astream_chat
astream_step
asynthesize
callback_manager
chat
chat_history
chat_repl
create_task
default_tool_choice
delete_task
delete_task_on_finish
finalize_response
from_llm
from_tools
get_completed_step
get_completed_steps
get_prompts
get_task
get_upcoming_steps
init_task_state_kwargs
list_tasks
memory
query
reset
retrieve
run_step
state
stream_chat
stream_step
streaming_chat_repl
synthesize
undo_step
update_prompts
verbose


Reading through all these three lists helps me understand that the index is loaded into these objects to allow these methods we will want when we really start to think about what our app needs.

One thing though, is that it would be nice to know which of these are properties and which are methods

In [18]:
#get each of the attributes and check if they are callable and output all of them labeled with either property or method
my_objects = retriever, query_engine, chat_engine
for my_object in my_objects:
  #print the variable name and object
  print(f"Object: {my_object}")
  for attribute in dir(my_object):
      # Use getattr to get the actual attribute value
      attr_value = getattr(my_object, attribute)

      # Check if the attribute is callable (method) or not (property)
      if callable(attr_value):
          print(f"{attribute} is a method")
      else:
          print(f"{attribute} is a property")



Object: <llama_index.core.indices.vector_store.retrievers.retriever.VectorIndexRetriever object at 0x7974486a8430>
__abstractmethods__ is a property
__class__ is a method
__delattr__ is a method
__dict__ is a property
__dir__ is a method
__doc__ is a property
__eq__ is a method
__format__ is a method
__ge__ is a method
__getattribute__ is a method
__gt__ is a method
__hash__ is a method
__init__ is a method
__init_subclass__ is a method
__le__ is a method
__lt__ is a method
__module__ is a property
__ne__ is a method
__new__ is a method
__reduce__ is a method
__reduce_ex__ is a method
__repr__ is a method
__setattr__ is a method
__sizeof__ is a method
__slots__ is a property
__str__ is a method
__subclasshook__ is a method
__weakref__ is a property
_abc_impl is a property
_aget_nodes_with_embeddings is a method
_ahandle_recursive_retrieval is a method
_alpha is a property
_aretrieve is a method
_aretrieve_from_object is a method
_as_query_component is a method
_build_node_list_from_que

## Hunting down the source code for the interesting functions

We might be able to get the same information from the documentation, but in this format we can ask for exactly the information we are interested in and ready to absorb. This is also the most authoritative source, and more trustworthy than the documentation.

The source code is also a good read, and the base.py class is, in my opinion, and exciting and illuminating piece of code:

https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/base.py

I'll copy it below:

In [None]:
"""Base index classes."""

import logging
from abc import ABC, abstractmethod
from typing import Any, Dict, Generic, List, Optional, Sequence, Type, TypeVar

from llama_index.core.base.base_query_engine import BaseQueryEngine
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.chat_engine.types import BaseChatEngine, ChatMode
from llama_index.core.data_structs.data_structs import IndexStruct
from llama_index.core.ingestion import run_transformations
from llama_index.core.llms.utils import LLMType, resolve_llm
from llama_index.core.schema import BaseNode, Document, IndexNode, TransformComponent
from llama_index.core.service_context import ServiceContext
from llama_index.core.settings import (
    Settings,
    callback_manager_from_settings_or_context,
    llm_from_settings_or_context,
    transformations_from_settings_or_context,
)
from llama_index.core.storage.docstore.types import BaseDocumentStore, RefDocInfo
from llama_index.core.storage.storage_context import StorageContext

IS = TypeVar("IS", bound=IndexStruct)
IndexType = TypeVar("IndexType", bound="BaseIndex")

logger = logging.getLogger(__name__)


class BaseIndex(Generic[IS], ABC):
    """Base LlamaIndex.

    Args:
        nodes (List[Node]): List of nodes to index
        show_progress (bool): Whether to show tqdm progress bars. Defaults to False.
        service_context (ServiceContext): Service context container (contains
            components like LLM, Embeddings, etc.).

    """

    index_struct_cls: Type[IS]

    def __init__(
        self,
        nodes: Optional[Sequence[BaseNode]] = None,
        objects: Optional[Sequence[IndexNode]] = None,
        index_struct: Optional[IS] = None,
        storage_context: Optional[StorageContext] = None,
        callback_manager: Optional[CallbackManager] = None,
        transformations: Optional[List[TransformComponent]] = None,
        show_progress: bool = False,
        # deprecated
        service_context: Optional[ServiceContext] = None,
        **kwargs: Any,
    ) -> None:
        """Initialize with parameters."""
        if index_struct is None and nodes is None and objects is None:
            raise ValueError("One of nodes, objects, or index_struct must be provided.")
        if index_struct is not None and nodes is not None:
            raise ValueError("Only one of nodes or index_struct can be provided.")
        # This is to explicitly make sure that the old UX is not used
        if nodes is not None and len(nodes) >= 1 and not isinstance(nodes[0], BaseNode):
            if isinstance(nodes[0], Document):
                raise ValueError(
                    "The constructor now takes in a list of Node objects. "
                    "Since you are passing in a list of Document objects, "
                    "please use `from_documents` instead."
                )
            else:
                raise ValueError("nodes must be a list of Node objects.")

        self._storage_context = storage_context or StorageContext.from_defaults()
        # deprecated
        self._service_context = service_context

        self._docstore = self._storage_context.docstore
        self._show_progress = show_progress
        self._vector_store = self._storage_context.vector_store
        self._graph_store = self._storage_context.graph_store
        self._callback_manager = (
            callback_manager
            or callback_manager_from_settings_or_context(Settings, service_context)
        )

        objects = objects or []
        self._object_map = {obj.index_id: obj.obj for obj in objects}
        for obj in objects:
            obj.obj = None  # clear the object to avoid serialization issues

        with self._callback_manager.as_trace("index_construction"):
            if index_struct is None:
                nodes = nodes or []
                index_struct = self.build_index_from_nodes(
                    nodes + objects  # type: ignore
                )
            self._index_struct = index_struct
            self._storage_context.index_store.add_index_struct(self._index_struct)

        self._transformations = (
            transformations
            or transformations_from_settings_or_context(Settings, service_context)
        )

    @classmethod
    def from_documents(
        cls: Type[IndexType],
        documents: Sequence[Document],
        storage_context: Optional[StorageContext] = None,
        show_progress: bool = False,
        callback_manager: Optional[CallbackManager] = None,
        transformations: Optional[List[TransformComponent]] = None,
        # deprecated
        service_context: Optional[ServiceContext] = None,
        **kwargs: Any,
    ) -> IndexType:
        """Create index from documents.

        Args:
            documents (Optional[Sequence[BaseDocument]]): List of documents to
                build the index from.

        """
        storage_context = storage_context or StorageContext.from_defaults()
        docstore = storage_context.docstore
        callback_manager = (
            callback_manager
            or callback_manager_from_settings_or_context(Settings, service_context)
        )
        transformations = transformations or transformations_from_settings_or_context(
            Settings, service_context
        )

        with callback_manager.as_trace("index_construction"):
            for doc in documents:
                docstore.set_document_hash(doc.get_doc_id(), doc.hash)

            nodes = run_transformations(
                documents,  # type: ignore
                transformations,
                show_progress=show_progress,
                **kwargs,
            )

            return cls(
                nodes=nodes,
                storage_context=storage_context,
                callback_manager=callback_manager,
                show_progress=show_progress,
                transformations=transformations,
                service_context=service_context,
                **kwargs,
            )

    @property
    def index_struct(self) -> IS:
        """Get the index struct."""
        return self._index_struct

    @property
    def index_id(self) -> str:
        """Get the index struct."""
        return self._index_struct.index_id

    def set_index_id(self, index_id: str) -> None:
        """Set the index id.

        NOTE: if you decide to set the index_id on the index_struct manually,
        you will need to explicitly call `add_index_struct` on the `index_store`
        to update the index store.

        Args:
            index_id (str): Index id to set.

        """
        # delete the old index struct
        old_id = self._index_struct.index_id
        self._storage_context.index_store.delete_index_struct(old_id)
        # add the new index struct
        self._index_struct.index_id = index_id
        self._storage_context.index_store.add_index_struct(self._index_struct)

    @property
    def docstore(self) -> BaseDocumentStore:
        """Get the docstore corresponding to the index."""
        return self._docstore

    @property
    def service_context(self) -> Optional[ServiceContext]:
        return self._service_context

    @property
    def storage_context(self) -> StorageContext:
        return self._storage_context

    @property
    def summary(self) -> str:
        return str(self._index_struct.summary)

    @summary.setter
    def summary(self, new_summary: str) -> None:
        self._index_struct.summary = new_summary
        self._storage_context.index_store.add_index_struct(self._index_struct)

    @abstractmethod
    def _build_index_from_nodes(self, nodes: Sequence[BaseNode]) -> IS:
        """Build the index from nodes."""

    def build_index_from_nodes(self, nodes: Sequence[BaseNode]) -> IS:
        """Build the index from nodes."""
        self._docstore.add_documents(nodes, allow_update=True)
        return self._build_index_from_nodes(nodes)

    @abstractmethod
    def _insert(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:
        """Index-specific logic for inserting nodes to the index struct."""

    def insert_nodes(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:
        """Insert nodes."""
        for node in nodes:
            if isinstance(node, IndexNode):
                try:
                    node.dict()
                except ValueError:
                    self._object_map[node.index_id] = node.obj
                    node.obj = None

        with self._callback_manager.as_trace("insert_nodes"):
            self.docstore.add_documents(nodes, allow_update=True)
            self._insert(nodes, **insert_kwargs)
            self._storage_context.index_store.add_index_struct(self._index_struct)

    def insert(self, document: Document, **insert_kwargs: Any) -> None:
        """Insert a document."""
        with self._callback_manager.as_trace("insert"):
            nodes = run_transformations(
                [document],
                self._transformations,
                show_progress=self._show_progress,
            )

            self.insert_nodes(nodes, **insert_kwargs)
            self.docstore.set_document_hash(document.get_doc_id(), document.hash)

    @abstractmethod
    def _delete_node(self, node_id: str, **delete_kwargs: Any) -> None:
        """Delete a node."""

    def delete_nodes(
        self,
        node_ids: List[str],
        delete_from_docstore: bool = False,
        **delete_kwargs: Any,
    ) -> None:
        """Delete a list of nodes from the index.

        Args:
            doc_ids (List[str]): A list of doc_ids from the nodes to delete

        """
        for node_id in node_ids:
            self._delete_node(node_id, **delete_kwargs)
            if delete_from_docstore:
                self.docstore.delete_document(node_id, raise_error=False)

        self._storage_context.index_store.add_index_struct(self._index_struct)

    def delete(self, doc_id: str, **delete_kwargs: Any) -> None:
        """Delete a document from the index.
        All nodes in the index related to the index will be deleted.

        Args:
            doc_id (str): A doc_id of the ingested document

        """
        logger.warning(
            "delete() is now deprecated, please refer to delete_ref_doc() to delete "
            "ingested documents+nodes or delete_nodes to delete a list of nodes."
        )
        self.delete_ref_doc(doc_id)

    def delete_ref_doc(
        self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any
    ) -> None:
        """Delete a document and it's nodes by using ref_doc_id."""
        ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
        if ref_doc_info is None:
            logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
            return

        self.delete_nodes(
            ref_doc_info.node_ids,
            delete_from_docstore=False,
            **delete_kwargs,
        )

        if delete_from_docstore:
            self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)

    def update(self, document: Document, **update_kwargs: Any) -> None:
        """Update a document and it's corresponding nodes.

        This is equivalent to deleting the document and then inserting it again.

        Args:
            document (Union[BaseDocument, BaseIndex]): document to update
            insert_kwargs (Dict): kwargs to pass to insert
            delete_kwargs (Dict): kwargs to pass to delete

        """
        logger.warning(
            "update() is now deprecated, please refer to update_ref_doc() to update "
            "ingested documents+nodes."
        )
        self.update_ref_doc(document, **update_kwargs)

    def update_ref_doc(self, document: Document, **update_kwargs: Any) -> None:
        """Update a document and it's corresponding nodes.

        This is equivalent to deleting the document and then inserting it again.

        Args:
            document (Union[BaseDocument, BaseIndex]): document to update
            insert_kwargs (Dict): kwargs to pass to insert
            delete_kwargs (Dict): kwargs to pass to delete

        """
        with self._callback_manager.as_trace("update"):
            self.delete_ref_doc(
                document.get_doc_id(),
                delete_from_docstore=True,
                **update_kwargs.pop("delete_kwargs", {}),
            )
            self.insert(document, **update_kwargs.pop("insert_kwargs", {}))

    def refresh(
        self, documents: Sequence[Document], **update_kwargs: Any
    ) -> List[bool]:
        """Refresh an index with documents that have changed.

        This allows users to save LLM and Embedding model calls, while only
        updating documents that have any changes in text or metadata. It
        will also insert any documents that previously were not stored.
        """
        logger.warning(
            "refresh() is now deprecated, please refer to refresh_ref_docs() to "
            "refresh ingested documents+nodes with an updated list of documents."
        )
        return self.refresh_ref_docs(documents, **update_kwargs)

    def refresh_ref_docs(
        self, documents: Sequence[Document], **update_kwargs: Any
    ) -> List[bool]:
        """Refresh an index with documents that have changed.

        This allows users to save LLM and Embedding model calls, while only
        updating documents that have any changes in text or metadata. It
        will also insert any documents that previously were not stored.
        """
        with self._callback_manager.as_trace("refresh"):
            refreshed_documents = [False] * len(documents)
            for i, document in enumerate(documents):
                existing_doc_hash = self._docstore.get_document_hash(
                    document.get_doc_id()
                )
                if existing_doc_hash is None:
                    self.insert(document, **update_kwargs.pop("insert_kwargs", {}))
                    refreshed_documents[i] = True
                elif existing_doc_hash != document.hash:
                    self.update_ref_doc(
                        document, **update_kwargs.pop("update_kwargs", {})
                    )
                    refreshed_documents[i] = True

            return refreshed_documents

    @property
    @abstractmethod
    def ref_doc_info(self) -> Dict[str, RefDocInfo]:
        """Retrieve a dict mapping of ingested documents and their nodes+metadata."""
        ...

    @abstractmethod
    def as_retriever(self, **kwargs: Any) -> BaseRetriever:
        ...

    def as_query_engine(
        self, llm: Optional[LLMType] = None, **kwargs: Any
    ) -> BaseQueryEngine:
        # NOTE: lazy import
        from llama_index.core.query_engine.retriever_query_engine import (
            RetrieverQueryEngine,
        )

        retriever = self.as_retriever(**kwargs)
        llm = (
            resolve_llm(llm, callback_manager=self._callback_manager)
            if llm
            else llm_from_settings_or_context(Settings, self.service_context)
        )

        return RetrieverQueryEngine.from_args(
            retriever,
            llm=llm,
            **kwargs,
        )

    def as_chat_engine(
        self,
        chat_mode: ChatMode = ChatMode.BEST,
        llm: Optional[LLMType] = None,
        **kwargs: Any,
    ) -> BaseChatEngine:
        service_context = kwargs.get("service_context", self.service_context)

        if service_context is not None:
            llm = (
                resolve_llm(llm, callback_manager=self._callback_manager)
                if llm
                else service_context.llm
            )
        else:
            llm = (
                resolve_llm(llm, callback_manager=self._callback_manager)
                if llm
                else Settings.llm
            )

        query_engine = self.as_query_engine(llm=llm, **kwargs)

        # resolve chat mode
        if chat_mode in [ChatMode.REACT, ChatMode.OPENAI, ChatMode.BEST]:
            # use an agent with query engine tool in these chat modes
            # NOTE: lazy import
            from llama_index.core.agent import AgentRunner
            from llama_index.core.tools.query_engine import QueryEngineTool

            # convert query engine to tool
            query_engine_tool = QueryEngineTool.from_defaults(query_engine=query_engine)

            return AgentRunner.from_llm(
                tools=[query_engine_tool],
                llm=llm,
                **kwargs,
            )

        if chat_mode == ChatMode.CONDENSE_QUESTION:
            # NOTE: lazy import
            from llama_index.core.chat_engine import CondenseQuestionChatEngine

            return CondenseQuestionChatEngine.from_defaults(
                query_engine=query_engine,
                llm=llm,
                **kwargs,
            )
        elif chat_mode == ChatMode.CONTEXT:
            from llama_index.core.chat_engine import ContextChatEngine

            return ContextChatEngine.from_defaults(
                retriever=self.as_retriever(**kwargs),
                llm=llm,
                **kwargs,
            )

        elif chat_mode == ChatMode.CONDENSE_PLUS_CONTEXT:
            from llama_index.core.chat_engine import CondensePlusContextChatEngine

            return CondensePlusContextChatEngine.from_defaults(
                retriever=self.as_retriever(**kwargs),
                llm=llm,
                **kwargs,
            )

        elif chat_mode == ChatMode.SIMPLE:
            from llama_index.core.chat_engine import SimpleChatEngine

            return SimpleChatEngine.from_defaults(
                llm=llm,
                **kwargs,
            )
        else:
            raise ValueError(f"Unknown chat mode: {chat_mode}")


# legacy
BaseGPTIndex = BaseIndex

## Generating the Documentation you need

Here is some documentation generated with GPT-4:

The code you've provided defines the `BaseIndex` class, which is a generic, abstract base class for constructing indexes in the LlamaIndex system. Here's a brief explanation of each method:

- **`__init__`**: Initializes the `BaseIndex` instance. It checks for the validity of input parameters (nodes, objects, or index_struct must be provided) and sets up storage, service contexts, and callback managers. It builds or sets the index structure (`_index_struct`) and processes transformation components.

- **`from_documents`**: A class method that creates an index from a sequence of documents. It processes the documents through transformations and initializes an index with the resulting nodes.

- **`index_struct`**: A property that returns the index structure associated with this index.

- **`index_id`**: A property that returns the unique identifier of the index structure.

- **`set_index_id`**: Allows setting a new unique identifier for the index, updating the storage context accordingly.

- **`docstore`**: Provides access to the document store associated with this index.

- **`service_context`**: Returns the service context associated with the index.

- **`storage_context`**: Returns the storage context associated with the index.

- **`summary`**: A property that provides a summary of the index structure. It's also writable, allowing you to update the summary.

- **`_build_index_from_nodes`**: An abstract method that must be implemented by subclasses to define how the index is built from nodes.

- **`build_index_from_nodes`**: Takes a sequence of nodes and builds the index from them. It updates the document store and constructs the index structure using `_build_index_from_nodes`.

- **`_insert`**: An abstract method for inserting nodes into the index, with subclass-specific logic.

- **`insert_nodes`**: Inserts a sequence of nodes into the index, updating the document store and index structure as needed.

- **`insert`**: Inserts a single document into the index, transforming it into nodes before inserting.

- **`_delete_node`**: An abstract method for deleting a single node from the index.

- **`delete_nodes`**: Deletes a list of nodes from the index, optionally removing them from the document store as well.

- **`delete`**: Deprecated method for deleting a document from the index by its document ID.

- **`delete_ref_doc`**: Deletes a referenced document and its associated nodes from the index and, optionally, the document store.

- **`update`**: Deprecated method for updating a document and its nodes in the index.

- **`update_ref_doc`**: Updates a referenced document and its nodes in the index, effectively reinserting it.

- **`refresh`**: Deprecated method for refreshing an index with updated documents.

- **`refresh_ref_docs`**: Updates the index with a list of documents that have changed, optimizing the use of LLM and embedding model calls.

- **`ref_doc_info`**: An abstract property that should provide mapping information for referenced documents and their nodes.

- **`as_retriever`**: An abstract method that should return a retriever instance based on the index.

- **`as_query_engine`**: Creates a query engine for the index, allowing for complex queries using a retriever and possibly a large language model (LLM).

- **`as_chat_engine`**: Sets up a chat engine with various modes, utilizing the index for retrieving information and a large language model for generating responses.

These methods provide a comprehensive interface for building, managing, and utilizing indexes in the LlamaIndex system, facilitating complex data retrieval and interaction scenarios.

I also generated more specific documentation on the three methods we used above:

The provided code snippet outlines the architecture of a sophisticated indexing system designed to work with various types of data and leverage large language models (LLMs) for processing. Here's an expansion on the key components you've asked about:

### as_retriever Method
- **Purpose**: This abstract method is intended to return a `BaseRetriever` instance specific to the index. A retriever is a component that allows for efficient fetching or searching of data from the index based on queries.
- **How It Works**: When implemented, this method configures and initializes a retriever that is capable of interacting with the index's underlying structure (document store, vector store, etc.) to retrieve relevant information or documents based on the input queries. This is crucial for building search and query functionalities on top of the indexed data.

### as_query_engine Method
- **Purpose**: Creates a `BaseQueryEngine` instance that facilitates complex queries over the index, potentially using a retriever and a large language model. This engine can handle more sophisticated queries than simple retrievals, including those that require understanding and processing natural language queries.
- **How It Works**: This method leverages the retriever (obtained via `as_retriever`) and optionally integrates a large language model to process and interpret complex queries. The query engine can perform tasks such as parsing natural language questions, determining the intent of queries, and using the retriever to fetch relevant data. The integration of LLMs allows for a more nuanced understanding and processing of queries, thereby enhancing the query capabilities beyond what's possible with traditional search algorithms.

### as_chat_engine Method
- **Purpose**: Sets up a `BaseChatEngine` with various modes, utilizing the index for retrieving information and a large language model for generating responses. This is geared towards building conversational interfaces that can interact intelligently with users, providing answers or information based on the indexed data.
- **How It Works**: The method configures a chat engine that can operate in different modes (e.g., `ChatMode.BEST`, `ChatMode.REACT`, etc.), depending on the requirements. It uses the query engine (set up via `as_query_engine`) to fetch relevant information from the index in response to user queries. The LLM plays a critical role here by generating conversational responses that are informed by the data retrieved from the index. This enables the development of interactive and intelligent chat-based interfaces or bots that can provide users with information, answer questions, or assist with tasks based on the indexed data.

Each of these components plays a critical role in building a comprehensive and interactive data indexing and retrieval system. They enable the construction of sophisticated applications capable of understanding and responding to complex queries, supporting conversational interfaces, and leveraging the power of large language models to enhance user interaction and information retrieval.

## Questions for Monday March 18:

1. What exactly are documents and nodes and when are they used?
2. What are the service context and storage context and when are they used?
3. What other chat agents are available in as_chat_engine?