# Advanced Indexing Techniques with LlamaIndex and Ollama: Part 2

Welcome back to our deep dive into LlamaIndex and Ollama! In Part 1, we covered the essentials of setting up and using these powerful tools for efficient information retrieval. Now, it’s time to explore advanced indexing techniques that will elevate your document processing and querying capabilities to the next level.

## 1. Introduction

Before we proceed, let’s quickly recap the key takeaways from Part 1:

- Setting up LlamaIndex and Ollama
- Creating a basic index
- Performing simple queries

In this part, we’ll dive into different index types, learn how to customize index settings, manage multiple documents, and explore advanced querying techniques. By the end, you’ll have a robust understanding of how to leverage LlamaIndex and Ollama for complex information retrieval tasks.

If you haven’t set up your environment yet, make sure to refer back to Part 1 for detailed instructions on installing and configuring LlamaIndex and Ollama.

## 2. Exploring Different Index Types

LlamaIndex offers various index types, each tailored to different use cases. Let’s explore the four main types:

### 2.1 List Index

The List Index is the simplest form of indexing in LlamaIndex. It’s an ordered list of text chunks, ideal for straightforward use cases.

In [22]:
from llama_index.core import ListIndex, SimpleDirectoryReader
from dotenv import load_dotenv
load_dotenv()

documents = SimpleDirectoryReader('data').load_data()
index = ListIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What is the capital of France?")
print(response)

LlamaIndex provides a simple interface to query your data using natural language.



**Pros:**

- Simple and quick to create
- Best suited for small document sets

**Cons:**

- Less efficient with large datasets
- Limited semantic understanding

### 2.2 Vector Store Index

The Vector Store Index leverages embeddings to create a semantic representation of your documents, enabling more sophisticated searches.

In [29]:
# import
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.ollama import OllamaEmbedding

from IPython.display import Markdown, display
import chromadb

# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")

embed_model = OllamaEmbedding(
    model_name="snowflake-arctic-embed",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"{response}"))

The author spent most of 2014 painting after deciding to do something completely different following his work with Y Combinator.

## 2.3 Tree Index

In [30]:
from llama_index.core import TreeIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
tree_index = TreeIndex.from_documents(documents)
query_engine = tree_index.as_query_engine()
response = query_engine.query("Explain the structure of the human respiratory system.")
print(response)

LlamaIndex is a powerful data framework designed to help developers build AI applications with large language models (LLMs). It provides a set of tools and techniques to connect custom data sources to LLMs, enabling more accurate and context-aware responses. Key features include data ingestion, data indexing, a query interface, LLM integration, customization options, and scalability. Use cases for LlamaIndex include question-answering systems, chatbots with domain-specific knowledge, semantic search applications, and document analysis and summarization.


## 2.4 Keyword Table Index

In [32]:
from llama_index.core import KeywordTableIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data/paul_graham').load_data()
keyword_index = KeywordTableIndex.from_documents(documents)
query_engine = keyword_index.as_query_engine()
response = query_engine.query("What are the symptoms of influenza?")
print(response)

Empty Response


## 3. Customizing Index Settings

### 3.1 Chunking Strategies

In [33]:
from llama_index.core.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults(chunk_size=1024)
# Sentence-based chunkingparser = SimpleNodeParser.from_defaults(chunk_size=None, chunk_overlap=0)
documents = SimpleDirectoryReader('data').load_data()
nodes = parser.get_nodes_from_documents(documents)
print(nodes[0])

Node ID: 42190b55-9682-457e-a22c-d8c9bd28869c
Text: LlamaIndex: An Introduction  LlamaIndex is a powerful data
framework designed to help developers build AI applications with large
language models (LLMs). It provides a set of tools and techniques to
connect custom data sources to LLMs, enabling more accurate and
context-aware responses.  Key Features of LlamaIndex:  1. Data
Ingestion: LlamaIndex...


## 3.2 Embedding Models

In [5]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(
    model_name="snowflake-arctic-embed",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
query_engine = index.as_query_engine()
response = query_engine.query("What are the main themes in Shakespeare's plays?")
print(response)

LlamaIndex is a powerful data framework designed to help developers build AI applications with large language models (LLMs). It provides a set of tools and techniques to connect custom data sources to LLMs, enabling more accurate and context-aware responses. Key features of LlamaIndex include data ingestion, data indexing, query interface, LLM integration, customization, and scalability. Use cases for LlamaIndex include question-answering systems, chatbots with domain-specific knowledge, semantic search applications, and document analysis and summarization.


## 4. Handling Multiple Documents

### 4.1 Creating a Multi-Document Index

In [39]:
# Load documents from different sourcespdf_docs = SimpleDirectoryReader('pdfs').load_data()
txt_docs = SimpleDirectoryReader('data/paul_graham').load_data()
# csv_docs = SimpleDirectoryReader('csvs').load_data()
web_docs = SimpleDirectoryReader('web_pages').load_data()
data = txt_docs  + web_docs
all_docs = txt_docs  + web_docs
index = VectorStoreIndex.from_documents(all_docs)

query_engine = index.as_query_engine()
response = query_engine.query("what os llama index")
print(response)

LlamaIndex is a tool that offers various index types for different use cases. It is a tool used for efficient information retrieval and document processing tasks.


## 4.2 Cross-Document Querying

In [40]:
from llama_index.core import QueryBundle
from llama_index.core.query_engine import RetrieverQueryEngine

retriever = index.as_retriever(similarity_top_k=5)
query_engine = RetrieverQueryEngine.from_args(retriever, response_mode="compact")
query = QueryBundle("What are the common themes across all documents?")
response = query_engine.query(query)
print(response)

The common themes across all documents include personal experiences and reflections, career development, technological advancements, educational pursuits, and the intersection of art and technology. The documents touch on topics such as the author's journey in writing and programming, the evolution of computers from mainframes to microcomputers, the importance of working on unprestigious projects, the challenges and rewards of pursuing art and painting, the significance of studying philosophy and transitioning to AI, and the impact of the World Wide Web on society.


# 5. Updating and Managing Indices
## 5.1 Adding New Documents to Existing Indices

In [42]:
new_doc = SimpleDirectoryReader('data/doc-1.txt').load_data()[0]
index.insert(new_doc)

ValueError: Directory data/doc-1.txt does not exist.

## 5.3 Index Persistence and Serialization

In [44]:
from llama_index.core import StorageContext, load_index_from_storage

index.storage_context.persist("save_dir")

storage_context = StorageContext.from_defaults(persist_dir="save_dir")
loaded_index = load_index_from_storage(storage_context)

<llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x142df0950>


## 6. Advanced Querying Techniques

### 6.1 Query Preprocessing

In [45]:
import re
def preprocess_query(query):
    # Normalize text    query = query.lower()
    # Remove special characters    query = re.sub(r'[^\w\s]', '', query)
    # Expand common abbreviations    query = query.replace("ai", "artificial intelligence")
    return query
preprocessed_query = preprocess_query("What's the latest in AI?")
response = query_engine.query(preprocessed_query)
print(response)

The latest in AI involves advancements in various fields such as natural language processing, computer vision, and machine learning. Researchers are working on developing more sophisticated AI models that can understand and generate human-like text, recognize objects and patterns in images, and improve decision-making processes. Additionally, there is a focus on creating AI systems that can learn from limited data and adapt to new environments, pushing the boundaries of what AI can achieve in terms of intelligence and autonomy.


### 6.2 Hybrid Search Strategies

In [46]:

keyword_retriever = index.as_retriever(mode="keyword")
vector_retriever = index.as_retriever(mode="embedding")
def hybrid_retriever(query_bundle: QueryBundle):
    keyword_nodes = keyword_retriever.retrieve(query_bundle)
    vector_nodes = vector_retriever.retrieve(query_bundle)
    return list(set(keyword_nodes + vector_nodes))

query_engine = RetrieverQueryEngine.from_args(hybrid_retriever)
response = query_engine.query("What are the environmental impacts of renewable energy?")
print(response)


AttributeError: 'function' object has no attribute 'retrieve'