# LlamaIndex: Starter Tutorial (Using Local LLMs)

- Starter Tutorial (Using Local LLMs)<br>
  https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/
- Embedding model **BAAI/bge-base-en-v1.5** and Model **meta-llama/Llama-3.1-8B** on (Hugging Face) 
  - https://huggingface.co/BAAI/bge-base-en-v1.5
  - https://huggingface.co/meta-llama/Llama-3.1-8B
- **LlamaHub Integrations**
  - https://llamahub.ai/
- **Ollama Jupyter Notebook Integration**
  - https://www.restack.io/p/ollama-answer-jupyter-notebook-cat-ai
- **Ollama Python Library**
  - https://github.com/ollama/ollama-python

## SETUP

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables (for API key)
load_dotenv()

# Set up OpenAI API key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable or add it to a .env file")

# Define the model to use
MODEL_GPT = "gpt-4o-mini"

## Ollama
- https://ollama.com/
- https://github.com/ollama/ollama

**Llama 3.1 8B (4.9 GB)**<br>
```python
ollama run llama3.1
ollama run llama3.1:8b
```
```python
ollama pull llama3.1
ollama list
ollama ps
ollama rm llama3.1
```
```python
ollama show llama3.1:8b
```
```
  Model
    architecture        llama
    parameters          8.0B
    context length      131072
    embedding length    4096
    quantization        Q4_K_M

  Parameters
    stop    "<|start_header_id|>"
    stop    "<|end_header_id|>"
    stop    "<|eot_id|>"

  License
    LLAMA 3.1 COMMUNITY LICENSE AGREEMENT
    Llama 3.1 Version Release Date: July 23, 2024

## SETUP (LlamaIndex, Ollama, HuggingFace)

In [2]:
# pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface
# pip install llama-index-llms-ollama llama-index-embeddings-huggingface

In [3]:
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

## Basic Agent Example

In [19]:
# Ollama Jupyter Notebook Integration
# - https://www.restack.io/p/ollama-answer-jupyter-notebook-cat-ai
# Ollama Python Library
# - https://github.com/ollama/ollama-python
import ollama

ollama.pull('llama3.1')
ollama.list()
# ollama.delete('llama3.1')

ListResponse(models=[Model(model='llama3.1:latest', modified_at=datetime.datetime(2025, 3, 27, 11, 3, 8, 745338, tzinfo=TzInfo(+01:00)), digest='46e0c10c039e019119339687c3c1757cc81b9da49709a3b3924863ba87ca666e', size=4920753328, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='8.0B', quantization_level='Q4_K_M')), Model(model='llama3.1:8b', modified_at=datetime.datetime(2025, 3, 27, 9, 47, 58, 235191, tzinfo=TzInfo(+01:00)), digest='46e0c10c039e019119339687c3c1757cc81b9da49709a3b3924863ba87ca666e', size=4920753328, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='8.0B', quantization_level='Q4_K_M'))])

In [14]:
import asyncio
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.ollama import Ollama

# Define a simple calculator tool
def multiply(a: float, b: float) -> float:
    """Useful for multiplying two numbers."""
    return a * b

# Create an agent workflow with our calculator tool
# agent = FunctionAgent(
#     tools=[multiply],
#     llm=Ollama(model="llama3.1", request_timeout=360.0),
#     system_prompt="You are a helpful assistant that can multiply two numbers.",
# )
agent = FunctionAgent(
    name="Agent",
    description="Useful for multiplying two numbers",
    tools=[multiply],
    llm=Ollama(model="llama3.1", request_timeout=360.0),
    system_prompt="You are a helpful assistant that can multiply two numbers.",
)

# async def main():
#     # Run the agent
#     response = await agent.run("What is 1234 * 4567?")
#     print(str(response))

# # Run the agent
# if __name__ == "__main__":
#     asyncio.run(main())

In [15]:
# Run the agent
response = await agent.run("What is 1234 * 4567?")

In [16]:
print(response)

The result of multiplying 1234 and 4567 is 5,635,678.


## Adding Chat History

In [21]:
# from llama_index.core.workflow import Context

# # create context
# ctx = Context(agent)

# # run agent with context
# response = await agent.run("My name is Logan", ctx=ctx)
# response = await agent.run("What is my name?", ctx=ctx)

In [22]:
# ERROR: AttributeError: 'FunctionAgent' object has no attribute '_get_steps'

## Adding Chat History by (AgentWorkflow Basic Introduction)
- https://docs.llamaindex.ai/en/stable/examples/agent/agent_workflow_basic/

In [34]:
# %pip install llama-index
# %pip install tavily-python

In [35]:
# from llama_index.llms.openai import OpenAI
from llama_index.llms.ollama import Ollama

# llm = OpenAI(model="gpt-4o-mini")
llm=Ollama(model="llama3.1", request_timeout=360.0)

In [36]:
# Creating a tool
import os
from tavily import AsyncTavilyClient

tavily_api_key = os.getenv("TAVILY_API_KEY")

async def search_web(query: str) -> str:
    """Useful for using the web to answer questions."""
    client = AsyncTavilyClient(api_key=tavily_api_key)
    return str(await client.search(query))

In [37]:
# Creating an AgentWorkflow that uses the tool
from llama_index.core.agent.workflow import AgentWorkflow

workflow = AgentWorkflow.from_tools_or_functions(
    [search_web],
    llm=llm,
    system_prompt="You are a helpful assistant that can search the web for information.",
)

In [None]:
# Running the Agent
# response = await workflow.run(user_msg="What is the weather in San Francisco?")
response = await workflow.run(user_msg="What is the weather in Prague?")
print(str(response))

In [None]:
# Maintaining State
from llama_index.core.workflow import Context

ctx = Context(workflow)

response = await workflow.run(
    user_msg="My name is Logan, nice to meet you!", ctx=ctx
)
print(str(response))

In [None]:
response = await workflow.run(user_msg="What is my name?", ctx=ctx)
print(str(response))

## Adding RAG Capabilities

Example data
```
mkdir data
wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -O data/paul_graham_essay.txt
```
-  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt

In [40]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import asyncio
import os

# Settings control global defaults
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
Settings.llm = Ollama(model="llama3.1", request_timeout=360.0)

# Create a RAG tool using LlamaIndex
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(
    documents,
    # we can optionally override the embed_model here
    # embed_model=Settings.embed_model,
)
query_engine = index.as_query_engine(
    # we can optionally override the llm here
    # llm=Settings.llm,
)

def multiply(a: float, b: float) -> float:
    """Useful for multiplying two numbers."""
    return a * b

async def search_documents(query: str) -> str:
    """Useful for answering natural language questions about an personal essay written by Paul Graham."""
    response = await query_engine.aquery(query)
    return str(response)

# Create an enhanced workflow with both tools
agent = AgentWorkflow.from_tools_or_functions(
    [multiply, search_documents],
    llm=Settings.llm,
    system_prompt="""You are a helpful assistant that can perform calculations
    and search through documents to answer questions.""",
)

# # Now we can ask questions about the documents or do calculations
# async def main():
#     response = await agent.run(
#         "What did the author do in college? Also, what's 7 * 8?"
#     )
#     print(response)

# # Run the agent
# if __name__ == "__main__":
#     asyncio.run(main())

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
response = await agent.run("What did the author do in college? Also, what's 7 * 8?")
print(response)

## Storing the RAG Index

In [46]:
# Save the index
# index.storage_context.persist("storage")

# embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
index.storage_context.persist("storage_bge")  # embed_model = BGE

# Later, load the index
from llama_index.core import StorageContext, load_index_from_storage

# storage_context = StorageContext.from_defaults(persist_dir="storage")
storage_context = StorageContext.from_defaults(persist_dir="storage_bge")
index = load_index_from_storage(
    storage_context,
    # we can optionally override the embed_model here
    # it's important to use the same embed_model as the one used to build the index
    # embed_model=Settings.embed_model,
)
query_engine = index.as_query_engine(
    # we can optionally override the llm here
    # llm=Settings.llm,
)

In [43]:
ollama.list()

ListResponse(models=[Model(model='llama3.1:latest', modified_at=datetime.datetime(2025, 3, 27, 11, 3, 8, 745338, tzinfo=TzInfo(+01:00)), digest='46e0c10c039e019119339687c3c1757cc81b9da49709a3b3924863ba87ca666e', size=4920753328, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='8.0B', quantization_level='Q4_K_M')), Model(model='llama3.1:8b', modified_at=datetime.datetime(2025, 3, 27, 9, 47, 58, 235191, tzinfo=TzInfo(+01:00)), digest='46e0c10c039e019119339687c3c1757cc81b9da49709a3b3924863ba87ca666e', size=4920753328, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='8.0B', quantization_level='Q4_K_M'))])

In [44]:
ollama.delete('llama3.1')

StatusResponse(status='success')

In [45]:
ollama.list()

ListResponse(models=[Model(model='llama3.1:8b', modified_at=datetime.datetime(2025, 3, 27, 9, 47, 58, 235191, tzinfo=TzInfo(+01:00)), digest='46e0c10c039e019119339687c3c1757cc81b9da49709a3b3924863ba87ca666e', size=4920753328, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='8.0B', quantization_level='Q4_K_M'))])