# Llama3 Cookbook with Ollama and Replicate

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/cookbooks/llama3_cookbook_ollama_replicate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Meta developed and released the Meta [Llama 3](https://ai.meta.com/blog/meta-llama-3/) family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

In this notebook, we demonstrate how to use Llama3 with LlamaIndex for a comprehensive set of use cases. 
1. Basic completion / chat 
2. Basic RAG (Vector Search, Summarization)
3. Advanced RAG (Routing, Sub-Questions)
4. Text-to-SQL 
5. Structured Data Extraction
6. Chat Engine + Memory
7. Agents


We use Llama3-8B through Ollama, and Llama3-70B through Replicate. 

## Installation and Setup

In [20]:
# !pip install llama-index
# !pip install llama-index-llms-ollama
# !pip install llama-index-llms-replicate
# !pip install llama-index-embeddings-huggingface
# !pip install llama-parse
# !pip install replicate

In [21]:
import nest_asyncio

nest_asyncio.apply()

### Setup LLM using Ollama

In [22]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3.2")

### Setup LLM using Replicate

Make sure you have REPLICATE_API_TOKEN specified!

In [23]:
# os.environ["REPLICATE_API_TOKEN"] = "<YOUR_API_KEY>"

In [24]:
# from llama_index.llms.replicate import Replicate

llm_replicate = Ollama(model="llama3.1")
# llm_replicate = Ollama(model="llama3.1")

### Setup Embedding Model

In [25]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

### Define Global Settings Configuration

In LlamaIndex, you can define global settings so you don't have to pass the LLM / embedding model objects everywhere.

In [26]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

### Download Data

Here you'll download data that's used in section 2 and onwards.

We'll download some articles on Kendrick, Drake, and their beef (as of May 2024).

In [27]:
# !mkdir data
# !wget "https://www.dropbox.com/scl/fi/t1soxfjdp0v44an6sdymd/drake_kendrick_beef.pdf?rlkey=u9546ymb7fj8lk2v64r6p5r5k&st=wjzzrgil&dl=1" -O data/drake_kendrick_beef.pdf
# !wget "https://www.dropbox.com/scl/fi/nts3n64s6kymner2jppd6/drake.pdf?rlkey=hksirpqwzlzqoejn55zemk6ld&st=mohyfyh4&dl=1" -O data/drake.pdf
# !wget "https://www.dropbox.com/scl/fi/8ax2vnoebhmy44bes2n1d/kendrick.pdf?rlkey=fhxvn94t5amdqcv9vshifd3hj&st=dxdtytn6&dl=1" -O data/kendrick.pdf

In [28]:
# Function to save document content as JSON with text and metadata
import json
import os


def save_as_json(docs, file_path, sub_dir):
    json_file_name = os.path.splitext(os.path.basename(file_path))[0] + ".json"
    json_file_path = os.path.join("./data", sub_dir, json_file_name)

    doc_data = [{"text": doc.text, "metadata": doc.metadata} for doc in docs]

    with open(json_file_path, "w", encoding="utf-8") as json_file:
        json.dump(doc_data, json_file, ensure_ascii=False, indent=2)

    print(f"Saved: {json_file_path}")

### Load Data

We load data using LlamaParse by default, but you can also choose to opt for our free pypdf reader (in SimpleDirectoryReader by default) if you don't have an account! 

1. LlamaParse: Signup for an account here: cloud.llamaindex.ai. You get 1k free pages a day, and paid plan is 7k free pages + 0.3c per additional page. LlamaParse is a good option if you want to parse complex documents, like PDFs with charts, tables, and more. 

2. Default PDF Parser (In `SimpleDirectoryReader`). If you don't want to signup for an account / use a PDF service, just use the default PyPDF reader bundled in our file loader. It's a good choice for getting started!

In [29]:
# from llama_parse import LlamaParse
# import os

# llama_cloud_api_key = os.environ.get("LLAMA_CLOUD_API_KEY")

# if not llama_cloud_api_key:
#     raise ValueError("LLAMA_CLOUD_API_KEY environment variable not found. Please set it to use LlamaParse, or use SimpleDirectoryReader instead.")

# docs_kendrick = LlamaParse(api_key=llama_cloud_api_key,result_type="text").load_data("./data/kendrick.pdf")
# docs_drake = LlamaParse(api_key=llama_cloud_api_key,result_type="text").load_data("./data/drake.pdf")
# docs_both = LlamaParse(api_key=llama_cloud_api_key,result_type="text").load_data(
#     "./data/drake_kendrick_beef.pdf"
# )

# sub_dir = "llama_parse"
# save_as_json(docs_kendrick,"./data/kendrick.pdf", sub_dir)
# save_as_json(docs_drake,"./data/drake.pdf", sub_dir)
# save_as_json(docs_both,"./data/drake_kendrick_beef.pdf", sub_dir)

## 1. Basic Completion and Chat

In [30]:
from llama_index.core import SimpleDirectoryReader
docs_kendrick = SimpleDirectoryReader(input_files=["./data/kendrick.pdf"]).load_data()
docs_drake = SimpleDirectoryReader(input_files=["./data/drake.pdf"]).load_data()
docs_both = SimpleDirectoryReader(input_files=["./data/drake_kendrick_beef.pdf"]).load_data()

sub_dir = "simple_directory_reader"
save_as_json(docs_kendrick,"./data/kendrick.pdf", sub_dir)
save_as_json(docs_drake,"./data/drake.pdf", sub_dir)
save_as_json(docs_both,"./data/drake_kendrick_beef.pdf", sub_dir)

Saved: ./data/simple_directory_reader/kendrick.json
Saved: ./data/simple_directory_reader/drake.json
Saved: ./data/simple_directory_reader/drake_kendrick_beef.json


### Call complete with a prompt

In [31]:
response = llm.complete("do you like drake or kendrick better?")

print(response)

As a neutral AI, I don't have personal preferences or opinions. I can provide information and comparisons about both Drake and Kendrick Lamar if that's what you're looking for.

Both artists are highly acclaimed and have made significant contributions to the music industry. They have distinct styles, lyrical themes, and musical genres.

Drake is known for his introspective and emotive lyrics, often exploring themes of love, relationships, and fame. His genre-bending sound combines elements of hip-hop, R&B, and pop.

Kendrick Lamar, on the other hand, is celebrated for his storytelling ability, socially conscious lyrics, and fusion of jazz and funk influences into his music. He's known for tackling complex issues like racism, police brutality, and black identity in his work.

Ultimately, the question of who is "better" depends on individual tastes and preferences. Some people may prefer Drake's more melodic flow and emotive delivery, while others might appreciate Kendrick's lyrical comp

In [32]:
stream_response = llm.stream_complete(
    "you're a drake fan. tell me why you like drake more than kendrick"
)

for t in stream_response:
    print(t.delta, end="")

.. I'm gonna give it to you straight, no cap.

First of all, let's talk about the emotional depth of Drake's music. Kendrick Lamar is definitely a master storyteller, but for me, his lyrics can sometimes feel a little too literal, too preachy. I mean, don't get me wrong, "Alright" is a banger and all, but sometimes I feel like it's hard to relate to the dude when he's just straight-up ranting about racism and police brutality.

Drake, on the other hand, has this ability to tap into my feelings in a way that Kendrick can't. His music is like therapy for me – it's like he's speaking directly to my soul. I mean, have you listened to "Marvin's Room" or "Hold On, We're Going Home"? Those songs are like the ultimate feel-good vibes, you know? They make me wanna put on a pair of sweatpants and just vibe out all day.

And don't even get me started on his versatility. Kendrick is great at making hip-hop, but Drake has this ability to experiment with different genres and styles that's unmatched.

### Call chat with a list of messages

In [33]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are Kendrick."),
    ChatMessage(role="user", content="Write a verse."),
]
response = llm.chat(messages)

In [34]:
print(response)

assistant: "I see the city lights, they reflect my pain
A reflection of the world, where love's in vain
I'm searching for a truth, that I've yet to find
A sense of belonging, but it's hard to unwind"


## 2. Basic RAG (Vector Search, Summarization)

### Basic RAG (Vector Search)

In [35]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(docs_both)
query_engine = index.as_query_engine(similarity_top_k=3)

In [36]:
response = query_engine.query("Tell me about family matters")

In [37]:
print(str(response))

“Family Matters” is a seven-and-a-half-minute diss track that sees Drake responding to Kendrick Lamar's attacks on him. The song is presented in three parts, with each section tackling different aspects of their feud. 

The first part delves into the personal lives of both rappers, touching upon allegations of infidelity, family struggles, and relationships. This includes claims about Kendrick having multiple children fathered by other men, including a 11-year-old daughter.

The second half focuses on issues of domestic abuse and pedophilia accusations made against Drake by Kendrick in his track “Euphoria”. 

Throughout the song, Drake defends himself against these allegations, claiming that he has only ever dated women named Whitney.


### Basic RAG (Summarization)

In [38]:
from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(docs_both)
summary_engine = summary_index.as_query_engine()

In [39]:
response = summary_engine.query(
    "Given your assessment of this article, who won the beef?"
)

ReadTimeout: timed out

In [None]:
print(str(response))

## 3. Advanced RAG (Routing, Sub-Questions)

### Build a Router that can choose whether to do vector search or summarization

In [20]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts.",
    ),
)

summary_tool = QueryEngineTool(
    index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document.",
    ),
)

In [None]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool], select_multi=False, verbose=True
)

response = query_engine.query(
    "Tell me about the song meet the grahams - why is it significant"
)

In [None]:
print(response)

### Break Complex Questions down into Sub-Questions

Our Sub-Question Query Engine breaks complex questions down into sub-questions.


In [None]:
drake_index = VectorStoreIndex.from_documents(docs_drake)
drake_query_engine = drake_index.as_query_engine(similarity_top_k=3)

kendrick_index = VectorStoreIndex.from_documents(docs_kendrick)
kendrick_query_engine = kendrick_index.as_query_engine(similarity_top_k=3)

In [None]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

drake_tool = QueryEngineTool(
    drake_index.as_query_engine(),
    metadata=ToolMetadata(
        name="drake_search",
        description="Useful for searching over Drake's life.",
    ),
)

kendrick_tool = QueryEngineTool(
    kendrick_index.as_query_engine(),
    metadata=ToolMetadata(
        name="kendrick_summary",
        description="Useful for searching over Kendrick's life.",
    ),
)

In [None]:
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    [drake_tool, kendrick_tool],
    llm=llm_replicate,  # llama3-70b
    verbose=True,
)

response = query_engine.query("Which albums did Drake release in his career?")

print(response)

## 4. Text-to-SQL 

Here, we download and use a sample SQLite database with 11 tables, with various info about music, playlists, and customers. We will limit to a select few tables for this test.

In [None]:
!wget "https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip" -O "./data/chinook.zip"
!unzip "./data/chinook.zip"

In [None]:
from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
    column,
)

engine = create_engine("sqlite:///chinook.db")

In [None]:
from llama_index.core import SQLDatabase

sql_database = SQLDatabase(engine)

In [None]:
from llama_index.core.indices.struct_store import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["albums", "tracks", "artists"],
    llm=llm_replicate,
)

In [None]:
response = query_engine.query("What are some albums?")

print(response)

In [None]:
response = query_engine.query("What are some artists? Limit it to 5.")

print(response)

This last query should be a more complex join

In [None]:
response = query_engine.query(
    "What are some tracks from the artist AC/DC? Limit it to 3"
)

print(response)

In [None]:
print(response.metadata["sql_query"])

## 5. Structured Data Extraction

An important use case for function calling is extracting structured objects. LlamaIndex provides an intuitive interface for this through `structured_predict` - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.

**NOTE**: Since there's no native function calling support with Llama3 / Ollama, the structured extraction is performed by prompting the LLM + output parsing.

In [None]:
from llama_index.llms.ollama import Ollama
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel


class Restaurant(BaseModel):
    """A restaurant with name, city, and cuisine."""

    name: str
    city: str
    cuisine: str


llm = Ollama(model="llama3.2")
prompt_tmpl = PromptTemplate(
    "Generate a restaurant in a given city {city_name}"
)

In [None]:
restaurant_obj = llm.structured_predict(
    Restaurant, prompt_tmpl, city_name="Miami"
)
print(restaurant_obj)

## 6. Adding Chat History to RAG (Chat Engine)

In this section we create a stateful chatbot from a RAG pipeline, with our chat engine abstraction.

Unlike a stateless query engine, the chat engine maintains conversation history (through a memory module like buffer memory). It performs retrieval given a condensed question, and feeds the condensed question + context + chat history into the final LLM prompt.

Related resource: https://docs.llamaindex.ai/en/stable/examples/chat_engine/chat_engine_condense_plus_context/

In [None]:
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

memory = ChatMemoryBuffer.from_defaults(token_limit=3900)

chat_engine = CondensePlusContextChatEngine.from_defaults(
    index.as_retriever(),
    memory=memory,
    llm=llm,
    context_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about the Kendrick and Drake beef."
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=True,
)

In [None]:
response = chat_engine.chat(
    "Tell me about the songs Drake released in the beef."
)
print(str(response))

In [None]:
response = chat_engine.chat("What about Kendrick?")
print(str(response))

## 7. Agents

Here we build agents with Llama 3. We perform RAG over simple functions as well as the documents above.

### Agents And Tools

In [None]:
import json
from typing import Sequence, List

from llama_index.core.llms import ChatMessage
from llama_index.core.tools import BaseTool, FunctionTool
from llama_index.core.agent import ReActAgent

import nest_asyncio

nest_asyncio.apply()

### Define Tools

In [None]:
def multiply(a: int, b: int) -> int:
    """Multiple two integers and returns the result integer"""
    return a * b


def add(a: int, b: int) -> int:
    """Add two integers and returns the result integer"""
    return a + b


def subtract(a: int, b: int) -> int:
    """Subtract two integers and returns the result integer"""
    return a - b


def divide(a: int, b: int) -> int:
    """Divides two integers and returns the result integer"""
    return a / b


multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)
subtract_tool = FunctionTool.from_defaults(fn=subtract)
divide_tool = FunctionTool.from_defaults(fn=divide)

### ReAct Agent

In [None]:
agent = ReActAgent.from_tools(
    [multiply_tool, add_tool, subtract_tool, divide_tool],
    llm=llm_replicate,
    verbose=True,
)

### Querying

In [None]:
response = agent.chat("What is (121 + 2) * 5?")
print(str(response))

### ReAct Agent With RAG QueryEngine Tools

In [None]:
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)

from llama_index.core.tools import QueryEngineTool, ToolMetadata

### Create ReAct Agent using RAG QueryEngine Tools

In [None]:
drake_tool = QueryEngineTool(
    drake_index.as_query_engine(),
    metadata=ToolMetadata(
        name="drake_search",
        description="Useful for searching over Drake's life.",
    ),
)

kendrick_tool = QueryEngineTool(
    kendrick_index.as_query_engine(),
    metadata=ToolMetadata(
        name="kendrick_search",
        description="Useful for searching over Kendrick's life.",
    ),
)

query_engine_tools = [drake_tool, kendrick_tool]

In [None]:
agent = ReActAgent.from_tools(
    query_engine_tools,  ## TODO: define query tools
    llm=llm_replicate,
    verbose=True,
)

### Querying

In [None]:
response = agent.chat("Tell me about how Kendrick and Drake grew up")
print(str(response))