<a href="https://colab.research.google.com/github/isamdr86/towards-ai/blob/main/notebooks/12-Improve_Query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages and Setup Variables


In [1]:
!pip install -q llama-index==0.10.57 openai==1.37.0 llama-index-finetuning llama-index-embeddings-huggingface llama-index-embeddings-cohere llama-index-readers-web cohere==5.6.2 tiktoken==0.7.0 chromadb==0.5.5 html2text sentence_transformers pydantic llama-index-vector-stores-chroma==0.1.10 kaleido==0.2.1 llama-index-llms-gemini==0.1.11

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━

In [2]:
%%capture
!pip install openai==1.55.3 httpx==0.27.2 tiktoken==0.7.0 --force-reinstall --quiet

In [3]:
import os

from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
os.environ["GOOGLE_API_KEY"] = userdata.get('google_api_key')

In [4]:
# Allows running asyncio in environments with an existing event loop, like Jupyter notebooks.
import nest_asyncio

nest_asyncio.apply()

# Load a Model


In [5]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=0, model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load Indexes


In [6]:
from huggingface_hub import hf_hub_download

vectorstore = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="vectorstore.zip", repo_type="dataset", local_dir=".")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vectorstore.zip:   0%|          | 0.00/97.2M [00:00<?, ?B/s]

In [7]:
!unzip -o vectorstore.zip

Archive:  vectorstore.zip
   creating: ai_tutor_knowledge/
   creating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/length.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/index_metadata.pickle  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/link_lists.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/header.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/data_level0.bin  
  inflating: ai_tutor_knowledge/chroma.sqlite3  


In [8]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex

# Create the vector index
db = chromadb.PersistentClient(path="./ai_tutor_knowledge")
chroma_collection = db.get_or_create_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
vector_index = VectorStoreIndex.from_vector_store(vector_store)

# Multi-Step Query Engine


## GPT-4o-mini


In [9]:
from llama_index.core.indices.query.query_transform.base import (
    StepDecomposeQueryTransform, #generate subqueries from original query
)

step_decompose_transform_gpt4o = StepDecomposeQueryTransform(verbose=True, llm=Settings.llm)

In [10]:
from llama_index.core.query_engine.multistep_query_engine import MultiStepQueryEngine

#Default query engine
query_engine_gpt4o_mini = vector_index.as_query_engine()

# Multi Step Query Engine
multi_step_query_engine = MultiStepQueryEngine(
    query_engine = query_engine_gpt4o_mini,
    query_transform = step_decompose_transform_gpt4o,
    index_summary = "Used to answer the Questions about RAG, Machine Learning, Deep Learning, and Generative AI",
)

# Query Dataset

## Default

In [11]:
# Default query engine
query_engine = vector_index.as_query_engine()
res = query_engine.query("Write about Llama 3.1 Model, BERT and PEFT")
print(res.response)

The provided information does not include details about the Llama 3.1 Model or BERT. However, it does discuss the LLaMA model and the PEFT (Parameter-Efficient Fine-Tuning) library, which are relevant to fine-tuning and adapting models for various tasks.

The LLaMA model is a foundational model used for natural language processing tasks, and it can be fine-tuned using methods like LoRA (Low-Rank Adaptation) through the PEFT library. This library offers tools for efficient fine-tuning, allowing users to adapt the LLaMA model with minimal additional parameters and reduced training time.

Additionally, the Llama-Adapter is a specific PEFT method designed to transform the LLaMA model into an instruction-following model by integrating learnable adaptation prompts while preserving the model's pre-trained knowledge. This method is efficient, requiring only a small number of learnable parameters and a short fine-tuning duration.

For more detailed information about Llama 3.1 and BERT, addition

In [12]:
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 781b7b12-eca2-47c0-a66e-9d6be670e951
Title	 LLaMA
Text	 on how to fine-tune LLaMA model using LoRA method via the 🤗 PEFT library with intuitive UI. 🌎 - A [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-open-llama.ipynb) on how to deploy Open-LLaMA model for text generation on Amazon SageMaker. 🌎 ## LlamaConfig[[autodoc]] LlamaConfig## LlamaTokenizer[[autodoc]] LlamaTokenizer    - build_inputs_with_special_tokens    - get_special_tokens_mask    - create_token_type_ids_from_sequences    - save_vocabulary## LlamaTokenizerFast[[autodoc]] LlamaTokenizerFast    - build_inputs_with_special_tokens    - get_special_tokens_mask    - create_token_type_ids_from_sequences    - update_post_processor    - save_vocabulary## LlamaModel[[autodoc]] LlamaModel    - forward## LlamaForCausalLM[[autodoc]] LlamaForCausalLM    - forward## LlamaForSequenceClassification[[autodoc]] LlamaForSequenceClassif

## GPT-4o-mini Multi-Step


In [13]:
response = multi_step_query_engine.query("Write about Llama 3.1 Model, BERT and PEFT")
print(response.response)

[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What is the Llama 3.1 Model?
[0m[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What are the key features of the Llama 3.1 Model?
[0m[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What are the key features of the Llama 3.1 Model?
[0mLlama 3.1 is an advanced open-source AI model developed by Meta, recognized for its significant scale and capabilities. It is the largest in the Llama series, having been trained on over 15 trillion tokens with the help of more than 16,000 H100 GPUs. One of its standout features is a context length of 128K, which allows it to process and understand longer texts effectively. The model excels in reasoning, coding, and multilingual processing, with approximately 50% of its training data consisting of multilingual tokens. It demonstrates strong logical r

In [14]:
for query, response in response.metadata['sub_qa']:
    print(f"**{query}**\n{response}\n")

**What is the Llama 3.1 Model?**
Llama 3.1 is an advanced open-source AI model developed by Meta, recognized as the largest in the Llama series, trained on over 15 trillion tokens using more than 16,000 H100 GPUs. It features a 128K context length and enhanced capabilities in reasoning, coding, and multilingual processing. The model supports zero-shot tool use and is designed to generate high-quality code while demonstrating strong logical reasoning and problem-solving skills. Llama 3.1 has shown superior performance in benchmark tests compared to other models like GPT-4o and Claude 3.5 Sonnet, particularly in areas such as mathematical reasoning, complex reasoning, and long text processing.

**What are the key features of the Llama 3.1 Model?**
The Llama 3.1 model boasts several key features, including:

1. **Model Scale and Training**: It is the largest model from Meta, trained on over 15 trillion tokens using more than 16,000 H100 GPUs.

2. **Extended Context Length**: The model sup

In [15]:
for src in response.source_nodes:
    print("Node ID\t", src.node_id)
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 f5453fa3-1b20-4c2e-8549-9882cc954df3
Text	 Llama 3.1 models  especially the 405 billion parameter version (also the 70B  8B)?   For the 405B parameter version  substantial GPU resources are required  up to 16K H100 GPUs for training  with 80GB HBM3 memory each  connected via NVLink within servers equipped with eight GPUs and two CPUs. Smaller versions (70B  8B) have lower resource requirements  using Nvidia Quantum2 InfiniBand fabric with 400 Gbps interconnects between GPUs  making them more accessible for many organizations  while storage requirements include a distributed file system offering up to 240 PB of storage with a peak throughput of 7 TB/s. Recently  Elie Bakouch (known for training LLMs on Hugging Face) shared that one can fine-tune Llama 3 405B using 8 H100 GPUs.   5  What specific advantages does Llama 3.1 offer in terms of performance  cost  and potential cost savings compared to closed models like GPT-4o?   Llama 3.1 offers significant advantages in performance

# Test gemini-1.5-flash Multi-Step


In [16]:
from llama_index.core import ServiceContext
from llama_index.core.indices.query.query_transform.base import (
    StepDecomposeQueryTransform,
)
from llama_index.core.query_engine.multistep_query_engine import MultiStepQueryEngine

from llama_index.llms.gemini import Gemini

llm = Gemini(model="models/gemini-1.5-flash")

service_context_gemini = ServiceContext.from_defaults(llm=llm)

step_decompose_transform = StepDecomposeQueryTransform(llm=llm, verbose=True)

query_engine_gemini = vector_index.as_query_engine(
    service_context=service_context_gemini
)
query_engine_gemini = MultiStepQueryEngine(
    query_engine=query_engine_gemini,
    query_transform=step_decompose_transform,
    index_summary="Used to answer the Questions about RAG, Machine Learning, Deep Learning, and Generative AI",
)

  service_context_gemini = ServiceContext.from_defaults(llm=llm)


In [17]:
response_gemini = query_engine_gemini.query("Write about Llama 3.1 Model, BERT and PEFT")

[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What are Llama 3.1, BERT, and PEFT, and how do they relate to RAG, Machine Learning, Deep Learning, and Generative AI?

[0m[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What are the key characteristics and applications of Llama 3.1, BERT, and PEFT within the context of machine learning, deep learning, and generative AI?

[0m[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: None

[0m

In [18]:
response_gemini.response

'Llama 3.1 is an open-source language model that has been optimized for performance and cost efficiency. It incorporates techniques such as weight pruning and knowledge distillation, resulting in a more compact version known as Llama-3.1-Minitron. This model is designed to maintain high performance while minimizing computational requirements, making it suitable for a range of applications in natural language processing. Its architecture supports various parameter sizes, with larger versions offering significant capabilities for high-performance tasks.\n\nBERT, or Bidirectional Encoder Representations from Transformers, is a prominent language model that utilizes a transformer architecture to understand the context of words within sentences. This capability allows BERT to excel in various natural language processing tasks, including sentiment analysis, question answering, and language translation. Its bidirectional approach enables a deeper understanding of context, which is crucial for

## Test Retriever on Multistep


In [None]:
# import llama_index
# from llama_index.core.indices.query.schema import QueryBundle

# t = QueryBundle("How Retrieval Augmented Generation (RAG) work?")
# query_engine_gemini.retrieve(t)

## Subquestion Query Engine

In [19]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = vector_index.as_query_engine()

query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="LlamaIndex",
            description="Used to answer the Questions about RAG, Machine Learning, Deep Learning, and Generative AI",
        ),
    ),
]

sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    use_async=True,
)

response = sub_question_engine.query("Write about Llama 3.1 Model, BERT and PEFT")


Generated 5 sub questions.
[1;3;38;2;237;90;200m[LlamaIndex] Q: What are the key features and improvements of the Llama 3.1 model compared to its predecessors?
[0m[1;3;38;2;90;149;237m[LlamaIndex] Q: How does BERT work and what are its main applications in natural language processing?
[0m[1;3;38;2;11;159;203m[LlamaIndex] Q: What is PEFT (Parameter-Efficient Fine-Tuning) and how does it enhance the performance of models like BERT?
[0m[1;3;38;2;155;135;227m[LlamaIndex] Q: What are the differences in architecture between Llama 3.1 and BERT?
[0m[1;3;38;2;237;90;200m[LlamaIndex] Q: In what scenarios is PEFT particularly beneficial for fine-tuning models?
[0m[1;3;38;2;155;135;227m[LlamaIndex] A: The provided information does not detail the architectural differences between Llama 3.1 and BERT. It primarily focuses on the specifications, performance, and advantages of Llama 3.1, particularly its 405 billion parameter version, as well as its open-source nature and hardware requiremen

In [20]:
response.response

'Llama 3.1 is a state-of-the-art AI model developed by Meta, notable for its significant advancements over previous versions. It boasts an impressive scale, having been trained on over 15 trillion tokens with more than 16,000 H100 GPUs. This model features a 128K context length, enhancing its ability to process longer texts and complex interactions. Additionally, Llama 3.1 demonstrates improved reasoning and coding capabilities, excels in multilingual processing, and supports zero-shot tool use, making it versatile for various applications. Its performance benchmarks indicate superiority over earlier models and competitors in areas such as mathematical reasoning and long text processing.\n\nBERT, or Bidirectional Encoder Representations from Transformers, utilizes a transformer architecture that processes text in both directions, allowing for a comprehensive understanding of context. It is pre-trained on tasks like Masked Language Modeling and Next Sentence Prediction, which help it ge

# HyDE Transform


In [21]:
query_engine = vector_index.as_query_engine()

In [22]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine.transform_query_engine import TransformQueryEngine

hyde = HyDEQueryTransform(include_original=True) # The include_original argument decides whether to include the original query string as one of the embedding strings during retrieval.
hyde_query_engine = TransformQueryEngine(query_engine, hyde)

In [23]:
response = hyde_query_engine.query("Write about Llama 3.1 Model, BERT and PEFT")

In [24]:
response.response

'Llama 3.1 405B is a significant advancement in the field of AI, developed by Meta. It stands out as the largest open-source model to date, trained on over 15 trillion tokens using more than 16,000 H100 GPUs. This extensive training has enabled it to achieve a 128K context length, which enhances its capabilities in reasoning, coding, and multilingual processing. The model has been designed to perform on par with leading proprietary models in various areas, including general knowledge, steerability, and tool use.\n\nIn contrast, BERT (Bidirectional Encoder Representations from Transformers) is a model developed by Google that focuses on understanding the context of words in a sentence by looking at the words that come before and after it. BERT has been widely used for tasks such as natural language understanding and has set benchmarks in various NLP tasks.\n\nPEFT (Parameter-Efficient Fine-Tuning) refers to techniques that allow models to be fine-tuned with fewer parameters, making the 

In [25]:
for src in response.source_nodes:
    print("Node ID\t", src.node_id)
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 5624cdc8-2997-4e4d-82d1-c7383d389215
Text	 3.1 405B is Metas largest model  trained with over 15 trillion tokens. For this  Meta optimized the entire training stack and trained it on more than 16 000 H100 GPUs  making it the first Llama model trained at this scale.   According to Meta  this version of the original model (Llama 1 and Llama 2) has 128K context length  improved reasoning and coding capabilities. Meta has also upgraded both multilingual 8B and 70B models.   Key Features of Llama 3.1 40 5B:Llama 3.1 comes with a host of features and capabilities that appeal to The users  such as:   RAG & tool use  Meta states that you can use Llama system components to extend the model using zero-shot tool use and build agentic behaviors with RAG.   Multi-lingual  Llama 3 naturally supports multilingual processing. The pre-training data includes about 50% multilingual tokens and can process and understand multiple languages.   Programming and Reasoning  Llama 3 has powerful program

In [26]:
query_bundle = hyde("Write about Llama 3.1 Model, BERT and PEFT")

In [27]:
hyde_doc = query_bundle.embedding_strs[0]

In [28]:
hyde_doc

'The Llama 3.1 model, developed by Meta, represents a significant advancement in the field of natural language processing (NLP). It builds upon the foundation laid by its predecessors, Llama 1 and Llama 2, by incorporating more extensive training data and improved architectural designs. Llama 3.1 is designed to enhance performance in various NLP tasks, such as text generation, summarization, and question-answering, making it a versatile tool for developers and researchers alike.\n\nIn contrast, BERT (Bidirectional Encoder Representations from Transformers), introduced by Google in 2018, revolutionized the way models understand context in language. BERT employs a transformer architecture that processes text bidirectionally, allowing it to capture the nuances of language more effectively than previous models that read text in a unidirectional manner. This capability enables BERT to excel in tasks like sentiment analysis, named entity recognition, and other applications requiring a deep u