<a href="https://colab.research.google.com/github/isamdr86/towards-ai/blob/main/notebooks/14-Adding_Chat_ir.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages and Setup Variables


In [1]:
!pip install -q llama-index==0.10.57 openai==1.37.0 llama-index-finetuning llama-index-embeddings-huggingface llama-index-embeddings-cohere llama-index-readers-web cohere==5.6.2 tiktoken==0.7.0 chromadb==0.5.5 html2text sentence_transformers pydantic llama-index-vector-stores-chroma==0.1.10 kaleido==0.2.1

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m51.2/56.5 kB[0m [31m59.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m93.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?2

In [2]:
%%capture
!pip install openai==1.55.3 httpx==0.27.2 tiktoken==0.7.0 --force-reinstall

In [3]:
import os

from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')


In [4]:
# Allows running asyncio in environments with an existing event loop, like Jupyter notebooks.
import nest_asyncio

nest_asyncio.apply()

# Load Models


In [5]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=1, model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load Indexes


In [6]:
# Downloading Vector store from Hugging face hub
from huggingface_hub import hf_hub_download

vectorstore = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="vectorstore.zip", repo_type="dataset", local_dir=".")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vectorstore.zip:   0%|          | 0.00/97.2M [00:00<?, ?B/s]

In [7]:
!unzip -o vectorstore.zip

Archive:  vectorstore.zip
   creating: ai_tutor_knowledge/
   creating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/length.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/index_metadata.pickle  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/link_lists.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/header.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/data_level0.bin  
  inflating: ai_tutor_knowledge/chroma.sqlite3  


In [8]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex

# Load the vector store from the local storage.
db = chromadb.PersistentClient(path="./ai_tutor_knowledge")
chroma_collection = db.get_or_create_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
vector_index = VectorStoreIndex.from_vector_store(vector_store)

# Display result


In [9]:
# A simple function to show the response and the sources.
def display_res(response):
    print("Response:\n\t", response.response.replace("\n", ""))

    print("Sources:")
    if response.source_nodes:
        for src in response.source_nodes:
            print("\tNode ID\t", src.node_id)
            print("\tText\t", src.text)
            print("\tScore\t", src.score)
            print("\t" + "-_" * 20)
    else:
        print("\tNo sources used!")

# Chat Engine


In [10]:
# define the chat_engine by using the index
chat_engine = vector_index.as_chat_engine(llm=Settings.llm)

In [11]:
# First Question:
response = chat_engine.chat("Use the tool to answer, how does parameter efficient finetuning work?")

display_res(response)

Response:
	 Parameter Efficient Fine Tuning (PEFT) enhances the fine-tuning of large language models (LLMs) by making targeted adjustments to model weights rather than performing full fine-tuning, which can alter every weight in the model. Here's how it works:1. **Pretrained Models**: PEFT starts with a pretrained LLM that already has significant language knowledge.2. **Task-Specific Datasets**: It employs specific datasets relevant to the task at hand, guiding the fine-tuning process.3. **Strategies**: The approach involves three main strategies:   - **Selective**: Only a subset of the model's parameters is chosen for fine-tuning, allowing for more focused adjustments.   - **Reparameterization**: Model weights are modified using low-rank representations, effectively reducing the number of parameters that need to be fine-tuned.   - **Additive**: This involves additional techniques that aid the fine-tuning process.Overall, PEFT aims to maintain the performance of LLMs while significantl

In [12]:
# Second Question:
response = chat_engine.chat("Could you tell me a joke?")
display_res(response)

Response:
	 Sure! Here's one for you:Why did the scarecrow win an award?Because he was outstanding in his field!
Sources:
	No sources used!


In [13]:
# Third Question: (check if it can recall previous interactions)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 The first question you asked was, "How does parameter efficient finetuning work?"
Sources:
	No sources used!


In [14]:
# Reset the session to clear the memory
chat_engine.reset()

In [15]:
# Fourth Question: (don't recall the previous interactions.)
response = chat_engine.chat("What was the first question I asked?")
display_res(response)

Response:
	 The first question you asked was, "What was the first question I asked?"
Sources:
	No sources used!


# Streaming


In [17]:
# Stream the words as soon as they are available instead of waiting for the model to finish generation.
streaming_response = chat_engine.stream_chat(
    "Write a paragraph explaining how RAG and PEFT work, and highlight the differences between them."
)
streaming_response.print_response_stream()

Retrieval-Augmented Generation (RAG) and Parameter-Efficient Fine-Tuning (PEFT) are two innovative strategies employed to optimize the performance of natural language processing models. RAG integrates a retrieval mechanism alongside a generative model, enabling it to fetch relevant information from an external knowledge base in real-time during the generation process. This approach enhances the accuracy and relevance of the responses by grounding them in up-to-date data. In contrast, PEFT focuses on refining existing pre-trained models by fine-tuning only a limited number of parameters, rather than retraining the entire model. This method is computationally efficient, requiring significantly less computational power and labeled data, making it accessible for various applications. The primary difference between the two lies in their approach: RAG leverages external information to enrich generative outputs, while PEFT seeks to optimize internal model performance with minimal resource usa

## Condense Question


Enhance the input prompt by looking at the previous chat history along with the present question. The refined prompt can then be used to fetch the nodes.


In [18]:
# Define GPT-4 model that will be used by the chat_engine to improve the query.
gpt4 = OpenAI(temperature=0.9, model="gpt-4o")

In [19]:
chat_engine = vector_index.as_chat_engine(
    chat_mode="condense_question", llm=gpt4, verbose=True
)

In [21]:
response = chat_engine.chat(
    "How does Retrieval-Augmented Generation (RAG) work, and which problem does it solve?"
)
display_res(response)

Querying with: What did I ask you about how Retrieval-Augmented Generation (RAG) works and the problem it solves?
Response:
	 You asked about how Retrieval-Augmented Generation (RAG) works and the problem it solves. RAG enhances the performance of large language models by addressing challenges such as producing outdated information and fabricating facts. It integrates pretraining with retrieval-based models to provide more reliable and accurate information generation. The typical workflow involves several processing steps: query classification, retrieval, reranking, repacking, and summarization. These steps help improve the quality of responses by integrating current and relevant information.
Sources:
	Node ID	 2aa05360-f43a-4819-bce7-0acf7b897eab
	Text	 Generative large language models are prone to producing outdated information or fabricating facts, although they were aligned with human preferences by reinforcement learning [1] or lightweight alternatives [2–5]. Retrieval-augmented g

## ReAct


ReAct is an agent-based chat mode that uses a loop to decide on querying a data engine during interactions, offering flexibility but relying on the Large Language Model's quality for effective responses, requiring careful management to avoid inaccurate answers.


In [22]:
chat_engine = vector_index.as_chat_engine(chat_mode="react", verbose=True)

In [None]:
response = chat_engine.chat(
    "Which company developed Claude 3.5 Sonnet, and what is its primary application?"
)

Added user message to memory: Which company developed Claude 3.5 Sonnet, and what is its primary application?


In [None]:
display_res(response)