# Notebook : Why Rag

# About

In this notebook, we will motivate Retrieval Augmented Generation (RAG) .

We will see why we can't use Large Language Model directly.



# Imports

In [1]:
from dotenv import load_dotenv
import rich
import logging
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
from huggingface_hub import InferenceClient
import os

In [2]:
#logging.basicConfig(level=logging.DEBUG)


In [3]:
load_dotenv(dotenv_path="../env")

True

In [4]:
llm = OpenAI(model="gpt-3.5-turbo",temperature=0)


# Problems

There are atleast two problems with using Large Language Model 

- Knoweldge Cutoff
- Hallucination

## Knoweldge Cutoff

In [5]:
llm = OpenAI(model="gpt-4o-mini",temperature=0)


In [6]:
def get_response(query:str):
    messages = [
        ChatMessage(role="user", content=query),
    ]
    
    resp = llm.chat(messages)

    return resp

In [7]:
query="what is different about Llama3.2 than Llama2 ?"


In [11]:
response = get_response(query)

In [16]:
rich.print(response.message.content)

In [17]:
response

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content="As of my last update in October 2023, specific details about Llama 3.2 compared to Llama 2 were not available. However, in general, updates from one version of a model to another typically include improvements in several areas:\n\n1. **Model Architecture**: Newer versions may incorporate changes in the underlying architecture that enhance performance, efficiency, or scalability.\n\n2. **Training Data**: Updates often involve training on larger and more diverse datasets, which can improve the model's understanding and generation of language.\n\n3. **Fine-tuning and Specialization**: Newer models may include better fine-tuning techniques or be specialized for certain tasks, leading to improved performance in specific applications.\n\n4. **Performance Metrics**: Improvements in metrics such as accuracy, coherence, and relevance of generated text are common in newer versions.\n\n5. **Robustness and Safety*

In [19]:
rich.print(response)

note the model is aliased to `model='gpt-4o-mini-2024-07-18'`.

So, the model's training data wont have any information after then.

## Hallucination

![Snapshot of Hallucination](../images/llama_2__hallucination.png)


In the above video, we ask about non existant Llama model.

We also give it a link to a study music youtube video.

Note: we are also using an "older" version of Llama model to make the inference

In [None]:
client = InferenceClient(api_key=os.environ['HF_API_KEY'])

In [None]:
message = f"""
What did Andrej Karpathy say about Meta's Llama 5.9 in the below youtube talk

https://www.youtube.com/watch?v=n61ULEU7CO0&ab_channel=LofiGirl

"""

In [None]:
messages = [
    { "role": "user", "content": message },
]

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf", 
    messages=messages, 
    temperature=0,
)

In [None]:
rich.print(response)

In [None]:
rich.print(response.choices[0].message.content)

## Notes

We explored at least two problems with directly using LLMs directly. 

- Knowledge Cuttoff  
- Hallucination.

