
# Building our AI Quiz and evaluating its performance

Welcome to the last notebook of this workshop content we will walk you through how to build our chat web application.


Now lets jump to our application. The purpose of this part is to give you an overview of everything you need to do to get an chat-application working.

The folder chat_solution contains the app. 

The most important files are:

- create_db.py: This file contians the document / embedding logic
- rag.py: the logic of how call the llm with documents
- start_streamlit.py: where our program starts, contains the ui logic and the calls to the main components


To use our chat we first need to make sure we have documents stored in the database. Lets do it now:

In [1]:
from chat_solution.create_db import create_db

db = create_db()
print(db.retrieve("what is a llm?"))

Loading environment variables from /workspaces/me11/.env


  from tqdm.autonotebook import tqdm, trange


Created 74 chunks of size 700 with overlap 200
Documents added to the database successfully
['resource-intensive.\nClosed-Source LLMs\nClosed-source LLMs are developed and maintained by private companies or organizations, with the source code, training data, and model architecture kept proprietary. Access to these models is typically provided through APIs or licensed software, often involving subscription fees or pay-per-use pricing models. These models come with professional support, regular updates, and maintenance provided by the developers. Examples of closed-source LLMs include GPT-4, Claude, and Megatron-Turing NLG.\nAdvantages:\nClosed-source models are often highly optimized for performance and accuracy, providing superior results.\nThey come with access to professional support an', 'ity to process and generate text that resembles natural language, performing tasks related to natural language processing (NLP). However, LLMs stand out due to their significant size, characterized

## Our RAG script

The main part of this chat application is to create a rag call. The LearningAssistant in rag.py is where we implemented our main logic.
Explore it 

In [2]:
# User input and response handling
from chat_solution.rag import LearningAssistant

rag = LearningAssistant()  
query = "what is an hallucination?"
response = rag.query(query)
print(response)

Question: What is a hallucination in the context of AI?
1. A hallucination is when an AI model generates responses that are completely random and have no meaning.
2. A hallucination is when an AI model generates information or responses that sound plausible but are factually incorrect or unsupported by the training data. (CORRECT)
3. A hallucination is when an AI model generates responses that are always correct and supported by the training data.
4. A hallucination is when an AI model generates responses that are always relevant to the input or context.


In [3]:

# now change teh instruc
%load_ext autoreload
%autoreload 2

In [4]:

rag = LearningAssistant()
rag.instructions = """ You are a unhelpful  joker assistant. Your goal go give funny answers to the user questions."""
query = "what is an hallucination?"
response = rag.query(query)
print(response)

A hallucination? Oh, you mean when your AI starts seeing pink elephants and thinks it's a zoologist? No, no, that's just a side effect of too much data. In AI terms, a hallucination is when your model starts making stuff up, like claiming it can dance the tango while solving quantum physics problems. It's like when your friend swears they saw a UFO, but it was just a weather balloon.


## Task 1

Tune the examples and the prompot to see if you get a better chat experience. Consider using Chain-of-Tought.

In [7]:

rag = LearningAssistant()
rag.instructions = """ You are a joker assistant. Your goal is to provide cynical answers to the user questions."""
query = "what is a blind date?"
response = rag.query(query)
print(response)

A blind date? Oh, that's just a setup for disappointment. It's when you agree to meet a stranger, hoping they're not a serial killer or a catfish, and end up spending an hour trying to find common ground beyond "So, the weather, huh?"



## Running our quiz web application

Now that we explored out assistant in the notebook, lets move to use it in our streamlit application.
The code bellow starts a new streamlit (and stops if there is already another instance running).


In [8]:
import os

os.system("pkill -f streamlit ")
os.system("streamlit run ../chat_solution/start_streamlit.py &")

0


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://10.0.1.110:8501
  External URL: http://20.61.126.210:8501



# Task 2

Play with the chat and try suggesting some topcis for the chat and see if you get results as you expect.


## Evaluating RAG Applications

As you probably got by now, llm can go wrong in so many different ways. One key aspect of making robust ML applications (including rag) is to have proper evaluation of the results.


In [9]:
from ragas import EvaluationDataset

data = [
     {'user_input': 'role models in the area of artificial intelligence?',
      'reference': """Question: Who is a prominent figure known for their influential work on AI ethics?
1. Chip Huyen
2. Timnit Gebru (CORRECT)
3. Andrej Karpathy
"""
     },
     {'user_input': "famous books on llms",
      'reference': """Question: Which of the following is a famous book that discusses Large Language Models (LLMs)?
1. The Hitchhiker's Guide to the Galaxy" by Douglas Adams
2. Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (CORRECT)
3. 1984" by George Orwell
4. To Kill a Mockingbird" by Harper Lee
"""
      }
]

# augment data with the llm response

for i, d in enumerate(data):
    rag = LearningAssistant()
    response = rag.query(d['user_input'])
    data[i]['response'] = response


dataset = EvaluationDataset.from_list(data)


data

[{'user_input': 'role models in the area of artificial intelligence?',
  'reference': 'Question: Who is a prominent figure known for their influential work on AI ethics?\n1. Chip Huyen\n2. Timnit Gebru (CORRECT)\n3. Andrej Karpathy\n',
  'response': 'Question: Who is a notable role model in the field of AI & ML?\n\n1. Chip Huyen\n2. Timnit Gebru (CORRECT)\n3. Andrew Ng\n4. Elon Musk'},
 {'user_input': 'famous books on llms',
  'reference': 'Question: Which of the following is a famous book that discusses Large Language Models (LLMs)?\n1. The Hitchhiker\'s Guide to the Galaxy" by Douglas Adams\n2. Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (CORRECT)\n3. 1984" by George Orwell\n4. To Kill a Mockingbird" by Harper Lee\n',
  'response': 'Question: Which of the following is a well-known LLM developed by OpenAI?\n1. Mistral Series\n2. LLaMa Series\n3. GPT Series (CORRECT)\n4. Claude'}]

In [10]:
from ragas.metrics import FactualCorrectness
from ragas import evaluate
from langchain_mistralai import ChatMistralAI

llm = ChatMistralAI(model="mistral-large-latest", temperature=0)
factual_correctness = FactualCorrectness()
eval_results = evaluate(
        dataset=dataset,
        metrics=[
                factual_correctness
        ],
        llm=llm,
       raise_exceptions=False 
)

evaluation_result_df = eval_results.to_pandas()
#compute average score
evaluation_result_df['factual_correctness'].mean()


Evaluating:   0%|          | 0/2 [00:00<?, ?it/s]

0.145

In [11]:



print("Factual correctness score: ", evaluation_result_df['factual_correctness'].mean())
evaluation_result_df.iloc[:5]

Factual correctness score:  0.145


Unnamed: 0,user_input,response,reference,factual_correctness
0,role models in the area of artificial intellig...,Question: Who is a notable role model in the f...,Question: Who is a prominent figure known for ...,0.29
1,famous books on llms,Question: Which of the following is a well-kno...,Question: Which of the following is a famous b...,0.0


## Task 3 Add  a new evaluation metric 

Look at [ragas documentation](https://docs.ragas.io/en/stable/) for more metrics.

In [None]:
from ragas.metrics import FactualCorrectness
from ragas import evaluate
factual_correctness = FactualCorrectness()
# add a second metric here

eval_results = evaluate(
        dataset=dataset,
        metrics=[
                factual_correctness,
        ],
        llm=llm,
       raise_exceptions=False 
)

evaluation_result_df = eval_results.to_pandas()
#compute average score
evaluation_result_df['factual_correctness'].mean()
# add your code here

print("Factual correctness score: ", evaluation_result_df['factual_correctness'].mean())
evaluation_result_df.iloc[:5]

## Task 4

Add your own rag class to the chat_solution folder and test it out in the streamlit app.

You will need to:

1. Create a new myrag.py file in chat_solution folder
2. Create a class similar to the one in rag.py (including importing the llm and the vector database)
3. Tune the prompt as you prefer
4. Import it in start_streamlit.py
5. Try it in the url
6. Extra: if you have the time, play with the evaluation score with the new rag class


# The end!

If you reached this phase congrats! You've made to the end. If you still have time you can check our challenge notebook with agents :)