
# Building our AI Quiz and evaluating its performance

Welcome to the last notebook of this workshop content we will walk you through how to build our chat web application.


Now lets jump to our application. The purpose of this part is to give you an overview of everything you need to do to get an chat-application working.

The folder chat_solution contains the app. 

The most important files are:

- create_db.py: This file contians the document / embedding logic
- rag.py: the logic of how call the llm with documents
- start_streamlit.py: where our program starts, contains the ui logic and the calls to the main components


To use our chat we first need to make sure we have documents stored in the database. Lets do it now:

In [None]:
from chat_solution.create_db import create_db

db = create_db()
print(db.retrieve("what is a llm?"))

## Our RAG script

The main part of this chat application is to create a rag call. The LearningAssistant in rag.py is where we implemented our main logic.
Explore it 

In [None]:
# User input and response handling
from chat_solution.rag import LearningAssistant

rag = LearningAssistant()  
query = "what is an hallucination?"
response = rag.query(query)
print(response)

In [None]:

# now change teh instruc
%load_ext autoreload
%autoreload 2

In [None]:

rag = LearningAssistant()
rag.instructions = """ You are a unhelpful  joker assistant. Your goal go give funny answers to the user questions."""
query = "what is an hallucination?"
response = rag.query(query)
print(response)

## Task 1

Tune the examples and the prompot to see if you get a better chat experience. Consider using Chain-of-Tought.

In [None]:

rag = LearningAssistant()
# add your code here
response = rag.query(query)
print(response)


## Running our quiz web application

Now that we explored out assistant in the notebook, lets move to use it in our streamlit application.
The code bellow starts a new streamlit (and stops if there is already another instance running).


In [None]:
import os

os.system("pkill -f streamlit ")
os.system("streamlit run ../chat_solution/start_streamlit.py &")

# Task 2

Play with the chat and try suggesting some topcis for the chat and see if you get results as you expect.


## Evaluating RAG Applications

As you probably got by now, llm can go wrong in so many different ways. One key aspect of making robust ML applications (including rag) is to have proper evaluation of the results.


In [None]:
from ragas import EvaluationDataset

data = [
     {'user_input': 'role models in the area of artificial intelligence?',
      'reference': """Question: Who is a prominent figure known for their influential work on AI ethics?
1. Chip Huyen
2. Timnit Gebru (CORRECT)
3. Andrej Karpathy
"""
     },
     {'user_input': "famous books on llms",
      'reference': """Question: Which of the following is a famous book that discusses Large Language Models (LLMs)?
1. The Hitchhiker's Guide to the Galaxy" by Douglas Adams
2. Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (CORRECT)
3. 1984" by George Orwell
4. To Kill a Mockingbird" by Harper Lee
"""
      }
]

# augment data with the llm response

for i, d in enumerate(data):
    rag = LearningAssistant()
    response = rag.query(d['user_input'])
    data[i]['response'] = response


dataset = EvaluationDataset.from_list(data)


data

In [None]:
from ragas.metrics import FactualCorrectness
from ragas import evaluate
from langchain_mistralai import ChatMistralAI

llm = ChatMistralAI(model="mistral-large-latest", temperature=0)
factual_correctness = FactualCorrectness()
eval_results = evaluate(
        dataset=dataset,
        metrics=[
                factual_correctness
        ],
        llm=llm,
       raise_exceptions=False 
)

evaluation_result_df = eval_results.to_pandas()
#compute average score
evaluation_result_df['factual_correctness'].mean()


In [None]:



print("Factual correctness score: ", evaluation_result_df['factual_correctness'].mean())
evaluation_result_df.iloc[:5]

## Task 3 Add  a new evaluation metric 

Look at [ragas documentation](https://docs.ragas.io/en/stable/) for more metrics.

In [None]:
from ragas.metrics import FactualCorrectness
from ragas import evaluate
factual_correctness = FactualCorrectness()
# add a second metric here

eval_results = evaluate(
        dataset=dataset,
        metrics=[
                factual_correctness,
        ],
        llm=llm,
       raise_exceptions=False 
)

evaluation_result_df = eval_results.to_pandas()
#compute average score
evaluation_result_df['factual_correctness'].mean()
# add your code here

print("Factual correctness score: ", evaluation_result_df['factual_correctness'].mean())
evaluation_result_df.iloc[:5]

## Task 4

Add your own rag class to the chat_solution folder and test it out in the streamlit app.

You will need to:

1. Create a new myrag.py file in chat_solution folder
2. Create a class similar to the one in rag.py (including importing the llm and the vector database)
3. Tune the prompt as you prefer
4. Import it in start_streamlit.py
5. Try it in the url
6. Extra: if you have the time, play with the evaluation score with the new rag class


# The end!

If you reached this phase congrats! You've made to the end. If you still have time you can check our challenge notebook with agents :)