
# Evaluation

When building a complex application using the LLM, one of the important but sometimes tricky steps is 
- how do you evaluate how well your application is doing?
- is it meeting some accuracty criteria?

Also also if you decide to change your implementation, 
- may be swap in a different LLM, OR 
- change the strategy of how you use a vector database OR 
- decided to use different chunking configuration OR 
- change some other parameters of your system. 

<b>How do you know if you are making it better or worse? <b/>

In this notebook we will see how to think about evaluating LLM-based application as well as utility chain in LangChain to help with the same. 

LLM applications are really chains and sequence of a lot of different steps. So the first thing that you should do is just understand what exatly is going in and coming out of each step. 

And one way to that is by looking at things by eye. But there's also this really super cool idea of using language models and chains themselves to evaluate other language models, chains and applications. 


In [0]:
%run "./utils/config"

In [0]:
import mlflow
requirements_path = mlflow.pyfunc.get_model_dependencies(config['model_uri'])
%pip install -r $requirements_path

2023/09/16 10:07:35 INFO mlflow.pyfunc: To install the dependencies that were used to train the model, run the following command: '%pip install -r /tmp/tmpt7i28vwp/requirements.txt'.


[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting mlflow==2.5.0
  Using cached mlflow-2.5.0-py3-none-any.whl (18.2 MB)
Collecting nemoguardrails==0.5.0
  Using cached nemoguardrails-0.5.0-py3-none-any.whl (13.9 MB)
Collecting langchain==0.0.251
  Using cached langchain-0.0.251-py3-none-any.whl (1.4 MB)
Collecting watchdog==3.0.0
  Using cached watchdog-3.0.0-py3-none-manylinux2014_x86_64.whl (82 kB)
Collecting PyMuPDF==1.23.3
  Using cached PyMuPDF-1.23.3-cp310-none-manylinux2014_x86_64.whl (4.3 MB)
Collecting pysqlite-binary
  Using cached pysqlite_binary-0.5.1.3380300-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
Collecting querystring-parser<2
  Using cached querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Collecting alembic!=1.10.0,<2
  Using cached alembic-1.12.0-py3-none-any.whl (226 kB)
Collecting docker<7,>=4.0.0
  Using cached docker-6.1.3-py3-none-any.whl (148 kB)
Collecting 

In [0]:
dbutils.library.restartPython()

In [0]:
%run "./utils/config"

In [0]:
%run "./utils/functions"

In [0]:
import pandas as pd
import mlflow
from langchain.evaluation.qa import QAEvalChain
from langchain.llms import OpenAI

from langchain.evaluation.qa import QAGenerateChain
from langchain.chat_models import ChatOpenAI

In [0]:
data = read_pdf_to_string(config['data_dir_path'])
data

['\u2029\nGovernment Data Security Policies  |   �1\nGOVERNMENT \nDATA SECURITY \nPOLICIES\nThis document contains general information for the \npublic only. It is not intended to be relied upon as a \ncomprehensive or deﬁnitive guide on each agency’s \npolicies and practices. The data security measures \nimplemented by each agency will differ depending on \nvarious factors such as the sensitivity of the data and \nthe agency’s assessment of data security risks. The \nGovernment may update the policies set out in this \ndocument without publishing such updates to the \npublic.    \nThe Government takes its responsibility as a \ncustodian of data very seriously.\nSince 2001, the Government’s data security policies have been set out in the Government \nInstruction Manual (IM) on Infocomm Technology and Smart Systems (ICT&SS) Management. In \n2019, the Public Sector Data Security Review Committee recommended additional technical and \nprocess measures to protect data and prevent data comp

In [0]:
data = [doc[:2000] for doc in data]

# Generate Question/Answer Pair 

So we have loaded the index and setup the retriever and we also have an application that we have assembled in `Step#2`. 

First thing we need to do is to really figure out what are some data points that we want to evaluate it on. 

Most simple way : Basically we're going to come up with data points that we think are good examples ourselves. And so to do that, we can just look at some of the data and come up with examples questions and then example ground truth answers that we can later use to evaluate. But this doesn't really scale that well. It takes a lot of time to go through each example and figure out what's going on to come up with example evaluation dataset and so is there a way that we can AUTOMATE it?

<b> And one of the really cool ways that we think we can automate it is with language models themselves.</b> Let's implement it

- examples are generated using LLM which is question answer pair
- now we are passing the questions from "examples" into our model to generate predictions
- compare the answer prediction[answer] v/s example_question_answer_pair[answe


### `QAGenerateChain` takes in documents and it will create question answer pair from each document. It'll do this using a language model itself. 

In [0]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

# for each doc ...
generated_examples = []

for doc in data:
        try:
            generated_examples += example_gen_chain.apply_and_parse([{"doc": doc}])
        except ValueError as e:
            print(f'OOPSY LLM QA Chain encountered while generating QnA error: {e}')



##### So now, if we look at what exactly is returned here, we can see a query and we can see an answer. Look at that!!! 

We just generated a bunch of question-answer pairs. More importantly we didn't have to write it ourselves. Saves us a bunch of time allowing us to code more exciting things. 


In [0]:
generated_examples = [example["qa_pairs"] for example in generated_examples]
generated_examples

[{'query': 'According to the document, why do the data security measures implemented by each agency differ?',
  'answer': "The data security measures implemented by each agency differ depending on various factors such as the sensitivity of the data and the agency's assessment of data security risks."},
 {'query': 'What are the two legal frameworks that govern data management in the public and private sectors?',
  'answer': 'The two legal frameworks that govern data management in the public and private sectors are the Public Sector (Governance) Act (PSGA) for the public sector and the Personal Data Protection Act (PDPA) for the private sector.'},
 {'query': "What are the key policies of the Government's Third-Party Management Framework?",
  'answer': "The key policies of the Government's Third-Party Management Framework are designed to guide agencies in ensuring that third parties adequately safeguard data. These policies are organized based on the lifecycle of the relationship between 


Let's run these generated examples through the chain and generate predictions

In [0]:
model = mlflow.pyfunc.load_model(config['model_uri'])
queries = pd.DataFrame({'question': [r['query'] for r in generated_examples]})
predictions = model.predict(queries)

In [0]:
predictions

[{'question': 'According to the document, why do the data security measures implemented by each agency differ?',
  'answer': 'The data security measures implemented by each agency differ due to the specific needs of the agency, the type of data being protected, and the resources available to the agency.'},
 {'question': 'What are the two legal frameworks that govern data management in the public and private sectors?',
  'answer': 'The two legal frameworks that govern data management in the public and private sectors are the General Data Protection Regulation (GDPR) and the Data Protection Act 2018 (DPA 2018).\nThe above response may have been hallucinated, and should be independently verified.'},
 {'question': "What are the key policies of the Government's Third-Party Management Framework?",
  'answer': "The key policies of the Government's Third-Party Management Framework include the following: 1) Establishing a risk-based approach to third-party management; 2) Establishing a framewor


# Evaluation

So we have generated whole bunch of prediction for the question/answer pairs we have created. 

HOW ARE WE EVER GOING TO EVALUATE THIS?

Similary to when creating them, one way to do it would be manually. We cound run the chain over all the examples, then look at the outputs, and try to figure out what's going on, whether it's correcct, incorrect or even partially correct. And this really starts to get a little bit tedious over time and a whole lot BORING. 

Lets go back to our favorite solution. <b>Can we ask a language model to do it for us? </b>

## YES WE CAN



#### `QAEvalChain` at rescue

In [0]:
llm = OpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(generated_examples, predictions, question_key="query", prediction_key="answer")

In [0]:
graded_outputs

[{'results': ' CORRECT'}, {'results': ' CORRECT'}, {'results': ' CORRECT'}]

In [0]:
for i, eg in enumerate(generated_examples):
    print(f"\n\n\nExample {i}:")
    print("Question: " + generated_examples[i]['query'])
    print("Real Answer: " + generated_examples[i]['answer'])
    print("Predicted Answer: " + predictions[i]['answer'])
    print("Predicted Grade: " + graded_outputs[i]["results"])




Example 0:
Question: According to the document, why do the data security measures implemented by each agency differ?
Real Answer: The data security measures implemented by each agency differ depending on various factors such as the sensitivity of the data and the agency's assessment of data security risks.
Predicted Answer: The data security measures implemented by each agency differ due to the specific needs of the agency, the type of data being protected, and the resources available to the agency.
Predicted Grade:  CORRECT



Example 1:
Question: What are the two legal frameworks that govern data management in the public and private sectors?
Real Answer: The two legal frameworks that govern data management in the public and private sectors are the Public Sector (Governance) Act (PSGA) for the public sector and the Personal Data Protection Act (PDPA) for the private sector.
Predicted Answer: The two legal frameworks that govern data management in the public and private sectors are 


#### Let's just think about this for a moment as to why we actually need to use the language modle in the first place. 

As you can see predicted answers and real answers strings are actually nothing alike. They're very different. Sometimes one is really short while predicted answers sometimes are very long. 

So if we were to try do some string matching, or exact matching, or even some regexes here, it wouldn't know what to do. They're not the same thing. And this shows off the importance of using language model to do the evaluation here. 

We have got these answers which are arbinatry strings. There's no single one truth string that is the best possible answer. There's many different variants. And as long as they have the same semantic meaning, they should be graded as being similar. And that's what a language model helps with, as opposed to just doing exact matching. 

This difficulty in comparing strings is what makes evaluation of language models so hard in the first place. We are using them for these really open-ended tasks, where they are asked to generate text. This hasn't really been done before, as models until recently weren't really good enough to do this. And so a lot of evaluation metrics that did exist up to this point just aren't good enough. And we are having to invent new ones, and invent new hueristics for doing so.

And the most interesting and most popular of those heuristics at the moment is actually using a language model to do the evaluation. 

In [0]:
data = {"same_as_answer": [r['results'].strip() for r in graded_outputs],
        'question': [r['query'] for r in generated_examples], 
        'answer': [r['answer'] for r in generated_examples], 
        'predicted_answer': [r['answer'] for r in predictions],
        }

results = pd.DataFrame(data)
display(results)

same_as_answer,question,answer,predicted_answer
CORRECT,"According to the document, why do the data security measures implemented by each agency differ?",The data security measures implemented by each agency differ depending on various factors such as the sensitivity of the data and the agency's assessment of data security risks.,"The data security measures implemented by each agency differ due to the specific needs of the agency, the type of data being protected, and the resources available to the agency."
CORRECT,What are the two legal frameworks that govern data management in the public and private sectors?,The two legal frameworks that govern data management in the public and private sectors are the Public Sector (Governance) Act (PSGA) for the public sector and the Personal Data Protection Act (PDPA) for the private sector.,"The two legal frameworks that govern data management in the public and private sectors are the General Data Protection Regulation (GDPR) and the Data Protection Act 2018 (DPA 2018). The above response may have been hallucinated, and should be independently verified."
CORRECT,What are the key policies of the Government's Third-Party Management Framework?,"The key policies of the Government's Third-Party Management Framework are designed to guide agencies in ensuring that third parties adequately safeguard data. These policies are organized based on the lifecycle of the relationship between the agency and the third party, including evaluation and selection, contracting and on-boarding, service management, and transition out.","The key policies of the Government's Third-Party Management Framework include the following: 1) Establishing a risk-based approach to third-party management; 2) Establishing a framework for assessing and managing third-party risk; 3) Establishing a process for monitoring and reporting on third-party performance; 4) Establishing a process for managing third-party contracts; and 5) Establishing a process for managing third-party data security. The above response may have been hallucinated, and should be independently verified."


In [0]:
print("Corrct results:  " + str(results.value_counts("same_as_answer")))
print("Total results:  " + str(results.count()[0]))

Corrct results:  same_as_answer
CORRECT    3
dtype: int64
Total results:  3


In [0]:
question = "The personal data can be disclosed for individuals who have been been dead for how minimum many years?"
queries = pd.DataFrame({'question': [question]})
predictions = model.predict(queries)
predictions

[{'question': 'The personal data can be disclosed for individuals who have been been dead for how minimum many years?',
  'answer': 'I cannot answer the question.'}]

In [0]:
model.unwrap_python_model().qabot.retriever.get_relevant_documents(question)

[Document(page_content='Government Personal Data Protection Policies (Annex A)   |   28\n \n \n \n03\nThe collection, use or disclosure (as the case may be) of personal data \nabout an individual, where — \n(a)\nconsent for the collection, use or disclosure (as the case may \nbe) cannot be obtained in a timely way; and \n(b)\nthere are reasonable grounds to believe that the health or \nsafety of the individual or another individual will be seriously \naffected..04\nThe collection, use or disclosure of personal data is for the purpose of \ncontacting the next-of-kin or a friend of any injured, ill or deceased \nindividual.', metadata={}),
 Document(page_content='03\nThe collection, use or disclosure (as the case may be) of personal data \nabout an individual is solely for archival or historical purposes, if a \nreasonable person would not consider the personal data to be too \nsensitive to the individual to be collected, used or disclosed (as the case \nmay be) at the proposed time..Gov