# Naive RAG

Reference:
1. [Simple RAG Cookbook on Hugging Face](https://huggingface.co/learn/cookbook/en/rag_zephyr_langchain)


## What is a Naive RAG

* Naive RAG is a `Retrieve-read` framework - steps are:
  * Index
  * Retrieve
  * Generate output (by augmenting prompt with returned search results)







## Prep

### Install dependencies

In [5]:
!pip install -q torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu langchain datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.0/974.0 kB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.7/314.7 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [6]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.4-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.21.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensi

### Imports

In [None]:
import torch
from google.colab import userdata
from langchain.document_loaders import GitHubIssuesLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import CTransformers, HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.vectorstores import VectorStoreRetriever
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline
import re

### Setup GitHub token

* For this exercise, the data we will be using the data from the PEFT Issues in GitHub
* I'll need to use the GitHub API
  * Will need a GitHub token to access the data

In [None]:
ACCESS_TOKEN = userdata.get("GH_READ_TOKEN") # Add the YOUR_GITHUB_PERSONAL_TOKEN secret to google colab secrets

## Index data into Vector Store as part of RAG pipeline

### Load RAG data (i.e. domain-specific data to inform general LLM at inference time)

In [None]:
loader = GitHubIssuesLoader(
                            repo="huggingface/peft",
                            access_token=ACCESS_TOKEN,
                            include_prs=False, # exclude Pull Requests
                            state="all" # include issues of all states
                            )

docs = loader.load() # List[Document]

### Chunk data

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_documents(docs) # List[Document]

In [None]:
assert len(chunked_docs) > len(docs)

### Generate Embeddings into FAISS Vector Store

In [None]:
# Using a HuggingFaceEmbeddings model named "BAAI/bge-base-en-v1.5"
# As of 4th June 2024, this Embeddings Model is ranked 30 on the MTEB (Massive Text Embeddings Benchmark board)
db = FAISS.from_documents(chunked_docs, HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))

  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Create Retriever

In [None]:
# Do this as other LangChain methods work with retrivers - eg. docs =
# retriever.invoke(query)
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

In [None]:
assert isinstance(retriever, VectorStoreRetriever)
assert isinstance(db, FAISS)

## Setup the LLM-only and RAG chains

### Load the pre-trained Model and its Tokenizer

In [None]:
model_name = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map='auto')

# Setup the Model's Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

  warn_deprecated(


### Setup the chains

In [None]:
prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()

In [None]:
# retriever = db.as_retriever() # Do this as other LangChain methods work with retrivers - eg. docs = retriever.invoke(query)

In [None]:
rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_chain

## Evaluate results between non-RAG-informed vs. RAG-informed LLM model

In [None]:
question = "How do you combine multiple adapters?"

### Evaluate LLM Chain (non-RAG informed)

In [None]:
llm_chain.invoke({"context": "", "question": question})

"\n<|system|>\nAnswer the question based on your knowledge. Use the following context to help:\n\n\n\n</s>\n<|user|>\nHow do you combine multiple adapters?\n</s>\n<|assistant|>\n\n  To combine multiple adapters, you need to ensure that they are compatible with each other and with the devices you want to connect. Here's how you can do it:\n\n1. Identify the adapters you need: Determine which adapters you require to connect your devices. For example, if you want to connect a USB device to an Ethernet network, you may need a USB-to-Ethernet adapter and an Ethernet-to-RJ45 adapter.\n\n2. Check compatibility: Make sure that the adapters you choose are compatible with each other and with the devices you want to connect. This information should be provided in the product specifications or user manuals.\n\n3. Connect the adapters: Plug one end of the first adapter into the device you want to connect, and then plug the second adapter into the output of the first adapter. Repeat this process for

### Evaluate RAG Chain (RAG informed)

In [None]:
rag_chain.invoke(question)

'\n<|system|>\nAnswer the question based on your knowledge. Use the following context to help:\n\n[Document(page_content=\'The documentation does not mention the need to perform a merge when switching adapters. Additionally, the methods add_adapter, set_adapter, and enable_adapters do not appear to work\\r\\n\\r\\nPlease provide clarification on how to correctly switch between adapters\', metadata={\'url\': \'https://github.com/huggingface/peft/issues/1802\', \'title\': \'Issues when switching between multiple adapters LoRAs \', \'creator\': \'JhonDan1999\', \'created_at\': \'2024-05-26T19:18:13Z\', \'comments\': 7, \'state\': \'open\', \'labels\': [], \'assignee\': None, \'milestone\': None, \'locked\': False, \'number\': 1802, \'is_pull_request\': False}), Document(page_content="If you can provide any advice, I would greatly appreciate it. I suspect that this is either unsupported and/or not fully-implemented; or, it has something to do with the way I\'m attaching adapters. I\'ve tri

## Outcome

The LLM response from the RAG Chain is more domain specific and more aligned to the use case.

## Next steps

1. Learn more about the HF library - specifically:
  * How the chains are created by piping together HF pipeline object
  * What is `RunnablePassthrough`?
2. Check out how the RAG chain is implemented in other libraries:
  * Vertex AI vs. Databricks vs. HF

# LLM-as-a-Judge aka Auto-evaluation - using human-labelled data

Reference:
1. [LLM Evaluation using LLM Judge](https://huggingface.co/learn/cookbook/en/llm_judge)

Sub-references:
1. https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG

## Prep

### Install dependencies

Install dependencies by running the cells in the `Naive RAG > Prep > Install dependencies` section in this notebook.


### Imports


In [7]:
from huggingface_hub import InferenceClient
from google.colab import userdata
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
import pandas as pd
import re
import time
import torch

In [8]:
tqdm.pandas() # initialize tqdm for pandas

### Setup (serverless API) InferenceClient

In [9]:
# Model card (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
#   Model card says - "This model can be loaded on Inference API (serverless)""
repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
    token=userdata.get('HF_TOKEN')
)

### Setup load model from AutoModelForCausalLM

In [20]:
llm_client_tokenizer = AutoTokenizer.from_pretrained(repo_id, token=userdata.get('HF_TOKEN'))

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [33]:
llm_model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    token=userdata.get('HF_TOKEN'),
    # trust_remote_code=True
) # This is WAYYYYYYYY slower that setting up the Inference client - have to download 10GB+ worth of tensor data (maybe it's weights???)


Downloading shards:   0%|          | 0/19 [00:00<?, ?it/s]

model-00007-of-00019.safetensors:  55%|#####4    | 2.67G/4.90G [00:00<?, ?B/s]

model-00008-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00009-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00010-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00011-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00012-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00013-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00014-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

KeyboardInterrupt: 

**TODO**: **WHY DOES MY TOKEN NEED WRITE ACCESS** to the mistralai repository in order to load the tokenizer? - WHY?

In [10]:
# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=50)

'\n\nI’m good, thanks. I’m just about to go to the gym.\n\nYou’re a fitness fanatic, aren’t you?\n\nI’m not a fanatic, but I do like to'

## Use a Human-evaluated dataset to evaluate correlation between model & human ratings

### Motivation
In order to use LLM-as-a-judge, we will first evaluate our foundation LLM's performance against a dataset of human-evaluated ratings.

If the evaluations correlate well - that's great! Otherwise, we may need to enhance the model reponses using:

1. Prompt engineering/tuning
2. RAG

### Load a Human-evaluated dataset

For this example, we'll use the McGill-NLP dataset ohe McGill-NLP dataset (on Huggingface Hub).

This dataset contains:
1. Question/Answer pairs (Answer is in the 'passage' field)
2. 2 separate human evaluations in the form of rating and feedback

In [11]:
# dataset = load_dataset("McGill-NLP/feedbackQA", data_files=data_file_d) # To load from the HuggingFace Hub
# However, at the time of writing, I was not able to load the dataset directly from HuggingFace Hub
# Instead I've:
# * Downloaded the datafiles manually
# * Uploaded to the /contents folder of my Runtime manually in Colab

data_file_d = {'test': 'feedback_test.json', 'train': 'feedback_train.json', 'validation': 'feedback_valid.json'}
dataset = load_dataset("/content/", data_files=data_file_d, streaming=True)

conversion_dict = {"Excellent": 4, "Acceptable": 3, "Could be Improved": 2, "Bad": 1}

def _cleanup(input_str: str) -> str:
  return re.sub(
      r"\n+",
      " ",
      re.sub(r"(\u2018|\u2019)", "'", input_str)
  )

def pre_process_ds_1(row: dict) -> dict:
  _ans_root: dict = row.get('passage', {}).get('reference', {})

  return {
      'rating1': conversion_dict.get(row['rating'][0]),
      'rating2': conversion_dict.get(row['rating'][1]),
      'feedback1': row['feedback'][0],
      'feedback2': row['feedback'][1],
      'answer': _cleanup(_ans_root.get('page_title', '')) + \
                  _cleanup(_ans_root.get('section_content', ''))
  }

dataset = dataset.\
            map(lambda row: pre_process_ds_1(row), batched=False).\
            remove_columns(['domain', 'rating', 'passage', 'feedback'])
dataset

IterableDatasetDict({
    test: IterableDataset({
        features: Unknown,
        n_shards: 1
    })
    train: IterableDataset({
        features: Unknown,
        n_shards: 1
    })
    validation: IterableDataset({
        features: Unknown,
        n_shards: 1
    })
})

In [12]:
for a in dataset['train']:
  print(a['rating1'])
  break

4


### Evaluate corellation between the 2 human feedback and ratings

In [13]:
pd_dataset = pd.DataFrame(dataset['train'])

x = np.corrcoef(pd_dataset['rating1'], pd_dataset['rating2'])[0][1]
print(f"Correlation coeff is {x:.3f}")

Correlation coeff is 0.563


In [14]:
pd_dataset.head(5)

Unnamed: 0,question,rating1,rating2,feedback1,feedback2,answer
0,How do I get help finding a job?,4,2,Has a link to detailed information about gover...,"This answer provides a link for job searches, ...",Coronavirus (COVID-19) information for job see...
1,How do I get help finding a job?,4,4,"A link to a job search website is included, as...","Includes a link to a Jobs Hub page, which is b...",Coronavirus (COVID-19) information for job see...
2,How do I get help finding a job?,1,3,Talks about tax credits for businesses that hi...,"This answer discusses the Employment Fund, whi...",Coronavirus (COVID-19) information and support...
3,If I am in Australia on a worker holiday marke...,2,3,"Answer is about Working Holiday Makers, but do...",Answer is rather cut and dry but is also a lit...,Frequently Asked QuestionsNo. Existing arrange...
4,If I am in Australia on a worker holiday marke...,1,2,Discusses pandemic visas. Doesn't mention the ...,This answer is very vague and does not answer ...,Frequently Asked QuestionsThe COVID-19 Pandemi...


### Analysis

The corellation between the 2 human-provided ratings is quite low - this needs to be improved in order to use method.

What we can do is only choose those questions whose **ratings are the same for both Human 1 and Human 2**.

In [15]:
pd_dataset = pd.DataFrame(
                      dataset['train'].filter(lambda row: row['rating1'] == row['rating2'])
            )

x = np.corrcoef(pd_dataset['rating1'], pd_dataset['rating2'])[0][1]
print(f"Correlation coeff is {x:.3f}") # Uh ok - this is DUH - but still, I'll leave it here

Correlation coeff is 1.000


### Outcome

We have now a DataFrame of human-evaluated ratings of question/answer tuples that we can use to evaluate the responses of our foundational LLM model.

In [16]:
ds_random_samples_from_each_score = pd_dataset.\
                                      groupby('rating1').\
                                      sample(7, random_state=42)

ds_random_samples_from_each_score.head(10)

# HF datasets don't have groupby, etc. - need to convert to Pandas/PyTorch dataframe for that
# Reference: https://huggingface.co/learn/nlp-course/en/chapter5/3

Unnamed: 0,question,rating1,rating2,feedback1,feedback2,answer
1869,My mom lost her job. Will this affect my stude...,1,1,"Generic, short answer about extraordinary circ...",This answer is irrelevant to the question. Thi...,Guidance Coronavirus (COVID-19): cancellation ...
2139,While we are in the midst of the Coronavirus p...,1,1,This answer is irrelevant to the question. The...,"This doesn't address baby rooms, but only talk...",Guidance Actions for early years and childcare...
938,When living in shared housing during COVID-19 ...,1,1,This appears to be focusing on medication whil...,This mostly talks about medication and conditi...,Living in Shared Housing Keep up-to-date lists...
1078,Will a patient need to get a negative Covid-19...,1,1,Answer refers to people who have been isolatin...,Discusses when it's safe to go outside again a...,Caring for Someone Sick at HomePeople with COV...
1755,Can real estate agents still practice?,1,1,"Answer concerns itself with letting agents, la...",This information does not answer the question....,Guidance Government advice on home moving duri...
225,What should immediate family members need to do?,1,1,Discusses what government officials need to do...,Question is rather broad as to what they are r...,Government response to the COVID-19 outbreakSt...
1237,Do people in contact with a person with COVID-...,1,1,Talks about preventing visits to correctional ...,This information does not answer the question....,FAQs for Correctional and Detention Facilities...
172,What arrangements can be made during quarantin...,2,2,"It addresses transit, but is focused on intern...",This is a decent response because it states th...,Transiting AustraliaIf you cannot remain in th...
14,How has the Australian government adjusted the...,2,2,It gives an answer ut it needs to define some ...,Nothing here mentions the agricultural industr...,Frequently Asked QuestionsYou should only appl...
1581,What can be done to safeguard children and the...,2,2,There is not enough information here about saf...,the only part of the answer that relates to th...,Guidance Coronavirus (COVID-19): support for p...


In [22]:
ds_random_samples_from_each_score.info()

<class 'pandas.core.frame.DataFrame'>
Index: 28 entries, 1869 to 141
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   question   28 non-null     object
 1   rating1    28 non-null     int64 
 2   rating2    28 non-null     int64 
 3   feedback1  28 non-null     object
 4   feedback2  28 non-null     object
 5   answer     28 non-null     object
dtypes: int64(2), object(4)
memory usage: 1.5+ KB


## Prompt tuning experiments

Since this is the first time I'm using the Mixtral 8x7B model, I had to follow the prompt format that's presented in the model card.

Reference URL for the model card is [here](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1#instruction-format).

### Define prompt template

In [22]:
MIXTRAL_PROMPT_TEMPLATE_FIRST_TRY = """
<s>
[INST] You will be given a user_question and system_answer couple. [/INST]
[INST] Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question. [/INST]
[INST] Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.[/INST]

[INST] Provide your feedback as follows: [/INST]
        Feedback::: [/INST]
        Total rating: (your rating, as a float between 0 and 10)
[/INST]

[INST] Now here are the question and answer.
           Question: {question}
           Answer: {answer}

Feedback:::
Total rating: [/INST]
</s>
""" # Was incorrect - Had to follow the example in the model card

MIXTRAL_PROMPT_TEMPLATE = [
    {'role': 'user', 'content': "You will be given a user_question and system_answer couple. Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question."},
    {'role': 'assistant', 'content': 'What is the user_question and system_answer?'},
    {'role': 'user', 'content': """The user_question is {question}"""},
    {'role': 'assistant', 'content': 'What is the system_answer?'},
    {'role': 'user', 'content': """The system_answer is {answer}"""},
    {'role': 'assistant', 'content': "How should the rating be scored?"},
    {'role': 'user', 'content': 'Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.'},
    {'role': 'assistant', 'content': 'Should my reply and rating be formatted in a specific way?'},
    # {'role': 'user', 'content': "You should provide your reply after this text - 'My Feedback:', and the rating after this text - 'My Rating:' in the format of 'x out of 10'."},
    {'role': 'user', 'content': "You should provide your rating after this text - 'My Rating:' in the format of 'x out of 10', and the feedback after this text - 'My Feedback:'."},
    {'role': 'assistant', 'content': 'OK on it!'}
]

### Prompt test run 1

In [56]:
# Test run 1
q = 'My mom lost her job. Will this affect my student finance?'
a = 'Guidance Coronavirus (COVID-19): cancellation of GCSEs, AS and A levels in 2020These are extraordinary circumstances. We are working with schools, sixth- forms, colleges and universities to ensure that we do everything we can to best help students prepare for and progress to the next stage of their education.'

_prompt = MIXTRAL_PROMPT_TEMPLATE_FIRST_TRY.format(question=q, answer=a)
_inst = [
    ('user',  _prompt),
    ('assistant', 'OK on it!')
]


attempt1 = [
    {'role': 'system', 'content': 'You are a helpful code assistant. Your task is to generate a valid JSON object based on the given information.'},
    {'role': 'user',
      'content': """name: John
                    lastname: Smith
                    address: #1 Samuel St."""
    },
    {'role': 'system', 'content': 'Just generate the JSON object without explanations'},
    {'role': 'assistant', 'content': 'OK on it!'}
]

attempt2 = [
    {'role': 'user', 'content': 'You are a helpful code assistant. Your task is to generate a valid JSON object based on the given data, without explanations'},
    {'role': 'assistant', 'content': 'Sure, give me the data.'},
    {'role': 'user', 'content': 'name: John\nlastname: Smith\naddress: #1 Samuel St.'},
    {'role': 'assistant', 'content': 'OK on it!'}
]

text_prompt = llm_client_tokenizer.apply_chat_template(
    attempt2,
    tokenize=False,
    add_generation_prompt=True,
    return_tensors="pt"
)

print(f'Prompt: {text_prompt}')

llm_client.text_generation(
     # prompt=JUDGE_PROMPT.format(question=q, answer=a),
     prompt = text_prompt,
     max_new_tokens=1000,
)

Prompt: <s>[INST] You are a helpful code assistant. Your task is to generate a valid JSON object based on the given data, without explanations [/INST]Sure, give me the data.</s>[INST] name: John
lastname: Smith
address: #1 Samuel St. [/INST]OK on it!</s>


'\n\nHere\'s the JSON object:\n\n```json\n{\n  "name": "John",\n  "lastname": "Smith",\n  "address": "#1 Samuel St."\n}\n```\n\nLet me know if you need help with anything else!'

### Prompt test run 2

The result of my test run 2 (below) looks pretty correct - and I've vailed down that the `attempt2` prompt template, together with the call to `apply_chat_template` is the right way to go.

**Note**: this is also hinted in the model card [here](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1#instruction-format)

In [23]:
# Test run 2
q = 'My mom lost her job. Will this affect my student finance?'
a = 'Guidance Coronavirus (COVID-19): cancellation of GCSEs, AS and A levels in 2020These are extraordinary circumstances. We are working with schools, sixth- forms, colleges and universities to ensure that we do everything we can to best help students prepare for and progress to the next stage of their education.'

q2 = 'While we are in the midst of the Coronavirus pandemic, are early years settings still supposed to have a different room for babies that are under the age of 2?'
a2 = 'Guidance Actions for early years and childcare providers during the coronavirus outbreakImportant information should be provided by the parent or carer to the setting on day one, including emergency contact details, dietary requirements and medical needs to safeguard the health, safety and welfare of the child.'

q3 = 'What can be done to safeguard children and their teachers online?'
a3 = 'Guidance Coronavirus (COVID-19): support for parents and carers to keep children safe onlineIf you are concerned about cyberbullying, you can find government advice and information about how you can protect your child and tackle it if it happens.'

mixtral_prompt = llm_client_tokenizer.apply_chat_template(
    [{k:v.format(question=q, answer=a) for k, v in x.items()} for x in MIXTRAL_PROMPT_TEMPLATE],
    tokenize=False,
    add_generation_prompt=True,
    return_tensors="pt"
)

mixtral_prompt2 = llm_client_tokenizer.apply_chat_template(
    [{k:v.format(question=q2, answer=a2) for k, v in x.items()} for x in MIXTRAL_PROMPT_TEMPLATE],
    tokenize=False,
    add_generation_prompt=True,
    return_tensors="pt"
)

mixtral_prompt3 = llm_client_tokenizer.apply_chat_template(
    [{k:v.format(question=q3, answer=a3) for k, v in x.items()} for x in MIXTRAL_PROMPT_TEMPLATE],
    tokenize=False,
    add_generation_prompt=True,
    return_tensors="pt"
)

print(
    llm_client.text_generation(
      prompt = mixtral_prompt,
      max_new_tokens=1000
    )
)

print(
    llm_client.text_generation(
      prompt = mixtral_prompt2,
      max_new_tokens=1000
    )
)

print(
    llm_client.text_generation(
      prompt = mixtral_prompt3,
      max_new_tokens=2000,
      details=True,
      return_full_text=False,
      temperature=0.5
    )
)


My Rating: 1 out of 10
My Feedback: The system answer does not address the user's concern about the impact of their mom's job loss on their student finance. It only talks about the cancellation of GCSEs, AS and A levels in 2020 due to COVID-19 and the efforts to help students progress to the next stage of their education.


My Rating: 2 out of 10

My Feedback: The system answer does not address the user's question about whether early years settings should have a different room for babies under the age of 2 during the Coronavirus pandemic. It only provides general information about the importance of communication between parents and early years settings.
TextGenerationOutput(generated_text="\n\nMy Rating: 6 out of 10\n\nMy Feedback: The system answer does provide some guidance and resources regarding online safety for children, specifically around the topic of cyberbullying. However, it does not directly address how teachers can be safeguarded online, nor does it provide a comprehensiv

### Prompt test run 3

From the run above, you can see that the question/answer pair 3 (i.e. variables `q3`, `a3`) is not returning a response when I get to get a response using `InferenceClient.text_generation`.

This could be related to the size of the LLM response.

I tried the following:
1. Load the model locally using `AutoModelForCausalLM.load_pretrained()`
  * Model is quite large - load time is LONG
  * Uses up most of the local disk in my Colab notebook
2. Directly query the Inference API using `requests`
  * This seems to get more responses
  * Now will try this with the right prompt formats from above

In [44]:
mixtral_prompt2

"<s>[INST] You will be given a user_question and system_answer couple. Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question. [/INST]What is the user_question and system_answer?</s>[INST] The user_question is While we are in the midst of the Coronavirus pandemic, are early years settings still supposed to have a different room for babies that are under the age of 2? [/INST]What is the system_answer?</s>[INST] The system_answer is Guidance Actions for early years and childcare providers during the coronavirus outbreakImportant information should be provided by the parent or carer to the setting on day one, including emergency contact details, dietary requirements and medical needs to safeguard the health, safety and welfare of the child. [/INST]How should the rating be scored?</s>[INST] Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means 

In [21]:
llm_client.text_generation(
    prompt="""<s>[INST] You will be given a user_question and system_answer couple. Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question. [/INST]What is the user_question and system_answer?</s>[INST] The user_question is What can be done to safeguard children and their teachers online? [/INST]What is the system_answer?</s>[INST] The system_answer is Guidance Coronavirus (COVID-19): support for parents and carers to keep children safe onlineIf you are concerned about cyberbullying, you can find government advice and information about how you can protect your child and tackle it if it happens. [/INST]How should the rating be scored?</s>[INST] Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question. [/INST]Should my reply and rating be formatted in a specific way?</s>[INST] You should provide your reply after this text - 'My Feedback:', and the rating after this text - 'My Rating:' in the format of 'x out of 10'. [/INST]OK on it!</s>"""
)

'\n'

In [54]:
import requests
from typing import Tuple

def text_gen_inference_query(payload, model_id, api_token) -> Tuple[int, str]:
  headers = {"Authorization": f"Bearer {api_token}"}
  API_URL = f"https://api-inference.huggingface.co/models/{model_id}/v1/chat/completions"
  params = {'max_new_tokens': '4000'}

  try:
    response = requests.post(API_URL, headers=headers, json=payload, params=params)

    response.status_code
    if len(response.json().get('choices', [])) == 0:
      return (response.status_code, "")
    else:
      _resp = response.json().get('choices', [])[0].get('message', {}).get('content', '')
      return (response.status_code, _resp)
  except:
    return (response.status_code, "")

# data = text_gen_inference_query({'model': repo_id, 'messages': [{'role': 'user', 'content': mixtral_prompt3}]}, repo_id, userdata.get('HF_TOKEN'))
# data.get('choices', [])[0].get('message', {}).get('content', '')

print(
  text_gen_inference_query(
    {'model': repo_id, 'messages': [{'role': 'user', 'content': mixtral_prompt}]},
    repo_id,
    userdata.get('HF_TOKEN')
  )
)

print(
  text_gen_inference_query(
    {'model': repo_id, 'messages': [{'role': 'user', 'content': mixtral_prompt2}]},
    repo_id,
    userdata.get('HF_TOKEN')
  )
)

print(
  text_gen_inference_query(
    {'model': repo_id, 'messages': [{'role': 'user', 'content': mixtral_prompt3}]},
    repo_id,
    userdata.get('HF_TOKEN')
  )
)

(200, "My Feedback:\nThe system answer provided doesn't seem to address the user's question directly. The concern of the user was about the impact of the mother losing her job on the student's financial aid. The system answer provided doesn't contain any information regarding student finance or financial aid.\n\nMy Rating:\n2 out of 10.")
(200, ' My Feedback:\nThe system answer did not explicitly address the concern about early years settings having a different room for babies under the age of 2 during the Coronavirus pandemic. Instead, it only stated the importance of communication between parents or carers and the setting regarding emergency contact details, dietary requirements, and medical needs.\n\nMy Rating:\n3 out of 10. The system answer did not direct and answer the specific concern, which could cause confusion for those who')
(200, 'My Feedback: The system\\_answer provides a resource for concerned parents to refer to for guidance on how to protect their children from cyberbu

## Create the LLM Judge

### Run LLM Judge using the LLM Serverless endpoint & question/answer

In [27]:
def _prompt_gen(q: str, a: str, tokenizer, prompt_template) -> str:
  return tokenizer.apply_chat_template(
                            [{k:v.format(question=q, answer=a) for k, v in x.items()} for x in prompt_template],
                            tokenize=False,
                            add_generation_prompt=True,
                            return_tensors="pt"
                          )

def _impl(prompt: str, model: str, llm_client, inf_client = False) -> str:
  res = ''
  iter = 1

  while (iter <= 3) and (res == ''):
    try:
      if inf_client:
        res = llm_client.text_generation(
          prompt=prompt,
          max_new_tokens=1000
        )
      else:
        res = text_gen_inference_query(
                {'model': repo_id, 'messages': [{'role': 'user', 'content': prompt}]},
                repo_id,
                userdata.get('HF_TOKEN')
              )[1]
      if res == '':
        raise Exception('No response returned from text_gen - retrying')
    except:
      pass
    finally:
      iter += 1
      time.sleep(iter * 1)

  return res

In [29]:
ds_random_samples_from_each_score["llm_judge"] = ds_random_samples_from_each_score.progress_apply(
    lambda x: _impl(
                  _prompt_gen(x['question'], x['answer'], llm_client_tokenizer, MIXTRAL_PROMPT_TEMPLATE),
                  repo_id,
                  llm_client,
                  True
              )
    ,
    axis=1
)

100%|██████████| 28/28 [04:12<00:00,  9.01s/it]


In [64]:
pd.options.display.width = 0
ds_random_samples_from_each_score[['question', 'answer', 'rating1', 'feedback1', 'feedback2', 'llm_judge']].head(30)

Unnamed: 0,question,answer,rating1,feedback1,feedback2,llm_judge
1869,My mom lost her job. Will this affect my stude...,Guidance Coronavirus (COVID-19): cancellation ...,1,"Generic, short answer about extraordinary circ...",This answer is irrelevant to the question. Thi...,My Rating: 1 out of 10\n\nMy Feedback: The sy...
2139,While we are in the midst of the Coronavirus p...,Guidance Actions for early years and childcare...,1,This answer is irrelevant to the question. The...,"This doesn't address baby rooms, but only talk...",My Rating: 2 out of 10\n\nMy Feedback: The sys...
938,When living in shared housing during COVID-19 ...,Living in Shared Housing Keep up-to-date lists...,1,This appears to be focusing on medication whil...,This mostly talks about medication and conditi...,My Rating: 2 out of 10\nMy Feedback: The syste...
1078,Will a patient need to get a negative Covid-19...,Caring for Someone Sick at HomePeople with COV...,1,Answer refers to people who have been isolatin...,Discusses when it's safe to go outside again a...,My Rating: 1 out of 10\nMy Feedback: The syste...
1755,Can real estate agents still practice?,Guidance Government advice on home moving duri...,1,"Answer concerns itself with letting agents, la...",This information does not answer the question....,My Rating: 5 out of 1
225,What should immediate family members need to do?,Government response to the COVID-19 outbreakSt...,1,Discusses what government officials need to do...,Question is rather broad as to what they are r...,My Rating: 4 out of 10\n\nMy Feedback: The sys...
1237,Do people in contact with a person with COVID-...,FAQs for Correctional and Detention Facilities...,1,Talks about preventing visits to correctional ...,This information does not answer the question....,My Rating: 5 out of 10\n\nMy Feedback: The sys...
172,What arrangements can be made during quarantin...,Transiting AustraliaIf you cannot remain in th...,2,"It addresses transit, but is focused on intern...",This is a decent response because it states th...,My Rating: 7 out of 1
14,How has the Australian government adjusted the...,Frequently Asked QuestionsYou should only appl...,2,It gives an answer ut it needs to define some ...,Nothing here mentions the agricultural industr...,My Rating: 1 out of 10\n\nMy Feedback: The sys...
1581,What can be done to safeguard children and the...,Guidance Coronavirus (COVID-19): support for p...,2,There is not enough information here about saf...,the only part of the answer that relates to th...,The system\_answer is a government resource th...


In [35]:
def extract_judge_score(answer: str, split_str: str = "My Rating:") -> int:
    try:
        if split_str in answer:
            rating = answer.split(split_str)[1]
        else:
            rating = answer
        digit_groups = [el.strip() for el in re.findall(r"\d+(?:\.\d+)?", rating)]
        return float(digit_groups[0])
    except Exception as e:
        print(e)
        return None

In [67]:
ds_random_samples_from_each_score['llm_judge_score'] = ds_random_samples_from_each_score['llm_judge'].apply(extract_judge_score)

### Evaluate the correlation between Human-provided ratings vs. LLM-provided ratings

In [33]:
pd.options.display.width = 0
ds_random_samples_from_each_score[['question', 'answer', 'rating1', 'llm_judge']].head(10)

Unnamed: 0,question,answer,rating1,llm_judge
1869,My mom lost her job. Will this affect my stude...,Guidance Coronavirus (COVID-19): cancellation ...,1,
2139,While we are in the midst of the Coronavirus p...,Guidance Actions for early years and childcare...,1,
938,When living in shared housing during COVID-19 ...,Living in Shared Housing Keep up-to-date lists...,1,
1078,Will a patient need to get a negative Covid-19...,Caring for Someone Sick at HomePeople with COV...,1,
1755,Can real estate agents still practice?,Guidance Government advice on home moving duri...,1,
225,What should immediate family members need to do?,Government response to the COVID-19 outbreakSt...,1,
1237,Do people in contact with a person with COVID-...,FAQs for Correctional and Detention Facilities...,1,
172,What arrangements can be made during quarantin...,Transiting AustraliaIf you cannot remain in th...,2,
14,How has the Australian government adjusted the...,Frequently Asked QuestionsYou should only appl...,2,
1581,What can be done to safeguard children and the...,Guidance Coronavirus (COVID-19): support for p...,2,


In [76]:
# Alternative way of doing this
# print(f"{ds_random_samples_from_each_score['llm_judge_score'].corr(ds_random_samples_from_each_score['rating1'], method='pearson'):.3f}")

print("Corellation between Human-provided rating vs. LLM-provided rating")
print(f"Pearson coeff: {np.corrcoef(ds_random_samples_from_each_score['rating1'], ds_random_samples_from_each_score['llm_judge_score'])[0][1]:.4f}")

Corellation between Human-provided rating vs. LLM-provided rating
Pearson coeff: 0.7879


## Improve the LLM Judge

Following the reference URL on HuggingFace, implement the following to improve the results returned by the LLM

1. Use a small integer scale like 1-4 or 1-5 instead of a large float scale as we had previously (1-10)
2. Leave more time for thought by adding an Evaluation field before the final answer
3. Provide an indicative scale for guidance

### Improved prompt test 1

In [40]:
IMPROVED_MIXTRAL_PROMPT_TEMPLATE = [
    {'role': 'user', 'content': "You will be given a user_question and system_answer couple. Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question."},
    {'role': 'assistant', 'content': 'What is the user_question and system_answer?'},
    {'role': 'user', 'content': """The user_question is {question}"""},
    {'role': 'assistant', 'content': 'What is the system_answer?'},
    {'role': 'user', 'content': """The system_answer is {answer}"""},
    {'role': 'assistant', 'content': "How should the rating be scored?"},
    {'role': 'user', 'content': 'Give your answer on a scale of 1 to 4, where 0 means that the system_answer is not helpful at all, and 4 means that the answer completely and helpfully addresses the question.'},
    {'role': 'assistant', 'content': 'Any further directions on the ratings scale?'},
    {'role': 'user',
     'content': """Here is the scale you should use to build your answer:
                1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
                2: The system_answer is mostly not helpful: misses some key aspects of the question
                3: The system_answer is mostly helpful: provides support, but still could be improved
                4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question
                """},
    {'role': 'assistant', 'content': 'Should my reply and rating be formatted in a specific way?'},
    {'role': 'user',
     'content': """You should provide your rating after this text -
                  'My Rating:' in the format of 'x out of 4'
                  'Evaluation: (your rationale for the rating, as a text)'
                  'My Feedback:'.
                  You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer."""},
    {'role': 'assistant', 'content': 'Anything else?'},
    {'role': 'user', 'content': "Provide your rating, evaluation and feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company."},
    {'role': 'assistant', 'content': 'OK on it!'}
]

In [41]:
# Test run 2
q = 'My mom lost her job. Will this affect my student finance?'
a = 'Guidance Coronavirus (COVID-19): cancellation of GCSEs, AS and A levels in 2020These are extraordinary circumstances. We are working with schools, sixth- forms, colleges and universities to ensure that we do everything we can to best help students prepare for and progress to the next stage of their education.'

q2 = 'While we are in the midst of the Coronavirus pandemic, are early years settings still supposed to have a different room for babies that are under the age of 2?'
a2 = 'Guidance Actions for early years and childcare providers during the coronavirus outbreakImportant information should be provided by the parent or carer to the setting on day one, including emergency contact details, dietary requirements and medical needs to safeguard the health, safety and welfare of the child.'

q3 = 'What can be done to safeguard children and their teachers online?'
a3 = 'Guidance Coronavirus (COVID-19): support for parents and carers to keep children safe onlineIf you are concerned about cyberbullying, you can find government advice and information about how you can protect your child and tackle it if it happens.'

imp_prompt_1 = llm_client_tokenizer.apply_chat_template(
    [{k:v.format(question=q, answer=a) for k, v in x.items()} for x in IMPROVED_MIXTRAL_PROMPT_TEMPLATE],
    tokenize=False,
    add_generation_prompt=True,
    return_tensors="pt"
)

print(
    llm_client.text_generation(
      prompt = imp_prompt_1,
      max_new_tokens=1000
    )
)

imp_prompt_2 = llm_client_tokenizer.apply_chat_template(
    [{k:v.format(question=q3, answer=a3) for k, v in x.items()} for x in IMPROVED_MIXTRAL_PROMPT_TEMPLATE],
    tokenize=False,
    add_generation_prompt=True,
    return_tensors="pt"
)

print(
    llm_client.text_generation(
      prompt = imp_prompt_2,
      max_new_tokens=1000
    )
)



My Rating: 1 out of 4
Evaluation: The system_answer does not address the user's question about whether the loss of their mom's job will affect their student finance. It only talks about the cancellation of GCSEs, AS and A levels in 2020 due to the coronavirus pandemic.
Total rating: The system_answer is terrible: completely irrelevant to the question asked, or very partial.
My Feedback: The system_answer should be more specific and relevant to the user's question. It should provide information about how the loss of a parent's job can affect student finance, or at least acknowledge that the user's question is outside the scope of the answer and provide a referral to a more appropriate source of information.


My Rating: 3 out of 4

Evaluation: The system\_answer is mostly helpful as it does provide support and guidance on how to keep children safe online, specifically addressing cyberbullying. However, it could be improved by providing more comprehensive information on other potential

### Run the improved LLM judge

In [42]:
ds_random_samples_from_each_score['llm_judge_imp_1'] = ds_random_samples_from_each_score.progress_apply(
    lambda x: _impl(
                  _prompt_gen(x['question'], x['answer'], llm_client_tokenizer, IMPROVED_MIXTRAL_PROMPT_TEMPLATE),
                  repo_id,
                  llm_client,
                  True
              )
    ,
    axis=1
)

100%|██████████| 28/28 [03:44<00:00,  8.03s/it]


### Evaluate

In [43]:
ds_random_samples_from_each_score['llm_judge_score_imp_1'] = ds_random_samples_from_each_score['llm_judge_imp_1'].apply(extract_judge_score)

In [44]:
ds_random_samples_from_each_score[['question', 'answer', 'rating1', 'feedback1', 'feedback2', 'llm_judge', 'llm_judge_imp_1', 'llm_judge_score_imp_1']].head(30)

Unnamed: 0,question,answer,rating1,feedback1,feedback2,llm_judge,llm_judge_imp_1,llm_judge_score_imp_1
1869,My mom lost her job. Will this affect my stude...,Guidance Coronavirus (COVID-19): cancellation ...,1,"Generic, short answer about extraordinary circ...",This answer is irrelevant to the question. Thi...,,\n\nMy Rating: 1 out of 4\nEvaluation: The sys...,1.0
2139,While we are in the midst of the Coronavirus p...,Guidance Actions for early years and childcare...,1,This answer is irrelevant to the question. The...,"This doesn't address baby rooms, but only talk...",,\n\nMy Rating: 1 out of 4\nEvaluation: The sys...,1.0
938,When living in shared housing during COVID-19 ...,Living in Shared Housing Keep up-to-date lists...,1,This appears to be focusing on medication whil...,This mostly talks about medication and conditi...,,\n\nMy Rating: 2 out of 4\n\nEvaluation: The s...,2.0
1078,Will a patient need to get a negative Covid-19...,Caring for Someone Sick at HomePeople with COV...,1,Answer refers to people who have been isolatin...,Discusses when it's safe to go outside again a...,,\n\nMy Rating: 2 out of 4\nEvaluation: The sys...,2.0
1755,Can real estate agents still practice?,Guidance Government advice on home moving duri...,1,"Answer concerns itself with letting agents, la...",This information does not answer the question....,,[My Rating:](https://www.facebook.com/hashtag/...,0.0
225,What should immediate family members need to do?,Government response to the COVID-19 outbreakSt...,1,Discusses what government officials need to do...,Question is rather broad as to what they are r...,,\n\nMy Rating: 2 out of 4\n\nEvaluation: The s...,2.0
1237,Do people in contact with a person with COVID-...,FAQs for Correctional and Detention Facilities...,1,Talks about preventing visits to correctional ...,This information does not answer the question....,,\n\nMy Rating: 2 out of 4\n\nEvaluation: The s...,2.0
172,What arrangements can be made during quarantin...,Transiting AustraliaIf you cannot remain in th...,2,"It addresses transit, but is focused on intern...",This is a decent response because it states th...,,\n\nMy Rating: 3 out of 4\n\nEvaluation: The s...,3.0
14,How has the Australian government adjusted the...,Frequently Asked QuestionsYou should only appl...,2,It gives an answer ut it needs to define some ...,Nothing here mentions the agricultural industr...,,\n\nMy Rating: 1 out of 4\nEvaluation: The sys...,1.0
1581,What can be done to safeguard children and the...,Guidance Coronavirus (COVID-19): support for p...,2,There is not enough information here about saf...,the only part of the answer that relates to th...,,\n\nMy Rating: 3 out of 4\n\nEvaluation: The s...,3.0


In [45]:
print("Corellation between Human-provided rating vs. LLM-provided rating")
print(f"Pearson coeff: {np.corrcoef(ds_random_samples_from_each_score['rating1'], ds_random_samples_from_each_score['llm_judge_score_imp_1'])[0][1]:.4f}")

Corellation between Human-provided rating vs. LLM-provided rating
Pearson coeff: 0.8047


That's an improvement from 0.78 to 0.80 using the updated prompt that does Evaluation, Feedback and Rating.

Observations:
1. Providing the incentive (the 100 H100 GPUs thing) at the end of the prompt seemed to help the ratings be more correct?
2. I really like the Evaluation explanation - I found it to be quite useful