# Evaluate LLM results

Install Dependencies

In [1]:
# %%capture
# !pip install datasets==2.20.0
# !pip install -U langsmith==0.1.99
# !pip install langchain_openai==0.1.22
# !pip install langchain==0.2.13
# !pip install langchain_community==0.2.12                          
# !pip install transformers==4.44.0
# !pip install termcolor==2.4.0
# !pip install accelerate==0.33.0
# !pip install pandas==2.2.2
# !pip install openpyxl==3.1.5
# !pip install python-dotenv==1.0.1
# !pip install einops==0.8.0
# !pip install wheel==0.44.0
# !pip install sentencepiece==0.2.0
# !pip install protobuf==5.27.3 #Mistral models needs this
# !pip install groq==0.10.0 #Groq models needs this

# !pip install flash-attn==2.6.3 #Install it at the end after wheel has been installed
# !pip install anthropic==0.34.1 #Anthropic models needs this

# #Only if CPU is used
# !pip install torch==2.0.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

RunPod specific parameters

In [2]:
#For RunPod change to persistent storage directory
import os
os.chdir('/workspace')

Specify Path and Load API Keys

In [3]:
file_path ='/workspace/Example_QA_data_raw.xlsx' #Dataset generated with the help of GPT-4o
#'/Users/nikolaossourlo/Desktop/Example_QA_data_raw.xlsx' #For MacOS
#'C:/Users/soyrl/Desktop/Example_QA_data_raw.xlsx' #For Windows
# '/content/drive/My Drive/Example_QA_data_raw.xlsx' #For Google Colab
#'/home/nikolaossourlo/Example_QA_data_raw.xlsx' #For Delft Blue
# '/workspace/Example_QA_data_raw.xlsx' #For RunPod

custom_cache_dir="/workspace/cache/huggingface" #Save models here so that we don't have to download them again
#"/scratch/nikolaossourlo/cache" in Delft Blue

# Check if custom_cache_dir is defined, otherwise use default behavior
try:
    cache_dir=custom_cache_dir
except:
    cache_dir=None

from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv(dotenv_path=os.getcwd()+"/env")

# Get the OpenAI API key
openai_api_key = os.getenv('OPENAI_API_KEY_DRACO')
langsmith_api_key = os.getenv('LANGSMITH_API_KEY')

#Login to Hugging Face
from huggingface_hub import login
# Log in with your Hugging Face token
login(token=os.getenv('HF_TOKEN'))

# print(openai_api_key)
# print(langsmith_api_key)

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Select model and name experiment

In [4]:
# models=["meta-llama/Meta-Llama-3.1-8B-Instruct", #Takes 2.5-3mins in A4500 (20GB VRAM) and in Delft Blue
    # "microsoft/Phi-3.5-mini-instruct", Took 5mins in A40 with 48GB VRAM, 2mins in A4500 with 20GB VRAM, 3mins in Delft Blue
    # "mistralai/Mistral-7B-Instruct-v0.3", #4mins in A40 with 48GB VRAM, 2.5mins in A4500 with 20GB VRAM and in Delft Blue
    # "google/gemma-2-9b-it", #More than 20GB of GPU memory needed - Works with A40 with 48GB VRAM (sometimes with A4500 too - 4mins) and takes ~8mins, 4.5mins in Delft Blue
    # "Qwen/Qwen2-7B-Instruct", #4mins in A40 with 48GB VRAM, 2 mins in A4500 with 20GB VRAM, 2.5mins in Delft Blue
    # 'AI-MO/NuminaMath-7B-TIR', #2.5 in A4500 with 20GB VRAM and in Delft Blue
    # 'openai/gpt-4o-mini']

# Groq models are defined as: groq_website/model_name e.g. 'groq_website/llama-3.1-70b-versatile'
# OpenAI models are defined as: 'openai/model_name', e.g. 'openai/gpt-4o-mini'

# I couldn't run 'nvidia/Mistral-NeMo-Minitron-8B-Base', "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4" (Conflicting dependencies)
# Anthropic models are defined as 'anthropic/model_name', e.g. 'anthropic/claude-3-haiku-20240307' - Couldn't use due to billing issues

#Model to generate response to questions - 'gpt-4o-mini costs' very low ~0.01$ for 9 Q&A pairs. 
model_name="google/gemma-2-9b-it"#'meta-llama/Meta-Llama-3.1-8B-Instruct'

#Define model to act as a judge
judge_model='openai/gpt-4o-mini' #If used with Llama, only 0.01$ for 9 Q&A pairs for gpt-4o-mini, and 0.22$ for gpt-4o

#How the dataset will be named in Langsmith
try: #For Hugging Face models
    dataset_name = "Chemical_Engineering_Evaluation_"+model_name.split('/')[1]+'_with_judge_'+judge_model
except: #For OpenAI models
    dataset_name = "Chemical_Engineering_Evaluation_"+model_name+'_with_judge_'+judge_model

#Define maximum number of tokes in the judge LLM output
max_output_tokens=500

#Limit of tokens in the generated response from LLM
generate_max_tokens=1000

Define prompts for custom evaluation metrics

In [5]:
#Define prompts
system_prompt="""
You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF. 
You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. If you think there might not be a correct 
answer, you say so. Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend 
a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question. 
Your users are experts in chemical engineering, so they already know you're a language model and your capabilities and limitations, so don't 
remind them of that. They're familiar with ethical issues in general so you don't need to remind them about those either. 
Don't be verbose in your answers, but do provide details and examples where it might help the explanation. 
Your goal is to act as a judge on if the predicted and actual answer on a given question have the same context or not. Be concise and provide
an explanation only in case the two answers do not have the same context. 
IMPORTANT: End your responses with the sentence: "FINAL JUDGEMENT:" followed by 'CORRECT' or 'INCORRECT' based on your evaluation.
"""

Check if GPU is available

In [6]:
import torch
print(torch.version.cuda)

12.1


Google Drive mount (If run in Colab)

In [7]:
if 'content/drive/My Drive' in file_path:
    from google.colab import drive
    drive.mount('/content/drive')

Read Excel File

In [8]:
import pandas as pd
qa=pd.read_excel(file_path) #Read Excel

Create Dataset from df

In [9]:
from datasets import Dataset
loaded_dataset=Dataset.from_pandas(qa)
loaded_dataset = loaded_dataset.train_test_split(test_size=0.2, seed=42)

In [10]:
dataset_train=loaded_dataset['train']
dataset_test=loaded_dataset['test']

Create Langsmith Test Dataset

In [11]:
#https://docs.smith.langchain.com/old/evaluation/faq/manage-datasets
from langsmith import Client

example_inputs = [(x['input'],x['output']) for x in dataset_test]
print(example_inputs)

client = Client(api_key=langsmith_api_key)

try:
    #Load the dataset if already exists
    for existing_dataset in client.list_datasets():
        if existing_dataset.name==dataset_name:
            dataset_langsmith=existing_dataset
    # dataset_langsmith = list(client.list_examples(dataset_name=dataset_name))
    for x in dataset_langsmith:
        print("Dataset Loaded")
        break

except: #Otherwise create it
    print("Dataset not found. Creating new dataset")
    # Storing inputs in a dataset lets us run chains and LLMs over a shared set of examples.
    dataset_langsmith = client.create_dataset(dataset_name=dataset_name,
                                               description="Q&A chemical engineering.")

    for input_prompt, output_answer in example_inputs:
        client.create_example(
            inputs={"question": input_prompt},
            outputs={"answer": output_answer},
            # metadata={"source": "Wikipedia"},
            dataset_id=dataset_langsmith.id,
        )

[('How can I separate my liquid-liquid mixture using a reactive separation method?', 'Some options to consider are reactive extraction or membrane reactors. The best unit to separate the mixture depends on the characteristics of the components.'), ('What is the selling point of membrane chromatography', 'The main feature and advantage of membrane chromatography is the elimination of the pore diﬀusion in dead-ended pores of the adsorbent beads, which is the main transport resistance in traditional chromatography. This limitation becomes particularly important in the case of large molecules, such as proteins. In membrane chromatography, dissolved molecules are carried by the forced convective ﬂow through the membrane pores, where the ﬁlm diﬀusion (much faster than the pore diﬀusion) leads them to functional ligands attached to the inner pore structure, which act as adsorbents. This way, the process throughput can be signiﬁcantly increased.'), ('What is the function of a membrane in the e

Custom Evaluation Metric

In [12]:
# https://docs.smith.langchain.com/old/cookbook/introduction
# https://docs.smith.langchain.com/old/evaluation/faq/custom-evaluators
from langsmith.schemas import Run, Example
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from termcolor import colored

#Function that compares the real answer with the predicted answer of an LLM and returns 'correct' or 'incorrect' based on the evaluation
def custom_evaluator(run: Run, example: Example) -> dict: 
    # print("Run:",run)

    question=run.inputs.get("inputs")['question']
    # print("Question:",question)
    actual_answer = example.outputs.get("answer")
    # print("Real answer:",example.outputs.get("answer"))
    predicted_answer = run.outputs.get("output")
    # print("Predicted Answer:",answer)
    
    # Check if there is output from LLM
    if not predicted_answer:
        print("No output from LLM")
        return {"key": "custom_metric" , "score": 0} 
    else:
        # Define roles and placeholders
        chat_template = ChatPromptTemplate.from_messages(
          [("system", system_prompt),
            ("user", "Question: {question}, Actual answer: {actual_answer}, Predicted answer: {predicted_answer}"),
            # ("ai", "It's sunny and warm outside."), #Use this if we want to use few shot prompts
          ]
        )

        messages = chat_template.format_messages(question=question, actual_answer=actual_answer, predicted_answer=predicted_answer)
        # print("Messages:",messages)

        formatted_messages = [(role, msg.content) for role, msg in zip(["system", "user"], messages)]
        # print("Formatted messages:",formatted_messages) #[('system', 'You are an autoregressive lan....', 'user':.....)]

        # Initialize the model and get response
        llm = ChatOpenAI(model_name=judge_model.split('/')[1], api_key=openai_api_key, temperature=0, max_tokens=max_output_tokens, seed=42)
        ai_response = llm.invoke(formatted_messages)

        # Output
        # print(colored("System message:"+ messages[0].content,'blue'))
        print(colored("User message:"+ messages[1].content, 'green'))
        print(colored("AI message:"+ ai_response.content,'red'))

        #Decide what the final score is based on output
        if "FINAL JUDGEMENT: CORRECT" in ai_response.content: 
            score = 1
        elif "FINAL JUDGEMENT: INCORRECT" in ai_response.content:
            score = 0
        else:
            print("Invalid response from LLM:", ai_response.content)
            score = 0

        print("\n")
        return {"key": "custom_metric" , "score": score} #Key is the name in the chart

Other evaluators from Langsmith

In [13]:
# https://docs.smith.langchain.com/old/evaluation/faq/evaluator-implementations
# https://docs.smith.langchain.com/old/evaluation/quickstart
# https://docs.smith.langchain.com/tutorials/Developers/evaluation
from langsmith.evaluation import LangChainStringEvaluator

eval_llm = ChatOpenAI(model_name=judge_model.split('/')[1], api_key=openai_api_key, temperature=0.0, seed=42)

#Evaluators
qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm}) #LLM just gives 'correct' or 'incorrect' based on reference answer
context_qa_evaluator = LangChainStringEvaluator("context_qa", config={"llm": eval_llm}) #Also uses reference context of example outputs to do the above
cot_qa_evaluator = LangChainStringEvaluator("cot_qa", config={"llm": eval_llm}) #Same as above but with chain of thought 'reasoning'

#Prompts Used internally:

# 1) context_qa_evaluator: You are a teacher grading a quiz.
# You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, 
# based on the context.

# Example Format:
# QUESTION: question here
# CONTEXT: context the question is about here
# STUDENT ANSWER: student's answer here
# GRADE: CORRECT or INCORRECT here

# Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. 
# It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 


# 2) cot_qa_evaluator: You are a teacher grading a quiz.
# You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, 
# based on the context.
# Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.

# Example Format:
# QUESTION: question here
# CONTEXT: context the question is about here
# STUDENT ANSWER: student's answer here
# EXPLANATION: step by step reasoning here
# GRADE: CORRECT or INCORRECT here

# Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer.
#  It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 


# 3) qa_evaluator: You are a teacher grading a quiz.
# You are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.

# Example Format:
# QUESTION: question here
# STUDENT ANSWER: student's answer here
# TRUE ANSWER: true answer here
# GRADE: CORRECT or INCORRECT here

# Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer.
#  It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 

Alternatively, use custom prompts as shown below (and set {"prompt": PROMPT} as additional argument inside the config above)

In [14]:
# from langchain_core.prompts.prompt import PromptTemplate

# _PROMPT_TEMPLATE = """You are an expert professor specialized in chemical engineering answers to questions.
# You are grading the following question:
# {query}
# Here is the real answer:
# {answer}
# You are grading the following predicted answer:
# {result}
# Respond with CORRECT or INCORRECT:
# """

# PROMPT = PromptTemplate(
#     input_variables=["query", "result", "answer"], template=_PROMPT_TEMPLATE
# )

Define Models that Generate Responses

In [15]:
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.random.manual_seed(0) #Set for reproducibility

def initialize_model(model_id):
    # # Check if mps acceleration is available (For MacOS)
    # device = "mps" if torch.backends.mps.is_available() else "cpu"
    # print(f"Using device {device}")
    # model.to(device)

    # transformers.set_seed(42) #Tried for reproducibility but didn't work
    
    pipeline = transformers.pipeline( 
            "text-generation",
            model=model_id,
            model_kwargs={"torch_dtype": torch.bfloat16, "cache_dir":cache_dir},
            # trust_remote_code=True,
            device_map="cuda" #Activate this if GPU available (works in Delft Blue with 32GB VRAM) - 'auto' the alternative
        )
    return pipeline

def get_model(model_id):
    """Given a model name, return the loaded model, tokenizer, and pipeline"""

    if 'openai' not in model_id and 'groq_website' not in model_id: #For Hugging Face models
        pipeline=initialize_model(model_id)

    #Returns below variables if defined, and returns None for any that are not.
    model = locals().get('model', None)
    tokenizer = locals().get('tokenizer', None)
    pipeline = locals().get('pipeline', None)

    return model, tokenizer, pipeline

In [16]:
model, tokenizer, pipeline = get_model(model_name)

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.42it/s]


Generate Responses

In [17]:
import time

def predict(inputs: dict) -> dict:
    """Given a question, return the answer from the model"""
    
    messages = [
      # {"role": "system", "content": "You are a language model specialized in chemical engineering. Answer the question."},
      {"role": "user", "content": inputs['question']},
    ]
    # print("Prompt:",messages)
    global model_name
    # print("Model name for response generation:",model_name)

    generation_args = { 
        "max_new_tokens": max_output_tokens, 
        "return_full_text": False, 
        "temperature": 0.0001, 
        "do_sample": True #False, 
      } 
    
    if 'openai' not in model_name and 'groq_website' not in model_name: #For Hugging Face models
      response=pipeline(messages, **generation_args)[0]['generated_text']
      print(model_name,':',response)

    else: 
      if 'openai' in model_name:
        try:
          import openai
          from langsmith.wrappers import wrap_openai

          # print("Using OpenAI models for generating the response")
                  
          # Define OpenAI client
          openai_client = wrap_openai(openai.Client(api_key=openai_api_key))
          
          response = openai_client.chat.completions.create(messages=messages, temperature=0, model=model_name.split('/')[1],  seed=42) 
          # print("Response:",response.choices[0].message.content)
          response=response.choices[0].message.content #That's the response without formatting
          time.sleep(5)
        except Exception as e:
          print("Error:",e)
          print("OpenAI Model ID:",model_name)
      elif 'groq_website' in model_name:
        try:
          from groq import Groq

          client = Groq()
          actual_model_name=model_name.split('/')[1]
          response = client.chat.completions.create(
              model=actual_model_name,
              max_tokens=generate_max_tokens,
              temperature=0,
              messages=messages)
          # print("Response from Groq:",response.choices[0].message.content)
        except Exception as e:
          print("Error:",e)
          print("Groq Model ID:",model_name)

    return {"output": response}

Perform the Evaluation

In [18]:
#https://python.langchain.com/v0.2/docs/integrations/chat/openai/
from langsmith.evaluation import evaluate

# Evaluation
begin=time.time()

evaluate(
    predict, #Function that call our LLM and returns its output
    data=dataset_langsmith.name, #dataset_langsmith_name, #JUST DATASET_NAME NOT WORKING WHEN CREATING NEW 
    evaluators=[custom_evaluator, qa_evaluator, context_qa_evaluator, cot_qa_evaluator],
    # metadata={"revision_id": "the version of your pipeline you are testing"},
    experiment_prefix=str(judge_model)+'_judge_with_'+str(model_name) # A prefix for your experiment names to easily identify them
)

end=time.time()
print("Total time taken:",end-begin) #Takes ~25mins on V100 GPU (probably CPU though) for quantized

View the evaluation results for experiment: 'openai/gpt-4o-mini_judge_with_google/gemma-2-9b-it-1f5e0173' at:
https://smith.langchain.com/o/42866633-791c-5d79-833d-4e624439c57c/datasets/2e26bbc4-5b41-45c5-8f7d-554f1aaba7ca/compare?selectedSessions=b286feb0-2617-44d1-bb39-4b4b9dab07b8




0it [00:00, ?it/s]

google/gemma-2-9b-it : In emulsion-free enzymatic splitting of fats, a membrane doesn't play a direct role in the enzymatic reaction itself. 

Here's why:

* **Emulsion-free:** This process implies that fats are not emulsified, meaning they are not broken down into tiny droplets dispersed in water. Emulsification is typically necessary for efficient enzymatic breakdown of fats because it increases the surface area available for the enzyme to act upon.
* **Enzymatic splitting:** This refers to the process of breaking down fats (triglycerides) into smaller molecules like fatty acids and glycerol using enzymes called lipases.

**However, membranes can play a supporting role in this process:**

* **Encapsulation:** Lipases can be encapsulated within membranes, creating microcompartments that concentrate the enzyme and substrate. This can enhance the efficiency of the reaction.
* **Selective permeability:** Membranes can be designed to be selectively permeable, allowing substrates to enter 

9it [03:54, 26.01s/it] 

Total time taken: 235.27580451965332





For real-time inference (below implementation only for meta-llama/Meta-Llama-3.1-8B-Instruct)

In [19]:
# from transformers import AutoModelForCausalLM, AutoTokenizer
# import torch
# # del pipeline #Otherwise too much memory is used

# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name,device_map='cuda')

# #Example of real-time response generation
# messages=[{"role": "user", "content": "What is the chemical formula of water?"}]

# inputs_tokenized = tokenizer.apply_chat_template(
#     messages,
#     tokenize=True,
#     add_generation_prompt=True,
#     return_tensors="pt",
#     return_dict=True,
# ).to("cuda")

# input_ids = inputs_tokenized['input_ids']

# # Generate tokens one by one
# max_length = 256
# output_ids = input_ids
# for _ in range(256):
#     outputs = model.generate(
#         output_ids,
#         max_new_tokens=1,
#         do_sample=True,
#         top_k=50,
#         pad_token_id=tokenizer.eos_token_id
#     )
#     new_token_id = outputs[0, -1].item()
#     if new_token_id == tokenizer.eos_token_id:
#         break
#     output_ids = torch.cat([output_ids, outputs[:, -1:]], dim=1)
#     new_token = tokenizer.decode(new_token_id, skip_special_tokens=True)
#     print(new_token, end="", flush=True)

# print()

Notes: Non-reproducible results, even when seed set (https://platform.openai.com/docs/api-reference/chat/create#chat-create-seed), temperature=0 (top_p should not change when we changed temperature)