# Model Evaluation using Vertex AI

In this notebook, we will compare the responses of 3 models that we are using to build our travel advisor chatbot. 

* Model #1 - Stock Gemini 1.5 Flash 002 model
* Model #2 - Fine-tuned Gemini 1.5 Flash, tuned with sample responses generated by Gemini
* Model #3 - Gemma2 9b instruction-tuned model

## Steps Performed

Before getting to the evaluation, the following steps were performed. 

### 1) Generate sample queries / prompts
We used `Gemini 1.5 Pro` to generate a few arbitrary questions related to travel. We asked the model to don various personas from curious minds, to annoyed ones and ask travel related questions in the style of that persona. 

This data is stored in `eval_queries.json`. 

### 2) Generate reponses using each model
We then read the prompts generated previously and fed it into each of the model above and saved the response that we got. So for each of the questions asked, we have the responses generated by each of the models. We can use this to do an evaluation - either a point-wise evaluation or a pair-wise comparison. 

The files are present in `gemini_responses.json`, `gemma_responses.json`, `tuned_responses.json`.

## Let's get some data from the files

In [3]:
# How many prompts are present?

import json
with open('eval_queries.json', 'r') as f:
    data = json.load(f)

print(f'there are {len(data)} records')

there are 57 records


In [8]:
# Let's check the average length of the responses of each model

def print_avg_len(filename):
    with open(filename, 'r') as f:
        prompts = json.load(f)

    responses = [prompt['response'] for prompt in prompts]
    avg_len_resp = sum(len(s) for s in responses) / len(responses)
    print(f'avg length of responses in {filename} is {avg_len_resp:.2f}')

print_avg_len('gemini_responses.json')
print_avg_len('gemma_responses.json')
print_avg_len('tuned_responses.json')



avg length of responses in gemini_responses.json is 553.98
avg length of responses in gemma_responses.json is 386.75
avg length of responses in tuned_responses.json is 170.12


## Let's do a pointwise evaluation of the responses based on certain criteria

1. We want to ensure that the response actually answered the question and didn't give a vague answer or ask a follow-up
2. We want to ensure the response is easy to read and friendly
3. We want to ensure the response is fun and has emojis and smileys

_Note: Ensure we are authenticated with Google Cloud and are using ADC_

In [10]:
# install necessary packages
!pip install google-cloud-aiplatform



In [14]:
# import necessary packages
import pandas as pd

import os
import json
import dotenv

import vertexai
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate

In [15]:
# load env vars
dotenv.load_dotenv()
project_id = os.environ.get("PROJECT_ID")
location = os.environ.get("REGION")

In [17]:
# init vertex ai
vertexai.init(project=project_id, location=location)

In [None]:
# create a PointWise metric for the responses

travel_response_quality = PointwiseMetric(
    metric="travel_response_quality",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "fluency": (
                "Sentences flow smoothly and are easy to read, avoiding awkward"
                " phrasing or run-on sentences. Ideas and sentences connect"
                " logically, using transitions effectively where needed."
            ),
            "entertaining": (
                "Short, amusing text that incorporates emojis, exclamations and"
                " questions to convey quick and spontaneous communication and"
                " diversion."
            ),
        },
        rating_rubric={
            "1": "The response performs well on both criteria.",
            "0": "The response is somewhat aligned with both criteria",
            "-1": "The response falls short on both criteria",
        },
    ),
)