### LLM Evaluation 

This code uses llm as a judge to compare the generated content by two different llm models
    
<b> llm as a judge using user provided metrics:<br>
- multimodal content coverage comparisions




In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]

In [14]:
#import libraries
import time
import random
from google.cloud import bigquery
import json
from datetime import datetime
import pandas as pd
from sklearn.utils import shuffle
from LLM_PairWiseEval_cls import PairwiseEvaluationClient


### Prepare Data Sample for Multimodal Coverage Comparisions
The assumption is that the generated content is in the form of json including the fields that are requested from llm models to be extracted from the content.<br>
Because we did not have data in our environment, we make some sample data

# Sample User Prompt
This is basically the prompt text that will be used to generate the content for each video segment or image during batch/online content generation.
Here, we used this prompt to generate the content of a sample video from 600s to 900s using two different models 'gemini-1.5-pro-002', 'gemini-1.5-flash-002'. The generated content is recorded in json format in output_model1.txt and output_model2.txt files.

In [3]:
start=600
end=900
schema="""{
    "description": "A structured schema to represent detailed information from a video or text analysis",
    "type": "object",
    "properties": {
        "Category": {
            "type": "string",
            "description": "The category or general type of the content"
        },
        "DetailedDescriptionOfEventsAndConversations": {
            "type": "string",
            "description": "A detailed textual description of the events and conversations in the content"
        },
        "BrandsCompanyNamesLogos": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of brands, company names, or logos appearing or mentioned in the content"
        },
        "KeyLocationsAndScenes": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of key locations and scenes appearing or mentioned in the content"
        },
        "KeyThemes": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of key themes discussed or portrayed in the content"
        },
        "PeopleAppearingAndMentioned": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of people who appear or are mentioned in the content"
        }
    },
    "required": [
        "Category",
        "DetailedDescriptionOfEventsAndConversations",
        "BrandsCompanyNamesLogos",
        "KeyLocationsAndScenes",
        "KeyThemes",
        "PeopleAppearingAndMentioned"
    ]
}"""


VAR_VIDEO_SEGMENT=f"Your task is to provide a comprehensive description of this video from segment {start} seconds to {end} seconds.\n"
VAR_INSTRUCTIONS= """To complete the task you need to follow these steps:\n
                           No greetings, closing remarks, or additional comments. Begin immediately with the video analysis and provide only the requested information in the specified format.\n
                           Idenify all instances of visual product placement. Pay close attention to background details and items held by the characters. List each product placement with the following
                            information: Brand name, product name (if applicable), and a brief description. Include information about product placement into the description generated for the video\n
                           Create a transcript of all the speeches, dialogs, narration.\n
                           Scrupulously examine each scene for any and all visible brand names, logos, and products. Even if a product appears briefly or in the background, it should be included.\n"""

VAR_CONSTRAINTS= """Describe the video content objectively, avoiding any subjective opinions or assumptions.\n
                           Specify who is saying what. If a person talking can be seen, specify their name and/or occupation. If it is voice behind the scenes, then describe it as a narrator.\n
                           Be specific when describing. Include all the information that is shown or given.\n
                           Do not show timestamps.\n
                           If an unidentified person is shown in the video first, but then their name is mentioned later in the video, make sure to mention their name in the description from the start.\n
                           """

VAR_STRUCTURE= f"""Organize the description with the following properties, and give a valid json file with JSON schema.<JSONSchema>{json.dumps(schema)}</JSONSchema>:
                       \n**Category**\n
                       \n**DetailedDescriptionOfEventsAndConversations**\n
                       \n**BrandsCompanyNamesLogos**\n
                       \n**KeyLocationsAndScenes**\n
                       \n**KeyThemes**\n
                       \n**PeopleAppearingAndMentioned**\n 
                 """ 

VAR_CONDITIONS = """Identify a video as one of these categories: News, TV Shows, Live Sport Events, News Analyses. \n
                       When describing the DetailedDescriptionOfEventsAndConversations, consider the following instructions for specific video types:\n
                       * **News:** Pay close attention to transitions, graphics, and on-screen text.\n
                       * **TV Shows:** Describe facial expressions, body language, appearances, and overall mood.\n
                       * **Live Sports Events:** Focus on key moments, like goals or fouls, and describe the overall flow and momentum of the game.\n
                       * **News Analyses:** Identify different perspectives, arguments, and supporting evidence.\n
                       Make sure to mention people's names in the DetailedDescriptionOfEventsAndConversations and in PeopleAppearingAndMentioned as well as any other information about them like their age, occupation, location, etc. \n"""

VAR_EXAMPLE = """Follow this example for the format of the output:\n
              {
                "Category": "TV Show",
                "DetailedDescriptionOfEventsAndConversations": "The video starts with a man sitting at a dining table, reading a letter. Two Fiji bottles are visible on the benchtop. He has short, light brown hair and a beard. His name is Harrison. The scene changes to Melissa. Melissa says: \"I'm Melissa, and I'm a hairdresser. I'm 41 years old, and I'm from Sydney.\"",
                "BrandsCompanyNamesLogos": ["Lacoste", "Fiji"],
                "KeyLocationsAndScenes": ["Apartment"],
                "KeyThemes": ["Marriage"],
                "PeopleAppearingAndMentioned": [
                "Harrison, 32, Builder, NSW",
                "Melissa, 41, Hairdresser, NSW"
                ]
            }
               """
  
    
video_description_prompt=VAR_VIDEO_SEGMENT+VAR_INSTRUCTIONS+VAR_CONSTRAINTS+VAR_STRUCTURE+VAR_CONDITIONS+VAR_EXAMPLE
 

In [4]:
#load pre-executed predictions
with open('output_model1.txt', 'r') as file:
    response1 = json.dumps(json.load(file))
with open('output_model2.txt', 'r') as file:
    response2 = json.dumps(json.load(file))

In [5]:
#create one comparision sample
generated_response = [
    response1,
    response2
]
llm_models=['gemini-1.5-pro-002', 'gemini-1.5-flash-002']

items= pd.DataFrame(
    {   'asset_id':"MAAT2024_1_A_HBB.mp4",
        "prompt_text_A": video_description_prompt, #this should be set with the prompt_text when doing batch generation for model -A
        "prompt_text_B": video_description_prompt, #this should be set with the prompt_text when doing batch generation for model -B
        "fileUri":'gs://raw_nine_files/vlt_video_extract/MAAT/MAAT2024_1_A_HBB.mp4' , #this should be set to file uri when doing batch generation
        "description_A": response1, #this should be set the generated content when doing batch generation for model-A
        "description_B": response2, #this should be set the generated content when doing batch generation for model-B
     
        "asset_type": 'video/mp4', #this should be set to asset_type/mime_type when doing batch generation
        "startOffset_seconds":[int(start)],
        "endOffset_seconds":[int(end)],
        "modelVersion_A":llm_models[0],#this should be set the model-A name
        "modelVersion_B":llm_models[1],#this should be set the model-B name

    }
)

In [8]:
prompt_A="Describe this image in detail. Identify any text, brands or company logos, locations and key themes. If there is a famous person like politician, celebrity or athlete, indicate their name and describe what they are famous for. Organize the description in the following format: \n**DetailedDescription**\n \n**BrandsCompanyNamesLogos**\n \n**KeyLocations**\n \n**KeyThemes**\n \n**PeopleAppearing**\n If you are not sure about any info, please do not make it up. No greetings, closing remarks, or additional comments."
response_A="**DetailedDescription**\nEye-level close-up of what appears to be a formal press conference or bilateral meeting.  Two individuals are seated at a table, likely dignitaries or officials, given the setting and presence of flags and microphones. The person on the left wears a yellow and grey patterned dress and a traditional floral lei. She has grey hair and glasses. The person on the right, with short grey hair, wears a dark suit jacket and a similar floral lei.  They are both looking directly towards the camera, possibly addressing an audience. \n\n\nThe table in front of them is white and holds two bottles of water, two small flags (Samoa and Australia), microphones with foam covers, a small bottle of hand sanitizer, and a document.  A centerpiece of pink, purple, and white flowers sits between the individuals. \n\n\nThe backdrop consists of a dark, possibly wooden, wall with a textured or patterned design. Two large flags are prominently displayed behind the individuals, one appearing to be the flag of Samoa and the other the flag of Australia. The lighting is well-balanced, illuminating the subjects and the foreground elements while the background remains slightly darker.\n\n**BrandsCompanyNamesLogos**\nNone\n\n**KeyLocations**\nSamoa (flag present)\nAustralia (flag present)\n\n**KeyThemes**\nInternational relations/diplomacy\nBilateral meeting/press conference\nFormal occasion\n\n**PeopleAppearing**\nFiame Naomi Mata'afa, Prime Minister of Samoa."

prompt_B="Describe this image in detail. Identify any text, brands or company logos, locations and key themes. If there is a famous person like politician, celebrity or athlete, indicate their name and describe what they are famous for. Organize the description in the following format: \n**DetailedDescription**\n \n**BrandsCompanyNamesLogos**\n \n**KeyLocations**\n \n**KeyThemes**\n \n**PeopleAppearing**\n If you are not sure about any info, please do not make it up. No greetings, closing remarks, or additional comments."
response_B="**DetailedDescription**\nEye-level view of a scene that appears to be a political rally or public speaking event. \n\n\nThe central figure is Donald Trump, standing on a stage with a large American flag draped across the front. He is wearing a dark suit, red tie, and his signature hairstyle. His hands are clasped in front of him, and he appears to be looking out at the crowd. \n\n\nA person with a camera and stabilizer rig is positioned to the right of the frame, seemingly filming or photographing the event. \n\n\nA crowd of people is visible behind a barrier in the background, many of whom are holding up cell phones, presumably taking pictures or videos. They appear to be engaged and focused on the stage. \n\n\nThe sky is visible above, mostly clear with some light clouds. The lighting suggests it's daytime, likely in the late afternoon or early evening.\n\n\n**BrandsCompanyNamesLogos**\nThe Trump campaign slogan appears on a woman's tank top.\n\n**KeyLocations**\nThe location appears to be an outdoor venue in the United States, possibly a fairground or open field, judging by the flat terrain and crowd barriers.\n\n\n**KeyThemes**\nPolitical rally, public speaking, presidential campaign\n\n\n**PeopleAppearing**\nDonald Trump, former President of the United States.\n"

new_item=pd.DataFrame(
   [ {   'asset_id':"04ae4ae696419a0d23df328343bc5e893bb8b666.jpeg",
        "prompt_text_A": prompt_A, #this should be set with the prompt_text when doing batch generation for model -A
        "prompt_text_B": prompt_B, #this should be set with the prompt_text when doing batch generation for model -B
        "fileUri":'gs://nineshowcaseassets/IMAGES/04ae4ae696419a0d23df328343bc5e893bb8b666.jpeg' , #this should be set to file uri when doing batch generation
        "description_A": response_A, #this should be set the generated content when doing batch generation for model-A
        "description_B": response_B, #this should be set the generated content when doing batch generation for model-B     
        "asset_type": 'image/jpeg', #this should be set to asset_type/mime_type when doing batch generation      
        "modelVersion_A":'gemini-1.5-pro-002',#this should be set the model-A name
        "modelVersion_B":'gemini-1.5-pro-002',#this should be set the model-B name

    }]
)

items=pd.concat([items,new_item], axis=0)

In [9]:
items

Unnamed: 0,asset_id,prompt_text_A,prompt_text_B,fileUri,description_A,description_B,asset_type,startOffset_seconds,endOffset_seconds,modelVersion_A,modelVersion_B
0,MAAT2024_1_A_HBB.mp4,Your task is to provide a comprehensive descri...,Your task is to provide a comprehensive descri...,gs://raw_nine_files/vlt_video_extract/MAAT/MAA...,"{""Category"": ""TV Show"", ""DetailedDescriptionOf...","{""Category"": ""TV Show"", ""DetailedDescriptionOf...",video/mp4,600.0,900.0,gemini-1.5-pro-002,gemini-1.5-flash-002
0,04ae4ae696419a0d23df328343bc5e893bb8b666.jpeg,Describe this image in detail. Identify any te...,Describe this image in detail. Identify any te...,gs://nineshowcaseassets/IMAGES/04ae4ae696419a0...,**DetailedDescription**\nEye-level close-up of...,**DetailedDescription**\nEye-level view of a s...,image/jpeg,,,gemini-1.5-pro-002,gemini-1.5-pro-002


In [10]:
experiment_name = "content-generation-qa-quality"
file_path = 'PairWiseMultimodalContentEvaluationMetrics.json'

# Open and load the JSON file
with open(file_path, 'r') as file:
    multimodal_eval_prompt_metrics = json.load(file)

    
video_multimodal_content_evaluation_metric_promopt =multimodal_eval_prompt_metrics['video_multimodal_content_evaluation_metric_promopt']
image_multimodal_content_evaluation_metric_promopt= multimodal_eval_prompt_metrics['image_multimodal_content_evaluation_metric_promopt']

multimodal_evaluation_promt={'video_prompt': video_multimodal_content_evaluation_metric_promopt,'image_prompt':image_multimodal_content_evaluation_metric_promopt}


In [12]:
items

Unnamed: 0,asset_id,prompt_text_A,prompt_text_B,fileUri,description_A,description_B,asset_type,startOffset_seconds,endOffset_seconds,modelVersion_A,modelVersion_B
0,MAAT2024_1_A_HBB.mp4,Your task is to provide a comprehensive descri...,Your task is to provide a comprehensive descri...,gs://raw_nine_files/vlt_video_extract/MAAT/MAA...,"{""Category"": ""TV Show"", ""DetailedDescriptionOf...","{""Category"": ""TV Show"", ""DetailedDescriptionOf...",video/mp4,600.0,900.0,gemini-1.5-pro-002,gemini-1.5-flash-002
0,04ae4ae696419a0d23df328343bc5e893bb8b666.jpeg,Describe this image in detail. Identify any te...,Describe this image in detail. Identify any te...,gs://nineshowcaseassets/IMAGES/04ae4ae696419a0...,**DetailedDescription**\nEye-level close-up of...,**DetailedDescription**\nEye-level view of a s...,image/jpeg,,,gemini-1.5-pro-002,gemini-1.5-pro-002


In [15]:
pairwise_evaluation_client=PairwiseEvaluationClient(project='nine-quality-test',
                          location='us-central1',
                          items=items,
                          response_A_desc_column_name= "description_A",
                          response_B_desc_column_name= "description_B",
                          response_A_llm_model_column_name="modelVersion_A",
                          response_B_llm_model_column_name="modelVersion_B",                        
                         experiment_name="pairwise-evaluation-experiment",    
                         multimodal_evaluation_promt=multimodal_evaluation_promt, # prompt that will be used for video and image generated content evaluation comparisions
                         response_A_userPrompt_column_name="prompt_text_A", # name of the column in the 'items' data frame that includes user input prompt when generating content for model-A
                         response_B_userPrompt_column_name="prompt_text_B", # name of the column in the 'items' data frame that includes user input prompt when generating content for model-b
                         response_media_column_metadata={'fileUri':'fileUri', 'startOffset':'startOffset_seconds','endOffset':'endOffset_seconds', 'mediaType':'asset_type'},   # name of metadata columns in the 'items' dataframe
                         response_mediaType_column_name='asset_type'
                             )
evaluations=pairwise_evaluation_client.get_evaluations()

Dataset nine-quality-test.vlt_eval_statistics_schema exists.
Table nine-quality-test.vlt_eval_statistics_schema.vlt_pairwise_eval_statistics exists. Checking schema...
Evaluations have successfully been loaded into nine-quality-test.vlt_eval_statistics_schema.vlt_pairwise_eval_statistics.


In [16]:
evaluations

Unnamed: 0,response_A,response_B,mediaType,multimodal_evaluation_promt,instruction_A,instruction_B,reference,response_A_llm_model,response_B_llm_model,run_experiment_name,...,detailed_description_of_events_and_conversations_score,detailed_description_of_events_and_conversations_explanation,brands_companynames_and_logos_score,brands_companynames_and_logos_explanation,keylocations_and_scenes_score,keylocations_and_scenes_explanation,key_themes_score,key_themes_explanation,people_appearing_and_mentioned_score,people_appearing_and_mentioned_explanation
0,"{""Category"": ""TV Show"", ""DetailedDescriptionOf...","{""Category"": ""TV Show"", ""DetailedDescriptionOf...",video/mp4,# Instruction\nYou are an expert evaluator. Yo...,Your task is to provide a comprehensive descri...,Your task is to provide a comprehensive descri...,"{""fileuri"": ""gs://raw_nine_files/vlt_video_ext...",gemini-1.5-pro-002,gemini-1.5-flash-002,pairwise-evaluation-experiment-1d4257a8-fc36-4...,...,5,Both models capture the main events of the vid...,1,Model A correctly identifies the Fiji water bo...,5,Both Models capture most of the key locations ...,3,Both Models are almost similar in identifying ...,5,Both models correctly identify most of the peo...
0,**DetailedDescription**\nEye-level close-up of...,**DetailedDescription**\nEye-level view of a s...,image/jpeg,# Instruction\nYou are an expert evaluator. Yo...,Describe this image in detail. Identify any te...,Describe this image in detail. Identify any te...,"{""fileuri"": ""gs://nineshowcaseassets/IMAGES/04...",gemini-1.5-pro-002,gemini-1.5-pro-002,pairwise-evaluation-experiment-1d4257a8-fc36-4...,...,5,Both models capture the main events of the vid...,1,Model A correctly identifies the Fiji water bo...,5,Both Models capture most of the key locations ...,3,Both Models are almost similar in identifying ...,5,Both models correctly identify most of the peo...


In [18]:
print(video_multimodal_content_evaluation_metric_promopt)

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user prompt, and AI-generated responses for model A and B, video and the segment for which this response is generated.
You should first read the user input carefully for analyzing the task, then look into video segment, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing coverage, which measures the ability to provide a detailed response based on the given video segment and requested properties.

## Criteria
Coverage: It is the quality of capturing all required detail for each requested property.
In the context of video content capturing, it re