### LLM Evaluation 

This code uses llm as a judge to compare the generated content by two different llm models
    
<b> llm as a judge using user provided metrics:<br>
- multimodal content coverage comparisions




In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]

In [1]:
#import libraries
import time
import random
from google.cloud import bigquery
import json
from datetime import datetime
import pandas as pd
from sklearn.utils import shuffle
#from LLM_PairWiseEval_cls import PairwiseEvaluationClient


### Prepare Data Sample for Multimodal Coverage Comparisions
The assumption is that the generated content is in the form of json including the fields that are requested from llm models to be extracted from the content.<br>
Because we did not have data in our environment, we make some sample data

# Sample User Prompt
This is basically the prompt text that will be used to generate the content for each video segment or image during batch/online content generation.
Here, we used this prompt to generate the content of a sample video from 600s to 900s using two different models 'gemini-1.5-pro-002', 'gemini-1.5-flash-002'. The generated content is recorded in json format in output_model1.txt and output_model2.txt files.

In [2]:
start=600
end=900
schema="""{
    "description": "A structured schema to represent detailed information from a video or text analysis",
    "type": "object",
    "properties": {
        "Category": {
            "type": "string",
            "description": "The category or general type of the content"
        },
        "DetailedDescriptionOfEventsAndConversations": {
            "type": "string",
            "description": "A detailed textual description of the events and conversations in the content"
        },
        "BrandsCompanyNamesLogos": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of brands, company names, or logos appearing or mentioned in the content"
        },
        "KeyLocationsAndScenes": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of key locations and scenes appearing or mentioned in the content"
        },
        "KeyThemes": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of key themes discussed or portrayed in the content"
        },
        "PeopleAppearingAndMentioned": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "A list of people who appear or are mentioned in the content"
        }
    },
    "required": [
        "Category",
        "DetailedDescriptionOfEventsAndConversations",
        "BrandsCompanyNamesLogos",
        "KeyLocationsAndScenes",
        "KeyThemes",
        "PeopleAppearingAndMentioned"
    ]
}"""


VAR_VIDEO_SEGMENT=f"Your task is to provide a comprehensive description of this video from segment {start} seconds to {end} seconds.\n"
VAR_INSTRUCTIONS= """To complete the task you need to follow these steps:\n
                           No greetings, closing remarks, or additional comments. Begin immediately with the video analysis and provide only the requested information in the specified format.\n
                           Idenify all instances of visual product placement. Pay close attention to background details and items held by the characters. List each product placement with the following
                            information: Brand name, product name (if applicable), and a brief description. Include information about product placement into the description generated for the video\n
                           Create a transcript of all the speeches, dialogs, narration.\n
                           Scrupulously examine each scene for any and all visible brand names, logos, and products. Even if a product appears briefly or in the background, it should be included.\n"""

VAR_CONSTRAINTS= """Describe the video content objectively, avoiding any subjective opinions or assumptions.\n
                           Specify who is saying what. If a person talking can be seen, specify their name and/or occupation. If it is voice behind the scenes, then describe it as a narrator.\n
                           Be specific when describing. Include all the information that is shown or given.\n
                           Do not show timestamps.\n
                           If an unidentified person is shown in the video first, but then their name is mentioned later in the video, make sure to mention their name in the description from the start.\n
                           """

VAR_STRUCTURE= f"""Organize the description with the following properties, and give a valid json file with JSON schema.<JSONSchema>{json.dumps(schema)}</JSONSchema>:
                       \n**Category**\n
                       \n**DetailedDescriptionOfEventsAndConversations**\n
                       \n**BrandsCompanyNamesLogos**\n
                       \n**KeyLocationsAndScenes**\n
                       \n**KeyThemes**\n
                       \n**PeopleAppearingAndMentioned**\n 
                 """ 

VAR_CONDITIONS = """Identify a video as one of these categories: News, TV Shows, Live Sport Events, News Analyses. \n
                       When describing the DetailedDescriptionOfEventsAndConversations, consider the following instructions for specific video types:\n
                       * **News:** Pay close attention to transitions, graphics, and on-screen text.\n
                       * **TV Shows:** Describe facial expressions, body language, appearances, and overall mood.\n
                       * **Live Sports Events:** Focus on key moments, like goals or fouls, and describe the overall flow and momentum of the game.\n
                       * **News Analyses:** Identify different perspectives, arguments, and supporting evidence.\n
                       Make sure to mention people's names in the DetailedDescriptionOfEventsAndConversations and in PeopleAppearingAndMentioned as well as any other information about them like their age, occupation, location, etc. \n"""

VAR_EXAMPLE = """Follow this example for the format of the output:\n
              {
                "Category": "TV Show",
                "DetailedDescriptionOfEventsAndConversations": "The video starts with a man sitting at a dining table, reading a letter. Two Fiji bottles are visible on the benchtop. He has short, light brown hair and a beard. His name is Harrison. The scene changes to Melissa. Melissa says: \"I'm Melissa, and I'm a hairdresser. I'm 41 years old, and I'm from Sydney.\"",
                "BrandsCompanyNamesLogos": ["Lacoste", "Fiji"],
                "KeyLocationsAndScenes": ["Apartment"],
                "KeyThemes": ["Marriage"],
                "PeopleAppearingAndMentioned": [
                "Harrison, 32, Builder, NSW",
                "Melissa, 41, Hairdresser, NSW"
                ]
            }
               """
  
    
video_description_prompt=VAR_VIDEO_SEGMENT+VAR_INSTRUCTIONS+VAR_CONSTRAINTS+VAR_STRUCTURE+VAR_CONDITIONS+VAR_EXAMPLE
 

In [3]:
#load pre-executed predictions
with open('output_model1.txt', 'r') as file:
    response1 = json.dumps(json.load(file))
with open('output_model2.txt', 'r') as file:
    response2 = json.dumps(json.load(file))

In [4]:
experiment_name = "content-generation-qa-quality"
file_path = 'PairWiseMultimodalContentEvaluationMetrics.json'

# Open and load the JSON file
with open(file_path, 'r') as file:
    multimodal_eval_prompt_metrics = json.load(file)

    
video_multimodal_content_evaluation_metric_promopt =multimodal_eval_prompt_metrics['video_multimodal_content_evaluation_metric_promopt']
image_multimodal_content_evaluation_metric_promopt= multimodal_eval_prompt_metrics['image_multimodal_content_evaluation_metric_promopt']

multimodal_evaluation_promt={'video_prompt': video_multimodal_content_evaluation_metric_promopt,'image_prompt':image_multimodal_content_evaluation_metric_promopt}


generated_response = [
    response1,
    response2
]
llm_models=['gemini-1.5-pro-002', 'gemini-1.5-flash-002']

items= pd.DataFrame(
    {   'asset_id':"MAAT2024_1_A_HBB.mp4",
        "prompt_text_A": video_description_prompt, #this should be set with the prompt_text when doing batch generation for model -A
        "prompt_text_B": video_description_prompt, #this should be set with the prompt_text when doing batch generation for model -B
        "fileUri":'gs://raw_nine_files/vlt_video_extract/MAAT/MAAT2024_1_A_HBB.mp4' , #this should be set to file uri when doing batch generation
        "description_A": response1, #this should be set the generated content when doing batch generation for model-A
        "description_B": response1, #this should be set the generated content when doing batch generation for model-B
     
        "asset_type": 'video/mp4', #this should be set to asset_type/mime_type when doing batch generation
        "startOffset_seconds":[int(start)],
        "endOffset_seconds":[int(end)],
        "modelVersion_A":llm_models[0],#this should be set the model-A name
        "modelVersion_B":llm_models[1],#this should be set the model-B name

    }
)

In [5]:
items

Unnamed: 0,asset_id,prompt_text_A,prompt_text_B,fileUri,description_A,description_B,asset_type,startOffset_seconds,endOffset_seconds,modelVersion_A,modelVersion_B
0,MAAT2024_1_A_HBB.mp4,Your task is to provide a comprehensive descri...,Your task is to provide a comprehensive descri...,gs://raw_nine_files/vlt_video_extract/MAAT/MAA...,"{""Category"": ""TV Show"", ""DetailedDescriptionOf...","{""Category"": ""TV Show"", ""DetailedDescriptionOf...",video/mp4,600,900,gemini-1.5-pro-002,gemini-1.5-flash-002


In [19]:
from google.cloud import aiplatform
import vertexai
import pandas as pd
import json
import math
from collections import Counter
import uuid 
from google.cloud import bigquery
from google.api_core.exceptions import NotFound
from datetime import datetime

from vertexai.preview.generative_models import (
    Content,
    GenerationConfig,
    GenerationResponse,
    GenerativeModel,
    Image,
    Part as GenerativeModelPart,
    HarmBlockThreshold,
    HarmCategory,
)

class PairwiseEvaluationClient:
    """Wrapper around Pairwise Evaluation Client."""

    def __init__(
        self,
        project: str=None,
        location: str = "us-central1",
        items: pd.core.frame.DataFrame = None,
        response_A_desc_column_name: str= 'description_A',
        response_B_desc_column_name: str= 'description_B',
        response_A_llm_model_column_name: str= None,
        response_B_llm_model_column_name: str=None,
        response_mediaType_column_name: str=None,
        response_media_column_metadata : dict=None,
        response_A_userPrompt_column_name: str=None,
        response_B_userPrompt_column_name: str=None,
        multimodal_evaluation_promt: dict=None,       
        experiment_name: str="pairwise-evaluation-experiment",
       
        ):
        """
        Initis the hyper parameters
        
        Args:
         str project:  project id 
         str locations: project location         
         Dataframe items: dataframe of AI-generated responses
         str response_A_desc_column_name: the name of the column in the 'items' dataframe that includes the AI-generated response for model A
         str response_B_desc_column_name: the name of the column in the 'items' dataframe that includes the AI-generated response for model B
         str response_A_llm_model_column_name: the name of the column in the 'items' dataframe that includes the model A's name that is used for extracting AI-generated responses A
         str response_B_llm_model_column_name: the name of the column in the 'items' dataframe that includes the model B's name that is used for extracting AI-generated responses B
         str response_mediaType_column_name:  the name of the column in the 'items' dataframe that represent media type
         str response_A_userPrompt_column_name: the name of the column in the 'items' dataframe that represent user prompt for model A using which the AI model generated the response A
         str response_A_userPrompt_column_name: the name of the column in the 'items' dataframe that represent user prompt for model B using which the AI model generated the response B
         dict response_media_column_metadata: dictionary including the name of fileuri, start and endoffset of the media if available
                                              e.g. {'fileUri':'fileUri', 'startOffset':'startOffset_seconds','endOffset':'endOffset_seconds', 'mediaType':'mediaType'}           
         dict multimodal_evaluation_promt: dictionary including prompts for multimodal content evaluations.
                                           e.g. {"video_prompt":"...","image_prompt":"..."}        
         str experiment_name: name of the evaluation experiment
        """
        
        #set the parameters
        self.location = location  
        self.project = project   
        self.items =items  
        self.experiment_name=experiment_name      
        self.multimodal_evaluation_promt=multimodal_evaluation_promt
        self.response_A_userPrompt_column_name=response_A_userPrompt_column_name
        self.response_B_userPrompt_column_name=response_B_userPrompt_column_name
        self.response_A_llm_model_column_name=response_A_llm_model_column_name
        self.response_B_llm_model_column_name=response_B_llm_model_column_name        
        self.response_media_column_metadata=response_media_column_metadata
        self.response_mediaType_column_name=response_mediaType_column_name
        self.response_A_desc_column_name=response_A_desc_column_name
        self.response_B_desc_column_name=response_B_desc_column_name
      
        self.run_experiment_name=self.experiment_name+"-"+ str(uuid.uuid4())

         # Load the schema from PairWise_Schema.json
        with open('PairWise_Schema.json') as config_file:
            self.pairwise_schema = json.load(config_file)
        
        #initialize Vertex AI
        vertexai.init(project=self.project, location= self.location )
         

    def set_evaluation_data(self):
        """
        Prepare the input data as in a dataframe for evaluation

        """
            
        eval_dataset = pd.DataFrame(
                        {
                            "response_A": self.items[self.response_A_desc_column_name].to_list(),
                            "response_B": self.items[self.response_B_desc_column_name].to_list(),
                                     
                            **({"mediaType": self.items[self.response_mediaType_column_name].to_list()} if 
                               self.response_mediaType_column_name !=None else {}),
                            **({"multimodal_evaluation_promt": [
                                self.multimodal_evaluation_promt['video_prompt'] if 'video' in str(self.items[self.response_mediaType_column_name].to_list()[i]).lower() else 
                                self.multimodal_evaluation_promt['image_prompt'] if 'image' in str(self.items[self.response_mediaType_column_name].to_list()[i]).lower() else None
                                for i in range(len(self.items))
                            ]} if self.response_mediaType_column_name!=None and self.multimodal_evaluation_promt!=None else {}),
                       
                             **({"instruction_A": self.items[self.response_A_userPrompt_column_name].to_list()} if 
                               self.response_A_userPrompt_column_name !=None else {}),   
                            **({"instruction_B": self.items[self.response_B_userPrompt_column_name].to_list()} if 
                               self.response_B_userPrompt_column_name !=None else {}),  
                            
                            "reference": [
                                        json.dumps(
                                            {
                                                "fileuri": self.items[self.response_media_column_metadata['fileUri']].to_list()[i],
                                                "metadata": {
                                                    "start_offset": {
                                                        "seconds": int(self.items[self.response_media_column_metadata['startOffset']].to_list()[i]),
                                                        "nanos": 0,
                                                    },
                                                    "end_offset": {
                                                        "seconds": int(self.items[self.response_media_column_metadata['endOffset']].to_list()[i]),
                                                        "nanos": 0,
                                                    },
                                                } if self.response_media_column_metadata['startOffset'] in self.items.columns and 
                                                     self.response_media_column_metadata['endOffset'] in self.items.columns and 
                                                     'video' in str(self.items[self.response_mediaType_column_name].to_list()[i]).lower() 
                                                else {}
                                            }
                                        ) if self.response_media_column_metadata is not None and 
                                             self.response_media_column_metadata.get('fileUri') is not None 
                                          else "{}"
                                
                                        for i in range(len(self.items))
                                    ],
                            "response_A_llm_model": self.items[self.response_A_llm_model_column_name],
                            "response_B_llm_model": self.items[self.response_B_llm_model_column_name],
                            "run_experiment_name": [self.run_experiment_name] * len(self.items),
                            "run_experiment_date": [datetime.today().strftime('%Y-%m-%d')] * len(self.items),
                        }
                    )
        
        return eval_dataset
    
    def log_evaluations(self,result):
        """
        Log the evaluation result into BigQuery, converting all columns to string type.

        Args:
            dataframe result : The evaluation result to be recorded into the database.
        """
        import json
        from google.cloud import bigquery
        from google.cloud.exceptions import NotFound

        # Load configuration from config.json
        with open('config.json') as config_file:
            config = json.load(config_file)

        table_id = config['pairwise_eval_table']
        dataset_id = config['eval_dataset']
        project_id = config["project"]
        location_id = config["project_location"]
        table_full_id = f"{project_id}.{dataset_id}.{table_id}"
        dataset_full_id = f"{project_id}.{dataset_id}"

        # Remove unwanted characters from column names
        result.columns = result.columns.str.replace("/", "_").str.replace(',','')

        # Convert all columns to string
        result = result.astype(str)

        # Convert DataFrame to list of dictionaries
        data_as_dict = result.to_dict(orient='records')

        # Initialize BigQuery Client
        client = bigquery.Client()


        try:
            client.get_dataset(dataset_full_id)
            print(f"Dataset {dataset_full_id} exists.")
        except NotFound:
            print(f"Dataset {dataset_full_id} not found. Creating dataset...")
            dataset = bigquery.Dataset(dataset_full_id)
            dataset.location = location_id
            client.create_dataset(dataset)
            print(f"Dataset {dataset_full_id} created successfully.")


        # Ensure the dataset exists    
        try:
            # Fetch the existing table
            table = client.get_table(table_full_id)
            existing_schema = {field.name: field.field_type for field in table.schema}
            print(f"Table {table_full_id} exists. Checking schema...")

            # Identify new columns to be added
            schema_changes = []
            for col in result.columns:
                if col not in existing_schema:
                    schema_changes.append(bigquery.SchemaField(col, bigquery.enums.SqlTypeNames.STRING))

            if schema_changes:
                print("Altering schema to add new columns...")
                table.schema = table.schema + schema_changes
                table = client.update_table(table, ["schema"])
                print(f"Schema updated successfully.")

        except NotFound:
            print(f"Table {table_full_id} not found. Creating table...")
            # Define schema as all string types
            schema = [bigquery.SchemaField(name, bigquery.enums.SqlTypeNames.STRING) for name in result.columns]

            # Create the table
            table = bigquery.Table(table_full_id, schema=schema)
            table = client.create_table(table)
            print(f"Table {table_full_id} created successfully.")


        # Insert rows into BigQuery
        try:
            errors = client.insert_rows_json(table_full_id, data_as_dict)
            if not errors:
                print(f"Evaluations have successfully been loaded into {table_full_id}.")
            else:
                print("Errors occurred while loading data:")
                for error in errors:
                    print(error)
        except Exception as e:
            print(f"An error occurred while inserting data: {e}")
    
    
    def get_autorater_response(self, metric_prompt: list, llm_model: str="gemini-1.5-pro") -> dict:
        
        """Extract evaluation metric on a AI-generated content using a AI-as-judge approach
        
        Args:
        list metric_prompt: the input metric prompt parameters
        str llm_model: evaluation model

        Returns:
        dict response_json: the evaluated metric in json format
        """
            
        # set evaluation metric schema
        metric_response_schema = self.pairwise_schema 

        #define a generative model as an autorator
        autorater = GenerativeModel(
            llm_model,
            generation_config=GenerationConfig(
                response_mime_type="application/json",
                response_schema=metric_response_schema,
            ),
            safety_settings={
                HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
                HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
            },
        )

        #generate the rating metrics as per requested measures and metric in the prompt
        response = autorater.generate_content(metric_prompt)

        response_json = {}

        if response.candidates and len(response.candidates) > 0:
            candidate = response.candidates[0]
            if (
                candidate.content
                and candidate.content.parts
                and len(candidate.content.parts) > 0
            ):
                part = candidate.content.parts[0]
                if part.text:
                    response_json = json.loads(part.text)

        return response_json

    def custom_coverage_fn(self,instance):
       
        """Extract evaluation metric on a AI-generated content using a AI-as-judge approach
        
        Args:
        dict instance: an instance of predictions that should be evaluated
       
        Returns:
        dict evaluation_response: scores and explanations related to the judgements for each requested metric
        """
        
        fileUri = json.loads(instance["reference"])["fileuri"]
        eval_instruction_template =instance["multimodal_evaluation_promt"]      
        user_prompt_A_instruction= instance["instruction_A"]
        user_prompt_B_instruction= instance["instruction_B"]
        response_A = instance["response_A"]
        response_B = instance["response_B"]
        
        evaluation_prompt=[]
        # set the evaluation prompt
        if 'video' in instance["mediaType"]:   
            evaluation_prompt = [
                eval_instruction_template,       
                "VIDEO URI: ",
                fileUri,
                "VIDEO METADATA: ",
                json.dumps(json.loads(instance["reference"])["metadata"]),  
                "USER'S INPUT PROMPT MODEL A:",
                user_prompt_A_instruction,
                "USER'S INPUT PROMPT MODEL B:",
                user_prompt_B_instruction,
                "GENERATED RESPONSE MODEL A: ",
                 response_A,
                 "GENERATED RESPONSE MODEL B: ",
                 response_B,
            ]
        elif 'image' in instance["mediaType"]:
            # generate the evaluation prompt
            evaluation_prompt = [
                eval_instruction_template,       
                "IMAGE URI: ",
                fileUri,   
                "USER'S INPUT PROMPT MODEL A:",
                user_prompt_A_instruction,
                "USER'S INPUT PROMPT MODEL B:",
                user_prompt_B_instruction,
                "GENERATED RESPONSE MODEL A: ",
                 response_A,
                 "GENERATED RESPONSE MODEL B: ",
                 response_B,
            ]
     
        #generate evaluation response
        evaluation_response = self.get_autorater_response(evaluation_prompt)
        return evaluation_response

    # Function to extract the score and explanation for each category
    def flatten_evaluations(self,instance):
        """Flattens a dict column type in a dataframe series
        
        Args:
        pandas.core.series.Series instance: an instance of predictions that should be evaluated
       
        Returns:
        Dataframe flattened_data: flattened data
        """ 
        flattened_data = {}
        for key in self.pairwise_schema['required']:
            flattened_data[f"{key.lower().replace(' ', '_')}_score"] = instance[key]['score']
            flattened_data[f"{key.lower().replace(' ', '_')}_explanation"] = instance[key]['explanation']
        
        return flattened_data
  
    
    def get_evaluations(self):
        """
        Extracts the evaluation metricsusing:
            1-user defined metrics and rating criteria
            
        """
        # set evaluation data
        eval_dataset=self.set_evaluation_data()       
            
        #calculate coverage metrics
        if self.multimodal_evaluation_promt:
            #get evaluations
            eval_dataset['custom_coverage']=eval_dataset.apply(self.custom_coverage_fn,axis=1)
             
            # Apply the function to flatten the 'custom_coverage' column and create new columns
            flattened_df = eval_dataset['custom_coverage'].apply(self.flatten_evaluations)
                                                                 
            # Join the flattened columns to the original dataframe
            eval_dataset = eval_dataset.join(pd.json_normalize(flattened_df))
            eval_dataset = eval_dataset.drop(columns=["custom_coverage"])
            
        eval_results=eval_dataset
            
        #log the statistics into bigquery
        self.log_evaluations(eval_results)
            
        return eval_results 

    


In [20]:
pairwise_evaluation_client=PairwiseEvaluationClient(project='nine-quality-test',
                          location='us-central1',
                          items=items,
                          response_A_desc_column_name= "description_A",
                          response_B_desc_column_name= "description_B",
                          response_A_llm_model_column_name="modelVersion_A",
                          response_B_llm_model_column_name="modelVersion_B",                        
                         experiment_name="pairwise-evaluation-experiment",    
                         multimodal_evaluation_promt=multimodal_evaluation_promt, # prompt that will be used for video and image generated content evaluation comparisions
                         response_A_userPrompt_column_name="prompt_text_A", # name of the column in the 'items' data frame that includes user input prompt when generating content for model-A
                         response_B_userPrompt_column_name="prompt_text_B", # name of the column in the 'items' data frame that includes user input prompt when generating content for model-b
                         response_media_column_metadata={'fileUri':'fileUri', 'startOffset':'startOffset_seconds','endOffset':'endOffset_seconds', 'mediaType':'asset_type'},   # name of metadata columns in the 'items' dataframe
                         response_mediaType_column_name='asset_type'
                             )
evaluations=pairwise_evaluation_client.get_evaluations()

Dataset nine-quality-test.vlt_eval_statistics_schema exists.
Table nine-quality-test.vlt_eval_statistics_schema.vlt_pairwise_eval_statistics not found. Creating table...
Table nine-quality-test.vlt_eval_statistics_schema.vlt_pairwise_eval_statistics created successfully.
Evaluations have successfully been loaded into nine-quality-test.vlt_eval_statistics_schema.vlt_pairwise_eval_statistics.


In [21]:
evaluations

Unnamed: 0,response_A,response_B,mediaType,multimodal_evaluation_promt,instruction_A,instruction_B,reference,response_A_llm_model,response_B_llm_model,run_experiment_name,...,detailed_description_of_events_and_conversations_score,detailed_description_of_events_and_conversations_explanation,brands_companynames_and_logos_score,brands_companynames_and_logos_explanation,keylocations_and_scenes_score,keylocations_and_scenes_explanation,key_themes_score,key_themes_explanation,people_appearing_and_mentioned_score,people_appearing_and_mentioned_explanation
0,"{""Category"": ""TV Show"", ""DetailedDescriptionOf...","{""Category"": ""TV Show"", ""DetailedDescriptionOf...",video/mp4,# Instruction\nYou are an expert evaluator. Yo...,Your task is to provide a comprehensive descri...,Your task is to provide a comprehensive descri...,"{""fileuri"": ""gs://raw_nine_files/vlt_video_ext...",gemini-1.5-pro-002,gemini-1.5-flash-002,pairwise-evaluation-experiment-03b2e9a2-b118-4...,...,3,Both models provided similar and relatively de...,5,"Model A is slightly better as it identifies ""F...",3,Both models captured the key locations accurat...,3,Both models accurately captured the key themes...,3,Both models accurately identified the people a...
