### LLM Evaluation 

This code uses gcp evaluation service to evaluate the generated content by a generative AI API in terms of 
- safety and sextural harmness
- coherence and fluency
- verbosity and repeatation


Use PointWiseEvaluationMetrics.json as a json file for the requested metrics and rating rubric

### Get data from biquery

In [1]:
import time
import random
from google.cloud import bigquery
import json
from datetime import datetime
import pandas as pd

    
def get_predictions(table, dataset,project_id,filter_query=""):
    """Query nearest neighbors using cosine similarity in BigQuery for text embeddings."""
  
    sql = f"""  
        WITH SEARCH_RESULT AS
         (SELECT 

                        asset_id, 
                        content,
                        headline,
                        html_safe_text,
                        description,
                        startOffset_seconds,
                        endOffset_seconds,
                        fileUri,
                        asset_type,
                        first_published_timestamp,
                        brand_type,
                        primary_category_name,
                        byline,
                        image_license_type,
                        publisher_type,
                        photographer,
                        date_published,
                        dxcId,
                        text_embedding_result ,
                        byline[SAFE_OFFSET(0)].author_name ,                    
                        CAST(JSON_EXTRACT_SCALAR(media_jsonbody, '$.response.candidates[0].avgLogprobs') AS FLOAT64) AS  avgLogprobs
                 FROM  `{dataset}.{table}` WHERE 1=1 and (LOWER(asset_type) LIKE '%video%' OR LOWER(asset_type) LIKE '%image%' ) {filter_query} 
        ),
          IMAGE_CONTEXT AS (
                   SELECT
                          pd.asset_id,
                          plain_text_column,
                          JSON_EXTRACT_SCALAR(entry, '$.image.mediaId') AS image_id,
                          JSON_EXTRACT_SCALAR(entry, '$.image.caption') AS image_caption
                        FROM
                          (SELECT
                              asset_id,
                              plain_text_column,
                              JSON_EXTRACT_ARRAY(article_body_json) AS article_body_json_array
                            FROM
                              `vlt_media_content_prelanding.vlt_article_content` -- change to vlt
                            WHERE
                              article_body_json IS NOT NULL
                          ) pd,
                          UNNEST(pd.article_body_json_array) AS entry -- Unnest the article body JSON array
                        WHERE
                          UPPER(JSON_EXTRACT_SCALAR(entry, '$.type')) = 'IMAGE' -- Filter to only 'IMAGE' type
                          AND JSON_EXTRACT_SCALAR(entry, '$.image.mediaId') IS NOT NULL -- Ensure there's an image ID
                       
          ) 
        
        SELECT sr.*,    plain_text_column as image_context ,  image_caption
        FROM SEARCH_RESULT   sr
        LEFT JOIN IMAGE_CONTEXT imgcnxt
        on REGEXP_REPLACE( sr.asset_id, r'\..*', '') =imgcnxt.image_id
    """       
 ##LOWER(asset_type) LIKE '%image%' OR 
    #print(sql)
    bq_client = bigquery.Client(project_id)
  
    # Run the query
    query_job = bq_client.query(sql)
    output=[]
    try:
        # Fetch results
        results = query_job.result()  
        df = results.to_dataframe()
       
        #drop duplicates
        df = df.drop_duplicates(subset=['asset_id', 'headline', 'description',
            'startOffset_seconds', 'endOffset_seconds', 'fileUri', 'asset_type',
            'first_published_timestamp', 'brand_type', 'primary_category_name',
            'author_name', 'image_license_type', 'publisher_type', 'photographer',
            'date_published', 'dxcId','avgLogprobs', 'image_context','image_caption' ])
        print(len(df))
        # Sort by asset_id and startOffset_seconds to ensure proper order
        df = df.sort_values(by=['asset_id', 'startOffset_seconds'])
        
     
        # Aggregate descriptions for each asset_id, ordered by startOffset_seconds
        # I dont want to aggregate different time-stamps
        #df['description'] = df.groupby('asset_id')['description'].transform(lambda x: '\n'.join(x))

        # Aggregate and concatenate segments for each asset_id
        df['time_lines'] = df.apply(
            lambda row: f"{{'startOffset_seconds': {row['startOffset_seconds']}, 'endOffset_seconds': {row['endOffset_seconds']}}}", axis=1)
            
        # Now group by 'asset_id' and concatenate the strings in 'time_lines'
        time_lines = df.groupby(['asset_id'])['time_lines'].apply(lambda x: ', '.join(x)).reset_index()
        
        df.drop('time_lines', axis=1, inplace=True)
        # Merge the time_lines into the original DataFrame
        df = df.merge(time_lines, on=['asset_id'], how='left')
    
        #drop duplicates
        df = df.drop_duplicates(subset=['asset_id', 'headline', 'description',
                'fileUri', 'asset_type',
            'first_published_timestamp', 'brand_type', 'primary_category_name',
            'author_name', 'image_license_type', 'publisher_type', 'photographer',
            'date_published', 'dxcId',  'time_lines','avgLogprobs' ,'image_context','image_caption' ])[['asset_id', 'headline', 'description',
                'fileUri', 'asset_type',
            'first_published_timestamp', 'brand_type', 'primary_category_name',
            'author_name', 'image_license_type', 'publisher_type', 'photographer',
            'date_published', 'dxcId',  'time_lines','avgLogprobs' ,'image_context','image_caption' ]]
            
        # Convert datetime to string using astype(str)
        df['date_published'] = df['date_published'].astype(str)
        df['first_published_timestamp'] = df['first_published_timestamp'].astype(str) 
        
        #set the output
        output = df#.to_dict(orient='records') 
 
    except Exception as e:
        print('error'+str(e))
    return output


In [2]:
dataset= "vlt_media_embeddings_integration"
content_table="vlt_all_media_content_text_embeddings"
project_id='nine-quality-test'
df=get_predictions(content_table, dataset,project_id,filter_query="")
df=df.reset_index(drop=True)

1568


### Pick some samples- this is just to have some saving on the costs

In [3]:
#pick 3 random samples
from sklearn.utils import shuffle
df = shuffle(df)
items=df.sample(1)
items

Unnamed: 0,asset_id,headline,description,fileUri,asset_type,first_published_timestamp,brand_type,primary_category_name,author_name,image_license_type,publisher_type,photographer,date_published,dxcId,time_lines,avgLogprobs,image_context,image_caption
1175,0b1dae83b0384856028b5efbcaf29a4866c12e4d.jpeg,,"The image features a man, likely in his 50s or...",gs://nineshowcaseassets/IMAGES/0b1dae83b038485...,image/jpeg,NaT,,,,Royalty Free,The Age,Chris Johnson,2023-01-23,0b1dae83b0384856028b5efbcaf29a4866c12e4d,"{'startOffset_seconds': <NA>, 'endOffset_secon...",-0.185971,Swiss authorities have moved quickly to calm f...,The fall comes just one day after Credit Suiss...


In [4]:
import json
experiment_name = "content-generation-qa-quality"
file_path = 'PointWiseEvaluationMetrics.json'

# Open and load the JSON file
with open(file_path, 'r') as file:
    eval_metrics = json.load(file)
 
#AI-generated Responses
items=items['description'].to_list()

In [5]:
from LLM_PointWiseEval_cls import PointWiseEvaluationClient

In [6]:
pointwise_evaluation_client=PointWiseEvaluationClient(project='nine-quality-test',
                          location='us-central1',
                          items=items,
                          response_llm_model='gemini-pro-1.5',
                          eval_metrics=eval_metrics,
                         experiment_name="pointwise-evaluation-experiment",                        
                         delete_experiment=True)
evaluations=pointwise_evaluation_client.get_evaluations()
evaluations

Associating projects/494586852359/locations/us-central1/metadataStores/default/contexts/pointwise-evaluation-experiment-pointwise-evaluation-experiment-4926a293-ea26-4e64-8195-dc849555b0c1 to Experiment: pointwise-evaluation-experiment


Computing metrics with a total of 3 Vertex Gen AI Evaluation Service API requests.


100%|██████████| 3/3 [00:05<00:00,  1.79s/it]

All 3 metric requests are successfully computed.
Evaluation Took:5.387495030008722 seconds





Dataset nine-quality-test.vlt_eval_statistics_schema exists.
Table nine-quality-test.vlt_eval_statistics_schema.vlt_pointwise_eval_statistics exists. Checking schema...
Type change detected for column 'run_experiment_date' from STRING to DATE.
Schema is already up-to-date.
Evaluations have successfully been loaded into nine-quality-test.vlt_eval_statistics_schema.vlt_pointwise_eval_statistics.
Experiment run pointwise-evaluation-experiment-4926a293-ea26-4e64-8195-dc849555b0c1 skipped backing tensorboard run deletion.
To delete backing tensorboard run, execute the following:
tensorboard_run_artifact = aiplatform.metadata.artifact.Artifact(artifact_name=f"pointwise-evaluation-experiment-pointwise-evaluation-experiment-4926a293-ea26-4e64-8195-dc849555b0c1-tb-run")
tensorboard_run_resource = aiplatform.TensorboardRun(tensorboard_run_artifact.metadata["resourceName"])
tensorboard_run_resource.delete()
tensorboard_run_artifact.delete()
Deleting Context : projects/494586852359/locations/us-ce

Unnamed: 0,response,response_llm_model,run_experiment_name,run_experiment_date,safety-explanation,safety-score,coherence and fluency-explanation,coherence and fluency-score,verbosity-explanation,verbosity-score
0,"The image features a man, likely in his 50s or...",gemini-pro-1.5,pointwise-evaluation-experiment-4926a293-ea26-...,2025-01-21,"The response is comprehensive, detailed, and s...",5.0,The AI response provides a highly descriptive ...,5.0,"The response is excessively verbose, containin...",2.0


In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]
%pip install --upgrade --user bigframes -q
%pip install --quiet --upgrade nest_asyncio