#### (An Attempt at) RAG Evaluation with Metrics

Use the 3.12 environment locally

To evaluate the effectiveness of the RAG model, we need a dataset and an evaluation method. Luckily, DeepEval provides a way to automate creating the dataset. Unfortunately, it is designed for use with text. Therefore, we will have to do something else to evaluate the images. 

DeepEval uses gpt-4o by default. I tried setting it up to use Llama so that I can run it locally, but it didn't work, and I don't have time to figure out what's wrong.

In [1]:
import pandas as pd

from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import StylingConfig

import random

We load the metadata sample that we used for the language detection test in language_detection.ipynb and for the product type check with Llama-vision in Huggingface_Llama_Vision.ipynb.

In [2]:
pdf_sample = pd.read_pickle('D:/abo-dataset/abo-listings-sample.pkl')
pdf_sample = pdf_sample[['item_name', 'brand', 'model_name', 'model_year', 
                     'product_description', 'product_type', 'color',
                     'fabric_type', 'style', 'material', 'item_keywords',
                     'pattern', 'finish_type', 'bullet_point']]

pdf_sample['product_type'] = pdf_sample['product_type'].str.replace('_', ' ')
pdf_sample.loc[pdf_sample['product_type'] == 'FINERING', 'product_type'] = 'FINE RING'
pdf_sample.loc[pdf_sample['product_type'] == 'FINENECKLACEBRACELETANKLET', 'product_type'] = 'FINE NECKLACE BRACELET ANKLET'
pdf_sample.loc[pdf_sample['product_type'] == 'FINEEARRING', 'product_type'] = 'FINE EARRING'
pdf_sample.loc[pdf_sample['product_type'] == 'FASHIONNECKLACEBRACELETANKLET', 'product_type'] = 'FASHION NECKLACE BRACELET ANKLET'
pdf_sample.loc[pdf_sample['product_type'] == 'FINEOTHER', 'product_type'] = 'FINE OTHER'
pdf_sample.loc[pdf_sample['product_type'] == 'FASHIONEARRING', 'product_type'] = 'FASHION EARRING'
pdf_sample.loc[pdf_sample['product_type'] == 'SHOWERHEAD', 'product_type'] = 'SHOWER HEAD'
pdf_sample.loc[pdf_sample['product_type'] == 'FASHIONOTHER', 'product_type'] = 'FASHION OTHER'
pdf_sample['product_type'] = pdf_sample['product_type'].str.replace('ABIS ', '')

Since we're using OpenAI's API and that's both rate limited and costs money, I'm randomly selecting 25 samples to generate ground truths for rather than the whole sample set.

In [3]:
random.seed(42)
subsample_item_ids = random.sample(list(pdf_sample.index), 25)
subsample_item_ids

['B01IJ5A2UA',
 'B07DFCMDT1',
 'B00L21KJ20',
 'B07H53W5WP',
 'B082XCSHKB',
 'B07RXR8VWC',
 'B074H73HNQ',
 'B07W35MDVN',
 'B07D55NJXD',
 'B075MD3TN4',
 'B07R59V5SD',
 'B07VB5RTCP',
 'B01G7ISMZS',
 'B07V4TFWFK',
 'B00G5K7L24',
 'B07PM7MNY3',
 'B073WGMTB8',
 'B07KTKFNXC',
 'B07CW43RFT',
 'B07RX8DBH4',
 'B074V3VT8V',
 'B07QW3JRT4',
 'B07FRK9WG7',
 'B07P488YBD',
 'B06ZY79QCZ']

The LLMs take strings, so convert each row to one.

In [4]:
def row_to_str(row):
    row_filtered = row.dropna()
    text = []
    for row_item in row_filtered:
        if isinstance(row_item, list):
            for list_item in row_item:
                text.append(str(list_item) + ';')
        else:
            text.append(str(row_item) + ';')
            
    final_string = ' '.join(text).replace('\n', ' ').replace('^', ' ').replace(',', ', ')
    return final_string[:-1]

In [5]:
row_strings = []
for item_id in subsample_item_ids:
    row = pdf_sample.loc[item_id]
    row_strings.append(row_to_str(row))

In [6]:
row_strings

['AmazonBasics Multi-Angle Portable Stand for Tablets,  E-readers and Phones - Black; AmazonBasics; PORTABLE ELECTRONIC DEVICE STAND; Black; Rubber; ipad stand for desk holder tablet lamicall laptop adjustable; Portable stand for comfortable,  hands-free viewing of a 4- to 10-inch tablet,  e-reader,  or smartphone; Easily adjusts to multiple viewing angles using convenient side button; holds device in either portrait or landscape position; Compatible with Kindle,  iPhone,  iPad,  Samsung Galaxy / Tab,  Google Nexus,  HTC,  LG,  Nokia Lumia,  OnePlus,  and more; Removable rubber pad for slip and scratch-resistant performance; Durable zinc-alloy body can hold up to 4.9 kg; folds flat when closed',
 'Amazon Brand: Umi.Essentials Stainless Steel Self-Wringing Microfibre Spin Mop and Bucket Floor Cleaning System(Green); UMI; CLEANING JUST GOT EASIER-The UMI Spin Mop and Bucket System is an effective and revolutionary system that keeps your hands dry.; BUCKET; Spin Mop-grey Green; floor clea

We need to tell the synthesizer how our RAG system is to be used.

In [7]:
styling_config = StylingConfig(
    input_format="An English-language request with a description of a product the user desires.",
    expected_output_format="A product name with a brief description.",
    task="A useful shopping assistant that helps users find products that meet their specifications.",
    scenario="Non-technical users looking to do some shopping without the usual hassle of looking through lists of products."
    )

Run the synthesizer.

In [8]:
import os
os.environ['OPENAI_API_KEY'] = "XXXX"

In [9]:
synthesizer = Synthesizer(max_concurrent=1, styling_config=styling_config)
goldens = synthesizer.generate_goldens_from_contexts(contexts=[[row_string] for row_string in row_strings])
goldens

Event loop is already running. Applying nest_asyncio patch to allow async execution...


✨ Generating up to 50 goldens using DeepEval (using gpt-4o, method=default):   6%|▌         | 3/50 [00:52<13:49, 17.64s/it]ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
✨ Generating up to 50 goldens using DeepEval (using gpt-4o, method=default):  16%|█▌        | 8/50 [01:49<07:35, 10.84s/it]ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
✨ Generating up to 50 goldens using DeepEval (using gpt-4o, method=default):  18%|█▊        | 9/50 [02:19<11:26, 16.73s/it]ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 3 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 4 time(s)...
✨ Generating up to 50 goldens using DeepEval (using gpt-4o, method=default):  24%|██▍       | 12/50 [03:18<10:07, 15.98s/it]ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
✨ Generatin

[Golden(input='Can you recommend a portable stand that securely holds an iPad, similar to the AmazonBasics Multi-Angle stand?', actual_output=None, expected_output='AmazonBasics Multi-Angle Portable Stand: A durable, adjustable stand for comfortable, hands-free viewing of tablets, e-readers, and smartphones, compatible with devices like iPads.', context=['AmazonBasics Multi-Angle Portable Stand for Tablets,  E-readers and Phones - Black; AmazonBasics; PORTABLE ELECTRONIC DEVICE STAND; Black; Rubber; ipad stand for desk holder tablet lamicall laptop adjustable; Portable stand for comfortable,  hands-free viewing of a 4- to 10-inch tablet,  e-reader,  or smartphone; Easily adjusts to multiple viewing angles using convenient side button; holds device in either portrait or landscape position; Compatible with Kindle,  iPhone,  iPad,  Samsung Galaxy / Tab,  Google Nexus,  HTC,  LG,  Nokia Lumia,  OnePlus,  and more; Removable rubber pad for slip and scratch-resistant performance; Durable zin

These ground truths are pretty horrible and the inputs way too descriptive.

In [10]:
goldens_dict = {'item_id': [], 'input': [], 'expected_output': []}
for i, golden in enumerate(goldens):
    goldens_dict['item_id'].append(subsample_item_ids[i//2])
    goldens_dict['input'].append(golden.input)
    goldens_dict['expected_output'].append(golden.expected_output)
goldens_df = pd.DataFrame(goldens_dict)

In [11]:
goldens_df.to_pickle('D:/abo-dataset/abo-listings-subsample-ground-truths.pkl')
goldens_df = pd.read_pickle('D:/abo-dataset/abo-listings-subsample-ground-truths.pkl')
goldens_df

Unnamed: 0,item_id,input,expected_output
0,B01IJ5A2UA,Can you recommend a portable stand that secure...,AmazonBasics Multi-Angle Portable Stand: A dur...
1,B01IJ5A2UA,"Can you help me find a durable, adjustable tab...",AmazonBasics Multi-Angle Portable Stand: A dur...
2,B07DFCMDT1,"I'm looking for a mop system that offers easy,...",Umi.Essentials Stainless Steel Self-Wringing M...
3,B07DFCMDT1,I'm looking for an advanced mop with integrate...,Umi.Essentials Stainless Steel Self-Wringing M...
4,B00L21KJ20,I'm looking for an AmazonBasics tablet accesso...,AmazonBasics Capacitive Stylus Pen: An Amazon ...
5,B00L21KJ20,I'm looking for a reliable capacitive stylus p...,AmazonBasics Capacitive Stylus Pen: A reliable...
6,B07H53W5WP,I am looking for Amazon Elements fragrance-fre...,"Amazon Elements Baby Wipes: Fragrance-free, su..."
7,B07H53W5WP,I'm looking for multipack fragrance-free wipes...,"Amazon Elements Baby Wipes: Fragrance-free, al..."
8,B082XCSHKB,I'm looking for a 5-blade fan that combines mo...,Amazon Brand – Stone & Beam Remote-Controlled ...
9,B082XCSHKB,"I'm looking for a 52"" remote-controlled ceilin...",Amazon Brand – Stone & Beam Remote-Controlled ...
