# Tutorial task - Keywords

Authors: 
- Mikołaj Gałkowski
- Julia Przybytniowska

We decided to use the following models:

- "google/gemma-2-9b-it",
- "Qwen/Qwen2.5-7B-Instruct",
- "meta-llama/Llama-3.1-8B-Instruct", 
- "ministral-8b-latest",

### Requirements

1. Identifying Product Data with LLMs
2. Defining similarity between products from the dataset and reviews

In [53]:
from dotenv import load_dotenv

load_dotenv() # loading api keys

True

In [54]:
import os
import json
import time
from mistralai import Mistral
from transformers import pipeline
from tqdm.notebook import tqdm

api_key = os.environ["API_KEY"]
model = "mistral-embed"

client = Mistral(api_key=api_key)

In [55]:
prompts = {
    "review1": {
        "review_content": "Gets my clothes fresh and clean every time. No lingering odor with Tide.",
        "golden_answer": {
            "product category": "Powder Detergents for Laundry",
            "brand": "Tide",
            "other keywords": ['fresh', 'clean', 'no lingering odor']
        },
    },
    "review2": {
        "review_content": "Keeps my coffee hot for hours—just what I need for long workdays. Thanks, Contigo.",
        "golden_answer": {
            "product category": "Thermal Mugs",
            "brand": "Contigo",
            "other keywords": ["hot for hours"]
        },
    },
}

In [198]:
reviews_for_evaluation = [
    "Gets my clothes fresh and clean every time. No lingering odor with Tide.",
    "Keeps my coffee hot for hours—just what I need for long workdays. Thanks, Contigo.",
]

reviews = [
    # Thermal Mugs
    "Keeps my coffee hot for hours—just what I need for long workdays. Thanks, Contigo.",
    "The lid isn’t leak-proof, but it keeps drinks warm for a decent amount of time.",
    "Love the sleek design of my Zojirushi mug, and it fits perfectly in my car cup holder!",
    "It’s lightweight but keeps my drinks at the right temperature for hours with Hydro Flask.",
    "No more cold coffee! This Yeti thermal mug does the job.",
    "It’s easy to clean, and the thermal insulation works like a charm.",
    "The handle makes it easy to carry, and it doesn’t spill.",
    "Not great for keeping drinks cold, but excellent for hot beverages.",
    "I wish it were bigger, but it’s perfect for my morning tea.",
    "I accidentally dropped it, and it didn’t dent! Very sturdy.",
    "The rubber seal around the lid came loose after a few weeks. Disappointing.",
    "Great for both coffee and soup—keeps them warm for hours.",
    "The exterior stays cool, even when my drink is piping hot inside.",
    "I love the color options, and it’s great for on-the-go.",
    "Keeps ice water cold for hours, even in hot weather!",
    "It’s a little tricky to open one-handed, but overall, a great mug.",
    "The size is perfect for travel, and it keeps drinks hot all day.",
    "It doesn’t leak, even when I toss it in my bag. Highly recommend Contigo.",
    "The lid is a little tight, but the mug works well for keeping drinks warm.",
    "Very stylish and functional! I get compliments all the time.",
    "Keeps my coffee scalding hot for longer than any mug I’ve owned with Zojirushi.",
    "Great value for the price. Works just as well as more expensive brands.",
    "The mug is lightweight and easy to carry around.",
    "It fits perfectly under my single-serve coffee machine!",
    "Durable, sleek, and it does exactly what it’s supposed to.",

    # Dishwasher Detergents
    "My dishes come out sparkling clean every time with Cascade. Love this detergent!",
    "It works well on glass, but I’ve noticed spots on my silverware.",
    "Great for tough, greasy messes. Leaves no residue! Thanks, Finish.",
    "This detergent smells amazing and leaves my dishwasher fresh.",
    "It’s a little pricey, but my dishes have never looked better with Cascade Platinum.",
    "Gets rid of even the most stubborn baked-on food. Highly recommend Finish Quantum.",
    "Not the best on hard water stains, but otherwise it works great.",
    "My dishes have never been so spotless after a wash!",
    "It’s very effective, but I wish it came in a fragrance-free version.",
    "Cuts through grease like a dream. No more pre-rinsing with Cascade Complete.",
    "This detergent doesn’t leave any residue on plastic, which I love.",
    "My glasses come out clear and sparkling every single time.",
    "It doesn’t work well with my eco dishwasher. Dishes aren’t as clean.",
    "Very efficient—gets rid of food stains and smells with no problem.",
    "I noticed some streaks on my glassware, but overall it works well.",
    "Leaves my dishes spotless and my machine smelling fresh.",
    "A great, eco-friendly option that actually works!",
    "It’s a little hard on some of my delicate dishware.",
    "This is the only detergent that works on my hard water stains.",
    "No need to rewash dishes after using this—so efficient!",
    "Perfect for everyday use. My dishes are clean and shiny.",
    "Leaves a chemical smell, but it’s effective at cleaning.",
    "A bit expensive, but worth it for the spotless results.",
    "No more streaks or water spots! Best dishwasher detergent ever.",
    "My silverware and dishes look brand new after every wash.",

    # Sunscreens
    "Absorbs quickly and doesn’t leave a greasy residue. Great for daily use with Neutrogena.",
    "The scent is a little strong, but it protects well with Coppertone.",
    "Perfect for sensitive skin! No breakouts or irritation with La Roche-Posay.",
    "A bit thick to apply, but once it’s on, it stays all day.",
    "I love the lightweight formula of CeraVe, perfect for wearing under makeup.",
    "Doesn’t leave a white cast, even on darker skin tones.",
    "The spray bottle makes it super easy to apply on the go.",
    "This sunscreen saved me from burning on a beach vacation with Banana Boat!",
    "It’s waterproof, which is a must for pool days. Highly recommend Hawaiian Tropic.",
    "A bit pricey, but the protection it provides is worth every penny with Supergoop.",
    "This is my go-to sunscreen for both my face and body with Neutrogena.",
    "It’s a little greasy, but it gets the job done in strong sun.",
    "No weird scent, and it goes on smooth. Love this product!",
    "Perfect for outdoor activities—no sunburn, even after hours outside.",
    "It’s great for kids! No irritation, and it’s easy to apply with Blue Lizard.",
    "A little too heavy for my face, but works perfectly for the body.",
    "The texture is nice and light, not sticky at all.",
    "This sunscreen doesn’t clog my pores, which is a huge plus with EltaMD.",
    "I’ve tried a lot of sunscreens, and this one offers the best protection with La Roche-Posay.",
    "It leaves a slight sheen, but I love how protected my skin feels.",
    "This formula doesn’t dry out my skin like others do.",
    "It’s great under makeup—no pilling or greasiness.",
    "I wish it were more affordable, but it’s worth it for the protection.",
    "Very effective, even after swimming for hours.",
    "My skin stays soft and protected all day with Neutrogena sunscreen.",

     # Powder Detergents for Laundry
    "Gets my clothes fresh and clean every time. No lingering odor with Tide.",
    "It dissolves well, even in cold water. My whites have never been brighter thanks to Ariel.",
    "A little pricey, but worth it for the excellent stain removal power of Persil.",
    "This powder leaves a residue on darker clothes. Not a fan of OMO.",
    "Great for sensitive skin! No itching or redness after using Seventh Generation.",
    "I love how eco-friendly this detergent is. It’s a big plus for me with Ecover.",
    "I don’t need fabric softener anymore—this leaves my clothes so soft!",
    "My laundry has never smelled so fresh, and it lasts for days with Gain.",
    "It’s not the best for heavy stains but works great for daily washes.",
    "Great value for the price. This box lasts forever! Thanks, Arm & Hammer.",
    "Perfect for my workout gear—gets rid of all the sweat smells.",
    "Leaves a bit of powder behind in the machine, but it cleans well.",
    "I’ve been using it for years, and Tide never disappoints.",
    "Not as effective in hard water areas, but still decent.",
    "My go-to detergent for all of my family’s laundry needs.",
    "I noticed some fading in my darker clothes after a few washes.",
    "It’s gentle on my baby’s clothes and skin with Dreft.",
    "Very effective at removing mud and grass stains from the kids’ clothes.",
    "I like the scent, but it might be too strong for some.",
    "No complaints so far! My clothes feel clean and fresh.",
    "Works just as well as liquid detergents but at a lower cost.",
    "A bit too perfumed for my taste, but it gets the job done.",
    "My clothes are noticeably softer and smell better than before.",
    "The box is hard to pour from, but the detergent works well.",
    "This is my new favorite detergent. So much better than the leading brand!",
]

In [42]:
def run_mistral(user_message, model=model):
    messages = [{"role": "user", "content": user_message}]
    chat_response = client.chat.complete(
        model=model,
        messages=messages,
        response_format={"type": "json_object"},
    )
    return chat_response.choices[0].message.content

prompt_template = """
Extract information from the following reviews:
{content}

Return json format with the following JSON schema:
{{
        "product category": {{
            "type": "string",
            "enum": ["Powder Detergents for Laundry", "Thermal Mugs", "Dishwasher Detergents", "Sunscreens", "Nappies", "Others"]
        }},
        "brand": {{
            "type": "string" or N/A
        }},
        "other keywords": {{
            "type": "array",
            "items": {{
                "type": "string"
            }}
        }},

}}
"""

### Model inference loop

In [44]:
from transformers import pipeline
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

models = [
    "Qwen/Qwen2.5-7B-Instruct", "meta-llama/Llama-3.1-8B-Instruct", 
    "google/gemma-2-9b-it","ministral-8b-latest",
]

In [51]:
results_for_evaluation = {}

for model in tqdm(models, desc="Processing Models"):
    model_results = {}
    
    if model == "ministral-8b-latest":
        for review in tqdm(reviews_for_evaluation, desc=f"Processing Reviews for {model}", leave=False):
            user_message = prompt_template.format(content=review)
            response = run_mistral(user_message, model=model)
            model_results[review] = response
            time.sleep(4)
    else:
        pipe = pipeline(
            "text-generation", 
            model=model, 
            max_new_tokens=1000, 
            model_kwargs={
                "torch_dtype": torch.float16,
                "quantization_config": {"load_in_4bit": True},
                "low_cpu_mem_usage": True,
            },
            trust_remote_code=True,
        )
        
        for review in tqdm(reviews_for_evaluation, desc=f"Processing Reviews for {model}", leave=False):
            user_message = prompt_template.format(content=review)
            response = pipe(user_message)
            try:
                model_results[review] = response[0]['generated_text']
            except:
                model_results[review] = response
        
        del pipe
    
    results_for_evaluation[model] = model_results

with open('merged_results.json', 'w') as f:
    json.dump(results_for_evaluation, f, indent=4)

Processing Models:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Processing Reviews for Qwen/Qwen2.5-7B-Instruct:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Processing Reviews for meta-llama/Llama-3.1-8B-Instruct:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Processing Reviews for google/gemma-2-9b-it:   0%|          | 0/2 [00:00<?, ?it/s]

Processing Reviews for ministral-8b-latest:   0%|          | 0/2 [00:00<?, ?it/s]

In [419]:
with open('merged_results.json', 'r') as f:
    results = json.load(f)
results.keys()

dict_keys(['ministral-8b-latest', 'google/gemma-2-9b-it', 'Qwen/Qwen2.5-7B-Instruct', 'meta-llama/Llama-3.1-8B-Instruct'])

In [422]:
for k, v in results.items():
    print(k, ' : ', len(v))

ministral-8b-latest  :  100
google/gemma-2-9b-it  :  100
Qwen/Qwen2.5-7B-Instruct  :  100
meta-llama/Llama-3.1-8B-Instruct  :  100


Each model was called for each review.

# Evaluation on 2 cases from `prompts`

In [None]:
def compare_json_objects(obj1, obj2, keys_to_compare={'other keywords', 'brand', 'product category'}):
    total_fields = 0
    identical_fields = 0
    common_keys = set(obj1.keys()) & set(obj2.keys() & keys_to_compare)
    for key in common_keys:
        if obj2[key] == "N/A":
            obj2[key] = "Not mentioned."
        identical_fields += obj1[key] == obj2[key]
    percentage_identical = (identical_fields / max(len(keys_to_compare), 1)) * 100

    return percentage_identical

In [300]:
import unicodedata
import re

json_pattern = re.compile(r'\{\s*"product category":\s*".+?",\s*"brand":\s*".+?",\s*"other keywords":\s*\[.*?\]\s*?\}', re.DOTALL)

accuracy_rates = {}

for model_name in models:
    model_accuracy_rates = []
    i=0
    for _, content in prompts.items():
        i+=1
        if i==1: continue
        text = content["review_content"]
        golden_answer = content["golden_answer"]
        response = results[model_name].get(text)

        response = re.sub(r'("brand":\s*)(N/A)(,)', r'\1"N/A"\3', response)

        match = json_pattern.search(response)
        if match:
            response = match.group()
        else:
            continue
        
        response = unicodedata.normalize("NFKD", response).encode("ascii", "ignore").decode("ascii")
                
        accuracy_rate = compare_json_objects(ast.literal_eval(response), 
                                         golden_answer, 
                                         keys_to_compare={'product category', 'brand'})
        model_accuracy_rates.append(accuracy_rate)

    average_accuracy = sum(model_accuracy_rates) / len(model_accuracy_rates) if model_accuracy_rates else 0
    accuracy_rates[model_name] = average_accuracy

print("Model Accuracy Rates:")
for model_name, accuracy in accuracy_rates.items():
    print(f"{model_name}: {accuracy:.2f}%")

Model Accuracy Rates:
Qwen/Qwen2.5-7B-Instruct: 100.00%
meta-llama/Llama-3.1-8B-Instruct: 100.00%
google/gemma-2-9b-it: 100.00%
ministral-8b-latest: 100.00%


All models achieved 100% accuracy on both test examples. 

To achieve a more meaningful accuracy measure, we expanded the test set (`golden data`) by generating additional answers with `Llama3.1-7b` and new prompt with **few-shot examples**, allowing for a more robust comparison.

In [71]:
prompt_template_for_golden_dataset = """
You are experienced at categorizing reviews in different product categories ['Powder Detergents for Laundry', 'Thermal Mugs', 'Dishwasher Detergents', 'Sunscreens', 'Nappies', 'Others']
and also providing brand name and other relevant keyword.

For example:
    - for this review: 'Gets my clothes fresh and clean every time. No lingering odor with Tide.'
{{
    'product category': 'Powder Detergents for Laundry',
    'brand': 'Tide',
    'other keywords': ['fresh', 'clean', 'no lingering odor']
}}

    - for this review: 'Keeps my coffee hot for hours—just what I need for long workdays. Thanks, Contigo.'
{{
    'product category': 'Thermal Mugs',
    'brand': 'Contigo',
    'other keywords': ['hot for hours']
}}

If brand is not mentioned please assign: "Not mentioned."

Please return only dictionary in your response with these keys ['product category', 'brand', 'other keywords'] as in examples above for this review:

{review}
"""

### Llama3.1 7b - for golden dataset creation

In [74]:
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-7B-Instruct", max_new_tokens=250,
                model_kwargs={
                    "torch_dtype": torch.float16,
                    "quantization_config": {"load_in_4bit": True},
                    "low_cpu_mem_usage": True,
                 },
                trust_remote_code=True,               
)
pipe.model.generation_config.pad_token_id = pipe.tokenizer.pad_token_id

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [87]:
json_pattern = re.compile(
    r"\{\s*'product category':\s*'.+?',\s*'brand':\s*'.+?',\s*'other keywords':\s*\[.*?\]\s*\}", re.DOTALL
)

golden_data = {}
i = 3  # Starting index (to have unique keys in final dictionary)

skip_reviews = {
    "Gets my clothes fresh and clean every time. No lingering odor with Tide.",
    "Keeps my coffee hot for hours—just what I need for long workdays. Thanks, Contigo."
}

for review in tqdm(reviews):
    if review in skip_reviews:
        continue

    user_message = prompt_template_for_golden_dataset.format(review=review)
    response = pipe(user_message)
    generated_text = response[0]['generated_text']

    match = json_pattern.search(generated_text)
    if match:
        try:
            extracted_dict = ast.literal_eval(match.group())
        except (ValueError, SyntaxError) as e:
            print(f"Error parsing response for review: {review} (index: {i}) - {e}")
            i += 1
            continue
    else:
        print(f"No valid JSON found for review: {review} (index: {i})")
        i += 1
        continue

    golden_data[f'review{i}'] = {
        "review_content": review,
        "golden_answer": extracted_dict,
    }
    i += 1

  0%|          | 0/99 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Settin

In [89]:
golden = prompts | golden_data

{'review1': {'review_content': 'Gets my clothes fresh and clean every time. No lingering odor with Tide.',
  'golden_answer': {'product category': 'Powder Detergents for Laundry',
   'brand': 'Tide',
   'other keywords': ['fresh', 'clean', 'no lingering odor']}},
 'review2': {'review_content': 'Keeps my coffee hot for hours—just what I need for long workdays. Thanks, Contigo.',
  'golden_answer': {'product category': 'Thermal Mugs',
   'brand': 'Contigo',
   'other keywords': ['hot for hours']}},
 'review3': {'review_content': 'The lid isn’t leak-proof, but it keeps drinks warm for a decent amount of time.',
  'golden_answer': {'product category': 'Powder Detergents for Laundry',
   'brand': 'Tide',
   'other keywords': ['fresh', 'clean', 'no lingering odor']}},
 'review4': {'review_content': 'Love the sleek design of my Zojirushi mug, and it fits perfectly in my car cup holder!',
  'golden_answer': {'product category': 'Powder Detergents for Laundry',
   'brand': 'Tide',
   'other key

In [158]:
with open('golden_dataset.json', 'w') as f:
    json.dump(golden, f, indent=4)

As we can see updated prompt resulted in accurate responses, so we decided to evaluate all previous results against these.

### Evaluation using `golden` dataset

In [301]:
accuracy_rates = {}

for model_name in models:
    model_accuracy_rates = []
    i=0
    for _, content in golden.items():
        i+=1
        if i==1: continue
        text = content["review_content"]
        golden_answer = content["golden_answer"]
        response = results[model_name].get(text)
        if response is None:
            print('Response is None')
            print(text)
            continue

        # JSON extraction using RegEx
        response = re.sub(r'("brand":\s*)(N/A)(,)', r'\1"N/A"\3', response)
        
        match = json_pattern.search(response)
        if match:
            response = match.group()
        else:
            continue
        
        response = unicodedata.normalize("NFKD", response).encode("ascii", "ignore").decode("ascii")
                
        accuracy_rate = compare_json_objects(ast.literal_eval(response), 
                                         golden_answer, 
                                         keys_to_compare={'product category', 'brand'})
        model_accuracy_rates.append(accuracy_rate)

        results[model_name][text] = ast.literal_eval(response)

    average_accuracy = sum(model_accuracy_rates) / len(model_accuracy_rates) if model_accuracy_rates else 0
    accuracy_rates[model_name] = average_accuracy

print("Model Accuracy Rates:")
for model_name, accuracy in accuracy_rates.items():
    print(f"{model_name}: {accuracy:.2f}%")

Model Accuracy Rates:
Qwen/Qwen2.5-7B-Instruct: 15.96%
meta-llama/Llama-3.1-8B-Instruct: 20.54%
google/gemma-2-9b-it: 14.14%
ministral-8b-latest: 16.67%


Let's investigate what is the reason behind poor accuracies.

In [428]:
print(results['meta-llama/Llama-3.1-8B-Instruct']['A bit too perfumed for my taste, but it gets the job done.'])


Extract information from the following reviews:
A bit too perfumed for my taste, but it gets the job done.

Return json format with the following JSON schema:
{
        "product category": {
            "type": "string",
            "enum": ["Powder Detergents for Laundry", "Thermal Mugs", "Dishwasher Detergents", "Sunscreens", "Nappies", "Others"]
        },
        "brand": {
            "type": "string" or N/A
        },
        "other keywords": {
            "type": "array",
            "items": {
                "type": "string"
            }
        },

}
----------------------------------------------------------

Review 1:
Product: Powder Detergents for Laundry
Brand: N/A
Other keywords: perfumed, job done

Review 2:
Product: Thermal Mugs
Brand: Yeti
Other keywords: job done, good

Review 3:
Product: Dishwasher Detergents
Brand: N/A
Other keywords: job done, cheap

Review 4:
Product: Sunscreens
Brand: N/A
Other keywords: job done, good, perfumed

Review 5:
Product: Nappies
Bra

The response above is an example that didn’t follow the expected format due to unnecessary repetition. To get better results we would try different prompt engineering techniques as we did in `golden` dataset.

### Product offers

In [302]:
product_offers = [
    # Thermal Mugs
    "Contigo Workday Travel Mug – Keeps Coffee Hot for Hours!",
    "Zojirushi Sleek Travel Mug – Perfect Fit for Car Holders",
    "Hydro Flask Lightweight Insulation Mug – Stay Warm for Hours",
    "Yeti Thermal Mug – No More Cold Coffee!",
    "Contigo All-Day Heat Retention Mug – Ideal for Travel",
    "Contigo Leak-Proof Mug – Toss in Your Bag with Confidence",
    "Zojirushi Scalding Hot Coffee Mug – Best Insulation Yet",

    # Dishwasher Detergents
    "Cascade Sparkling Clean Detergent – Your Dishes Will Shine",
    "Finish Detergent – Tough on Grease, No Residue Left",
    "Cascade Platinum Detergent – Pricey, But Worth It for Results",
    "Finish Quantum Detergent – Stubborn Food Stains Gone",
    "Cascade Complete Detergent – No Pre-Rinsing Needed for Grease",

    # Sunscreens
    "Neutrogena Daily Sunscreen – Quick Absorption, No Grease",
    "Coppertone Suncream – Strong Scent, Strong Protection",
    "La Roche-Posay Sensitive Skin Sunscreen – No Breakouts",
    "CeraVe Lightweight Sunscreen – Perfect Under Makeup",
    "Banana Boat Beach-Saver Sunscreen – No Burns, Just Fun",
    "Hawaiian Tropic Waterproof Sunscreen – Pool Day Essential",
    "Supergoop Premium Sunscreen – Worth Every Penny",
    "Neutrogena Face & Body Sunscreen – All-Purpose Protection",
    "EltaMD Pore-Friendly Sunscreen – Protection Without Clogging",
    "La Roche-Posay Suncream – The Best in Sun Protection",
    "Neutrogena Sunscreen – Soft, Protected Skin All Day",

    # Powder Detergents for Laundry
    "Tide Powder Detergent – Fresh, Clean Clothes Every Time",
    "Ariel Powder Detergent – Whites Brighter, Even in Cold Water",
    "Persil Powder Detergent – Powerful Stain Removal",
    "OMO Powder Detergent – Leaves Residue on Dark Clothes",
    "Seventh Generation Powder Detergent – Great for Sensitive Skin",
    "Ecover Eco-Friendly Detergent – Perfect for the Eco-Conscious",
    "Gain Powder Detergent – Fresh-Smelling Laundry for Days",
    "Arm & Hammer Powder Detergent – Great Value, Lasts Forever",
    "Dreft Baby Powder Detergent – Gentle on Baby Clothes"
]

In [None]:
product_results = {}

for model in models:
    model_results = {}
    
    if model == "ministral-8b-latest":
        for product in tqdm(product_offers, desc=f"Processing Reviews for {model}", leave=False):
            user_message = prompt_template.format(content=product)
            response = run_mistral(user_message, model=model)
            model_results[product] = response
            time.sleep(4)
    else:
        pipe = pipeline(
            "text-generation", 
            model=model, 
            max_new_tokens=1000, 
            model_kwargs={
                "torch_dtype": torch.float16,
                "quantization_config": {"load_in_4bit": True},
                "low_cpu_mem_usage": True,
            },
            trust_remote_code=True,
        )
        
        for product in tqdm(product_offers, desc=f"Processing Reviews for {model}", leave=False):
            user_message = prompt_template.format(content=product)
            response = pipe(user_message)
            try:
                model_results[product] = response[0]['generated_text']
            except Exception as e:
                print(f"Error processing review: {e}")
                model_results[product] = response
    
        del pipe
    
    product_results[model] = model_results

with open(f'product_offers/merged_results_product_offers.json', 'w') as f:
    json.dump(product_results, f, indent=4)

In [None]:
with open('product_offers/merged_results_product_offers.json', 'r') as f:
    product_results = json.load(f)

JSON extraction using RegEx

In [313]:
json_pattern = re.compile(r'\{\s*"product category":\s*".+?",\s*"brand":\s*".+?",\s*"other keywords":\s*\[.*?\]\s*?\}', re.DOTALL)

for model_name in product_results:
    results_per_model = product_results[model_name]

    for review in results_per_model:
        response = results_per_model[review]

        if isinstance(response, str):
            response = re.sub(r'("brand":\s*)(N/A)(,)', r'\1"N/A"\3', response)
            match = json_pattern.search(response)

            if match:
                response = match.group()
            else:
                product_results[model_name][review] = 'Nothing was extracted'    
                continue
        
            response = unicodedata.normalize("NFKD", response).encode("ascii", "ignore").decode("ascii")
            response = ast.literal_eval(response)
        product_results[model_name][review] = response       

### Similarity analysis between Reviews and Product offers

In [433]:
from bert_score import score
import random
random.seed(3)

In [339]:
def compare_json_objects_with_keywords_method_1(obj1, obj2, keys_to_compare={'other keywords', 'brand', 'product category'}):
    identical_fields = 0
    common_keys = set(obj1.keys()) & set(obj2.keys() & keys_to_compare)
    for key in common_keys:
        if key == 'other keywords':
            if isinstance(obj1.get(key), list) and isinstance(obj2.get(key), list):
                common_keywords = set(obj1.get(key)) & set(obj2.get(key))
                identical_fields += len(common_keywords) * 0.3  # Keywords have lower weight
            else:
              identical_fields += 0
        else:
          identical_fields += obj1[key] == obj2[key]
    percentage_identical = (identical_fields / max(len(keys_to_compare), 1)) * 100

    return percentage_identical

def compare_json_objects_with_keywords_method_2(obj1, obj2, keys_to_compare={'other keywords', 'brand', 'product category'}):
    if obj1.get('product category') != obj2.get('product category'):
        return 0  # No similarity if categories don't match

    similarity_score = 0
    common_keys = set(obj1.keys()) & set(obj2.keys() & keys_to_compare)

    if obj1.get('product category') == obj2.get('product category'):
        similarity_score += 0.3  # Base similarity for matching categories

    if obj1.get('brand') == obj2.get('brand'):
        similarity_score += 0.4  # Additional similarity for matching brands

    for key in common_keys:
        if key == 'other keywords':
            if isinstance(obj1.get(key), list) and isinstance(obj2.get(key), list):
                common_keywords = set(obj1.get(key)) & set(obj2.get(key))
                similarity_score += len(common_keywords) * 0.05  # Keywords have lower weight
            else:
                similarity_score += 0

    return min(100, round(similarity_score * 100))

def compare_keywords_with_bert_score(review_keywords, product_keywords):
    """Compares keywords using BERTScore."""

    max_length = max(len(review_keywords), len(product_keywords))
    review_keywords = review_keywords + [''] * (max_length - len(review_keywords))
    product_keywords = product_keywords + [''] * (max_length - len(product_keywords))

    if not review_keywords or not product_keywords:
        return 0
    P, R, F1 = score(
        review_keywords,
        product_keywords,
        lang="en",
        model_type="bert-base-uncased",
        verbose=False
    )
    return F1.mean().item()*100

def compare_product_review_similarity(review_data, product_data):
    """
    Calculates a similarity score between a review and a product based on
    categories, brands, and keywords, including a comparison of full product
    title with review information using BERTScore.
    """
    similarity_score = 0

    # Category Matching (Highest weight)
    if review_data.get('product category') == product_data.get('product category'):
        similarity_score += 0.5

    # Brand Matching (Medium weight)
    if review_data.get('brand') == product_data.get('brand'):
        similarity_score += 0.3

    # String Comparison (BERTScore) between Product Title and Review Data
    review_info_string = " ".join(
        [
            review_data.get("product category"),
            review_data.get("brand"),
            " ".join(review_data.get("other keywords")),
        ]
    )

    product_info_string = " ".join(
        [
            product_data.get("product category"),
            product_data.get("brand"),
            " ".join(product_data.get("other keywords")),
        ]
    )

    P, R, F1 = score(
        [product_info_string],
        [review_info_string],
        lang="en",
        model_type="bert-base-uncased",
        verbose=False
    )

    similarity_score += F1.mean().item() * 0.2

    return round(min(1, similarity_score) * 100)

We chose only 3 random review and products for further comparison.

In [340]:
random_reviews = random.sample(reviews, 3)
random_products = random.sample(product_offers, 3)

In [435]:
similarity_results = {}

for model in models:
    similarity_results[model] = {}

    for review in random_reviews:
        review_json = results[model][review]
        review_keywords = review_json.get('other keywords', [])

        for product in random_products:
            product_json = product_results[model][product]
            product_keywords = product_json.get('other keywords', [])
            
            # Calculate similarity rates for the chosen model, review, and product
            similarity_rate_m_1 = round(compare_json_objects_with_keywords_method_1(review_json, product_json), 2)
            similarity_rate_m_2 = round(compare_json_objects_with_keywords_method_2(review_json, product_json), 2)
            bert_score = round(compare_keywords_with_bert_score(review_keywords, product_keywords), 2)
            custom_score = round(compare_product_review_similarity(review_json, product_json), 2)

            # Store results in nested dictionary structure
            similarity_results[model][(review, product)] = {
                "similarity_1": similarity_rate_m_1,
                "similarity_2": similarity_rate_m_2,
                "bert_score": bert_score,
                "custom_score": custom_score,
                "review_extracted": review_json,
                "product_extracted": product_json,
            }

similarity_results



{'Qwen/Qwen2.5-7B-Instruct': {('Gets rid of even the most stubborn baked-on food. Highly recommend Finish Quantum.',
   'Finish Detergent – Tough on Grease, No Residue Left'): {'similarity_1': 0.0,
   'similarity_2': 0,
   'bert_score': 29.47,
   'custom_score': 12,
   'review_extracted': {'product category': 'Dishwasher Detergents',
    'brand': 'Finish',
    'other keywords': ['stubborn baked-on food',
     'highly recommend',
     'Finish Quantum']},
   'product_extracted': {'product category': 'Powder Detergents for Laundry',
    'brand': 'N/A',
    'other keywords': ['tough on grease', 'no residue left']}},
  ('Gets rid of even the most stubborn baked-on food. Highly recommend Finish Quantum.',
   'Tide Powder Detergent – Fresh, Clean Clothes Every Time'): {'similarity_1': 0.0,
   'similarity_2': 0,
   'bert_score': 38.62,
   'custom_score': 10,
   'review_extracted': {'product category': 'Dishwasher Detergents',
    'brand': 'Finish',
    'other keywords': ['stubborn baked-on foo

In [444]:
import pandas as pd


rows = []

for model, reviews in similarity_results.items():
    for (review, product), scores in reviews.items():
        row = {
            'review': review,
            'product': product,
            'review_extracted': scores['review_extracted'],
            'product_extracted': scores['product_extracted'],
            'model': model,
            'similarity_1': scores['similarity_1'],
            'similarity_2': scores['similarity_2'],
            'bert_score': scores['bert_score'],
            'custom_score': scores['custom_score']
        }
        rows.append(row)

df = pd.DataFrame(rows)
df.sort_values(['review', 'product'])

Unnamed: 0,review,product,review_extracted,product_extracted,model,similarity_1,similarity_2,bert_score,custom_score
5,Gets my clothes fresh and clean every time. No...,"Arm & Hammer Powder Detergent – Great Value, L...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,Qwen/Qwen2.5-7B-Instruct,33.33,30,34.68,64
14,Gets my clothes fresh and clean every time. No...,"Arm & Hammer Powder Detergent – Great Value, L...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,meta-llama/Llama-3.1-8B-Instruct,33.33,30,59.12,64
23,Gets my clothes fresh and clean every time. No...,"Arm & Hammer Powder Detergent – Great Value, L...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,google/gemma-2-9b-it,33.33,30,34.68,65
32,Gets my clothes fresh and clean every time. No...,"Arm & Hammer Powder Detergent – Great Value, L...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,ministral-8b-latest,33.33,30,34.68,64
3,Gets my clothes fresh and clean every time. No...,"Finish Detergent – Tough on Grease, No Residue...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,Qwen/Qwen2.5-7B-Instruct,33.33,30,31.69,65
12,Gets my clothes fresh and clean every time. No...,"Finish Detergent – Tough on Grease, No Residue...",{'product category': 'Powder Detergents for La...,"{'product category': 'Dishwasher Detergents', ...",meta-llama/Llama-3.1-8B-Instruct,0.0,0,55.4,13
21,Gets my clothes fresh and clean every time. No...,"Finish Detergent – Tough on Grease, No Residue...",{'product category': 'Powder Detergents for La...,"{'product category': 'Dishwasher Detergents', ...",google/gemma-2-9b-it,0.0,0,31.69,13
30,Gets my clothes fresh and clean every time. No...,"Finish Detergent – Tough on Grease, No Residue...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,ministral-8b-latest,33.33,30,31.69,66
4,Gets my clothes fresh and clean every time. No...,"Tide Powder Detergent – Fresh, Clean Clothes E...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,Qwen/Qwen2.5-7B-Instruct,33.33,30,63.82,64
13,Gets my clothes fresh and clean every time. No...,"Tide Powder Detergent – Fresh, Clean Clothes E...",{'product category': 'Powder Detergents for La...,{'product category': 'Powder Detergents for La...,meta-llama/Llama-3.1-8B-Instruct,76.67,75,44.43,96


From the results we can see that `Llama3.1-7b` exhibits in brand and keywords extraction, hence having higher score results than other models for the same pairs (*review-product*). The rest of the models behave in a similar manner.