Data Curation Part 1
====================

Begin scrubbing and curating the dataset by focusing on a subset of the data: Home Appliances.

The dataset uses:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

Folder with all the product datasets is here:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories

# Dependencies

In [None]:
# imports

import os
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt

# Setup

In [None]:
# environment

load_dotenv(override=True)
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
%matplotlib inline

## HuggingFace Token

**IMPORTANT** requires read and write permissions.

Add `HF_TOKEN` to secrets, paste value and toggle on for this notebook.

In [None]:
# Log in to HuggingFace

hf_token = os.environ['HF_TOKEN']
#login(hf_token, add_to_git_credential=True)
login()

In [28]:
# One more import after logging in
import sys
sys.path.append('../testing/')
from items import Item

# Investigate Chosen Dataset to Verify Suitability

## Step 1. Load in dataset

Number of Appliances: **94,327**

In [None]:
# Load in our dataset

dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_Appliances", split="full", trust_remote_code=True)

In [None]:
print(f"Number of Appliances: {len(dataset):,}")

## Step 2. Investigate a particular datapoint

> Clothes Dryer Drum Slide, General Electric, Hotpoint, WE1M333, WE1M504
>
> ['Brand new dryer drum slide, replaces General Electric, Hotpoint, RCA, WE1M333, WE1M504.']
> []
> {"Manufacturer": "RPI", "Part Number": "WE1M333,", "Item Weight": "0.352 ounces", "Package Dimensions": "5.5 x 4.7 x 0.4 inches", "Item model number": "WE1M333,", "Is Discontinued By Manufacturer": "No", "Item Package Quantity": "1", "Batteries Included?": "No", "Batteries Required?": "No", "Best Sellers Rank": {"Tools & Home Improvement": 1315213, "Parts & Accessories": 181194}, "Date First Available": "February 25, 2014"}
>
> None

In [None]:
# Investigate a particular datapoint
datapoint = dataset[2]

In [None]:
# Investigate

print(datapoint["title"])
print(datapoint["description"])
print(datapoint["features"])
print(datapoint["details"])
print(datapoint["price"])

## Step 3. How many have prices? Are there enough for training, validation, and testing?

There are 46,726 with prices, which is 49.5%

In [None]:
# How many have prices?

prices = 0
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices += 1
    except ValueError as e:
        pass

print(f"There are {prices:,} with prices which is {prices/len(dataset)*100:,.1f}%")

## Step 4. For those with prices, gather the price and the length

In [None]:
# For those with prices, gather the price and the length

prices = []
lengths = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices.append(price)
            contents = datapoint["title"] + str(datapoint["description"]) + str(datapoint["features"]) + str(datapoint["details"])
            lengths.append(len(contents))
    except ValueError as e:
        pass

## Step 5. Plot the distribution of lengths

<img src="./../images/Product-Pricer-Curation-Part-1-Length-Distribution.jpg" alt="Distribution of Content Length" />

In [None]:
# Plot the distribution of lengths

plt.figure(figsize=(15, 6))
plt.title(f"Lengths: Avg {sum(lengths)/len(lengths):,.0f} and highest {max(lengths):,}\n")
plt.xlabel('Length (chars)')
plt.ylabel('Count')
plt.hist(lengths, rwidth=0.7, color="lightblue", bins=range(0, 6000, 100))
plt.show()

## Step 6. Plot the distribution of prices

<img src="./../images/Product-Pricer-Curation-Part-1-Price-Distribution.jpg" alt="Distribution of Prices" />

In [None]:
# Plot the distribution of prices

plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.2f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="orange", bins=range(0, 1000, 10))
plt.show()

## Step 7. Identify highly priced outliers

> TurboChef BULLET Rapid Cook Electric Microwave Convection Oven
>
> 21095.62

\\$21k for a microwave?! Let's not include that in the training dataset.

Closest found on Amazon and no longer available:
https://www.amazon.com/TurboChef-Electric-Countertop-Microwave-Convection/dp/B01D05U9NO/

In [None]:
# So what is this item??

for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 21000:
            print(datapoint['title'])
            print(datapoint["description"])
            print(datapoint["features"])
            print(datapoint["details"])
            print(datapoint["price"])
    except ValueError as e:
        pass

# Curate Dataset

Chosen approach:

- Select items that cost between 1 and 999 USD
- Create item instances, which truncate the text to fit within 180 tokens using the right Tokenizer
- Create a prompt to be used during Training.
- Reject items if they don't have sufficient characters.

## Step 1. Create an Item object for each with a price

There are 29,191 items

In [None]:
# Create an Item object for each with a price

items = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            item = Item(datapoint, price)
            if item.include:
                items.append(item)
    except ValueError as e:
        pass

print(f"There are {len(items):,} items")

## Step 2. Look at the first item

> <WP67003405 67003405 Door Pivot Block - Compatible Kenmore KitchenAid Maytag Whirlpool Refrigerator - Replaces AP6010352 8208254 PS11743531 - Quick DIY Repair Solution = $16.52>

In [None]:
# Look at the first item

items[1]

## Step 3. Investigate the prompt that will be used during training - the model learns to complete this

> How much does this cost to the nearest dollar?
>
> Samsung Assembly Ice Maker-Mech
>
> This is an O.E.M. Authorized part, fits with various Samsung brand models, oem part # this product in manufactured in south Korea. This is an O.E.M. Authorized part Fits with various Samsung brand models Oem part # This is a Samsung replacement part Part Number This is an O.E.M. part Manufacturer J&J International Inc., Part Weight 1 pounds, Dimensions 18 x 12 x 6 inches, model number Is Discontinued No, Color White, Material Acrylonitrile Butadiene Styrene, Quantity 1, Certification Certified frustration-free, Included Components Refrigerator-replacement-parts, Rank Tools & Home Improvement Parts & Accessories 31211, Available April 21, 2011
>
> Price is $118.00

In [None]:
# Investigate the prompt that will be used during training - the model learns to complete this

print(items[100].prompt)

## Step 4. Investigate the prompt that will be used during testing - the model has to complete this

> How much does this cost to the nearest dollar?
>
> Samsung Assembly Ice Maker-Mech
>
> This is an O.E.M. Authorized part, fits with various Samsung brand models, oem part # this product in manufactured in south Korea. This is an O.E.M. Authorized part Fits with various Samsung brand models Oem part # This is a Samsung replacement part Part Number This is an O.E.M. part Manufacturer J&J International Inc., Part Weight 1 pounds, Dimensions 18 x 12 x 6 inches, model number Is Discontinued No, Color White, Material Acrylonitrile Butadiene Styrene, Quantity 1, Certification Certified frustration-free, Included Components Refrigerator-replacement-parts, Rank Tools & Home Improvement Parts & Accessories 31211, Available April 21, 2011
>
> Price is $

In [None]:
# Investigate the prompt that will be used during testing - the model has to complete this

print(items[100].test_prompt())

## Step 5. Plot the distribution of token counts

<img src="./../images/Product-Pricer-Curation-Part-1-Token-Distribution.jpg" alt="Distribution of Tokens" />

In [None]:
# Plot the distribution of token counts

tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.hist(tokens, rwidth=0.7, color="green", bins=range(0, 300, 10))
plt.show()

## Step 6. Plot the distribution of prices

<img src="./../images/Product-Pricer-Curation-Part-1-Price-Distribution-Curated.jpg" alt="Distribution of Prices After Being Curated to Between 1 and 999 USD" />

In [None]:
# Plot the distribution of prices

prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="purple", bins=range(0, 300, 10))
plt.show()