# Houzz Data Cleaning Automation Tool

## Overview

This notebook automates the process of cleaning and transforming data for Houzz product listings. The objective is to ensure the data meets required quality standards before analysis and reporting.

Key tasks include:
- Loading CSV data
- Renaming columns and selecting key columns
- Handling missing values and cleaning string data
- Parsing JSON-like fields
- Merging parsed data back into the main DataFrame
- Exporting the final dataset to Excel and applying styling

Following has applied to improve efficiency:
- Remove unused imports
- Consolidate and vectorize DF operations
    - Vectorized string replacement instead of loops 
    - List comprehension replacement to parse price columns 
    - List comprehension replacement to build DFs from specifications
- Optimize combinations field handling
    - Combiner function directly applied
    - Using built-in explode mode instead of loops
- Streamline Excel styling
    - Combine alignment and borders in one loop
    - Use a loop for column-specific operations
    - Use dictionaries for column widths and number formats

## 1. Import Libraries and Load Data

In [5]:
# Import necessary libraries
import pandas as pd
import ast
from itertools import repeat
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Border, Side, Alignment, Font
from openpyxl.styles import numbers

In [6]:
# Load the CSV file
df = pd.read_csv("../data/raw/snap_m82yajnh2ozqyqbwdf.csv")

## 2. Data Cleaning and Preprocessing

In [7]:
# Rename columns
df = df.rename(columns={'SKU_number':'SKU',
                        'name':'product_name',
                        'regular_price':'regular_retail_price',
                        'promo_price':'discounted_retail_price',
                        'avrega_rating':'average_rating',
                        'combnations':'combinations'})

In [8]:
# Define the list of columns to retain
column_list = ['SKU', 'category','breadcrumbs','product_name', 'availability', 'sale_tag','regular_retail_price', 'discounted_retail_price','image_count',
       'product_description', 'product_specifications','number_of_reviews','average_rating', 'is_video',  'combinations','shipping_and_returns','reviews','url']

# Keep only the specified columns
df = df[column_list]

In [9]:
# Fill missing SKU values with None
df['SKU'].fillna('None',inplace=True)

# Fill missing regular prices with a default JSON string
df['regular_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)

# Fill missing discounted prices with a default JSON string
df['discounted_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)

# Fill missing product specifications with a default string
df['product_specifications'].fillna('Model # (MPN) : None/Product ID : None/Manufactured By : None/Sold By : Houzz/Size/Weight : None/Materials : None/Assembly Required : None/Category : None/Style : None/',inplace=True)

# Fill missing combinations with a default value
df['combinations'].fillna("{'option_name': 'Variants not found','price': '$0.00'}",inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['SKU'].fillna('None',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['regular_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

In [10]:
# Clean the 'combinations' field by replacing a known problematic string format
df['combinations'] = df['combinations'].str.replace(
    "[{'option_name': 'Variants not found', 'price': '$0.00'}]",
    "{'option_name': 'Variants not found','price': '$0.00'}",
    regex=False
)

## 3. Parse Product Specifications

In [11]:
# Function to clean and convert the product_specifications into a dictionary
def dictionary_maker(record: str):
    # Replace known problem characters and handle specific cases
    list_specifications = (
        record
        .replace(" / ", "x ")  # Replace slashes within values
        .replace('Size/Weight', 'Size_Weight')
        .replace("Faux Leather/Leatherette", "Faux_Leather")
        .replace("Fabric/Linen", "Fabric_Linen")
        .replace("Metal Turnplate/Kickplate", "Metal_Turnplate_Kickplate")
        .replace("MDF/Solid Wood", "MDF_Solid_Wood")
        .replace("Fabric/Walnut", "Fabric_Walnut")
        .replace("Coated/Black", "Coated_Black")
        #.replace(" ", "_")  # Replace spaces with underscores
        .split('/')  # Split the string by '/'
    )

    keys = []
    values = []
    
    for i in list_specifications:
        # Split by the first occurrence of ':' to get the key-value pair
        if ':' in i:
            key_value = i.split(':', 1)
            keys.append(key_value[0])
            values.append(key_value[1])
        else:
            continue
    
    # Zip the keys and values into a dictionary
    dict_ = dict(zip(keys, values))
    return dict_


In [12]:
# Convert the product specifications column to dictionaries
df['product_specifications'] = df['product_specifications'].map(lambda x: dictionary_maker(x),na_action='ignore')

## 4. Parse Price Data for Regular and Discounted Prices

In [13]:
# Extract URL column
url = df[['url']]

# Extract SKU and product_specifications columns
product_s = df[['SKU','product_specifications']]

# Extract SKU and regular price columns
reg_price = df[['SKU','regular_retail_price']]

# Extract SKU and discounted price columns
promo_price = df[['SKU','discounted_retail_price']]

In [14]:
# Parse the price strings into DataFrames
reg_dfs = [
    pd.DataFrame(ast.literal_eval(row['regular_retail_price']), index=[row['SKU']])
    for _, row in reg_price.iterrows()
]
reg_total = pd.concat(reg_dfs).reset_index().rename(columns={"value": "regular-retail-price"})

promo_dfs = [
    pd.DataFrame(ast.literal_eval(row['discounted_retail_price']), index=[row['SKU']])
    for _, row in promo_price.iterrows()
]
promo_total = pd.concat(promo_dfs).reset_index().rename(columns={"value": "discounted-retail-price"})

In [15]:
# Function to insert a decimal point in price strings
def decimalAdder(x:str):
    if len(x)>4:
        x = (x[:len(x)-2] + '.' + x[-2:])
    return x

In [16]:
# Process regular and discounted prices: convert to string, apply decimalAdder, then convert to float

## Regular prices
reg_total['regular-retail-price'] = reg_total['regular-retail-price'].map(str)
reg_total['regular-retail-price'] = reg_total['regular-retail-price'].map(lambda x: decimalAdder(x),na_action='ignore')
reg_total['regular-retail-price'] = reg_total['regular-retail-price'].astype(float)

## Discount prices
promo_total['discounted-retail-price'] = promo_total['discounted-retail-price'].map(str)
promo_total['discounted-retail-price'] = promo_total['discounted-retail-price'].map(lambda x: decimalAdder(x),na_action='ignore')
promo_total['discounted-retail-price'] = promo_total['discounted-retail-price'].astype(float)

## 5. Process Product Specification Data

In [17]:
# Build a DataFrame from the product_specifications for each SKU 
spec_dfs = [
    pd.DataFrame([specs], index=[sku])
    for sku, specs in zip(product_s['SKU'], product_s['product_specifications'])
]
specifications = pd.concat(spec_dfs)

In [19]:
# Define the list of columns to retain from the specifications
column_list =['Product ID','Model # (MPN)','Manufactured By','Sold By','Size_Weight','Materials',
              'Assembly Required','Category','Style','Color','Collection','Commercial-grade']#,'Size']

# Select the defined columns
specifications_new = specifications[column_list]

# Reset index of the specifications DataFrame
specifications_new  = specifications_new.reset_index()

In [20]:
# Concatenate the original DataFrame with specifications, promo, and regular price
df_new = pd.concat([df,specifications_new,promo_total,reg_total],axis=1)

## 6. Process Combinations Field

In [22]:
# Extract columns related to combinations processing
combinations = df_new[['SKU','combinations','url']]

In [23]:
# Function to convert the combinations field from a string to a list
def combiner(record:str):
    try:
        check_point = ast.literal_eval(record)
        if isinstance(check_point,dict):
            final_list = [ast.literal_eval(record)]
        if isinstance(check_point,list):
            record = record.replace(", '",",'")
            final_list = ast.literal_eval(record)
                  
        return final_list
    except:
        return []

In [24]:
# Apply the combiner function 
combinations.combinations = combinations.combinations.map(lambda x: combiner(x),na_action='ignore')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combinations.combinations = combinations.combinations.map(lambda x: combiner(x),na_action='ignore')


In [26]:
# Explode the combinations list
combinations_exploded = combinations.explode('combinations')

# Normalize the dictionary stored in combinations
comb_expanded = pd.json_normalize(combinations_exploded['combinations'])
combinations_exploded = combinations_exploded.drop(['combinations', 'SKU'], axis=1).reset_index(drop=True)
combination_new = pd.concat([combinations_exploded, comb_expanded], axis=1)
combination_new.dropna(subset=['option_name'],)

Unnamed: 0,url,option_name,price
0,https://www.houzz.com/products/agi-mid-century...,Gray,$384.00
1,https://www.houzz.com/products/agi-mid-century...,Orange,$384.00
5,https://www.houzz.com/products/aberdeen-30-mid...,"Brown, Counter Stool",$214.00
6,https://www.houzz.com/products/aberdeen-30-mid...,"Brown, Barstool",$217.00
7,https://www.houzz.com/products/aberdeen-30-mid...,"Gray, Barstool",$217.00
...,...,...,...
2134,https://www.houzz.com/products/arden-back-bars...,"Gray Walnut and Black Faux Leather, 26"" Counte...",$430.00
2135,https://www.houzz.com/products/arden-back-bars...,"Gray Walnut and Black Faux Leather, 30"" Bar He...",$436.00
2137,https://www.houzz.com/products/sandringham-26-...,", Gray",$246.00
2138,https://www.houzz.com/products/sandringham-26-...,"30"", Brushed stainless steel and Gray",$248.00


In [27]:
# Merge the combinations data with the main DataFrame on 'url'
df_new_2 = pd.merge(df_new,combination_new[['url','option_name','price']],on='url', how='left')

# Rename columns
df_new_2 = df_new_2.rename(columns={
    'regular_retail_price': 'drop1',
    'discounted_retail_price': 'drop2',
    'price': 'option_price',
    'url': 'URL'
})

# Drop extra index column if present
if 'index' in df_new_2.columns:
    df_new_2 = df_new_2.drop('index', axis=1)

# Drop unwanted columns if present
for col in ['currency', 'symbol']:
    if col in df_new_2.columns:
        df_new_2 = df_new_2.drop(col, axis=1)

# Drop temporary columns
df_new_2 = df_new_2.drop(['drop1', 'drop2'], axis=1)   

## 7. Finalize DataFrame and Export to Excel

In [30]:
# Define the final column order
final_column_list = ['Product ID','SKU', 'Model # (MPN)','category', 'breadcrumbs', 'product_name', 'regular-retail-price','discounted-retail-price', 'availability',
       'option_name', 'option_price','sale_tag', 'image_count', 'product_description',
       'product_specifications', 'number_of_reviews', 'average_rating','reviews',
       'is_video', 'combinations', 'shipping_and_returns', 'URL',
       'Manufactured By', 'Sold By',
       'Size_Weight', 'Materials', 'Assembly Required', 'Category', 'Style',
       'Color']

# Select the final columns for the output
df_final = df_new_2[final_column_list]

# Export the final DataFrame to an Excel file
df_final.to_excel('../data/processed/houzz_retail_data_3132025.xlsx')

## 8. Load Excel Workbook and Apply Styling

In [31]:
# Load the exported Excel workbook
wb =load_workbook(filename = '../data/processed/houzz_retail_data_3132025.xlsx')

# Select the active worksheet
ws = wb.active

# Apply an auto-filter 
ws.auto_filter.ref = ws.dimensions

In [32]:
# Define a header font style
font = Font(size=15, bold=True, italic=False, vertAlign=None, underline='none', strike=False, color='FF000000')

# Define text wrapping alignment
wrap = Alignment(wrapText=True,horizontal='left')

# Define left alignment for cells
left_alignment = Alignment(horizontal='left')

# Define a fill pattern for header cells
fill = PatternFill("solid", fgColor="00CCFFCC")

# Define thin borders for cells
top=Side(border_style='thin',color="FF000000")
bottom=Side(border_style='thin', color="FF000000")
left = Side(border_style='thin', color="FF000000")
right = Side(border_style='thin', color="FF000000")
border=Border(top=top,bottom=bottom,left=left,right=right)

In [33]:
# Get the total number of rows in the worksheet
last_row = ws.max_row

# Set a standard row height for all rows
for i in range(2,last_row+1):
    ws.row_dimensions[i].height = 15

# Apply number formatting to specific columns 
for col in ["B", "C", "D", "Q", "R"]:
    for cell in ws[col]:
        cell.number_format = numbers.FORMAT_NUMBER

# Apply left alignment and thing borders to all cells
for row in ws.iter_rows(min_row=1, max_row=last_row):
    for cell in row:
        cell.alignment = left_alignment
        cell.border = border

# Format the header row
for cell in ws["1:1"]:
    cell.font = font
    cell.fill = fill

## 9. Apply Additional Alignments for Specific Columns

In [34]:
# Enable text wrapping for specific columns
for col in ['G', 'F', 'O', 'V']:
    for cell in ws[col]:
        cell.alignment = wrap

## 10. Freeze Panes and Set Column Widths

In [35]:
# Freeze panes to keep the header visible when scrolling
ws.freeze_panes = ws["B2"]

# Define column widths
col_widths = {
    "B": 20, "C": 20, "D": 20, "E": 20, "F": 60, "G": 60,
    "H": 30, "I": 30, "J": 20, "K": 30, "L": 20, "M": 20,
    "N": 20, "O": 100, "P": 60, "Q": 20, "R": 20, "S": 20,
    "T": 30, "U": 40, "V": 100, "W": 20, "X": 20, "Y": 30,
    "Z": 30, "AA": 30, "AB": 30, "AC":30, "AD": 30
    }

# Apply column widths
for col, width in col_widths.items():
    ws.column_dimensions[col].width = width

## 11. Save the Styled Workbook

In [36]:
wb.save("../data/processed/houzz_retail_data_3132025_styled.xlsx")