# Hemo Depot Data Cleaning Automation Tool

## Overview

This notebook automates the process of cleaning and transforming data for Home Depot product listings. The objective is to ensure the data meets required quality standards before analysis and reporting.

Key tasks include:
- Handling missing values
- Renaming columns
- Flattening nested JSON data
- Subsetting the data
- Excel styling and formatting

Following has applied to improve efficiency:
- Remove unused imports
- Combine concatenations
- Vectorize string operations
- Optimize Excel styling
- Iterate over columns
- Use a dictionary for column widths

Most recent updates (03-17-25)
- Extracted and renamed 'value' key from regular_price and discounted_price columns
- Extracted and renamed 'value' key from combinations column
- Normalized the seller_info column to extract both name and url

## 1. Import Libraries and Load Data

In [127]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Border, Side, Alignment, Font

In [128]:
# Read JSON file
df = pd.read_json("/Users/merveogretmek/Desktop/AL/March/13th/Data Cleaning/data/raw/snap_m82yajn2uy51wplt3.json")

## 2. Data Cleaning and Preprocessing

In [129]:
# Filling missing prices with a default value
df['list_price'].fillna('{"currency":"USD","symbol":"$", "value":0}',inplace=True)
df['promo_price'].fillna('{"currency":"USD","symbol":"$", "value":0}',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['list_price'].fillna('{"currency":"USD","symbol":"$", "value":0}',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['promo_price'].fillna('{"currency":"USD","symbol":"$", "value":0}',inplace=True)


In [130]:
# Rename columns 
df = df.rename(columns={"name":"product_name","list_price":"regular_retail_price","promo_price":"discounted_retail_price"})

In [131]:
# Fill missing regular prices with a default JSON string
df['regular_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)

# Fill missing discounted prices with a default JSON string
df['discounted_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)

# Normalize Nested JSON data
regular_price_value = pd.json_normalize(df['regular_retail_price'])[['value']].rename(columns={'value': 'regular_price'})
discounted_price_value = pd.json_normalize(df['discounted_retail_price'])[['value']].rename(columns={'value': 'promo_price'})
link = pd.json_normalize(df['url'])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['regular_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['discounted_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)


In [132]:
# Normalize the combinations column and extract value
combinations_value = pd.json_normalize(df['combinations'])[['value']].rename(columns = {'value': 'combinations_value'})

In [133]:
# Normalize seller_info column
seller_info_df = pd.json_normalize(df['seller_info'])[['name', 'url']]

# Rename columns
seller_info_df = seller_info_df.rename(columns = {'name': 'seller_name', 'url': 'seller_url'})

In [134]:
# Merge the new columns with main data
df_new = pd.concat([df, regular_price_value, discounted_price_value, link, combinations_value, seller_info_df], axis=1)

# Create a new column 'omsid' by extracting SKU
df_new['omsid'] = df_new['url'].str.split('/').str[-1]

## 3. Select Relevant Columns and Export as Excel

In [135]:
# Define a list of columns to retain in the final DataFrame
clist = ['omsid','product_name','breadcrumbs', 'category', 'availability', 'regular_price', 'promo_price', 'average_ratings', 
       'combinations_value', 'description', 'image_counter', 'is_video','number_of_reviews', 'reviews', 'seller_name', 'seller_url', 'special_buy_tag', 'url']

# Select only the relevant columns for further analysis
df_new =  df_new[clist]

# Write the final DataFrame to an Excel
df_new.to_excel('/Users/merveogretmek/Desktop/AL/March/17th/Improving Notebooks/data/processed/homedepot_retail_data_3132025.xlsx')

In [136]:
df_new

Unnamed: 0,omsid,product_name,breadcrumbs,category,availability,regular_price,promo_price,average_ratings,combinations_value,description,image_counter,is_video,number_of_reviews,reviews,seller_name,seller_url,special_buy_tag,url
0,320959275,Armen Living Menorca Bar Height Aluminum Outdo...,"Outdoors, Patio Furniture, Outdoor Bar Furnitu...",Outdoor Bar Stools,true,729.91,,,"Dark Gray, Dark Grey, Dark Grey, Gray",Embrace a sophisticated outdoor entertaining e...,11,No video found,,[],Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Menor...
1,315389745,Armen Living Wesley Chestnut Leather Power Rec...,"Furniture, Living Room Furniture, Chairs, Recl...",Recliners,false,,,,,The beautiful Wesley Leather Power Reclining T...,8,No video found,1.0,[The leather on the chair was wonderful. The o...,Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Wesle...
2,302247024,Armen Living Tudor 30 in. Kahlua Faux Leather ...,"Furniture, Bar Furniture, Bar Stools",Bar Stools,false,,,,,"Spice up your kitchen, dining area or even you...",6,No video found,,[],Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Tudor...
3,310535531,Armen Living Primrose Dark Metal and Greige Ge...,"Furniture, Living Room Furniture, Sofas & Couches",Sofas & Couches,false,,,,,The Armen Living Primrose Contemporary Top Gra...,9,No video found,,[],Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Primr...
4,321824865,Armen Living Abbey 3-Piece Silver Grey Oak Kin...,"Furniture, Bedroom Furniture, Bedroom Sets",Bedroom Sets,false,,,,,Make a sleek and elegant statement in your bed...,9,No video found,,[],Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Abbey...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2684,315244832,Armen Living Brielle Armed Outdoor UV Protecte...,"Outdoors, Patio Furniture, Patio Chairs, Outdo...",Outdoor Dining Chairs,true,480.86,,,"CHARCOAL / DARK GREY, SHADES OF GREY",The Armen Living Brielle Eucalyptus and Rope O...,10,No video found,,[],Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Briel...
2685,310359927,Armen Living Brisbane Contemporary 30 in. Matt...,"Furniture, Bar Furniture, Bar Stools",Bar Stools,false,127.42,,3.0,,The Armen Living Brisbane contemporary barstoo...,8,No video found,2.0,"[not comfortable, padding is not soft, soft, w...",Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Brisb...
2686,314827789,Armen Living Ulric Walnut Wood and Charcoal Fa...,"Furniture, Kitchen & Dining Room Furniture, Di...",Dining Chairs,true,144.80,,,,The Ulrich is an amazing dining room chair add...,8,No video found,,[],Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Ulric...
2687,324206251,Armen Living Cuffay Brown 4-Piece Metal Patio ...,"Outdoors, Patio Furniture, Outdoor Lounge Furn...",Patio Conversation Sets,true,3268.63,,,"Brown, Dark Grey","Mixed materials are all the rage, and the Cuff...",10,No video found,,[],Armen Living,https://www.homedepot.com/b/Armen-Living/N-5yc...,False,https://www.homedepot.com/p/Armen-Living-Cuffa...


## 4. Load Excel Workbook and Apply Styling

In [137]:
# Load the exported Excel workbook 
wb =load_workbook(filename = '/Users/merveogretmek/Desktop/AL/March/17th/Improving Notebooks/data/processed/homedepot_retail_data_3132025.xlsx')

# Select the active worksheet in the workbook
ws = wb.active

# Apply an auto-filter
ws.auto_filter.ref = ws.dimensions

In [138]:
# Define the font style for header cells
font = Font(size=15, bold=True, italic=False, vertAlign=None, underline='none', strike=False, color='FF000000')

# Define a cell alignment that enables text wrapping
wrap = Alignment(wrapText=True,horizontal='left')

# Define left alignment for cells
left_alignment = Alignment(horizontal='left')

# Define a fill pattern for header cells
fill = PatternFill("solid", fgColor="00CCFFCC")

# Define thin borders for cells
top=Side(border_style='thin',color="FF000000")
bottom=Side(border_style='thin', color="FF000000")
left = Side(border_style='thin', color="FF000000")
right = Side(border_style='thin', color="FF000000")
border=Border(top=top,bottom=bottom,left=left,right=right)

In [139]:
# Get the total number of rows in the worksheet
last_row = ws.max_row

# Set a standard row height for all rows
for i in range(2,last_row+1):
    ws.row_dimensions[i].height = 15

# Apply left alignment and thin border to all cells in the worksheet
for row in ws.iter_rows(min_row=1, max_row=ws.max_row):
    for cell in row:
        cell.alignment = left_alignment
        cell.border = border

# Format the header row 
for cell in ws["1:1"]:
    cell.font = font
    cell.fill = fill

## 5. Apply Additional Alignment for Specific Columns

In [140]:
# Enable text wrapping for column J, D, E, G
for col in ['J', 'D', 'E', 'G']:
    for cell in ws[col]:
        cell.alignment = wrap

## 6. Freeze Panes and Set Column Widths

In [141]:
# Freeze panes to keep the header visible when scrolling
ws.freeze_panes = ws["B2"]

## 7. Save the Styled Workbook

In [142]:
wb.save("/Users/merveogretmek/Desktop/AL/March/17th/Improving Notebooks/data/processed/homedepot_retail_data_3132025_styled.xlsx")