# Hemo Depot Data Cleaning Automation Tool

## Overview

This notebook automates the process of cleaning and transforming data for Home Depot product listings. The objective is to ensure the data meets required quality standards before analysis and reporting.

Key tasks include:
- Handling missing values
- Renaming columns
- Flattening nested JSON data
- Subsetting the data
- Excel styling and formatting

Following has applied to improve efficiency:
- Remove unused imports
- Combine concatenations
- Vectorize string operations
- Optimize Excel styling
- Iterate over columns
- Use a dictionary for column widths

## 1. Import Libraries and Load Data

In [19]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Border, Side, Alignment, Font

In [20]:
# Read JSON file
df = pd.read_json("../data/raw/snap_m82yajn2uy51wplt3.json")

## 2. Data Cleaning and Preprocessing

In [21]:
# Filling missing prices with a default value
df['list_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)
df['promo_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['list_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['promo_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}',inplace=True)


In [22]:
# Rename columns 
df = df.rename(columns={"name":"product_name","list_price":"regular_price","promo_price":"discounted_price"})

In [23]:
# Normalize Nested JSON data
regular_price = pd.json_normalize(df['regular_price'])
link = pd.json_normalize(df['url'])
discounted_price = pd.json_normalize(df['discounted_price'])

In [24]:
df_new = pd.concat([df, regular_price, link, discounted_price], axis=1)

In [25]:
# Create a new column 'omsid' by extracting SKU
df_new['omsid'] = df_new['url'].str.split('/').str[-1]

## 3. Select Relevant Columns and Export as Excel

In [26]:
# Define a list of columns to retain in the final DataFrame
clist = ['omsid','product_name','breadcrumbs', 'category', 'availability', 'regular_price', 'discounted_price', 'average_ratings', 
       'combinations', 'description', 'image_counter', 'is_video','number_of_reviews', 'reviews', 'seller_info', 'special_buy_tag', 'url']

# Select only the relevant columns for further analysis
df_new =  df_new[clist]

# Write the final DataFrame to an Excel
df_new.to_excel('../data/processed/homedepot_retail_data_3132025.xlsx')

## 4. Load Excel Workbook and Apply Styling

In [27]:
# Load the exported Excel workbook 
wb =load_workbook(filename = '../data/processed/homedepot_retail_data_3132025.xlsx')

# Select the active worksheet in the workbook
ws = wb.active

# Apply an auto-filter
ws.auto_filter.ref = ws.dimensions

In [28]:
# Define the font style for header cells
font = Font(size=15, bold=True, italic=False, vertAlign=None, underline='none', strike=False, color='FF000000')

# Define a cell alignment that enables text wrapping
wrap = Alignment(wrapText=True,horizontal='left')

# Define left alignment for cells
left_alignment = Alignment(horizontal='left')

# Define a fill pattern for header cells
fill = PatternFill("solid", fgColor="00CCFFCC")

# Define thin borders for cells
top=Side(border_style='thin',color="FF000000")
bottom=Side(border_style='thin', color="FF000000")
left = Side(border_style='thin', color="FF000000")
right = Side(border_style='thin', color="FF000000")
border=Border(top=top,bottom=bottom,left=left,right=right)

In [29]:
# Get the total number of rows in the worksheet
last_row = ws.max_row

# Set a standard row height for all rows
for i in range(2,last_row+1):
    ws.row_dimensions[i].height = 15

# Apply left alignment and thin border to all cells in the worksheet
for row in ws.iter_rows(min_row=1, max_row=ws.max_row):
    for cell in row:
        cell.alignment = left_alignment
        cell.border = border

# Format the header row 
for cell in ws["1:1"]:
    cell.font = font
    cell.fill = fill

## 5. Apply Additional Alignment for Specific Columns

In [30]:
# Enable text wrapping for column J, D, E, G
for col in ['J', 'D', 'E', 'G']:
    for cell in ws[col]:
        cell.alignment = wrap

## 6. Freeze Panes and Set Column Widths

In [31]:
# Freeze panes to keep the header visible when scrolling
ws.freeze_panes = ws["B2"]

# Set custom column widths 
col_widths = {
    "B": 15, "C": 20, "D":40, "E": 40,
    "F": 20, "G": 40, "H": 30, "I": 30,
    "J": 100, "K": 30, "L": 20, "M": 20,
    "N": 20, "O": 40, "P": 100
}

# Apply column widths
for col, width in col_widths.items():
    ws.column_dimensions[col].width = width

## 7. Save the Styled Workbook

In [32]:
wb.save("../data/processed/homedepot_retail_data_3132025_styled.xlsx")