# Walmart Data Cleaning Automation Tool

## Overview

This notebook automates the process of cleaning and transforming data for Walmart product listings. The objective is to ensure the data meets required quality standards before analysis and reporting. 

Key tasks include:

- Renaming and reformatting columns
- Extracting unique item identifiers from URLs
- Cleaning and converting price data from JSON-like strings to numeric values
- Parsing seller information 
- Merging additional product IDs from an external Excel file
- Exporting the final dataset to Excel and apply styling

Following has applied to improve efficiency:
- Remove unused imports
- Consolidate repeated lines for removing HTML tags
- Use string operations
- Rename variables for clarity
- Modify price_converter to handle a wider range of formats
- Optimize Excel styling

## 1. Import Libraries and Load Data

In [36]:
# Importing necessary libraries
import pandas as pd
import ast
from itertools import repeat
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Border, Side, Alignment, Font
from openpyxl.styles import numbers

In [37]:
# Read CSV file
df = pd.read_csv("../data/raw/snap_m82yajne15u4d4bgnm.csv")

In [38]:
df

Unnamed: 0,name,regular_price,promo_price,image_count,description,category,availability,discount,number_reviews,average_ratings,seller_info,breadcrumbs,is_video,buy_box_winner,url
0,"Armen Living Indoor Celine 26"" Counter Height ...","{""currency"":""USD"",""symbol"":""$"",""value"":219.1}",$183.29,12,Timeless style is at your fingertips with the ...,Shop All Bar Stools & Counter Stools,IN_STOCK,"{""currency"":""USD"",""symbol"":""$"",""value"":35.81}",4,4.500,"[{""price"":""$183.29"",""seller_name"":""Walmart.com""}]",Home Furniture Kitchen & Dining Furniture Bar ...,False,Sold and shipped by Walmart.com,https://www.walmart.com/ip/Celine-26-Counter-H...
1,Portals Outdoor 4 Piece Sofa Set in Matte Sand...,"{""currency"":""USD"",""symbol"":""$"",""value"":2133.68}","$2,133.68",4,The Portals Outdoor 4 Piece set with 2 Modular...,Shop All Outdoor Teak Furniture,OUT_OF_STOCK,,0,,"[{""price"":""$2,133.68"",""seller_name"":""Merkatoos...",Patio & Garden Patio Furniture Shop Patio Furn...,False,Merkatoos LLC,https://www.walmart.com/ip/Portals-Outdoor-4-P...
2,"Charlotte 26"" Swivel Counter Stool in Brushed ...","{""currency"":""USD"",""symbol"":""$"",""value"":298.27}",$298.27,8,Embrace the refined style of the Charlotte swi...,Shop All Bar Stools & Counter Stools,IN_STOCK,,0,,"[{""price"":""$298.27"",""seller_name"":""Walmart.com...",Home Furniture Kitchen & Dining Furniture Bar ...,False,Sold and shipped by Walmart.com,https://www.walmart.com/ip/Charlotte-26-Swivel...
3,Armen Living Portals Outdoor C-Shape Side Tabl...,"{""currency"":""USD"",""symbol"":""$"",""value"":146.68}",$146.68,5,The Armen Living Portals Outdoor C-Shape Side ...,Rectangular End Tables,OUT_OF_STOCK,,0,,"[{""price"":""$146.68"",""seller_name"":""UnbeatableS...",Home Furniture Living Room Furniture End Table...,False,"UnbeatableSale.com, Inc.",https://www.walmart.com/ip/Portals-Outdoor-C-S...
4,Elegance Loveseat in Blush Velvet with Acrylic...,"{""currency"":""USD"",""symbol"":""$"",""value"":609.6}",$609.60,5,The Armen Living Elegance contemporary lovesea...,Velvet Loveseats,OUT_OF_STOCK,,0,,"[{""price"":""$609.60"",""seller_name"":""UnbeatableS...",Home Furniture Living Room Furniture Loveseats...,False,"UnbeatableSale.com, Inc.",https://www.walmart.com/ip/Elegance-Loveseat-i...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3604,"Armen Living Corbin 30"" Modern Faux Leather Sw...","{""currency"":""USD"",""symbol"":""$"",""value"":215.61}",$215.61,8,The Armen Living Corbin Gray wood barstool wit...,Shop All Bar Stools & Counter Stools,IN_STOCK,,0,,"[{""price"":""$215.61"",""seller_name"":""Walmart.com...",Home Furniture Kitchen & Dining Furniture Bar ...,False,Walmart.com,https://www.walmart.com/ip/Corbin-30-Bar-Heigh...
3605,"Armen Living Indoor Fargo 30"" Counter Height M...","{""currency"":""USD"",""symbol"":""$"",""value"":272.17}",$272.17,9,Modernize your home with the Flynn swivel bar ...,Shop All Bar Stools & Counter Stools,IN_STOCK,,0,,"[{""price"":""$272.17"",""seller_name"":""Cymax""},{""p...",Home Furniture Kitchen & Dining Furniture Bar ...,False,Sold and shipped by Homesquare,https://www.walmart.com/ip/Armen-Living-Armen-...
3606,Armen Living Artemio Queen Platform Wood Bed F...,"{""currency"":""USD"",""symbol"":""$"",""value"":893.95}",$893.95,6,Invite a dose of style and comfort to your bed...,Queen Beds,IN_STOCK,,0,,"[{""price"":""$893.95"",""seller_name"":"" ""},{""price...",Home Furniture Bedroom Furniture Beds Queen Beds,False,Sold and shipped by VirVentures,https://www.walmart.com/ip/Artemio-Queen-Platf...
3607,Armen Living Rainbow Micro Fiber Storage Ottom...,,,3,The 530 Rainbow Storage Ottoman is a wood fram...,Ottomans,OUT_OF_STOCK,,40,4.325,[{}],Home Furniture Living Room Furniture Ottomans,False,,https://www.walmart.com/ip/Micro-Fiber-Storage...


In [39]:
# Rename columns standardize names
df = df.rename(columns={"name":"product_name",
                        'regular_price':'regular_retail_price',
                        'promo_price':'discounted_retail_price',
                        'seller_info':'sellers',
                        'breadcrumbs':'category',
                        'url':'URL'})

## 2. Data Cleaning and Preprocessing

In [40]:
# Create and fill 'item_id' from the URL
df.insert(0,'item_id','')
df['item_id'] = df.apply(lambda x: x['URL'].rsplit("/",1)[1], axis=1)
df['item_id'] = pd.to_numeric(df['item_id'])

In [41]:
# Clean columns
df['availability'] = df['availability'].str.replace("https://schema.org/", "", regex=False)
df['sellers'] = df['sellers'].fillna("[]").astype(str).replace("nan", "[]", regex=True)
df['discounted_retail_price'] = df['discounted_retail_price'].str.replace("$", "", regex=False)
df['product_name'].fillna('None', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['product_name'].fillna('None', inplace=True)


In [42]:
# Remove common HTMl tags from 'description'
tags_to_remove = ["<p>", "</p>", "<strong>", "</strong>", 
                  "<br>", "</br>", "<li>", "</li>", "&nbsp"]
for tag in tags_to_remove:
    df['description'] = df['description'].str.replace(tag, "", regex=False)

## 3. Merge Additional Product ID Data

In [43]:
# Merge external product IDs
product_id = pd.read_excel('../data/raw/walmart_product_id_4302024.xlsx')
df =  pd.merge(df,product_id[['item_id','SKU']],on='item_id', how='left')

## 4. Process Price Data

In [44]:
# Extract a table with 'regular_retail_price'
price_df = df[['regular_retail_price', 'item_id']].copy()
price_df['regular_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}', inplace=True)

# Parse the string representations
parsed_price_list = []
for idx in price_df.index:
    raw_str = price_df.at[idx, 'regular_retail_price']
    try:
        parsed = ast.literal_eval(raw_str)  # Expecting a list of dict(s)
        # Create a DataFrame for each item
        temp_df = pd.DataFrame(parsed, index=[price_df.at[idx, 'item_id']] * len(parsed))
        parsed_price_list.append(temp_df)
    except ValueError as e:
        print(f"Error parsing price at index {idx}: {e}")

parsed_prices = pd.concat(parsed_price_list).reset_index().rename(columns={'index':'item_id'})
df.drop(columns='regular_retail_price', inplace=True)
df = df.merge(parsed_prices[['item_id', 'value']], on='item_id', how='left')
df.rename(columns={'value': 'regular_retail_price'}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  price_df['regular_retail_price'].fillna('{"value":0,"currency":"USD","symbol":"$"}', inplace=True)


## 5. Process Seller Data

In [45]:
# Extract a table with 'sellers'
sellers_df = df[['sellers','item_id']].copy()

In [46]:
# Convert a string representation into a list of dictionaries
def parse_sellers(seller_str:str):
    try:
        # Remove leading/trailing brackets, split by ", "
        raw_list = seller_str.strip('[]').split(', ')
        # If empty or invalid, return a default
        if not raw_list or raw_list == ['']:
            return [{'seller_name':'', 'price':''}]
        # Convert the first element to a dict or list
        first_item = ast.literal_eval(raw_list[0])
        return [first_item] if isinstance(first_item, dict) else list(first_item)
    except:
        return [{'seller_name':'','price':''}]

In [47]:
# Apply the parse_sellers function
sellers_df['sellers'] = sellers_df['sellers'].apply(parse_sellers)

In [48]:
# For each row, create a DataFrame from the sellers list 
seller_frames = []
for idx in sellers_df.index:
    item_id_val = sellers_df.at[idx, 'item_id']
    row_sellers = sellers_df.at[idx, 'sellers']
    # Repeat the item_id for each seller entry
    item_ids = [item_id_val] * len(row_sellers)
    seller_df = pd.DataFrame(row_sellers, index=item_ids)
    seller_frames.append(seller_df)

all_sellers = pd.concat(seller_frames).reset_index().rename(columns={'index':'item_id'})

In [49]:
# Function to convert a price by removing '$' and adding a decimal point
def price_converter(price_str: str) -> str:
    # Remove $ and commas
    clean_str = price_str.replace('$', '').replace(',', '')
    # If ends with .00, remove it
    clean_str = clean_str.replace('.00','')
    return clean_str

In [50]:
all_sellers['price'] = all_sellers['price'].astype(str)
all_sellers['price'] = all_sellers['price'].apply(price_converter)

df.drop(columns=['sellers','category'], inplace=True)
df = df.merge(all_sellers[['item_id','seller_name','price']], on='item_id', how='left')
df.rename(columns={'price':'seller_retail_price'}, inplace=True)

## 6. Select Relevant Column and Export as Excel

In [51]:
# Define the desired column order for final DataFrame
final_cols = [
    'item_id', 'SKU','product_name', 'regular_retail_price','discounted_retail_price','seller_name','buy_box_winner', 'seller_retail_price','image_count',
    'description', 'availability', 'number_reviews','average_ratings','is_video', 'URL'
    ]

# Select the target columns in defined order and remove duplicates
df_final = df[final_cols].drop_duplicates()

# Write the final DataFrame to an Excel file
df_final.to_excel('../data/processed/walmart_retail_data_3132025.xlsx')

## 7. Load Excel Workbook and Apply Styling

In [52]:
# Load the exported Excel workbook
wb =load_workbook(filename = '../data/processed/walmart_retail_data_3132025.xlsx')

# Select the active worksheet in the workbook
ws = wb.active

# Apply an auto-filter
ws.auto_filter.ref = ws.dimensions

In [53]:
# Define the font style for header cells
font = Font(size=15, bold=True, italic=False, vertAlign=None, underline='none', strike=False, color='FF000000')

# Define a cell alignment that enables text wrapping
wrap = Alignment(wrapText=True,horizontal='left')

# Define left alignment for cells
left_alignment = Alignment(horizontal='left')

# Define a fill pattern for header cells
fill = PatternFill("solid", fgColor="00CCFFCC")

# Define thin borders for cells
top=Side(border_style='thin',color="FF000000")
bottom=Side(border_style='thin', color="FF000000")
left = Side(border_style='thin', color="FF000000")
right = Side(border_style='thin', color="FF000000")
border=Border(top=top,bottom=bottom,left=left,right=right)

In [54]:
# Get the total number of rows in the worksheet
last_row = ws.max_row

# Set a standard row height for all rows
for i in range(2,last_row+1):
    ws.row_dimensions[i].height = 15

# Apply number formatting to specific columns
for col in ["B", "C"]:
    for cell in ws[col]:
        cell.number_format = numbers.FORMAT_NUMBER

# Apply left alignment and thin borders to every cell
for rows in ws.iter_rows(min_row=1, max_row=last_row, min_col=None):
    for cell in rows:
         cell.alignment = left_alignment
         cell.border = border

# Format the header row
for cell in ws["1:1"]:
    cell.font = font
    cell.fill = fill

## 8. Apply Additional Alignment for Specific Columns

In [55]:
# Enable text wrapping for specific columns
for col in ['D', 'E', 'F', 'L']:
    for cell in ws[col]:
        cell.alignment = wrap

## 9. Freeze Panes and Set Column Widths

In [56]:
# Freeze panes to keep the header visible when scrolling
ws.freeze_panes = ws["B2"]

# Define column widths
col_widths = {
    "B": 30, "C": 30, "D": 60, "E": 60, "F": 60,
    "G": 30, "H": 30, "I": 30, "J": 30, "K": 30,
    "L": 100, "M": 30, "N": 30, "O": 30, "P": 30
}

# Apply column widths
for col, width in col_widths.items():
    ws.column_dimensions[col].width = width

## 10. Save the Styled Workbook

In [57]:
wb.save("../data/processed/walmart_retail_data_3132025_styled.xlsx")