# Target Data Cleaning Automation Tool

## Overview

This notebook automates the process of cleaning and transforming data for Target product listings. The objective is to ensure the data meets required quality standards before analysis and reporting.

Key tasks include:
- Loading and normalizing raw JSON data
- Parsing nested price and specifications fields into flat structures
- Merging processed specifications with the main DataFrame
- Dropping duplicate or unnecessary columns and reordering the columns
- Exporting the final dataset to Excel and applying styling

Following has applied to improve efficiency:
- Remove unused imports
- Implement cleaner dictonary-maker logic
- Consolidate the creation of specifications data
- Rename variables for clarity
- Optimized styling process with merging loops and using dictionaries

## 1. Import Libraries and Load the Data

In [73]:
# Importing necessart libraries
import pandas as pd
import re
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Border, Side, Alignment, Font
from openpyxl.styles import numbers

In [74]:
# Read JSON file
df = pd.read_json("../data/raw/snap_m82yajnk2ckxnn3lyv.json")

## 2. Data Cleaning and Preprocessing

In [75]:
# Normalize the nested JSON fields
regular_price = pd.json_normalize(df['regular_price']).rename(columns={'value': 'regular_retail_price'})
promo_price = pd.json_normalize(df['promo_price']).rename(columns={'value': 'discounted_retail_price'})

# Concatenate the original DataFrame with the normalized price DataFrame
dF= pd.concat([df,regular_price,promo_price],axis=1)

In [76]:
dF.columns

Index(['availability', 'average_ratings', 'breadcrumbs', 'category',
       'description', 'details', 'image_counter', 'is_video', 'name',
       'number_of_reviews', 'promo_price', 'regular_price', 'reviews',
       'sale_tag', 'seller_info', 'specifications', 'url',
       'variations/list_of_options', 'zipcode', 'currency', 'symbol',
       'regular_retail_price', 'currency', 'symbol',
       'discounted_retail_price'],
      dtype='object')

## 3. Convert Specifications to a DataFrame

In [77]:
# Extract the 'specifications' and 'url' columns
spes = dF[['specifications','url']]

In [78]:
# Function to convert a list of strings (each in key:value format) into a dictionary
def dictionary_maker(dict_specifications: dict) -> dict:
    keys, values = [], []
    spec_list = dict_specifications.get('Table', [])

    for item in spec_list:
        try:
            # Skip items containing hyphens
            if '-' in item:
                continue
            # Split on the first colon
            if ':' in item:
                parts = re.split(r':\s?', item, maxsplit=1)
                if len(parts) == 2:
                    keys.append(parts[0].strip())
                    values.append(parts[1].strip())
        except Exception as e:
            print(f"Error parsing item '{item}': {e}")
            continue

    return dict(zip(keys, values))

In [79]:
# Apply the dictionary_maker function to the 'specifications' column
spes['specifications'] = spes['specifications'].apply(dictionary_maker)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spes['specifications'] = spes['specifications'].apply(dictionary_maker)


In [80]:
spes

Unnamed: 0,specifications,url
0,"{'Number of Pieces': '2', 'Seats up to': '2', ...",https://www.target.com/p/set-of-2-azalea-fabri...
1,"{'Features': 'Padded Back, Contoured Back, Arm...",https://www.target.com/p/30%EF%BF%BD-Sonia-Swi...
2,"{'Features': 'With Back, Armless, Swivels, Foo...",https://www.target.com/p/30-Carise-Swivel-Coun...
3,"{'Features': 'Low Back', 'Number of Pieces': '...",https://www.target.com/p/30-Alec-Bar-Height-Sw...
4,"{'Features': 'Rectangle (shape)', 'Number of P...",https://www.target.com/p/Rectangular-Solange-C...
...,...,...
526,"{'Features': 'Square (shape)', 'Number of Piec...",https://www.target.com/p/Duval-Adjustable-Bars...
527,"{'Number of Pieces': '2', 'Seats up to': '2', ...",https://www.target.com/p/armen-living-set-of-2...
528,"{'Features': 'Curved Back, Swivels, Footrest, ...",https://www.target.com/p/30-Cohen-Barstool-wit...
529,"{'Features': 'Round (shape)', 'Number of Piece...",https://www.target.com/p/armen-living-30-flynn...


In [81]:
# Build a list of rows for the new specs DataFrame
spec_data = []

for idx in spes.index:
    spec_dict = spes.at[idx, 'specifications']
    if isinstance(spec_dict, dict):
        # Filter out excessively long keys or complex types
        filtered_dict = {
            k: (str(v) if isinstance(v, (dict, list)) else v)
            for k, v in spec_dict.items()
            if len(k) < 50
        }
        filtered_dict['url'] = spes.at[idx, 'url']
        spec_data.append(filtered_dict)

reg_total = pd.DataFrame(spec_data).reset_index(drop=True)

In [82]:
# Reset index
dF = dF.reset_index(drop=True)
reg_total = reg_total.reset_index(drop=True)

# Rename index column
reg_total = reg_total.rename(columns={"index":"url"})

# Retrieve the list of columns
totalColumns = reg_total.columns.to_list()

# Ensure both DataFrames have their indices reset
dF.reset_index(drop=True, inplace=True)
reg_total.reset_index(drop=True, inplace=True)

## 4. Handle Duplicate Columns and Rename Specific Ones

In [83]:
# Check for duplicates
dup_dF = dF.columns[dF.columns.duplicated()].unique()
dup_regTotal = reg_total.columns[reg_total.columns.duplicated()].unique()

print("Duplicate columns in dF:", dup_dF)
print("Duplicate columns in reg_total:", dup_regTotal)


Duplicate columns in dF: Index(['currency', 'symbol'], dtype='object')
Duplicate columns in reg_total: Index([], dtype='object')


In [84]:
# Add suffixes to the duplicate columns in dF
dF.rename(columns={
    'currency': 'currency_dF',
    'symbol': 'symbol_dF',
    'value': 'value_dF'
}, inplace=True)

In [85]:
# Helper function to make column names unique by appending suffixes
def make_unique_columns(df):
    seen = {}
    new_cols = []
    for col in df.columns:
        if col in seen:
            seen[col] += 1
            new_cols.append(f"{col}_{seen[col]}")
        else:
            seen[col] = 0
            new_cols.append(col)
    df.columns = new_cols
    return df

In [86]:
# Apply the helper to both DataFrames
dF = make_unique_columns(dF)
reg_total = make_unique_columns(reg_total)

# Now try concatenating again
dF2 = pd.concat([dF, reg_total], axis=0, ignore_index=True)

In [None]:
# Define the target list of columns after concatenation
columnsList = ['name','availability',
 'average_ratings',
 'breadcrumbs',
 'category',
 'description',
 'details',
 'image_counter',
 'is_video',
 'number_of_reviews',
 'regular_retail_price',
 'discounted_retail_price',
 'reviews',
 'sale_tag',
 'seller_info',
 'specifications',
 'url',
 'variations/list_of_options',
 'zipcode',
 'currency_dF',
 'symbol_dF',
 #'currency_dF.1',
 #'symbol_dF.1',
 #'value_dF.1',
 'Number of Pieces',
 'Seats up to',
 'Dimensions (Overall)',
 'Seat Dimensions',
 'Seat Back Dimensions',
 'Weight',
 'Holds up to',
 'Material',
 'Assembly Details',
 'Upholstery',
 'Care & Cleaning',
 'TCIN',
 'UPC',
 'Origin',
 'WARNING',
 'Features',
 'Maximum Height, Floor to Seat Top',
 'Minimum Height, Floor to Seat Top',
 'Height Adjustability',
 'Fill Material',
 'Includes',
 'Seat back height',
 'Tabletop Material',
 'Frame Color',
 'Floor to apron height',
 'Tabletop Thickness',
 'Tabletop color',
 'Table base style',
 'Leg Height',
 'Street Date',
 'Textile Material',
 'Arm Height',
 'Finish',
 'Suggested Age',
 'Compartment 1 Type',
 'Number of Shelves',
 'Number of Doors',
 'Drawers',
 'Furniture Legs Material',
 'Surface Material',
 'Piece 1',
 'Piece 1 Weight',
 'Piece 2',
 'Piece 2 Weight',
 'Arm Type',
 'Piece 3',
 'Piece 3 Weight',
 'Bed Size',
 'Box spring required',
 'Seating Piece 1',
 'Seating Piece 1 Weight',
 'Seating Piece 1 Holds up to',
 'Table Piece 1',
 'table piece 1 weight',
 'Table Piece 1 Floor to Apron Height',
 'Table Piece 1 Max Number of Seats',
 'Seat Material',
 'Seat Cushion Dimensions',
 'Required, Not Included',
 'Floor to Frame Height',
 'Headboard Surface Type',
 'Piece 4',
 'Piece 4 Weight',
 'null']

In [88]:
# Loop through columns to find any discrepancies
for i in range(len(dF2.columns)):
    if dF2.columns[i] in columnsList:
        continue
    else:
        print(dF2.columns[i],':','found')
for i in range(len(columnsList)):
    if columnsList[i] in dF2.columns:
        continue
    else:
        print(columnsList[i],':','found')

currency_dF_1 : found
symbol_dF_1 : found
Hardware : found


In [89]:
dF2.columns

Index(['availability', 'average_ratings', 'breadcrumbs', 'category',
       'description', 'details', 'image_counter', 'is_video', 'name',
       'number_of_reviews', 'promo_price', 'regular_price', 'reviews',
       'sale_tag', 'seller_info', 'specifications', 'url',
       'variations/list_of_options', 'zipcode', 'currency_dF', 'symbol_dF',
       'regular_retail_price', 'currency_dF_1', 'symbol_dF_1',
       'discounted_retail_price', 'Number of Pieces', 'Seats up to',
       'Dimensions (Overall)', 'Seat Dimensions', 'Seat Back Dimensions',
       'Weight', 'Holds up to', 'Material', 'Assembly Details', 'Upholstery',
       'Maximum Height, Floor to Seat Top',
       'Minimum Height, Floor to Seat Top', 'Textile Material',
       'Furniture Legs Material', 'Includes', 'Height Adjustability',
       'Fill Material', 'Seat back height', 'Tabletop Material', 'Frame Color',
       'Floor to apron height', 'Tabletop Thickness', 'Tabletop color',
       'Table base style', 'Leg Height', 

In [91]:
# Select the target columns in the desired order
dF2 = dF2[columnsList]

In [100]:
dF2['regular_retail_price']

0        247.99
1        116.99
2        163.99
3        189.99
4       1999.99
         ...   
1057        NaN
1058        NaN
1059        NaN
1060        NaN
1061        NaN
Name: regular_retail_price, Length: 1062, dtype: float64

## 5. Finalize DataFrame and Export to Excel

In [92]:
# Export the processed DataFrame 
dF2.to_excel('../data/processed/target_retail_data_3132025.xlsx')

In [99]:
dF2.columns

Index(['name', 'availability', 'average_ratings', 'breadcrumbs', 'category',
       'description', 'details', 'image_counter', 'is_video',
       'number_of_reviews', 'promo_price', 'regular_price', 'reviews',
       'sale_tag', 'seller_info', 'specifications', 'url',
       'variations/list_of_options', 'zipcode', 'regular_retail_price',
       'discounted_retail_price', 'currency_dF', 'symbol_dF',
       'Number of Pieces', 'Seats up to', 'Dimensions (Overall)',
       'Seat Dimensions', 'Seat Back Dimensions', 'Weight', 'Holds up to',
       'Material', 'Assembly Details', 'Upholstery', 'Care & Cleaning', 'TCIN',
       'Maximum Height, Floor to Seat Top',
       'Minimum Height, Floor to Seat Top', 'Height Adjustability',
       'Fill Material', 'Includes', 'Seat back height', 'Tabletop Material',
       'Frame Color', 'Floor to apron height', 'Tabletop Thickness',
       'Tabletop color', 'Table base style', 'Leg Height', 'Street Date',
       'Textile Material', 'Arm Height', 'Fi

## 6. Load Excel Workbook and Apply Styling

In [93]:
# Load the exported Excel workbook
wb =load_workbook(filename = '../data/processed/target_retail_data_3132025.xlsx')

# Select the active worksheet
ws = wb.active

# Apply an auto-filter
ws.auto_filter.ref = ws.dimensions

In [94]:
# Define a header font style
font = Font(size=15, bold=True, italic=False, vertAlign=None, underline='none', strike=False, color='FF000000')

# Define text wrapping alignment
wrap = Alignment(wrapText=True,horizontal='left')

# Define left alignment for cells
left_alignment = Alignment(horizontal='left')

# Define a fill pattern for header cells
fill = PatternFill("solid", fgColor="00CCFFCC")

# Define thin borders for cells
top=Side(border_style='thin',color="FF000000")
bottom=Side(border_style='thin', color="FF000000")
left = Side(border_style='thin', color="FF000000")
right = Side(border_style='thin', color="FF000000")
border=Border(top=top,bottom=bottom,left=left,right=right)

In [95]:
# Get the total number of rows in the worksheet
last_row = ws.max_row

# Set a standard row height for all rows
for i in range(2,last_row+1):
    ws.row_dimensions[i].height = 15

# Apply number formatting to specific columns
for col in ["G", "I", "J", "K", "O", "P"]:
    for cell in ws[col]:
        cell.number_format = numbers.FORMAT_NUMBER

# Apply left alignment to every cell
for rows in ws.iter_rows(min_row=1, max_row=last_row, min_col=None):
    for cell in rows:
         cell.alignment = left_alignment
         cell.border = border

# Format the header row
for cell in ws["1:1"]:
    cell.font = font
    cell.fill = fill

## 7. Apply Additional Alignment for Specific Columns

In [96]:
# Enable text wrapping for specific columns
for col in ['C', 'H', 'S']:
    for cell in ws[col]:
        cell.alignment = wrap

## 8. Freeze Panes and Set Column Widths

In [97]:
# Freeze panes to keep the header visible
ws.freeze_panes = ws["B2"]

# Define column widths
col_widths = {
    "B": 20, "C": 60, "D": 20, "E": 20, "F": 20,
    "G": 20, "H": 60, "I": 20, "J": 20, "K": 20,
    "L": 30, "M": 20, "N": 20, "O": 20, "P": 20,
    "Q": 20, "R": 60, "S": 20, "T": 20, "U": 20,
    "V": 20, "W": 20, "X": 20, "Y": 20, "Z": 20
}

# Apply column widths
for col, width in col_widths.items():
    ws.column_dimensions[col].width = width

## 9. Save the Styled Workbook

In [98]:
wb.save("../data/processed/target_retail_data_3132025_styled.xlsx")