# Walmart Products Dataset Preprocessing

This script performs preliminary cleaning of the Walmart Products dataset from Kaggle, which can be found [here](https://www.kaggle.com/datasets/thedevastator/product-prices-and-sizes-from-walmart-grocery/code).

Some of the cleaning steps were inspired by a similar project on Walmart price prediction on Kaggle, which can be found [here](https://www.kaggle.com/code/ryanbell62101/walmart-product-price-predictor).



In [38]:
# Necessary Imports
import numpy as np
import pandas as pd
import re

## Step 1: Read in Dataset

First, we want to a look at the dataset and verify that it imported successfully.

In [39]:
# Read in dataset and display the top couple of rows to verify it imported properly
walmart = pd.read_csv('walmart_dataset.csv')
walmart.head()

  walmart = pd.read_csv('walmart_dataset.csv')


Unnamed: 0,index,SHIPPING_LOCATION,DEPARTMENT,CATEGORY,SUBCATEGORY,BREADCRUMBS,SKU,PRODUCT_URL,PRODUCT_NAME,BRAND,PRICE_RETAIL,PRICE_CURRENT,PRODUCT_SIZE,PROMOTION,RunDate,tid
0,0,79936,Deli,"Hummus, Dips, & Salsa",,"Deli/Hummus, Dips, & Salsa",110895339,https://www.walmart.com/ip/Marketside-Roasted-...,"Marketside Roasted Red Pepper Hummus, 10 Oz",Marketside,2.67,2.67,10,,2022-09-11 21:20:04,16163804
1,1,79936,Deli,"Hummus, Dips, & Salsa",,"Deli/Hummus, Dips, & Salsa",105455228,https://www.walmart.com/ip/Marketside-Roasted-...,"Marketside Roasted Garlic Hummus, 10 Oz",Marketside,2.67,2.67,10,,2022-09-11 21:20:04,16163805
2,2,79936,Deli,"Hummus, Dips, & Salsa",,"Deli/Hummus, Dips, & Salsa",128642379,https://www.walmart.com/ip/Marketside-Classic-...,"Marketside Classic Hummus, 10 Oz",Marketside,2.67,2.67,10,,2022-09-11 21:20:04,16163806
3,3,79936,Deli,"Hummus, Dips, & Salsa",,"Deli/Hummus, Dips, & Salsa",366126367,https://www.walmart.com/ip/Marketside-Everythi...,"Marketside Everything Hummus, 10 oz",Marketside,2.67,2.67,10,,2022-09-11 21:20:04,16163807
4,4,79936,Deli,"Hummus, Dips, & Salsa",,"Deli/Hummus, Dips, & Salsa",160090316,https://www.walmart.com/ip/Price-s-Jalapeno-Di...,"Price's Jalapeno Dip, 12 Oz.",Price's,3.12,3.12,12,,2022-09-11 21:20:04,16163808


## Step 2: Look for Unique Values in each Feature

Each feature that provides meaningful information should have a decent amount of unique values across a dataset with 569k entries. Therefore, we want to remove any features that have low numbers of unique values, as low uniqueness indicates there is not much variation in that feature across the data entries.

In [40]:
# Analyze number of unique items in each column
unique = walmart.nunique()
unique

index                568534
SHIPPING_LOCATION        26
DEPARTMENT               14
CATEGORY                114
SUBCATEGORY             125
BREADCRUMBS             116
SKU                   30827
PRODUCT_URL           32008
PRODUCT_NAME          30688
BRAND                  4368
PRICE_RETAIL           1852
PRICE_CURRENT          1833
PRODUCT_SIZE           1290
PROMOTION                 0
RunDate                   1
tid                  568534
dtype: int64

In [41]:
# Drop "Promotion" and "RunDate" features as they only have 0 and 1 unique values respectively, so they provide no valuable info
walmart.drop(columns=['RunDate', 'PROMOTION'], inplace=True)


## Step 3: Deal with Missing Data

Next, we want to analyze how much missing data (typically "NA" values) are in the dataset.

In [42]:
# sum all missing values for each feature
walmart.isna().sum()

index                     0
SHIPPING_LOCATION         0
DEPARTMENT                0
CATEGORY                  0
SUBCATEGORY          207210
BREADCRUMBS               0
SKU                       0
PRODUCT_URL               0
PRODUCT_NAME              0
BRAND                    27
PRICE_RETAIL              0
PRICE_CURRENT             0
PRODUCT_SIZE          62825
tid                       0
dtype: int64

Here, we see that Brand and Product Size have a relatively minor amount of NA values, in comparison to Subcategory which has 207k NA values. Therefore, we want to initially keep the entries with NA values for Subcategory to avoid removing too much data. In the meantime, we choose to get rid of the entries with NA for Brand and Product Size for now.

In [43]:
# Subcategory has too many missing values to remove all associated data, otherwise dataset would significantly shrink
walmart['SUBCATEGORY'].fillna('none', inplace=True)

walmart.dropna(inplace=True) # Drop entries with NA values for product_name and product_size
walmart.isna().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  walmart['SUBCATEGORY'].fillna('none', inplace=True)


index                0
SHIPPING_LOCATION    0
DEPARTMENT           0
CATEGORY             0
SUBCATEGORY          0
BREADCRUMBS          0
SKU                  0
PRODUCT_URL          0
PRODUCT_NAME         0
BRAND                0
PRICE_RETAIL         0
PRICE_CURRENT        0
PRODUCT_SIZE         0
tid                  0
dtype: int64

### Step 3: Analyze Data Types for Each Feature

Now, we should verify the data types for each feature to ensure that each data type makes logical sense and will be what we want to use moving forward.

In [44]:
# Look at overview of dataset thus far
walmart.info()

<class 'pandas.core.frame.DataFrame'>
Index: 505709 entries, 0 to 568533
Data columns (total 14 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   index              505709 non-null  int64  
 1   SHIPPING_LOCATION  505709 non-null  int64  
 2   DEPARTMENT         505709 non-null  object 
 3   CATEGORY           505709 non-null  object 
 4   SUBCATEGORY        505709 non-null  object 
 5   BREADCRUMBS        505709 non-null  object 
 6   SKU                505709 non-null  int64  
 7   PRODUCT_URL        505709 non-null  object 
 8   PRODUCT_NAME       505709 non-null  object 
 9   BRAND              505709 non-null  object 
 10  PRICE_RETAIL       505709 non-null  float64
 11  PRICE_CURRENT      505709 non-null  float64
 12  PRODUCT_SIZE       505709 non-null  object 
 13  tid                505709 non-null  int64  
dtypes: float64(2), int64(4), object(8)
memory usage: 57.9+ MB


Here, we notice that product size is an "object" instead of a number. This should be a numerical value instead of a string.

In [45]:
# Product size should be numerical, take digits out of string and convert into numerical format
def get_digits(string):
    digit_search = re.search('([0-9]+)', string)
    return digit_search.group(1) if digit_search else None

walmart['PRODUCT_SIZE'] = pd.to_numeric(walmart['PRODUCT_SIZE'].map(get_digits))

Another issue is that, for entries with strings such as Product Category, Product Subcategory, Brand, and Breadcrumbs, we should standardize letter casing in order to remove duplicate entries.

In [46]:
# Eliminate duplicate entries by converting all text to lowercase
walmart['CATEGORY'] = walmart['CATEGORY'].str.lower()
walmart['SUBCATEGORY'] = walmart['SUBCATEGORY'].str.lower()
walmart['BRAND'] = walmart['BRAND'].str.lower()
walmart['BREADCRUMBS'] = walmart['BREADCRUMBS'].str.lower()

walmart.nunique()

index                505709
SHIPPING_LOCATION        26
DEPARTMENT               14
CATEGORY                113
SUBCATEGORY             121
BREADCRUMBS             115
SKU                   26604
PRODUCT_URL           27634
PRODUCT_NAME          26537
BRAND                  3871
PRICE_RETAIL           1709
PRICE_CURRENT          1684
PRODUCT_SIZE            137
tid                  505709
dtype: int64

## Step 4: Final Clean and Organization of Data

In [47]:
# Check if any new NA values were introduced in the above steps
walmart.isna().sum()

index                  0
SHIPPING_LOCATION      0
DEPARTMENT             0
CATEGORY               0
SUBCATEGORY            0
BREADCRUMBS            0
SKU                    0
PRODUCT_URL            0
PRODUCT_NAME           0
BRAND                  0
PRICE_RETAIL           0
PRICE_CURRENT          0
PRODUCT_SIZE         202
tid                    0
dtype: int64

In [48]:
# Drop any new NA values and re-verify
walmart.dropna(inplace=True)
walmart.isna().sum()

index                0
SHIPPING_LOCATION    0
DEPARTMENT           0
CATEGORY             0
SUBCATEGORY          0
BREADCRUMBS          0
SKU                  0
PRODUCT_URL          0
PRODUCT_NAME         0
BRAND                0
PRICE_RETAIL         0
PRICE_CURRENT        0
PRODUCT_SIZE         0
tid                  0
dtype: int64

In [49]:
# Sort the data in a hierarchical structure by categories followed by subcategories
category_groups = walmart.groupby(['CATEGORY','SUBCATEGORY'])
category_groups[['PRICE_RETAIL']].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,PRICE_RETAIL,PRICE_RETAIL,PRICE_RETAIL,PRICE_RETAIL,PRICE_RETAIL,PRICE_RETAIL,PRICE_RETAIL,PRICE_RETAIL
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
CATEGORY,SUBCATEGORY,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
"bacon, hot dogs, sausage",none,5939.0,5.474102,3.054611,0.84,3.78,4.72,6.84,24.66
baking nuts & seeds,none,552.0,6.776793,4.086878,1.18,3.24,6.12,9.30,17.92
baking soda & starch,none,434.0,3.898710,3.824149,0.72,1.48,2.48,4.12,15.86
beef jerky,none,1510.0,8.375377,3.907909,1.08,4.98,7.88,11.98,18.58
beer,domestic beer,1490.0,12.444389,5.758637,1.48,7.99,12.98,16.98,27.98
...,...,...,...,...,...,...,...,...,...
wine,sparkling wine,739.0,13.576685,8.817995,3.72,8.98,11.48,14.98,67.27
wine,specialty wine,48.0,9.535417,3.102972,5.48,6.98,8.98,11.48,18.98
wine,white wine,2190.0,10.602868,4.910865,2.96,6.99,9.98,12.98,90.00
yeast,none,237.0,3.521561,1.981249,0.86,1.72,4.62,5.18,6.37


In [50]:
# Save cleaned dataset
walmart.to_csv("walmart_cleaned.csv", index=False)