# Data Cleaning Walkthrough

_A short, practical intro to getting data into model-ready shape._

---

## I. **Data Cleaning**

This walkthrough shows how to prepare messy data so AI libraries can use it—typically by converting to numerical form **without losing meaning**.

**Goal:** Transform raw data into consistent, validated, and machine-readable features.

**In this section, we’ll cover:**
- Custom functions (e.g., text/field normalization, parsing, feature engineering)
- One-Hot Encoding (categorical → binary indicators)
- Label Encoding (categorical → integer labels)

---
### **Question:** Can we turn natural language into numerical representations suitable for models?
---

## **Lesson Outcome: Screenshots**
### Before Data Cleaning
<img src="../screenshots/data-cleaning/before.png" width="900" height="300">

### After Data Cleaning
<img src="../screenshots/data-cleaning/after.png" width="900" height="300">



In [1]:
# Install required packages 
# !pip install pandas kagglehub scikit-learn matplotlib

### Access dataset- Learn more about dataset

[Kaggle e-commerce customer behavior](https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset)

In [2]:
# Import Kagglehub (where we are accessing dataset)
import kagglehub

# Download latest version
path = kagglehub.dataset_download("uom190346a/e-commerce-customer-behavior-dataset")

# Display path to dataset
print("Path to dataset files:", path)

Path to dataset files: /Users/mo/.cache/kagglehub/datasets/uom190346a/e-commerce-customer-behavior-dataset/versions/1


In [3]:
# Import Pandas library
import pandas as pd

# Import kaggle E-commerce dataset in as a pandas dataframe
data = pd.read_csv(f'{path}/E-commerce Customer Behavior - Sheet1.csv')

# Display first 5 sample rows of dataset
data

Unnamed: 0,Customer ID,Gender,Age,City,Membership Type,Total Spend,Items Purchased,Average Rating,Discount Applied,Days Since Last Purchase,Satisfaction Level
0,101,Female,29,New York,Gold,1120.20,14,4.6,True,25,Satisfied
1,102,Male,34,Los Angeles,Silver,780.50,11,4.1,False,18,Neutral
2,103,Female,43,Chicago,Bronze,510.75,9,3.4,True,42,Unsatisfied
3,104,Male,30,San Francisco,Gold,1480.30,19,4.7,False,12,Satisfied
4,105,Male,27,Miami,Silver,720.40,13,4.0,True,55,Unsatisfied
...,...,...,...,...,...,...,...,...,...,...,...
345,446,Male,32,Miami,Silver,660.30,10,3.8,True,42,Unsatisfied
346,447,Female,36,Houston,Bronze,470.50,8,3.0,False,27,Neutral
347,448,Female,30,New York,Gold,1190.80,16,4.5,True,28,Satisfied
348,449,Male,34,Los Angeles,Silver,780.20,11,4.2,False,21,Neutral


# Data Exploration

Simple data exploration be to understnand the columns and how they would need to be tranformed in order to run them through an AI Unsupervised ML Algorithm

### Numerical vs. Categorical Data

##### **Numerical Data** is what the computer can process
##### **Categorical Data** is letters, words, or symbols that we have to turn into a numerical representation in order to process but keep the meaning behind each.

In [4]:
# Separate the numerical columns from the text columns
numerical_data = data.select_dtypes(include=['int64', 'float64'])
categorical_data = data.select_dtypes(include=['object'])

---
### Numerical vs. Categorical (Text)
##### Notice how categorical data is represented in regular words that we understand.     
##### As data scientist it is our job to turn the text into meaningful numerical representations.
---

In [5]:
# Display a sample of both sets of data Numerical vs Categorical (Text)
display(numerical_data.head())
print("Categorical Data")
display(categorical_data.head())

Unnamed: 0,Customer ID,Age,Total Spend,Items Purchased,Average Rating,Days Since Last Purchase
0,101,29,1120.2,14,4.6,25
1,102,34,780.5,11,4.1,18
2,103,43,510.75,9,3.4,42
3,104,30,1480.3,19,4.7,12
4,105,27,720.4,13,4.0,55


Categorical Data


Unnamed: 0,Gender,City,Membership Type,Satisfaction Level
0,Female,New York,Gold,Satisfied
1,Male,Los Angeles,Silver,Neutral
2,Female,Chicago,Bronze,Unsatisfied
3,Male,San Francisco,Gold,Satisfied
4,Male,Miami,Silver,Unsatisfied


## Three main methods used to *turn a categorical/text columns into a numerical columns*

### **Crustom functions** - to pair each unique label with a number, example 1-True, 0-False
### **OneHotEncoder** - Tool used to transform categorical labels by creating a column for each unique value, dramatically extending the dataset
### **LabelEncoder** - Tool used to transform categorical labels, keeping them in one column.
---

# 1. Custom Function Example

### Look at the **'Gender' Column** and write a custom function that defines a *1 for female and a 2 for males*.
 * #### lambda function (below) : allows for one line functions
 * #### can easily write a regular function, clean fundamental practices support lambda functions
---

In [6]:
# Custom lambda function that changes categorical 'Gender' column to 
categorical_data['Gender'] = categorical_data['Gender'].apply(lambda x: 1 if x == 'Female' else 2)
categorical_data.head()

Unnamed: 0,Gender,City,Membership Type,Satisfaction Level
0,1,New York,Gold,Satisfied
1,2,Los Angeles,Silver,Neutral
2,1,Chicago,Bronze,Unsatisfied
3,2,San Francisco,Gold,Satisfied
4,2,Miami,Silver,Unsatisfied


# 2. OneHotEncoder Example
* ### We can see that there are 6 unique values in the 'City' column. 
* ### OneHotEncoder will create 6 columns for each value and put a 1 in the column that represents the correct value in each row and a 0 for all others.

#### 🔎 Lets take a look!

In [7]:
# Take a look at the unique cities that are in the dataset
categorical_data['City'].value_counts()

City
New York         59
Los Angeles      59
Chicago          58
San Francisco    58
Miami            58
Houston          58
Name: count, dtype: int64

In [8]:
# Import OneHotEncoder from 'scikit-learn' library
from sklearn.preprocessing import OneHotEncoder

# Initiate OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform 'City' column 
encoded = encoder.fit_transform(categorical_data[['City']])

# Created a dataframe from encoded columns
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['City']))

# Combines encoded columns back into dataset and dropped the 'City' column
final_df = pd.concat([categorical_data, encoded_df], axis=1).drop('City', axis=1)

final_df.head()



Unnamed: 0,Gender,Membership Type,Satisfaction Level,City_Chicago,City_Houston,City_Los Angeles,City_Miami,City_New York,City_San Francisco
0,1,Gold,Satisfied,0.0,0.0,0.0,0.0,1.0,0.0
1,2,Silver,Neutral,0.0,0.0,1.0,0.0,0.0,0.0
2,1,Bronze,Unsatisfied,1.0,0.0,0.0,0.0,0.0,0.0
3,2,Gold,Satisfied,0.0,0.0,0.0,0.0,0.0,1.0
4,2,Silver,Unsatisfied,0.0,0.0,0.0,1.0,0.0,0.0


#### 🕰️ Now that you see how OneHotEncoder works, pretty cool right!
#### ✚ Label encoder kind of does the exact opposite.

# 3. Label Encoder Example

In [9]:
# Impor LabelEncoder from 'scikit-learn' library
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
encoder = LabelEncoder()

# Fit and tranform the LabelEncoder on  the 'Membership Type' and 'Satisfaction Level' columns
final_df['Membership Type'] = encoder.fit_transform(final_df['Membership Type'])
final_df['Satisfaction Level'] = encoder.fit_transform(final_df['Satisfaction Level'])

final_df.head()

Unnamed: 0,Gender,Membership Type,Satisfaction Level,City_Chicago,City_Houston,City_Los Angeles,City_Miami,City_New York,City_San Francisco
0,1,1,1,0.0,0.0,0.0,0.0,1.0,0.0
1,2,2,0,0.0,0.0,1.0,0.0,0.0,0.0
2,1,0,2,1.0,0.0,0.0,0.0,0.0,0.0
3,2,1,1,0.0,0.0,0.0,0.0,0.0,1.0
4,2,2,2,0.0,0.0,0.0,1.0,0.0,0.0


### **Label Encoder Results**

#### **Membership Type:**
* Gold - 1
Silver - 2
Bronze - 0   

#### **Satisfaction Level:**
* Satisfied - 1
Neutral - 0
Unsatisfied - 2

In [10]:
# Combine the tranformed categorical dataset with the original data
df = pd.concat([final_df, numerical_data], axis=1)

# Display the full dataset metrics (From 11 columns to 15 columns & 350 rows)
df


Unnamed: 0,Gender,Membership Type,Satisfaction Level,City_Chicago,City_Houston,City_Los Angeles,City_Miami,City_New York,City_San Francisco,Customer ID,Age,Total Spend,Items Purchased,Average Rating,Days Since Last Purchase
0,1,1,1,0.0,0.0,0.0,0.0,1.0,0.0,101,29,1120.20,14,4.6,25
1,2,2,0,0.0,0.0,1.0,0.0,0.0,0.0,102,34,780.50,11,4.1,18
2,1,0,2,1.0,0.0,0.0,0.0,0.0,0.0,103,43,510.75,9,3.4,42
3,2,1,1,0.0,0.0,0.0,0.0,0.0,1.0,104,30,1480.30,19,4.7,12
4,2,2,2,0.0,0.0,0.0,1.0,0.0,0.0,105,27,720.40,13,4.0,55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
345,2,2,2,0.0,0.0,0.0,1.0,0.0,0.0,446,32,660.30,10,3.8,42
346,1,0,0,0.0,1.0,0.0,0.0,0.0,0.0,447,36,470.50,8,3.0,27
347,1,1,1,0.0,0.0,0.0,0.0,1.0,0.0,448,30,1190.80,16,4.5,28
348,2,2,0,0.0,0.0,1.0,0.0,0.0,0.0,449,34,780.20,11,4.2,21


# The Foundation - Data Cleaning

The foundation of your data you want to plan how to change the categorical values in the dataset into numerical representations. In this example, we have to keep up with the fact that:
- **Gender**:      
    1 - Female    
    2 - Male

- **Membership Type**:    
    0 - Bronze  
    1 - Gold   
    2 - Silver
    
- **Satisfaction Level**:   
    1 - Satisfied    
    0 - Neutral   
    2 - Unsatisfied