# **01 | Data Cleaning Notebook**
## **Notebook Overview**
- **Author:** Trice Balthazar
- **Date:** 2025-12-03
- **Version:** 1.0
- **Purpose:** Transform raw data into clean, reliable datasets ready for preprocessing.

## **Table of Contents**
1. Setup & Imports
2. Load Raw Data
3. Cleaning Steps
4. Data Quality Checks
5. Export & Summary

# **1. Setup & Imports**

In [22]:
import pandas as pd
# import re

## **1.1. Function Definition**

In [23]:
# Function to normalize DataFrame columns
def normalize_columns(df):
    """
    Normalize column names by:
    - Converting to lowercase
    - Replacing spaces and certain special characters (- / #) with underscores
    - Removing all other special characters (keep letters, numbers, and underscores)
    - Collapsing multiple underscores into one
    - Stripping leading/trailing underscores
    """
    new_cols = (
        df.columns
        .str.lower()
        .str.replace(r'[\s\-\/#]+', '_', regex=True)      # replace space, -, /, # with _
        .str.replace(r'[^a-z0-9_]', '', regex=True)       # remove remaining special chars
        .str.replace(r'_+', '_', regex=True)              # collapse multiple underscores
        .str.strip('_')                                   # trim leading/trailing _
    )
    df.columns = new_cols
    return df

# **2. Load Raw Data**

In [24]:
# Read from csv file and disply top 5 rows
zoom_df_raw = pd.read_csv(r'..\data\raw\attendee_20250410.csv',skiprows=34)
# zoom_df_raw.head()

In [25]:
# Read from csv file and disply top 5 rows
eventbrite_df_raw = pd.read_excel(r'..\data\raw\Eventbrite Report-2025-04-10.xlsx')
# eventbrite_df_raw.head()

In [26]:
# Display info about dataset for a quick overview of the zoom data
zoom_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Attended                   599 non-null    object
 1   User Name (Original Name)  599 non-null    object
 2   Email                      599 non-null    object
 3   Join Time                  599 non-null    object
 4   Leave Time                 599 non-null    object
 5   Time in Session (minutes)  599 non-null    int64 
 6   Is Guest                   599 non-null    object
 7   Country/Region Name        599 non-null    object
dtypes: int64(1), object(7)
memory usage: 37.6+ KB


In [27]:
# Display info about dataset for a quick overview of the eventbrite data
eventbrite_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1487 entries, 0 to 1486
Data columns (total 45 columns):
 #   Column                                                                                                                                                                                                                                    Non-Null Count  Dtype         
---  ------                                                                                                                                                                                                                                    --------------  -----         
 0   Order #                                                                                                                                                                                                                                   1487 non-null   int64         
 1   Order Date                                                                    

In [28]:
# Display the columns of the zoom data
zoom_df_raw.columns

Index(['Attended', 'User Name (Original Name)', 'Email', 'Join Time',
       'Leave Time', 'Time in Session (minutes)', 'Is Guest',
       'Country/Region Name'],
      dtype='object')

In [29]:
# Display the columns of the eventbrite data
eventbrite_df_raw.columns

Index(['Order #', 'Order Date', 'First Name', 'Last Name', 'Email', 'Quantity',
       'Price Tier', 'Ticket Type', 'Attendee #', 'Group', 'Order Type',
       'Currency', 'Total Paid', 'Fees Paid', 'Eventbrite Fees',
       'Eventbrite Payment Processing', 'Attendee Status', 'Home Address 1',
       'Home Address 2', 'Home City', 'Home State', 'Home Zip', 'Home Country',
       'Home Phone', 'City', 'Province/Territory', 'Postal/Zip Code',
       'What is your current gender identity?', 'Please select your age group',
       'Which of the following best describes your background?',
       'I am (select all that apply)',
       'I am interested in information about (select all that apply)',
       'Are there any specific questions you would like to see answered during this event? *Please note – it may not be possible for us to address all questions during the event.',
       'How did you hear about this event?',
       'I consent to be contacted by the event organizer for feedback on m

# **3. Cleaning Steps**
- Create copies of the DataFrames. This ensures we always have the raw data, in case we need it.
- Keep columns needed for analysis.
- Ensure columns are of correct type.
- Remove duplicate values.
- Save cleaned data sets

## **3.1. Zoom Data Cleaning**

In [30]:
# Create copy of zoom data
zoom_df = zoom_df_raw[['User Name (Original Name)', 
                       'Email', 
                       'Join Time',
                       'Leave Time', 
                       'Time in Session (minutes)']].copy()

# Rename columns
zoom_df.rename(columns={'User Name (Original Name)': 'User Name',
                        'Time in Session (minutes)': 'Time in Session in minutes'}, inplace=True)

# Normalize columns
zoom_df = normalize_columns(zoom_df)

# Reset index
zoom_df.reset_index(drop=True, inplace=True)

# Convert join_time and leave_time to pd.DateTime()
zoom_df['join_time'] = pd.to_datetime(zoom_df['join_time'], format='%m/%d/%Y %I:%M:%S %p')
zoom_df['leave_time'] = pd.to_datetime(zoom_df['leave_time'],format='%m/%d/%Y %I:%M:%S %p')

# Remove duplicate values
zoom_df = zoom_df.drop_duplicates()

# zoom_df.head()

## **3.2. Eventbrite Data Cleaning**

In [31]:
# Create copy of eventbrite data
eventbrite_df = eventbrite_df_raw[['First Name', 
                                   'Last Name', 
                                   'Email', 
                                   'City', 
                                   'Province/Territory', 
                                   'Please specify']].copy()
# Normailize columns
eventbrite_df = normalize_columns(eventbrite_df)

# Reset index
eventbrite_df.reset_index(drop=True, inplace=True)

# Remove rows where the first_name is 'Info Requested'
info_requested_mask = eventbrite_df['first_name'] == 'Info Requested'
eventbrite_df = eventbrite_df[~info_requested_mask]

# Replace na values with 'City not Specified' in city column
eventbrite_df['city'] = eventbrite_df['city'].fillna('City not Specified')

# Remove duplicates across first_name, last_name and email columns
eventbrite_df = eventbrite_df.drop_duplicates(subset=['first_name', 'last_name', 'email'])

# eventbrite_df.head()

# **4. Data Quality Checks**

In [32]:
# Check for missing values in the zoom data
zoom_df.isna().sum()

user_name                     0
email                         0
join_time                     0
leave_time                    0
time_in_session_in_minutes    0
dtype: int64

In [33]:
# Check for missing values in the eventbrite data
eventbrite_df.isna().sum()

# The 'please_specify' column is optional, thus it is not an issue to have null values in this columm.

first_name               0
last_name                0
email                    0
city                     0
province_territory       0
please_specify        1074
dtype: int64

In [34]:
# Display zoom data info
zoom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   user_name                   599 non-null    object        
 1   email                       599 non-null    object        
 2   join_time                   599 non-null    datetime64[ns]
 3   leave_time                  599 non-null    datetime64[ns]
 4   time_in_session_in_minutes  599 non-null    int64         
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 23.5+ KB


In [35]:
# Display eventbrit data info
eventbrite_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1133 entries, 0 to 1484
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   first_name          1133 non-null   object
 1   last_name           1133 non-null   object
 2   email               1133 non-null   object
 3   city                1133 non-null   object
 4   province_territory  1133 non-null   object
 5   please_specify      59 non-null     object
dtypes: object(6)
memory usage: 62.0+ KB


In [36]:
# Check the shape of the eventbrite data
eventvrite_data_shape = eventbrite_df.shape
print(f'There are {eventvrite_data_shape[0]} rows and {eventvrite_data_shape[1]} columns in the cleaned eventbrite data set.')

There are 1133 rows and 6 columns in the cleaned eventbrite data set.


In [37]:
# Check the shape of the eventbrite data
zoom_data_shape = zoom_df.shape
print(f'There are {zoom_data_shape[0]} rows and {zoom_data_shape[1]} columns in the cleaned zoom data set.')

There are 599 rows and 5 columns in the cleaned zoom data set.


# **5. Export & Sumamry**

## **5.1. Export Cleaned Data**

In [38]:
# Export clean zoom data
zoom_df.to_csv(r'..\data\clean\clean-zoom-data.csv', index=False)

# Export clean eventbrite data
eventbrite_df.to_csv(r'..\data\clean\clean-eventbrite-data.csv', index=False)

## **5.1 Summary**
- Raw data: 599 Zoom records and 1,487 Eventbrite records  
- Cleaned data: 599 Zoom records and 1,133 Eventbrite records after removing duplicates  
- Column names were cleaned  
- Column types were fixed
- No null values in the zoom data