# 3.2 Guided Exercise: Data Cleaning (15 minutes)

**Objective**: Clean the dataset by addressing missing data, outliers, and errors.

**Steps**:
1. Load the dataset and display the first few rows.
2. Identify and handle missing data.
3. Identify and handle outliers.
4. Correct any data errors.

In [None]:
# Step 1: Load the dataset
import pandas as pd

# Load the dataset (!!! adapt the path to your directory !!!)
df = pd.read_csv('/mnt/data/nfl_stadiums.csv', encoding='ISO-8859-1')

# Display the first few rows of the dataframe
print(df.head())

In [None]:
# Step 2: Identify and handle missing data
# Checking for missing values
print(df.isnull().sum())

# Fill missing values for numerical columns with the mean
df['stadium_open'].fillna(df['stadium_open'].mean(), inplace=True)
df['stadium_close'].fillna(df['stadium_close'].mean(), inplace=True)
df['stadium_capacity'].fillna(df['stadium_capacity'].mean(), inplace=True)

# Fill missing values for categorical columns with a placeholder
df.fillna('Unknown', inplace=True)

print(df.isnull().sum())

In [None]:
# Step 3: Identify and handle outliers
# Here, we assume the 'stadium_capacity' column might have outliers
import numpy as np

# Calculate the z-scores
df['stadium_capacity'] = df['stadium_capacity'].str.replace(',', '').astype(float)
z_scores = np.abs((df['stadium_capacity'] - df['stadium_capacity'].mean()) / df['stadium_capacity'].std())

# Set a threshold for z-score
threshold = 3

# Filter out outliers
df = df[z_scores < threshold]

In [None]:
# Step 4: Correct any data errors
# For demonstration, let's assume 'stadium_azimuthangle' has an incorrect entry
df['stadium_azimuthangle'].replace('Unknown', np.nan, inplace=True)
df['stadium_azimuthangle'].fillna(df['stadium_azimuthangle'].mean(), inplace=True)

print(df.head())

# 3.3 Guided Exercise: Data Transformation (15 minutes)

**Objective**: Apply data transformation techniques to the dataset.

**Steps**:
1. Normalize and standardize numerical columns.
2. Encode categorical variables.

In [None]:
# Import necessary libraries
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
import pandas as pd

# Load the dataset
df = pd.read_csv('/mnt/data/nfl_stadiums.csv', encoding='ISO-8859-1')

In [None]:
# Step 1: Normalize and standardize numerical columns
# Normalize 'stadium_capacity'
scaler = MinMaxScaler()
df['stadium_capacity'] = df['stadium_capacity'].str.replace(',', '').astype(float)
df['stadium_capacity_normalized'] = scaler.fit_transform(df[['stadium_capacity']])
print(df[['stadium_capacity', 'stadium_capacity_normalized']].head())

# Standardize 'stadium_capacity'
scaler = StandardScaler()
df['stadium_capacity_standardized'] = scaler.fit_transform(df[['stadium_capacity']])
print(df[['stadium_capacity', 'stadium_capacity_standardized']].head())

In [None]:
# Step 2: Encode categorical variables
# One-Hot Encode 'stadium_type'
df_encoded = pd.get_dummies(df, columns=['stadium_type'])
print(df_encoded.head())