
# Data Cleaning: Handling Missing Values

This notebook illustrates common techniques for detecting and handling missing values using pandas.


In [None]:
import pandas as pd
import numpy as np

%matplotlib inline


In [None]:
# Create a DataFrame with missing values

data = {
    'customer': ['Alice', 'Bob', 'Charlie', 'Dana', 'Eve'],
    'age': [25, np.nan, 30, 35, np.nan],
    'city': ['Jakarta', 'Depok', np.nan, 'Bogor', 'Tangerang'],
    'purchase_amount': [100.0, 200.0, np.nan, 150.0, 120.0]
}

df = pd.DataFrame(data)
df



## Detecting Missing Values

We can use pandas functions to identify missing values in each column.


In [None]:
# Count missing values per column
df.isnull().sum()



## Removing Missing Values

Sometimes dropping rows or columns with missing values is appropriate when the incomplete data is not needed.


In [None]:
# Drop rows with any missing values
df_dropped = df.dropna()
df_dropped



## Imputing Missing Values

When removing data isn’t practical, you can fill missing values using statistical measures such as mean or mode.


In [None]:
# Fill missing ages with mean age and missing cities with the mode

mean_age = df['age'].mean()
mode_city = df['city'].mode()[0]

imputed_df = df.copy()
imputed_df['age'] = imputed_df['age'].fillna(mean_age)
imputed_df['city'] = imputed_df['city'].fillna(mode_city)
imputed_df['purchase_amount'] = imputed_df['purchase_amount'].fillna(imputed_df['purchase_amount'].median())

imputed_df
