# Data Cleaning

## Learning Objectives

By the end of this notebook, you will be able to:

1. Detect and handle missing data (NaN values)
2. Identify and remove duplicate rows
3. Convert and manage data types
4. Use string methods for text data cleaning
5. Apply transformations to clean and standardize data

---

## Setup

In [None]:
import pandas as pd
import numpy as np

# Set display options
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)

In [None]:
# Create a messy DataFrame for cleaning exercises
messy_data = pd.DataFrame({
    'name': ['  Alice  ', 'BOB', 'charlie', 'Diana', 'eve', 'Alice', None, 'Frank'],
    'email': ['alice@email.com', 'bob@EMAIL.COM', 'charlie@email.com', None, 'eve@email.com', 'alice@email.com', 'grace@email.com', 'FRANK@email.com'],
    'age': ['25', '30', 'thirty-five', '28', '32', '25', '40', None],
    'salary': [50000, 60000, None, 55000, np.nan, 50000, 70000, 65000],
    'department': ['Sales', 'MARKETING', 'sales', 'Engineering', 'Marketing', 'Sales', 'HR', 'engineering'],
    'hire_date': ['2020-01-15', '2019/06/01', '2021-03-20', '20-11-2018', '2022-02-28', '2020-01-15', '2017-08-10', 'invalid']
})

print("Messy Data:")
print(messy_data)
print(f"\nData types:\n{messy_data.dtypes}")

---

## 1. Handling Missing Data

Missing data in Pandas is represented as `NaN` (Not a Number) or `None`.

### 1.1 Detecting Missing Values

In [None]:
# Check for missing values
print("Missing values per column:")
print(messy_data.isnull().sum())

In [None]:
# Get percentage of missing values
missing_pct = (messy_data.isnull().sum() / len(messy_data)) * 100
print("Missing value percentage:")
print(missing_pct)

In [None]:
# Check if any value is missing in each row
print("Rows with any missing values:")
print(messy_data[messy_data.isnull().any(axis=1)])

In [None]:
# isna() is an alias for isnull()
print("Total missing values:", messy_data.isna().sum().sum())

### 1.2 Dropping Missing Values

In [None]:
# Drop rows with ANY missing values
cleaned = messy_data.dropna()
print(f"Original rows: {len(messy_data)}, After dropna: {len(cleaned)}")
print(cleaned)

In [None]:
# Drop rows where ALL values are missing
cleaned = messy_data.dropna(how='all')
print(f"Rows after dropping all-NA rows: {len(cleaned)}")

In [None]:
# Drop rows with missing values in specific columns
cleaned = messy_data.dropna(subset=['name', 'email'])
print("After dropping rows missing name or email:")
print(cleaned)

In [None]:
# Drop columns with missing values
cleaned = messy_data.dropna(axis=1)
print("Columns without missing values:")
print(cleaned)

In [None]:
# Drop rows with at least N non-null values (thresh)
cleaned = messy_data.dropna(thresh=5)  # Keep rows with at least 5 non-null values
print(f"Rows with at least 5 non-null values: {len(cleaned)}")

### 1.3 Filling Missing Values

In [None]:
# Create a simpler DataFrame for fill examples
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': ['x', None, 'y', 'z', None]
})
print("Original:")
print(df)

In [None]:
# Fill with a constant value
print("Fill with 0:")
print(df.fillna(0))

In [None]:
# Fill with different values per column
fill_values = {'A': 0, 'B': -1, 'C': 'unknown'}
print("Fill with different values:")
print(df.fillna(fill_values))

In [None]:
# Fill with statistics (mean, median, mode)
df_numeric = df[['A', 'B']].copy()
print("Fill with mean:")
print(df_numeric.fillna(df_numeric.mean()))

In [None]:
# Forward fill (use previous value)
print("Forward fill (ffill):")
print(df.ffill())

In [None]:
# Backward fill (use next value)
print("Backward fill (bfill):")
print(df.bfill())

In [None]:
# Interpolate numeric values
print("Interpolate:")
print(df[['A', 'B']].interpolate())

---

## 2. Handling Duplicates

In [None]:
# Check the messy data for duplicates
print("Original data:")
print(messy_data[['name', 'email', 'age']])

### 2.1 Detecting Duplicates

In [None]:
# Check for duplicate rows
print("Duplicate rows:")
print(messy_data.duplicated())

In [None]:
# Show duplicate rows
print("Rows that are duplicates:")
print(messy_data[messy_data.duplicated(keep=False)])

In [None]:
# Check duplicates in specific columns
print("Duplicate emails:")
print(messy_data[messy_data.duplicated(subset=['email'], keep=False)])

### 2.2 Removing Duplicates

In [None]:
# Remove duplicate rows (keep first occurrence)
cleaned = messy_data.drop_duplicates()
print(f"Original: {len(messy_data)}, After removing duplicates: {len(cleaned)}")
print(cleaned)

In [None]:
# Remove duplicates based on specific columns
cleaned = messy_data.drop_duplicates(subset=['email'], keep='first')
print("After removing duplicate emails (keep first):")
print(cleaned)

In [None]:
# Keep last occurrence instead of first
cleaned = messy_data.drop_duplicates(subset=['email'], keep='last')
print("After removing duplicate emails (keep last):")
print(cleaned)

---

## 3. Data Type Conversions

In [None]:
# Check current data types
print("Current data types:")
print(messy_data.dtypes)

### 3.1 Converting to Numeric

In [None]:
# The 'age' column has mixed types - let's try to convert
print("Age column values:")
print(messy_data['age'])

In [None]:
# pd.to_numeric with errors='coerce' converts invalid values to NaN
numeric_age = pd.to_numeric(messy_data['age'], errors='coerce')
print("Converted to numeric (errors='coerce'):")
print(numeric_age)

In [None]:
# Convert column using astype (when data is clean)
df = pd.DataFrame({'numbers': ['1', '2', '3', '4', '5']})
df['numbers'] = df['numbers'].astype(int)
print(f"Converted type: {df['numbers'].dtype}")

### 3.2 Converting to DateTime

In [None]:
# The hire_date column has various formats
print("Hire date values:")
print(messy_data['hire_date'])

In [None]:
# pd.to_datetime with errors='coerce'
dates = pd.to_datetime(messy_data['hire_date'], errors='coerce')
print("Converted dates:")
print(dates)

In [None]:
# Specify format for consistent dates
date_strings = pd.Series(['2020-01-15', '2019-06-01', '2021-03-20'])
dates = pd.to_datetime(date_strings, format='%Y-%m-%d')
print("Dates with format:")
print(dates)

In [None]:
# Extract date components
df = pd.DataFrame({'date': dates})
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_name'] = df['date'].dt.day_name()
print("Date components:")
print(df)

### 3.3 Converting to Category

In [None]:
# Convert department to category type (memory efficient for repeated values)
df = messy_data.copy()
print(f"Before: {df['department'].dtype}")
print(f"Memory: {df['department'].memory_usage(deep=True)} bytes")

df['department'] = df['department'].astype('category')
print(f"\nAfter: {df['department'].dtype}")
print(f"Memory: {df['department'].memory_usage(deep=True)} bytes")
print(f"Categories: {df['department'].cat.categories.tolist()}")

---

## 4. String Methods

Pandas provides the `.str` accessor for string operations on Series.

In [None]:
# Create a clean copy for string operations
df = messy_data.copy()
print("Names before cleaning:")
print(df['name'])

### 4.1 Case Conversion

In [None]:
# Convert to lowercase
print("Lowercase:")
print(df['name'].str.lower())

In [None]:
# Convert to uppercase
print("Uppercase:")
print(df['name'].str.upper())

In [None]:
# Title case
print("Title case:")
print(df['name'].str.title())

### 4.2 Stripping Whitespace

In [None]:
# Strip leading and trailing whitespace
print("Before strip:")
print(repr(df['name'].iloc[0]))  # repr shows the spaces

print("\nAfter strip:")
print(repr(df['name'].str.strip().iloc[0]))

In [None]:
# Clean names: strip whitespace and title case
df['name_clean'] = df['name'].str.strip().str.title()
print("Cleaned names:")
print(df[['name', 'name_clean']])

### 4.3 String Replacement

In [None]:
# Replace specific strings
df['department_clean'] = df['department'].str.replace('MARKETING', 'Marketing')
print(df[['department', 'department_clean']])

In [None]:
# Better approach: standardize all department names
df['department_clean'] = df['department'].str.strip().str.title()
print("Standardized departments:")
print(df['department_clean'].unique())

### 4.4 String Splitting and Extraction

In [None]:
# Split email to get username and domain
email_parts = df['email'].str.split('@', expand=True)
email_parts.columns = ['username', 'domain']
print("Email parts:")
print(email_parts)

In [None]:
# Extract using regex
# Extract just the username part
df['username'] = df['email'].str.extract(r'([^@]+)@')
print("Extracted usernames:")
print(df[['email', 'username']])

### 4.5 String Matching

In [None]:
# Check if string contains pattern
print("Contains 'email.com':")
print(df['email'].str.contains('email.com', case=False, na=False))

In [None]:
# Check if string starts/ends with pattern
print("Starts with 'a' (case insensitive):")
print(df['name'].str.lower().str.startswith('a', na=False))

---

## 5. Replacing Values

In [None]:
# Replace specific values
df = pd.DataFrame({
    'status': ['active', 'inactive', 'pending', 'Active', 'INACTIVE', 'unknown'],
    'grade': ['A', 'B', 'C', 'D', 'F', 'incomplete']
})
print("Original:")
print(df)

In [None]:
# Replace single value
df['status_clean'] = df['status'].replace('unknown', np.nan)
print("After replacing 'unknown':")
print(df)

In [None]:
# Replace multiple values with a mapping dictionary
status_mapping = {
    'active': 'Active',
    'inactive': 'Inactive',
    'Active': 'Active',
    'INACTIVE': 'Inactive',
    'pending': 'Pending',
    'unknown': np.nan
}
df['status_clean'] = df['status'].replace(status_mapping)
print("After mapping:")
print(df[['status', 'status_clean']])

In [None]:
# Replace using regex
df = pd.DataFrame({'text': ['Price: $100', 'Cost: $250', 'Value: $75']})
df['numbers'] = df['text'].str.replace(r'[^\d]', '', regex=True)
print("Extract numbers:")
print(df)

---

## 6. Applying Functions for Custom Cleaning

In [None]:
# Apply a function to a column
def clean_name(name):
    if pd.isna(name):
        return None
    return name.strip().title()

df = messy_data.copy()
df['name_clean'] = df['name'].apply(clean_name)
print("Applied custom function:")
print(df[['name', 'name_clean']])

In [None]:
# Apply lambda function
df['email_lower'] = df['email'].apply(lambda x: x.lower() if pd.notna(x) else x)
print("Email lowercased:")
print(df[['email', 'email_lower']])

In [None]:
# Apply function to entire DataFrame
def clean_string_columns(df):
    df_clean = df.copy()
    for col in df_clean.select_dtypes(include=['object']).columns:
        df_clean[col] = df_clean[col].str.strip().str.lower()
    return df_clean

# Note: This would need handling for None values in practice

---

## 7. Putting It All Together

In [None]:
# Complete data cleaning pipeline
def clean_employee_data(df):
    """
    Clean the messy employee data.
    """
    df_clean = df.copy()
    
    # 1. Clean names: strip whitespace, title case
    df_clean['name'] = df_clean['name'].str.strip().str.title()
    
    # 2. Clean emails: lowercase
    df_clean['email'] = df_clean['email'].str.lower()
    
    # 3. Convert age to numeric, coercing errors to NaN
    df_clean['age'] = pd.to_numeric(df_clean['age'], errors='coerce')
    
    # 4. Standardize department names
    df_clean['department'] = df_clean['department'].str.strip().str.title()
    
    # 5. Convert hire_date to datetime
    df_clean['hire_date'] = pd.to_datetime(df_clean['hire_date'], errors='coerce')
    
    # 6. Remove duplicate rows based on email
    df_clean = df_clean.drop_duplicates(subset=['email'], keep='first')
    
    # 7. Drop rows missing critical information
    df_clean = df_clean.dropna(subset=['name', 'email'])
    
    # 8. Fill remaining missing values
    df_clean['salary'] = df_clean['salary'].fillna(df_clean['salary'].median())
    df_clean['age'] = df_clean['age'].fillna(df_clean['age'].median())
    
    return df_clean

# Apply the cleaning pipeline
cleaned_data = clean_employee_data(messy_data)
print("Cleaned Data:")
print(cleaned_data)
print(f"\nData types:\n{cleaned_data.dtypes}")

---

## Exercises

In [None]:
# Create exercise data
exercise_data = pd.DataFrame({
    'product_name': ['  Widget A  ', 'WIDGET B', 'gadget c', 'Widget A', None, 'Gadget D'],
    'price': ['19.99', '29.99', 'thirty', '19.99', '49.99', np.nan],
    'quantity': [100, None, 50, 100, 75, 25],
    'category': ['electronics', 'ELECTRONICS', 'home', 'Electronics', 'Home', 'home'],
    'date_added': ['2023-01-15', '2023/02/20', '15-03-2023', '2023-01-15', 'invalid', '2023-04-10']
})
print("Exercise Data:")
print(exercise_data)

### Exercise 1: Missing Data Analysis

1. Find the total number of missing values in the exercise_data
2. Calculate the percentage of missing values for each column
3. Identify which rows have any missing values

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Total missing values
total_missing = exercise_data.isnull().sum().sum()
print(f"Total missing values: {total_missing}")

# 2. Percentage missing per column
missing_pct = (exercise_data.isnull().sum() / len(exercise_data)) * 100
print(f"\nMissing percentage per column:")
print(missing_pct)

# 3. Rows with any missing values
print(f"\nRows with missing values:")
print(exercise_data[exercise_data.isnull().any(axis=1)])
```
</details>

### Exercise 2: Clean Product Names

Clean the 'product_name' column:
1. Strip whitespace
2. Convert to title case
3. Handle None values appropriately

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
df = exercise_data.copy()
df['product_name_clean'] = df['product_name'].str.strip().str.title()
print("Cleaned product names:")
print(df[['product_name', 'product_name_clean']])
```
</details>

### Exercise 3: Data Type Conversion

1. Convert the 'price' column to numeric (handling errors)
2. Convert the 'date_added' column to datetime (handling errors)
3. Convert 'category' to a categorical type after standardizing

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
df = exercise_data.copy()

# 1. Convert price to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')
print(f"Price dtype: {df['price'].dtype}")
print(df['price'])

# 2. Convert date_added to datetime
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
print(f"\nDate dtype: {df['date_added'].dtype}")
print(df['date_added'])

# 3. Standardize and convert category
df['category'] = df['category'].str.strip().str.title().astype('category')
print(f"\nCategory dtype: {df['category'].dtype}")
print(f"Categories: {df['category'].cat.categories.tolist()}")
```
</details>

### Exercise 4: Remove Duplicates

1. After cleaning the product names, identify duplicate products
2. Remove duplicates, keeping the first occurrence
3. Report how many rows were removed

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
df = exercise_data.copy()

# Clean product names first
df['product_name'] = df['product_name'].str.strip().str.title()

# 1. Identify duplicates
print("Duplicate products:")
print(df[df.duplicated(subset=['product_name'], keep=False)])

# 2. Remove duplicates
original_count = len(df)
df_deduped = df.drop_duplicates(subset=['product_name'], keep='first')

# 3. Report
removed = original_count - len(df_deduped)
print(f"\nOriginal rows: {original_count}")
print(f"After deduplication: {len(df_deduped)}")
print(f"Rows removed: {removed}")
```
</details>

### Exercise 5: Complete Cleaning Pipeline

Write a function that performs a complete cleaning of the exercise_data:
1. Clean product names (strip, title case)
2. Convert price to numeric
3. Fill missing quantities with 0
4. Standardize category names
5. Convert dates to datetime
6. Remove duplicates based on product name
7. Drop rows where product_name is null

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
def clean_product_data(df):
    """
    Complete cleaning pipeline for product data.
    """
    df_clean = df.copy()
    
    # 1. Clean product names
    df_clean['product_name'] = df_clean['product_name'].str.strip().str.title()
    
    # 2. Convert price to numeric
    df_clean['price'] = pd.to_numeric(df_clean['price'], errors='coerce')
    
    # 3. Fill missing quantities with 0
    df_clean['quantity'] = df_clean['quantity'].fillna(0).astype(int)
    
    # 4. Standardize category names
    df_clean['category'] = df_clean['category'].str.strip().str.title()
    
    # 5. Convert dates to datetime
    df_clean['date_added'] = pd.to_datetime(df_clean['date_added'], errors='coerce')
    
    # 6. Remove duplicates
    df_clean = df_clean.drop_duplicates(subset=['product_name'], keep='first')
    
    # 7. Drop rows with null product_name
    df_clean = df_clean.dropna(subset=['product_name'])
    
    return df_clean

# Apply the cleaning function
cleaned = clean_product_data(exercise_data)
print("Cleaned Data:")
print(cleaned)
print(f"\nData types:\n{cleaned.dtypes}")
```
</details>

---

## Summary

In this notebook, you learned:

1. **Missing Data**:
   - Detection: `isnull()`, `isna()`, `sum()`
   - Removal: `dropna()` with how, subset, thresh options
   - Filling: `fillna()`, `ffill()`, `bfill()`, `interpolate()`

2. **Duplicates**:
   - Detection: `duplicated()`
   - Removal: `drop_duplicates()` with subset, keep options

3. **Data Type Conversions**:
   - `pd.to_numeric()`, `pd.to_datetime()` with errors='coerce'
   - `astype()` for clean data
   - Category type for memory efficiency

4. **String Methods** (`.str` accessor):
   - Case: `lower()`, `upper()`, `title()`
   - Whitespace: `strip()`, `lstrip()`, `rstrip()`
   - Manipulation: `replace()`, `split()`, `extract()`
   - Matching: `contains()`, `startswith()`, `endswith()`

5. **Value Replacement**:
   - `replace()` with values or dictionaries
   - Regex support for pattern matching

---

## Next Steps

Continue to the next notebook: **[05_groupby_aggregation.ipynb](05_groupby_aggregation.ipynb)** to learn how to group data and perform aggregations.