# Day 17 â€” Data Cleaning in Pandas

---

## Objectives
- Understand why data cleaning is important
- Handle missing values (`NaN`) and duplicates
- Rename columns and standardize formats
- Convert data types
- Detect and handle outliers
- Clean string columns

---

## 1. Inspecting the Data


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('datasets/employee_data.csv')

# Inspect data
print("First 5 rows:")
print(df.head())

print("\nData info:")
print(df.info())

print("\nSummary statistics:")
print(df.describe(include='all'))

print("\nCheck for missing values:")
print(df.isnull().sum())


## 2. Handling Missing Values


In [None]:
# Drop rows with any missing values
df_drop = df.dropna()
print("After dropping rows with missing values:", df_drop.shape)

# Fill missing values
# For numeric columns, fill with mean
if 'Salary' in df.columns:
    df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# For categorical columns, fill with mode
if 'Department' in df.columns:
    df['Department'] = df['Department'].fillna(df['Department'].mode()[0])

print("Missing values after filling:")
print(df.isnull().sum())


## 3. Handling Duplicates


In [None]:
# Check for duplicates
duplicates = df.duplicated()
print("Number of duplicate rows:", duplicates.sum())

# Drop duplicate rows
df = df.drop_duplicates()
print("Shape after dropping duplicates:", df.shape)


## 4. Renaming Columns


In [None]:
# Rename columns for clarity
df.rename(columns={'Age':'Employee_Age', 'Salary':'Employee_Salary'}, inplace=True)
print(df.head())


## 5. Converting Data Types


In [None]:
# Convert Salary to integer
if 'Employee_Salary' in df.columns:
    df['Employee_Salary'] = df['Employee_Salary'].astype(int)

# Convert 'JoiningDate' to datetime
if 'JoiningDate' in df.columns:
    df['JoiningDate'] = pd.to_datetime(df['JoiningDate'], errors='coerce')

print(df.dtypes)


## 6. String Cleaning


In [None]:
# Remove extra spaces in string columns and standardize
if 'Name' in df.columns:
    df['Name'] = df['Name'].str.strip()

if 'City' in df.columns:
    df['City'] = df['City'].str.title()  # Capitalize first letters

print(df[['Name','City']].head())


## 7. Detecting Outliers


In [None]:
# Simple method using IQR for Employee_Salary
if 'Employee_Salary' in df.columns:
    Q1 = df['Employee_Salary'].quantile(0.25)
    Q3 = df['Employee_Salary'].quantile(0.75)
    IQR = Q3 - Q1

    outliers = df[(df['Employee_Salary'] < (Q1 - 1.5*IQR)) | (df['Employee_Salary'] > (Q3 + 1.5*IQR))]
    print("Outliers based on Employee_Salary:")
    print(outliers)


## 8. Practice Exercises

1. Check for missing values in all columns and fill them appropriately.  
2. Remove duplicate rows if any.  
3. Rename columns to meaningful names.  
4. Convert date columns to datetime objects.  
5. Clean string columns (remove spaces, standardize case).  
6. Detect outliers in numeric columns and flag them.  

---

## End of Day 17 notebook
