# Title: Python Series – Day 48: Advanced Pandas (Data Cleaning & Transformation)

## 1. Introduction
**Data Cleaning** is a critical step in any data project. Real-world data is often messy, containing missing values, duplicates, and inconsistent formats.

**Today's Focus:**
- Handling Missing Data (NaN)
- Removing Duplicates
- String Operations & Transformations
- Grouping and Aggregation
- merging and Concatenating DataFrames

## 2. Import Pandas & Load Sample Data

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Name": ["Ali", "Sara", None, "Hina", "Ali"],
    "Age": [20, None, 22, 21, 20],
    "Marks": [85, 92, None, 75, 85],
    "City": ["Lahore", "Karachi", "Lahore", None, "Lahore"]
})

print("Original DataFrame:")
display(df)

## 3. Handling Missing Data (NaN)

In [None]:
print("Missing Values Count:")
print(df.isnull().sum())

# Filling missing values
# df.fillna(0) # Fills all NaNs with 0 (Be careful!)

df["City"].fillna("Unknown", inplace=True)
print("\nAfter filling City:")
display(df)

# Forward Fill (propagates last valid observation forward)
# df.fillna(method="ffill")

# Dropping rows with ANY missing value
# df.dropna()

## 4. Handling Duplicate Data

In [None]:
print("Duplicates:", df.duplicated().sum())

df.drop_duplicates(inplace=True, keep="first")
print("\nAfter Removing Duplicates:")
display(df)

## 5. Replacing Values

In [None]:
df["City"].replace({"Lahore": "LHR", "Karachi": "KHI"}, inplace=True)
display(df)

## 6. String Operations
Pandas provides string methods via `.str` accessor.

In [None]:
# Convert to Uppercase (Handling NaN in Name first to avoid errors if any left)
df["Name"] = df["Name"].fillna("Unknown").str.upper()

# Check content
lhr_city = df[df["City"].str.contains("LHR", na=False)]
print("Rows with City containing 'LHR':")
display(lhr_city)

## 7. Applying Functions
- `apply()`: Works on Series (columns) or DataFrames.
- `map()`: Works on Series for substitution.
- `applymap()`: Works element-wise on DataFrame.

In [None]:
# Increase marks by 10% using lambda
df["Marks"] = df["Marks"].apply(lambda x: x * 1.1 if pd.notnull(x) else x)

# Map City names back to full form
df["City"] = df["City"].map({"LHR": "Lahore", "KHI": "Karachi", "Unknown": "Unknown"})

display(df)

## 8. Binning / Categorization

In [None]:
# Create Age Groups
# Note: Filling NaN Age for this example
df["Age"] = df["Age"].fillna(df["Age"].mean())

df["Age_Group"] = pd.cut(df["Age"], bins=[0, 20, 30, 100], labels=["Junior", "Young Adult", "Senior"])
display(df)

## 9. GroupBy (Very Important)
Split-Apply-Combine strategy.

In [None]:
print("Average Marks by City:")
print(df.groupby("City")["Marks"].mean())

print("\nDetailed Aggregation:")
print(df.groupby("City").agg({"Marks": "mean", "Age": "max"}))

## 10. Merging & Joining DataFrames

In [None]:
df1 = pd.DataFrame({"ID": [1,2], "Name": ["Ali", "Sara"]})
df2 = pd.DataFrame({"ID": [1,2], "Score": [88, 92]})

merged = pd.merge(df1, df2, on="ID")
print("Merged Data:")
display(merged)

## 11. Concatenation

In [None]:
df_concat = pd.concat([df1, df2], axis=1) # Side by side
print("Concatenated Data:")
display(df_concat)

## 12. Sorting & Reindexing

In [None]:
df.sort_values("Marks", ascending=False, inplace=True)
display(df)

## 13. Practice Exercises
1. Convert all 'City' names to Uppercase in the original `df`.
2. Remove duplicate rows based on specific columns subset.
3. Fill missing 'Marks' with the class average.
4. Group by 'Age_Group' and find the count of students.

## 14. Mini Project – Data Cleaning Pipeline
We will simulate a raw messy dataset and clean it sequentially.

In [None]:
# 1. Create Mock Raw Data
raw_data = {
    "STUDENT NAME": ["  Ali  ", "Sara", "ALI", "zara", None],
    " marKS ": [80, 95, 80, None, 40],
    "Grade": [None, "A", None, "F", "F"]
}
df_raw = pd.DataFrame(raw_data)
df_raw.to_csv("students_raw.csv", index=False)
print("Created students_raw.csv")

# 2. Pipeline Implementation
def clean_data_pipeline(filename):
    # Load
    df = pd.read_csv(filename)
    
    # Standardize Headers
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
    
    # Remove duplicates
    df.drop_duplicates(inplace=True)
    
    # Clean Strings
    df['student_name'] = df['student_name'].str.strip().str.title()
    df.dropna(subset=['student_name'], inplace=True)
    
    # Handle Missing Numeric
    df['marks'] = df['marks'].fillna(df['marks'].mean())
    
    # Create New Column
    df['status'] = df['marks'].apply(lambda x: "Pass" if x >= 50 else "Fail")
    
    return df

# Run Pipeline
cleaned_df = clean_data_pipeline("students_raw.csv")
print("\n--- Cleaned Data ---")
display(cleaned_df)

# Save
cleaned_df.to_csv("students_cleaned.csv", index=False)
print("Saved to students_cleaned.csv")

## 15. Day 48 Summary
- Data Cleaning is a prerequisite for analysis.
- Pandas offers robust tools for handling Nulls, Duplicates, and Strings.
- `groupby` and `merge` are essential for complex data manipulation.

**Next topic: Day 49 – Data Visualization with Matplotlib**