# Week 6 Tasks - Data Wrangling in Python

This notebook demonstrates essential data wrangling skills using Python and pandas. We will work through data loading, cleaning, transformation, and aggregation tasks.

**Author:** Your Name  
**Date:** [Current Date]


## Load Required Libraries


In [None]:
# Load required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("Libraries loaded successfully!")


# Task 1: Data Loading and Initial Exploration

## Load Dataset


In [None]:
# Load your chosen dataset
# Replace with your dataset path
# data = pd.read_csv('path/to/your/dataset.csv')

# For demonstration, we'll create a sample dataset with common data quality issues
np.random.seed(123)

sample_data = pd.DataFrame({
    'id': range(1, 1001),
    'name': [f'Customer {i}' for i in range(1, 1001)],
    'age': np.random.choice([*range(18, 81), np.nan], 1000, p=[0.98/63] * 63 + [0.02]),
    'income': np.round(np.random.normal(50000, 15000, 1000)),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', np.nan], 1000),
    'purchase_date': pd.date_range('2020-01-01', '2023-12-31', periods=1000),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home', 'Sports'], 1000),
    'amount': np.round(np.random.uniform(10, 500, 1000), 2)
})

# Introduce some data quality issues
sample_data.loc[np.random.choice(1000, 50, replace=False), 'age'] = np.nan
sample_data.loc[np.random.choice(1000, 30, replace=False), 'income'] = np.nan
sample_data.loc[np.random.choice(1000, 25, replace=False), 'city'] = np.nan
sample_data.loc[np.random.choice(1000, 10, replace=False), 'name'] = ''  # Empty strings
sample_data.loc[np.random.choice(1000, 5, replace=False), 'name'] = '   '  # Whitespace only

print(f"Dataset shape: {sample_data.shape[0]} rows, {sample_data.shape[1]} columns")
print(f"Column names: {', '.join(sample_data.columns)}")


## Display First and Last Rows


In [None]:
# First 10 rows
print("First 10 rows:")
sample_data.head(10)


In [None]:
# Last 5 rows
print("Last 5 rows:")
sample_data.tail(5)


# Task 2: Data Cleaning and Preprocessing

## Handle Missing Values


In [None]:
# Create a copy for cleaning
data_cleaned = sample_data.copy()

# Strategy 1: Remove rows with missing critical information (if any)
# data_cleaned = data_cleaned.dropna(subset=['id'])

# Strategy 2: Impute missing values
# For age: use median
data_cleaned['age'].fillna(data_cleaned['age'].median(), inplace=True)

# For income: use mean
data_cleaned['income'].fillna(data_cleaned['income'].mean(), inplace=True)

# For city: use mode (most frequent)
city_mode = data_cleaned['city'].mode()[0]
data_cleaned['city'].fillna(city_mode, inplace=True)

# Verify missing values after imputation
print("Missing values after imputation:")
print(data_cleaned.isnull().sum())


# Task 3: Data Transformation and Feature Engineering

## Create Derived Variables


In [None]:
# Create age groups
def categorize_age(age):
    if age < 25:
        return 'Young'
    elif age < 40:
        return 'Adult'
    elif age < 60:
        return 'Middle-aged'
    else:
        return 'Senior'

data_cleaned['age_group'] = data_cleaned['age'].apply(categorize_age)

# Create income categories
def categorize_income(income):
    if income < 30000:
        return 'Low'
    elif income < 60000:
        return 'Medium'
    else:
        return 'High'

data_cleaned['income_category'] = data_cleaned['income'].apply(categorize_income)

print("Derived variables created successfully!")
print("Sample of new categorical variables:")
print(data_cleaned[['age', 'age_group', 'income', 'income_category']].head(10))


# Task 4: Data Aggregation and Grouping Operations

## Groupby Operations with Multiple Aggregations


In [None]:
# Aggregate by age group
age_group_summary = data_cleaned.groupby('age_group').agg({
    'id': 'count',
    'income': ['mean', 'median'],
    'amount': ['sum', 'mean']
}).round(2)

# Flatten column names
age_group_summary.columns = ['count', 'avg_income', 'median_income', 'total_amount', 'avg_amount']

print("Summary by Age Group:")
print(age_group_summary)


# Conclusion

This data wrangling exercise demonstrates essential skills for preparing data for analysis:

1. **Data Loading**: Successfully imported and explored the dataset
2. **Data Cleaning**: Handled missing values, cleaned text, and removed duplicates
3. **Data Transformation**: Created derived variables, applied mathematical transformations, and encoded categorical variables
4. **Data Aggregation**: Performed comprehensive grouping and aggregation operations

The final dataset is now ready for further analysis, modeling, or visualization tasks.
