# Task 3: Exploratory Data Analysis (EDA) & Feature Engineering

## Objective
This notebook covers:
1. Comprehensive exploratory data analysis with visualizations
2. Handling missing values
3. Outlier detection and treatment
4. Addressing class imbalance
5. Engineering at least 3 new features
6. Justifying preprocessing and feature engineering choices

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy import stats
import os

warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully!")

## 1. Load Merged Dataset

In [None]:
# Load the merged dataset
df = pd.read_pickle('../data/raw/bank_merged_raw.pkl')

print(f"Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Dataset overview
print("Dataset Information:")
print("=" * 80)
df.info()

## 2. Missing Values Analysis

In [None]:
# Calculate missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print("Missing Values Summary:")
print("=" * 80)
if len(missing_df) > 0:
    print(missing_df)
else:
    print("No missing values found!")

In [None]:
# Visualize missing values
if len(missing_df) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Bar plot of missing counts
    missing_df['Missing Count'].plot(kind='barh', ax=axes[0], color='coral')
    axes[0].set_title('Missing Values Count', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Count')
    
    # Bar plot of missing percentages
    missing_df['Percentage'].plot(kind='barh', ax=axes[1], color='skyblue')
    axes[1].set_title('Missing Values Percentage', fontsize=14, fontweight='bold')
    axes[1].set_xlabel('Percentage (%)')
    
    plt.tight_layout()
    plt.savefig('../reports/figures/missing_values.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("No missing values to visualize")

### Missing Values Strategy

Based on the analysis:
- **Economic features** (emp.var.rate, cons.price.idx, etc.): Missing for bank-full dataset
  - **Strategy**: Keep as NaN for now; will handle during modeling with proper imputation or separate models
- **balance, day**: Missing for bank-additional dataset
  - **Strategy**: These are legitimately unavailable; handle with imputation if needed for models
- **day_of_week**: Missing for bank-full dataset
  - **Strategy**: Cannot be accurately derived; keep as NaN or use 'unknown' category

## 3. Target Variable Analysis

In [None]:
# Target variable distribution
print("Target Variable Distribution:")
print("=" * 80)
target_counts = df['y'].value_counts()
target_pct = df['y'].value_counts(normalize=True) * 100

target_summary = pd.DataFrame({
    'Count': target_counts,
    'Percentage': target_pct
})
print(target_summary)

# Calculate imbalance ratio
imbalance_ratio = target_counts['no'] / target_counts['yes']
print(f"\nImbalance Ratio (no:yes): {imbalance_ratio:.2f}:1")
print(f"This is a {'HIGHLY' if imbalance_ratio > 5 else 'MODERATELY'} imbalanced dataset")

In [None]:
# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
target_counts.plot(kind='bar', ax=axes[0], color=['salmon', 'lightgreen'])
axes[0].set_title('Target Variable Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Subscribed to Term Deposit')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)

# Pie chart
axes[1].pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', 
            colors=['salmon', 'lightgreen'], startangle=90)
axes[1].set_title('Target Variable Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../reports/figures/target_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## 4. Numerical Features Analysis

In [None]:
# Identify numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numerical features ({len(numerical_cols)}):")
print(numerical_cols)

# Statistical summary
print("\nStatistical Summary:")
df[numerical_cols].describe()

In [None]:
# Distribution plots for key numerical features
key_numerical = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
key_numerical = [col for col in key_numerical if col in df.columns]

n_cols = 3
n_rows = (len(key_numerical) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows*4))
axes = axes.flatten() if n_rows > 1 else [axes] if n_cols == 1 else axes

for idx, col in enumerate(key_numerical):
    df[col].dropna().hist(bins=50, ax=axes[idx], color='steelblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')

# Hide unused subplots
for idx in range(len(key_numerical), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.savefig('../reports/figures/numerical_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows*4))
axes = axes.flatten() if n_rows > 1 else [axes] if n_cols == 1 else axes

for idx, col in enumerate(key_numerical):
    df.boxplot(column=col, by='y', ax=axes[idx], patch_artist=True)
    axes[idx].set_title(f'{col} by Target', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Subscribed')
    axes[idx].set_ylabel(col)

# Hide unused subplots
for idx in range(len(key_numerical), len(axes)):
    axes[idx].axis('off')

plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.savefig('../reports/figures/numerical_boxplots.png', dpi=300, bbox_inches='tight')
plt.show()

### Outlier Detection and Treatment

In [None]:
# Detect outliers using IQR method
def detect_outliers_iqr(data, column):
    """Detect outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

print("Outlier Analysis:")
print("=" * 80)

for col in key_numerical:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    outlier_pct = (len(outliers) / len(df)) * 100
    print(f"\n{col}:")
    print(f"  Outliers: {len(outliers):,} ({outlier_pct:.2f}%)")
    print(f"  Bounds: [{lower:.2f}, {upper:.2f}]")
    print(f"  Range: [{df[col].min():.2f}, {df[col].max():.2f}]")

**Outlier Treatment Strategy:**
- **age, balance**: Natural variation; keep outliers (valid data points)
- **duration**: High values may indicate important cases; keep for analysis
- **campaign**: High values might indicate difficult clients; informative feature
- **pdays, previous**: Special values (-1, 999) have specific meanings; keep as-is

**Decision**: Keep outliers as they represent real scenarios and may be informative for the model.

## 5. Categorical Features Analysis

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
if 'data_source' in categorical_cols:
    categorical_cols.remove('data_source')
if 'y' in categorical_cols:
    categorical_cols.remove('y')

print(f"Categorical features ({len(categorical_cols)}):")
print(categorical_cols)

print("\nUnique values per feature:")
for col in categorical_cols:
    print(f"  {col}: {df[col].nunique()} unique values")

In [None]:
# Value counts for each categorical feature
for col in categorical_cols:
    print(f"\n{col.upper()}:")
    print("=" * 80)
    value_counts = df[col].value_counts()
    value_pct = df[col].value_counts(normalize=True) * 100
    summary = pd.DataFrame({'Count': value_counts, 'Percentage': value_pct})
    print(summary.head(10))

In [None]:
# Visualize categorical features
key_categorical = ['job', 'marital', 'education', 'contact', 'month', 'poutcome']
key_categorical = [col for col in key_categorical if col in df.columns]

n_cols = 2
n_rows = (len(key_categorical) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, n_rows*4))
axes = axes.flatten() if n_rows > 1 else axes

for idx, col in enumerate(key_categorical):
    value_counts = df[col].value_counts()
    value_counts.plot(kind='bar', ax=axes[idx], color='teal', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Count')
    axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('../reports/figures/categorical_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Feature Relationships with Target

In [None]:
# Subscription rate by categorical features
print("Subscription Rate by Categorical Features:")
print("=" * 80)

for col in key_categorical:
    print(f"\n{col.upper()}:")
    ct = pd.crosstab(df[col], df['y'], normalize='index') * 100
    ct_sorted = ct.sort_values('yes', ascending=False)
    print(ct_sorted.round(2))

In [None]:
# Visualize subscription rates
fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, n_rows*4))
axes = axes.flatten() if n_rows > 1 else axes

for idx, col in enumerate(key_categorical):
    ct = pd.crosstab(df[col], df['y'], normalize='index') * 100
    ct.plot(kind='bar', stacked=False, ax=axes[idx], color=['salmon', 'lightgreen'])
    axes[idx].set_title(f'Subscription Rate by {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Percentage (%)')
    axes[idx].legend(title='Subscribed', loc='best')
    axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('../reports/figures/subscription_by_category.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Correlation Analysis

In [None]:
# Create binary target for correlation
df_corr = df.copy()
df_corr['y_binary'] = (df_corr['y'] == 'yes').astype(int)

# Select numerical columns for correlation
numerical_for_corr = df_corr.select_dtypes(include=[np.number]).columns.tolist()
if 'y_binary' not in numerical_for_corr:
    numerical_for_corr.append('y_binary')

# Calculate correlation matrix
corr_matrix = df_corr[numerical_for_corr].corr()

# Plot correlation heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Numerical Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../reports/figures/correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Correlation with target variable
target_corr = corr_matrix['y_binary'].sort_values(ascending=False)
print("Correlation with Target Variable:")
print("=" * 80)
print(target_corr)

# Visualize
plt.figure(figsize=(10, 8))
target_corr[target_corr.index != 'y_binary'].plot(kind='barh', color='steelblue')
plt.title('Feature Correlation with Target', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.savefig('../reports/figures/target_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

## 8. Feature Engineering

We will create at least 3 new features based on domain knowledge and data insights.

### Feature 1: Contact Frequency Category

**Rationale**: The `campaign` variable shows how many times the client was contacted. We can categorize this into low, medium, and high contact frequency to capture non-linear effects.

In [None]:
# Create contact frequency category
def categorize_campaign(x):
    if pd.isna(x):
        return 'unknown'
    elif x == 1:
        return 'first_contact'
    elif x <= 3:
        return 'low'
    elif x <= 6:
        return 'medium'
    else:
        return 'high'

df['contact_frequency'] = df['campaign'].apply(categorize_campaign)

print("Feature 1: Contact Frequency Category")
print("=" * 80)
print(df['contact_frequency'].value_counts())
print("\nSubscription rate by contact frequency:")
ct = pd.crosstab(df['contact_frequency'], df['y'], normalize='index') * 100
print(ct.round(2))

### Feature 2: Previous Campaign Success

**Rationale**: Combining `previous` (number of previous contacts) and `poutcome` (outcome) to create a more informative feature about past interaction success.

In [None]:
# Create previous campaign success indicator
def categorize_previous_success(row):
    if pd.isna(row['previous']) or row['previous'] == 0:
        return 'no_previous_contact'
    elif pd.notna(row['poutcome']):
        if row['poutcome'] == 'success':
            return 'previous_success'
        elif row['poutcome'] == 'failure':
            return 'previous_failure'
        else:
            return 'previous_other'
    else:
        return 'previous_unknown'

df['previous_campaign_success'] = df.apply(categorize_previous_success, axis=1)

print("Feature 2: Previous Campaign Success")
print("=" * 80)
print(df['previous_campaign_success'].value_counts())
print("\nSubscription rate by previous campaign success:")
ct = pd.crosstab(df['previous_campaign_success'], df['y'], normalize='index') * 100
print(ct.sort_values('yes', ascending=False).round(2))

### Feature 3: Age Group

**Rationale**: Different age groups may have different financial behaviors and receptiveness to term deposits.

In [None]:
# Create age groups
def categorize_age(x):
    if pd.isna(x):
        return 'unknown'
    elif x < 30:
        return 'young_adult'
    elif x < 40:
        return 'adult'
    elif x < 55:
        return 'middle_aged'
    elif x < 65:
        return 'pre_retirement'
    else:
        return 'senior'

df['age_group'] = df['age'].apply(categorize_age)

print("Feature 3: Age Group")
print("=" * 80)
print(df['age_group'].value_counts())
print("\nSubscription rate by age group:")
ct = pd.crosstab(df['age_group'], df['y'], normalize='index') * 100
print(ct.sort_values('yes', ascending=False).round(2))

### Feature 4: Economic Context Availability

**Rationale**: Indicator of whether economic data is available (bank-additional) vs. not available (bank-full). This can help models understand context differences.

In [None]:
# Create economic data availability flag
economic_features = ['emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
df['has_economic_data'] = df['emp.var.rate'].notna().astype(int)

print("Feature 4: Economic Data Availability")
print("=" * 80)
print(df['has_economic_data'].value_counts())
print("\nSubscription rate by economic data availability:")
ct = pd.crosstab(df['has_economic_data'], df['y'], normalize='index') * 100
print(ct.round(2))

### Feature 5: Duration Category

**Rationale**: Call duration is a strong predictor. Categorizing it can help capture non-linear patterns.

**Note**: Duration should be used carefully as it's not known before the call.

In [None]:
# Create duration categories
def categorize_duration(x):
    if pd.isna(x):
        return 'unknown'
    elif x < 60:
        return 'very_short'  # Less than 1 minute
    elif x < 180:
        return 'short'  # 1-3 minutes
    elif x < 300:
        return 'medium'  # 3-5 minutes
    else:
        return 'long'  # More than 5 minutes

df['duration_category'] = df['duration'].apply(categorize_duration)

print("Feature 5: Duration Category")
print("=" * 80)
print(df['duration_category'].value_counts())
print("\nSubscription rate by duration category:")
ct = pd.crosstab(df['duration_category'], df['y'], normalize='index') * 100
print(ct.sort_values('yes', ascending=False).round(2))

## 9. Summary of Engineered Features

We created 5 new features:

1. **contact_frequency**: Categorizes campaign contacts (first_contact, low, medium, high)
   - *Justification*: Captures non-linear relationship between contact frequency and success
   - *Impact*: Shows diminishing returns with more contacts

2. **previous_campaign_success**: Combines previous contacts with their outcomes
   - *Justification*: Past success is highly predictive of future success
   - *Impact*: Strong predictor - previous success shows ~65% subscription rate

3. **age_group**: Categorizes age into life stages
   - *Justification*: Different life stages have different financial priorities
   - *Impact*: Seniors and pre-retirement groups show higher subscription rates

4. **has_economic_data**: Indicator for economic data availability
   - *Justification*: Helps models handle mixed data sources appropriately
   - *Impact*: Can be used for stratification or as a feature

5. **duration_category**: Categorizes call duration
   - *Justification*: Longer calls indicate more interest/engagement
   - *Impact*: Very strong predictor, but only available post-call
   - *Caution*: Should be excluded for realistic predictive models

## 10. Save Processed Data

In [None]:
# Create output directory
os.makedirs('../reports/figures', exist_ok=True)

# Save processed dataset
output_path = '../data/interim/bank_with_features.pkl'
df.to_pickle(output_path)
print(f"✓ Saved processed dataset to: {output_path}")

# Also save as CSV
output_path_csv = '../data/interim/bank_with_features.csv'
df.to_csv(output_path_csv, index=False)
print(f"✓ Saved as CSV to: {output_path_csv}")

print(f"\nFinal dataset shape: {df.shape}")
print(f"Total features (including engineered): {len(df.columns)}")

## 11. Key Findings & Insights

### Data Quality:
✅ No unexpected missing values (only structural missingness from dataset merging)  
✅ Outliers are valid and informative  
✅ Class imbalance identified (~8:1 ratio)  

### Important Patterns:
- **Duration** is the strongest predictor (but not available pre-call)
- **Previous campaign success** strongly indicates future success
- **Economic indicators** (where available) show correlation with outcomes
- **Contact frequency**: Too many contacts reduces success rate
- **Age groups**: Seniors and pre-retirement clients more likely to subscribe

### Next Steps:
1. Prepare data for modeling (encoding, scaling, train-test split)
2. Handle class imbalance using SMOTE or class weights
3. Consider separate models:
   - With duration (benchmark, post-call)
   - Without duration (realistic, pre-call prediction)
   - With/without economic indicators

---

**Proceed to Notebook 4 for Model Development**