# Academic Status and Dropout Prediction - Feature Engineering

This notebook builds on the findings from our data exploration to implement feature engineering techniques for our academic status and dropout prediction system. Feature engineering is crucial for enhancing model performance by transforming raw data into features that better represent the underlying patterns.

## Table of Contents
1. [Setup and Configuration](#1.-Setup-and-Configuration)
2. [Data Loading](#2.-Data-Loading)
3. [Feature Preprocessing](#3.-Feature-Preprocessing)
   - [Handling Missing Values](#3.1-Handling-Missing-Values)
   - [Handling Outliers](#3.2-Handling-Outliers)
4. [Feature Creation](#4.-Feature-Creation)
   - [Academic Performance Indicators](#4.1-Academic-Performance-Indicators)
   - [Engagement Metrics](#4.2-Engagement-Metrics)
   - [Socioeconomic Indicators](#4.3-Socioeconomic-Indicators)
   - [Economic Context Features](#4.4-Economic-Context-Features)
5. [Feature Transformation](#5.-Feature-Transformation)
   - [Encoding Categorical Variables](#5.1-Encoding-Categorical-Variables)
   - [Scaling Numerical Features](#5.2-Scaling-Numerical-Features)
6. [Feature Selection](#6.-Feature-Selection)
   - [Statistical Methods](#6.1-Statistical-Methods)
   - [Model-Based Selection](#6.2-Model-Based-Selection)
7. [Data Splitting](#7.-Data-Splitting)
8. [Feature Set Evaluation](#8.-Feature-Set-Evaluation)
9. [Saving Processed Data](#9.-Saving-Processed-Data)

## 1. Setup and Configuration

Let's first import the necessary libraries and set up the environment.

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
import os
import pickle

# Feature engineering libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import SelectKBest, f_classif, chi2, mutual_info_classif
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Import custom utility functions if any
import sys
sys.path.append('../')
# from src.features.build_features import create_academic_indicators  # Uncomment when available

# Configure visualizations
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set random state for reproducibility
RANDOM_STATE = 42

## 2. Data Loading

Let's load the dataset and set up our feature categorization based on the data exploration findings.

In [None]:
# Load dataset
file_path = '../data/raw/dataset.csv'
df = pd.read_csv(file_path)

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1] - 1}")  # Excluding target column

# Display first few rows
df.head()

In [None]:
# Define feature categories based on data exploration findings
cat_features = [
    'Marital status', 'Application mode', 'Course',
    'Daytime/evening attendance', 'Previous qualification', 'Nacionality',
    'Mother\'s qualification', 'Father\'s qualification', 
    'Mother\'s occupation', 'Father\'s occupation',
    'Displaced', 'Educational special needs', 'Debtor',
    'Tuition fees up to date', 'Gender', 'Scholarship holder',
    'International'
]

num_features = [
    'Application order', 'Age at enrollment',
    'Curricular units 1st sem (credited)', 'Curricular units 1st sem (enrolled)',
    'Curricular units 1st sem (evaluations)', 'Curricular units 1st sem (approved)',
    'Curricular units 1st sem (grade)', 'Curricular units 1st sem (without evaluations)',
    'Curricular units 2nd sem (credited)', 'Curricular units 2nd sem (enrolled)',
    'Curricular units 2nd sem (evaluations)', 'Curricular units 2nd sem (approved)',
    'Curricular units 2nd sem (grade)', 'Curricular units 2nd sem (without evaluations)',
    'Unemployment rate', 'Inflation rate', 'GDP'
]

target = 'Target'

# Based on data exploration, identify key features that were most predictive
# This would come from the feature importance analysis in notebook 01
key_academic_features = [
    'Curricular units 1st sem (approved)',
    'Curricular units 1st sem (grade)',
    'Curricular units 2nd sem (approved)',
    'Curricular units 2nd sem (grade)',
    'Curricular units 1st sem (evaluations)',
    'Curricular units 2nd sem (evaluations)'
]

key_demographic_features = [
    'Age at enrollment',
    'Scholarship holder',
    'Marital status',
    'Debtor'
]

key_economic_features = [
    'Unemployment rate',
    'Inflation rate',
    'GDP'
]

# Check target distribution again
target_counts = df[target].value_counts()
print("\nTarget Distribution:")
print(target_counts)

## 3. Feature Preprocessing

Before creating new features, let's handle any data quality issues like missing values and outliers.

### 3.1 Handling Missing Values

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_data = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percentage
})

# Display features with missing values (if any)
missing_features = missing_data[missing_data['Missing Values'] > 0].sort_values(by='Missing Values', ascending=False)
print("Features with missing values:")
display(missing_features)

# If there are missing values, let's handle them
if len(missing_features) > 0:
    # For categorical features with missing values
    cat_missing = [col for col in cat_features if col in missing_features.index]
    if cat_missing:
        # Use mode imputation for categorical features
        for col in cat_missing:
            mode_val = df[col].mode()[0]
            df[col] = df[col].fillna(mode_val)
            print(f"Filled missing values in {col} with mode: {mode_val}")
    
    # For numerical features with missing values
    num_missing = [col for col in num_features if col in missing_features.index]
    if num_missing:
        # Use median imputation for numerical features
        for col in num_missing:
            median_val = df[col].median()
            df[col] = df[col].fillna(median_val)
            print(f"Filled missing values in {col} with median: {median_val:.2f}")
else:
    print("No missing values found in the dataset.")

### 3.2 Handling Outliers

We'll use the IQR method to detect and handle outliers in numerical features.

In [None]:
# Function to detect outliers using IQR method
def detect_outliers_iqr(data, feature):
    q1 = data[feature].quantile(0.25)
    q3 = data[feature].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)][feature]
    return outliers, lower_bound, upper_bound

# Check for outliers in numerical features
outlier_summary = []

for feature in num_features:
    outliers, lower_bound, upper_bound = detect_outliers_iqr(df, feature)
    outlier_count = len(outliers)
    outlier_percentage = (outlier_count / len(df)) * 100
    
    outlier_summary.append({
        'Feature': feature,
        'Outlier Count': outlier_count,
        'Outlier Percentage': outlier_percentage,
        'Lower Bound': lower_bound,
        'Upper Bound': upper_bound
    })

outlier_df = pd.DataFrame(outlier_summary)
display(outlier_df.sort_values(by='Outlier Percentage', ascending=False))

# Handle outliers for features with significant outliers (e.g., > 5%)
# We'll cap the values at the boundaries rather than removing rows
significant_outliers = outlier_df[outlier_df['Outlier Percentage'] > 5]['Feature'].tolist()

for feature in significant_outliers:
    _, lower_bound, upper_bound = detect_outliers_iqr(df, feature)
    print(f"Capping outliers for {feature}")
    # Cap the lower and upper bounds
    df[feature] = df[feature].clip(lower=lower_bound, upper=upper_bound)

# For features with fewer outliers, we'll keep them as they might be important signals
print(f"\nFeatures with significant outliers (>5%): {significant_outliers}")

## 4. Feature Creation

Based on our data exploration, we'll create new features that might help improve model performance.

### 4.1 Academic Performance Indicators

In [None]:
# Create aggregated academic performance indicators

# 1. Success Rate (Approved vs. Enrolled)
df['1st_sem_success_rate'] = df['Curricular units 1st sem (approved)'] / df['Curricular units 1st sem (enrolled)'].replace(0, 1)
df['2nd_sem_success_rate'] = df['Curricular units 2nd sem (approved)'] / df['Curricular units 2nd sem (enrolled)'].replace(0, 1)
df['overall_success_rate'] = (df['Curricular units 1st sem (approved)'] + df['Curricular units 2nd sem (approved)']) / \
                           (df['Curricular units 1st sem (enrolled)'] + df['Curricular units 2nd sem (enrolled)']).replace(0, 1)

# Replace infinity values with 0 (in case of division by 0)
df.replace([np.inf, -np.inf], 0, inplace=True)

# 2. Evaluation Engagement Ratio
df['1st_sem_evaluation_ratio'] = df['Curricular units 1st sem (evaluations)'] / df['Curricular units 1st sem (enrolled)'].replace(0, 1)
df['2nd_sem_evaluation_ratio'] = df['Curricular units 2nd sem (evaluations)'] / df['Curricular units 2nd sem (enrolled)'].replace(0, 1)

# 3. Performance Trend (2nd semester vs 1st semester)
df['grade_trend'] = df['Curricular units 2nd sem (grade)'] - df['Curricular units 1st sem (grade)']
df['approval_trend'] = df['Curricular units 2nd sem (approved)'] - df['Curricular units 1st sem (approved)']

# 4. Weighted Grade (considering number of units)
df['weighted_grade'] = (df['Curricular units 1st sem (grade)'] * df['Curricular units 1st sem (enrolled)'] + 
                       df['Curricular units 2nd sem (grade)'] * df['Curricular units 2nd sem (enrolled)']) / \
                      (df['Curricular units 1st sem (enrolled)'] + df['Curricular units 2nd sem (enrolled)']).replace(0, 1)

# 5. Non-evaluation Rate
df['1st_sem_non_eval_rate'] = df['Curricular units 1st sem (without evaluations)'] / df['Curricular units 1st sem (enrolled)'].replace(0, 1)
df['2nd_sem_non_eval_rate'] = df['Curricular units 2nd sem (without evaluations)'] / df['Curricular units 2nd sem (enrolled)'].replace(0, 1)

# 6. Credit Efficiency
df['credit_efficiency'] = (df['Curricular units 1st sem (credited)'] + df['Curricular units 2nd sem (credited)']) / \
                         (df['Curricular units 1st sem (enrolled)'] + df['Curricular units 2nd sem (enrolled)']).replace(0, 1)

# List of new academic features
new_academic_features = [
    '1st_sem_success_rate', '2nd_sem_success_rate', 'overall_success_rate',
    '1st_sem_evaluation_ratio', '2nd_sem_evaluation_ratio',
    'grade_trend', 'approval_trend', 'weighted_grade',
    '1st_sem_non_eval_rate', '2nd_sem_non_eval_rate',
    'credit_efficiency'
]

# Add to numerical features list
num_features.extend(new_academic_features)

# Display summary of new features
print("Summary of new academic performance indicators:")
display(df[new_academic_features].describe())

### 4.2 Engagement Metrics

Now let's create features that reflect student engagement.

In [None]:
# Create engagement metrics

# 1. Overall Engagement Score (using evaluations vs. enrolled)
df['engagement_score'] = ((df['Curricular units 1st sem (evaluations)'] / df['Curricular units 1st sem (enrolled)'].replace(0, 1)) + 
                        (df['Curricular units 2nd sem (evaluations)'] / df['Curricular units 2nd sem (enrolled)'].replace(0, 1))) / 2 * 100

# 2. Dropout Risk Indicator (based on non-evaluation rate)
df['dropout_risk_indicator'] = (df['1st_sem_non_eval_rate'] + df['2nd_sem_non_eval_rate']) / 2 * 100

# 3. Academic Consistency (standard deviation of success rates)
df['academic_consistency'] = np.where(
    (df['1st_sem_success_rate'] > 0) & (df['2nd_sem_success_rate'] > 0),
    100 - (abs(df['1st_sem_success_rate'] - df['2nd_sem_success_rate']) * 50),
    0  # If either semester has zero success rate, consistency is 0
)

# 4. Academic Momentum (improvement from 1st to 2nd semester)
df['academic_momentum'] = np.where(
    df['approval_trend'] > 0,
    df['approval_trend'] * 10,  # Positive momentum
    df['approval_trend'] * 5    # Negative momentum (less weight)
)

# List of new engagement features
new_engagement_features = [
    'engagement_score', 'dropout_risk_indicator',
    'academic_consistency', 'academic_momentum'
]

# Add to numerical features list
num_features.extend(new_engagement_features)

# Display summary of new engagement features
print("Summary of new engagement metrics:")
display(df[new_engagement_features].describe())

### 4.3 Socioeconomic Indicators

Let's create features that combine socioeconomic indicators.

In [None]:
# Create socioeconomic indicators

# 1. Family Educational Support (combining mother's and father's qualification)
df['family_education'] = df["Mother's qualification"] + df["Father's qualification"]

# 2. Financial Status Indicator
df['financial_status'] = df['Scholarship holder'] * 5 + (1 - df['Debtor']) * 5 + (df['Tuition fees up to date']) * 5

# 3. Social Support Index (based on scholarship, financial status, and family education)
df['social_support_index'] = df['Scholarship holder'] * 10 + df['family_education'] / 2

# 4. Financial Risk (debtor status and tuition payment)
df['financial_risk'] = df['Debtor'] * 5 + (1 - df['Tuition fees up to date']) * 5

# List of new socioeconomic features
new_socioeconomic_features = [
    'family_education', 'financial_status',
    'social_support_index', 'financial_risk'
]

# Add to numerical features list
num_features.extend(new_socioeconomic_features)

# Display summary of new socioeconomic features
print("Summary of new socioeconomic indicators:")
display(df[new_socioeconomic_features].describe())

### 4.4 Economic Context Features

Let's create features that combine economic indicators with student characteristics.

In [None]:
# Create economic context features

# 1. Economic Pressure Index (combination of unemployment and inflation)
df['economic_pressure'] = df['Unemployment rate'] + df['Inflation rate'] - df['GDP'] / 100

# 2. Economic Risk for Non-Scholarship Students
df['economic_risk_non_scholarship'] = np.where(
    df['Scholarship holder'] == 0,
    df['economic_pressure'] * 1.5,  # Higher risk for non-scholarship students
    df['economic_pressure'] * 0.5   # Lower risk for scholarship students
)

# 3. Economic Support Need (based on economic indicators and financial status)
df['economic_support_need'] = df['economic_pressure'] * (10 - df['financial_status']) / 10

# List of new economic context features
new_economic_features = [
    'economic_pressure', 'economic_risk_non_scholarship', 'economic_support_need'
]

# Add to numerical features list
num_features.extend(new_economic_features)

# Display summary of new economic context features
print("Summary of new economic context features:")
display(df[new_economic_features].describe())

## 5. Feature Transformation

Let's transform our features to make them more suitable for machine learning models.

### 5.1 Encoding Categorical Variables

In [None]:
# First, let's create a copy of the original DataFrame
df_processed = df.copy()

# Encode target variable
label_encoder = LabelEncoder()
df_processed['Target_encoded'] = label_encoder.fit_transform(df_processed[target])
target_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Target Encoding Mapping:")
for original, encoded in target_mapping.items():
    print(f"{original} -> {encoded}")

# One-hot encode categorical features with low cardinality (<= 10 categories)
# For high-cardinality features, we'll use target encoding later
cat_features_low_card = []
cat_features_high_card = []

for feature in cat_features:
    if df_processed[feature].nunique() <= 10:
        cat_features_low_card.append(feature)
    else:
        cat_features_high_card.append(feature)

# One-hot encode low-cardinality features
if cat_features_low_card:
    print(f"\nOne-hot encoding {len(cat_features_low_card)} low-cardinality features")
    encoder = OneHotEncoder(sparse_output=False, drop='first')
    encoded_data = encoder.fit_transform(df_processed[cat_features_low_card])
    encoded_feature_names = [f"{col}_{cat}" for col, cats in zip(cat_features_low_card, encoder.categories_) 
                           for cat in cats[1:]]  # Drop the first category
    
    # Create a DataFrame with encoded features
    encoded_df = pd.DataFrame(encoded_data, columns=encoded_feature_names, index=df_processed.index)
    
    # Combine with original DataFrame
    df_processed = pd.concat([df_processed, encoded_df], axis=1)
    
    # Note these encoded feature names for later use
    print(f"Added {len(encoded_feature_names)} one-hot encoded features")
else:
    encoded_feature_names = []
    print("No low-cardinality categorical features to one-hot encode")

# For high-cardinality features, use target encoding (mean target encoding)
if cat_features_high_card:
    print(f"\nTarget encoding {len(cat_features_high_card)} high-cardinality features")
    target_encoded_feature_names = []
    
    for feature in cat_features_high_card:
        # Calculate mean target value for each category
        target_means = df_processed.groupby(feature)['Target_encoded'].mean()
        
        # Create new column with target encoding
        encoded_feature_name = f"{feature}_target_encoded"
        df_processed[encoded_feature_name] = df_processed[feature].map(target_means)
        
        # Add to list of target encoded features
        target_encoded_feature_names.append(encoded_feature_name)
    
    print(f"Added {len(target_encoded_feature_names)} target encoded features")
else:
    target_encoded_feature_names = []
    print("No high-cardinality categorical features to target encode")

# Update numerical features list with encoded feature names
all_encoded_features = encoded_feature_names + target_encoded_feature_names
num_features.extend(all_encoded_features)

### 5.2 Scaling Numerical Features

Let's scale our numerical features to ensure they're on comparable scales.

In [None]:
# Let's scale the numerical features using StandardScaler
# We'll keep the original features and add scaled versions

# Choose features to scale (original numerical features, not derived ones for transparency)
features_to_scale = [
    'Application order', 'Age at enrollment',
    'Curricular units 1st sem (credited)', 'Curricular units 1st sem (enrolled)',
    'Curricular units 1st sem (evaluations)', 'Curricular units 1st sem (approved)',
    'Curricular units 1st sem (grade)', 'Curricular units 1st sem (without evaluations)',
    'Curricular units 2nd sem (credited)', 'Curricular units 2nd sem (enrolled)',
    'Curricular units 2nd sem (evaluations)', 'Curricular units 2nd sem (approved)',
    'Curricular units 2nd sem (grade)', 'Curricular units 2nd sem (without evaluations)',
    'Unemployment rate', 'Inflation rate', 'GDP'
]

# Scale features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_processed[features_to_scale])
scaled_feature_names = [f"{col}_scaled" for col in features_to_scale]

# Create a DataFrame with scaled features
scaled_df = pd.DataFrame(scaled_features, columns=scaled_feature_names, index=df_processed.index)

# Combine with processed DataFrame
df_processed = pd.concat([df_processed, scaled_df], axis=1)

# Add scaled feature names to numerical features list
num_features.extend(scaled_feature_names)

print(f"Added {len(scaled_feature_names)} scaled features")
print(f"Total features in df_processed: {df_processed.shape[1]}")

# Display summary of scaled features
print("\nSummary of scaled features:")
display(df_processed[scaled_feature_names].describe())

## 6. Feature Selection

Let's select the most relevant features for our model using both statistical methods and model-based selection.

### 6.1 Statistical Methods

In [None]:
# First, let's collect all our numerical features for feature selection
# Exclude the original categorical features and target
all_features_for_selection = [col for col in df_processed.columns 
                            if col not in cat_features + [target, 'Target_encoded']]

print(f"Number of features for selection: {len(all_features_for_selection)}")

# Prepare X and y for feature selection
X = df_processed[all_features_for_selection]
y = df_processed['Target_encoded']

# 1. ANOVA F-value for feature selection
print("\nFeature selection using ANOVA F-value:")
f_selector = SelectKBest(score_func=f_classif, k=20)  # Select top 20 features
f_selector.fit(X, y)

# Get scores and p-values
f_scores = pd.DataFrame({
    'Feature': all_features_for_selection,
    'F_Score': f_selector.scores_,
    'P_Value': f_selector.pvalues_
})

# Display top features by F-score
top_f_features = f_scores.sort_values(by='F_Score', ascending=False).head(20)
display(top_f_features)

# Get selected feature names
f_support = f_selector.get_support()
f_selected_features = [all_features_for_selection[i] for i in range(len(all_features_for_selection)) if f_support[i]]
print(f"Selected {len(f_selected_features)} features using ANOVA F-value")

# 2. Mutual Information for feature selection
print("\nFeature selection using Mutual Information:")
mi_selector = SelectKBest(score_func=mutual_info_classif, k=20)  # Select top 20 features
mi_selector.fit(X, y)

# Get scores
mi_scores = pd.DataFrame({
    'Feature': all_features_for_selection,
    'MI_Score': mi_selector.scores_
})

# Display top features by MI score
top_mi_features = mi_scores.sort_values(by='MI_Score', ascending=False).head(20)
display(top_mi_features)

# Get selected feature names
mi_support = mi_selector.get_support()
mi_selected_features = [all_features_for_selection[i] for i in range(len(all_features_for_selection)) if mi_support[i]]
print(f"Selected {len(mi_selected_features)} features using Mutual Information")

### 6.2 Model-Based Selection

Now, let's use a Random Forest model to evaluate feature importance.

In [None]:
# Random Forest for feature importance
rf_model = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)
rf_model.fit(X, y)

# Get feature importances
importances = rf_model.feature_importances_
rf_importance_df = pd.DataFrame({
    'Feature': all_features_for_selection,
    'Importance': importances
})

# Sort features by importance
rf_importance_df = rf_importance_df.sort_values(by='Importance', ascending=False)

# Display top 20 features
print("Top 20 features by Random Forest importance:")
display(rf_importance_df.head(20))

# Plot feature importances
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=rf_importance_df.head(20))
plt.title('Top 20 Features by Random Forest Importance', fontsize=16)
plt.tight_layout()
plt.show()

# Select top 20 features based on Random Forest importance
rf_selected_features = rf_importance_df.head(20)['Feature'].tolist()
print(f"Selected {len(rf_selected_features)} features using Random Forest importance")

In [None]:
# Create a combined set of selected features from all methods
all_selected_features = list(set(f_selected_features + mi_selected_features + rf_selected_features))
print(f"Total unique features selected across all methods: {len(all_selected_features)}")
print("\nFinal selected features:")
for feature in sorted(all_selected_features):
    print(f"- {feature}")

# Create a final feature set that includes all selected features
final_features = all_selected_features.copy()

# Add target for completeness
final_features.append('Target')
final_features.append('Target_encoded')

## 7. Data Splitting

Let's split our data into training and testing sets for model evaluation.

In [None]:
# Create final dataset with selected features
df_final = df_processed[final_features]

# Split data into train and test sets
X = df_final.drop(columns=['Target', 'Target_encoded'])
y = df_final['Target_encoded']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

# Verify class distribution in train and test sets
print("\nClass distribution in training set:")
print(pd.Series(y_train).value_counts(normalize=True) * 100)

print("\nClass distribution in testing set:")
print(pd.Series(y_test).value_counts(normalize=True) * 100)

## 8. Feature Set Evaluation

Let's evaluate our engineered feature set using a simple Random Forest model.

In [None]:
# Evaluate the feature set using Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate model
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Cross-validation for more robust evaluation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_scores = cross_val_score(rf_model, X, y, cv=cv, scoring='accuracy')

print(f"Cross-validation accuracy scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

## 9. Saving Processed Data

Let's save our processed data and feature information for the model training phase.

In [None]:
# Create directories if they don't exist
processed_data_dir = '../data/processed'
if not os.path.exists(processed_data_dir):
    os.makedirs(processed_data_dir)

# Save the final processed dataframe
df_final.to_csv(f'{processed_data_dir}/processed_data.csv', index=False)
print(f"Saved processed data to {processed_data_dir}/processed_data.csv")

# Save train-test split
train_data = pd.concat([X_train, y_train.reset_index(drop=True)], axis=1)
test_data = pd.concat([X_test, y_test.reset_index(drop=True)], axis=1)

train_data.to_csv(f'{processed_data_dir}/train_data.csv', index=False)
test_data.to_csv(f'{processed_data_dir}/test_data.csv', index=False)

print(f"Saved train data to {processed_data_dir}/train_data.csv")
print(f"Saved test data to {processed_data_dir}/test_data.csv")

# Save label encoder for later use
with open(f'{processed_data_dir}/label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)
print(f"Saved label encoder to {processed_data_dir}/label_encoder.pkl")

# Save feature information
feature_info = {
    'original_features': {
        'categorical': cat_features,
        'numerical': num_features[:len(df.columns) - len(cat_features) - 1]  # Original num features
    },
    'engineered_features': {
        'academic': new_academic_features,
        'engagement': new_engagement_features,
        'socioeconomic': new_socioeconomic_features,
        'economic': new_economic_features
    },
    'transformed_features': {
        'encoded': all_encoded_features,
        'scaled': scaled_feature_names
    },
    'selected_features': {
        'anova': f_selected_features,
        'mutual_info': mi_selected_features,
        'random_forest': rf_selected_features,
        'final': [f for f in final_features if f not in ['Target', 'Target_encoded']]
    },
    'target_mapping': target_mapping
}

with open(f'{processed_data_dir}/feature_info.pkl', 'wb') as f:
    pickle.dump(feature_info, f)
print(f"Saved feature information to {processed_data_dir}/feature_info.pkl")

## Summary and Next Steps

In this notebook, we performed comprehensive feature engineering for our academic status and dropout prediction system. Here's a summary of what we accomplished:

1. **Feature Preprocessing**:
   - Handled missing values through imputation
   - Identified and capped outliers using the IQR method

2. **Feature Creation**:
   - Created academic performance indicators (success rates, grade trends, etc.)
   - Developed engagement metrics to capture student participation
   - Generated socioeconomic indicators by combining family and financial features
   - Created economic context features relating external factors to student characteristics

3. **Feature Transformation**:
   - Encoded categorical variables using one-hot encoding and target encoding
   - Scaled numerical features for better model performance

4. **Feature Selection**:
   - Applied statistical methods (ANOVA F-test, Mutual Information)
   - Used model-based selection with Random Forest importance
   - Combined methods to create a final feature set

5. **Data Splitting and Evaluation**:
   - Split data into training and testing sets
   - Evaluated feature set performance using a Random Forest model
   - Validated results with cross-validation

6. **Data Persistence**:
   - Saved processed data and feature information for model training

**Next Steps**:
1. Develop and evaluate multiple machine learning models in the model training notebook
2. Perform hyperparameter tuning to optimize model performance
3. Implement more sophisticated ensemble techniques
4. Create a robust pipeline for production deployment

The engineered features have shown promising predictive power in our initial evaluation, with clear patterns emerging in relation to the target variable. Our next step is to build and optimize machine learning models using these features to create an accurate academic status and dropout prediction system.