# Academic Status and Dropout Prediction - Data Exploration

This notebook explores the dataset used for predicting academic status and dropout rates of students in higher education institutions. The goal is to understand the data, identify patterns, and gain insights for feature engineering and model development.

## Table of Contents
1. [Setup and Configuration](#1.-Setup-and-Configuration)
2. [Data Loading and Overview](#2.-Data-Loading-and-Overview)
3. [Data Cleaning and Preprocessing](#3.-Data-Cleaning-and-Preprocessing)
4. [Exploratory Data Analysis](#4.-Exploratory-Data-Analysis)
   - [Univariate Analysis](#4.1-Univariate-Analysis)
   - [Bivariate Analysis](#4.2-Bivariate-Analysis)
   - [Multivariate Analysis](#4.3-Multivariate-Analysis)
5. [Correlation Analysis](#5.-Correlation-Analysis)
6. [Target Analysis](#6.-Target-Analysis)
7. [Feature Importance](#7.-Feature-Importance)
8. [Findings and Recommendations](#8.-Findings-and-Recommendations)

## 1. Setup and Configuration

Let's first import the necessary libraries and set up the environment.

In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

# Import custom utility functions if any
import sys
sys.path.append('../')
# from src.data.data_utilities import load_dataset  # Uncomment when available

# Configure visualizations
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

## 2. Data Loading and Overview

Let's load the dataset and get a general overview of its structure.

In [None]:
# Load dataset
file_path = '../data/raw/dataset.csv'
df = pd.read_csv(file_path)

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1] - 1}")  # Excluding target column

# Display first few rows
df.head()

In [None]:
# Display information about dataset
df.info()

In [None]:
# Summary statistics
df.describe(include='all').T

### Target Variable Distribution

Let's examine the distribution of our target variable 'Target' which represents the academic status.

In [None]:
# Display target variable distribution
target_counts = df['Target'].value_counts()
print("Target Distribution:")
print(target_counts)

# Visualize target distribution
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='Target', data=df)
plt.title('Target Distribution', fontsize=16)
plt.xlabel('Academic Status', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Add percentages on top of bars
total = len(df)
for p in ax.patches:
    percentage = f'{100 * p.get_height() / total:.1f}%'
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    ax.annotate(percentage, (x, y), ha='center', va='bottom', fontsize=12)

plt.show()

## 3. Data Cleaning and Preprocessing

Let's check for missing values, duplicates, and any other data quality issues.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_data = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percentage
})

# Display features with missing values (if any)
print("Features with missing values:")
display(missing_data[missing_data['Missing Values'] > 0].sort_values(by='Missing Values', ascending=False))

# If all features have no missing values
if missing_data['Missing Values'].sum() == 0:
    print("No missing values found in the dataset.")

In [None]:
# Check for duplicates
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

if duplicate_count > 0:
    print("\nSample of duplicate rows:")
    display(df[df.duplicated(keep='first')])

### Create Features Dictionary

Let's organize our features by their types to facilitate further analysis.

In [None]:
# Categorize features by type
cat_features = [
    'Marital status', 'Application mode', 'Course',
    'Daytime/evening attendance', 'Previous qualification', 'Nacionality',
    'Mother\'s qualification', 'Father\'s qualification', 
    'Mother\'s occupation', 'Father\'s occupation',
    'Displaced', 'Educational special needs', 'Debtor',
    'Tuition fees up to date', 'Gender', 'Scholarship holder',
    'International'
]

num_features = [
    'Application order', 'Age at enrollment',
    'Curricular units 1st sem (credited)', 'Curricular units 1st sem (enrolled)',
    'Curricular units 1st sem (evaluations)', 'Curricular units 1st sem (approved)',
    'Curricular units 1st sem (grade)', 'Curricular units 1st sem (without evaluations)',
    'Curricular units 2nd sem (credited)', 'Curricular units 2nd sem (enrolled)',
    'Curricular units 2nd sem (evaluations)', 'Curricular units 2nd sem (approved)',
    'Curricular units 2nd sem (grade)', 'Curricular units 2nd sem (without evaluations)',
    'Unemployment rate', 'Inflation rate', 'GDP'
]

target = 'Target'

# Verify that all features are accounted for
all_features = cat_features + num_features + [target]
missing_features = set(df.columns) - set(all_features)
extra_features = set(all_features) - set(df.columns)

if missing_features:
    print(f"Missing features in our categorization: {missing_features}")
if extra_features:
    print(f"Extra features in our categorization: {extra_features}")

print(f"Categorical features: {len(cat_features)}")
print(f"Numerical features: {len(num_features)}")
print(f"Total features: {len(cat_features) + len(num_features)}")

## 4. Exploratory Data Analysis

### 4.1 Univariate Analysis

#### Categorical Features

In [None]:
# Function to plot categorical feature distributions
def plot_categorical_distributions(df, features, rows=3, cols=3):
    plt.figure(figsize=(18, 15))
    for i, feature in enumerate(features, 1):
        if i <= rows * cols:
            plt.subplot(rows, cols, i)
            value_counts = df[feature].value_counts().sort_index()
            value_counts.plot(kind='bar')
            plt.title(f'Distribution of {feature}')
            plt.ylabel('Count')
            plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Plot distributions for first 9 categorical features
plot_categorical_distributions(df, cat_features[:9])

# Plot distributions for remaining categorical features
remaining_cat_features = cat_features[9:]
if remaining_cat_features:
    plot_categorical_distributions(df, remaining_cat_features)

#### Numerical Features

In [None]:
# Function to plot numerical feature distributions
def plot_numerical_distributions(df, features, rows=3, cols=3):
    plt.figure(figsize=(18, 15))
    for i, feature in enumerate(features, 1):
        if i <= rows * cols:
            plt.subplot(rows, cols, i)
            sns.histplot(df[feature], kde=True)
            plt.title(f'Distribution of {feature}')
            plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

# Plot distributions for first 9 numerical features
plot_numerical_distributions(df, num_features[:9])

# Plot distributions for remaining numerical features
remaining_num_features = num_features[9:]
if remaining_num_features:
    plot_numerical_distributions(df, remaining_num_features)

In [None]:
# Box plots for numerical features to detect outliers
def plot_boxplots(df, features, rows=3, cols=3):
    plt.figure(figsize=(18, 15))
    for i, feature in enumerate(features, 1):
        if i <= rows * cols:
            plt.subplot(rows, cols, i)
            sns.boxplot(x=df[feature])
            plt.title(f'Boxplot of {feature}')
    plt.tight_layout()
    plt.show()

# Plot boxplots for first 9 numerical features
plot_boxplots(df, num_features[:9])

# Plot boxplots for remaining numerical features
if remaining_num_features:
    plot_boxplots(df, remaining_num_features)

### 4.2 Bivariate Analysis

Let's explore the relationship between features and the target variable.

In [None]:
# Categorical features vs Target
def plot_categorical_vs_target(df, features, target, rows=3, cols=2):
    plt.figure(figsize=(18, 15))
    for i, feature in enumerate(features, 1):
        if i <= rows * cols:
            plt.subplot(rows, cols, i)
            
            # Create a crosstab to calculate percentages
            ct = pd.crosstab(df[feature], df[target], normalize='index') * 100
            ct.plot(kind='bar', stacked=True)
            
            plt.title(f'{feature} vs {target}')
            plt.ylabel('Percentage (%)')
            plt.xticks(rotation=45)
            plt.legend(title=target)
    plt.tight_layout()
    plt.show()

# Plot first 6 categorical features vs target
plot_categorical_vs_target(df, cat_features[:6], target)

# Plot next 6 categorical features vs target
if len(cat_features) > 6:
    plot_categorical_vs_target(df, cat_features[6:12], target)

# Plot remaining categorical features vs target
remaining_cat = cat_features[12:]
if remaining_cat:
    plot_categorical_vs_target(df, remaining_cat, target)

In [None]:
# Numerical features vs Target (Box plots)
def plot_numerical_vs_target(df, features, target, rows=3, cols=2):
    plt.figure(figsize=(18, 15))
    for i, feature in enumerate(features, 1):
        if i <= rows * cols:
            plt.subplot(rows, cols, i)
            sns.boxplot(x=target, y=feature, data=df)
            plt.title(f'{feature} by {target}')
            plt.xlabel(target)
            plt.ylabel(feature)
    plt.tight_layout()
    plt.show()

# Plot first 6 numerical features vs target
plot_numerical_vs_target(df, num_features[:6], target)

# Plot next 6 numerical features vs target
if len(num_features) > 6:
    plot_numerical_vs_target(df, num_features[6:12], target)

# Plot remaining numerical features vs target
remaining_num = num_features[12:]
if remaining_num:
    plot_numerical_vs_target(df, remaining_num, target)

### 4.3 Multivariate Analysis

Let's examine relationships between multiple variables.

In [None]:
# Scatter plots for academic performance features
academic_features = [
    'Curricular units 1st sem (approved)',
    'Curricular units 1st sem (grade)',
    'Curricular units 2nd sem (approved)',
    'Curricular units 2nd sem (grade)'
]

plt.figure(figsize=(12, 10))
sns.pairplot(df[academic_features + [target]], hue=target, diag_kind='kde')
plt.suptitle('Multivariate Analysis of Academic Performance', y=1.02, fontsize=16)
plt.show()

In [None]:
# Economic indicators and their relationship with target
economic_features = ['Unemployment rate', 'Inflation rate', 'GDP']

plt.figure(figsize=(12, 10))
sns.pairplot(df[economic_features + [target]], hue=target, diag_kind='kde')
plt.suptitle('Multivariate Analysis of Economic Indicators', y=1.02, fontsize=16)
plt.show()

## 5. Correlation Analysis

Let's analyze correlations between numerical features.

In [None]:
# Correlation matrix for numerical features
corr_matrix = df[num_features].corr()

plt.figure(figsize=(16, 14))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', 
            vmin=-1, vmax=1, linewidths=0.5, cbar_kws={'shrink': .8})
plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Identify highly correlated features
threshold = 0.7
high_corr = {}

for i in range(len(corr_matrix.columns)):
    for j in range(i):
        if abs(corr_matrix.iloc[i, j]) > threshold:
            col_i = corr_matrix.columns[i]
            col_j = corr_matrix.columns[j]
            high_corr[f"{col_i} & {col_j}"] = corr_matrix.iloc[i, j]

if high_corr:
    print("Highly correlated features (|correlation| > 0.7):")
    for pair, corr_val in sorted(high_corr.items(), key=lambda x: abs(x[1]), reverse=True):
        print(f"{pair}: {corr_val:.2f}")
else:
    print("No highly correlated features found (threshold = 0.7)")

### Chi-Square Test for Categorical Features vs Target

In [None]:
# Chi-square test for categorical features
from scipy.stats import chi2_contingency

chi2_results = []

for feature in cat_features:
    contingency_table = pd.crosstab(df[feature], df[target])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    chi2_results.append({
        'Feature': feature,
        'Chi-Square': chi2,
        'P-Value': p,
        'Significant': p < 0.05
    })

chi2_df = pd.DataFrame(chi2_results).sort_values(by='P-Value')
print("Chi-Square Test for Independence between Categorical Features and Target:")
display(chi2_df)

### ANOVA Test for Numerical Features vs Target

In [None]:
# ANOVA test for numerical features
from scipy.stats import f_oneway

anova_results = []

for feature in num_features:
    # Create groups based on target values
    groups = [df[df[target] == val][feature].dropna() for val in df[target].unique()]
    
    # Run ANOVA
    f_stat, p_value = f_oneway(*groups)
    
    anova_results.append({
        'Feature': feature,
        'F-Statistic': f_stat,
        'P-Value': p_value,
        'Significant': p_value < 0.05
    })

anova_df = pd.DataFrame(anova_results).sort_values(by='P-Value')
print("ANOVA Test for Numerical Features vs Target:")
display(anova_df)

## 6. Target Analysis

Let's analyze the target variable in more detail.

In [None]:
# Top features by target class
def plot_feature_by_target(df, feature, target):
    plt.figure(figsize=(10, 6))
    
    # If feature is categorical
    if feature in cat_features:
        ct = pd.crosstab(df[feature], df[target], normalize='index') * 100
        ct.plot(kind='bar', stacked=True)
        plt.ylabel('Percentage (%)')
    # If feature is numerical
    else:
        for i, target_val in enumerate(df[target].unique()):
            subset = df[df[target] == target_val][feature]
            sns.kdeplot(subset, label=f'{target}={target_val}')
        plt.ylabel('Density')
    
    plt.title(f'{feature} by {target}')
    plt.xlabel(feature)
    plt.legend(title=target)
    plt.tight_layout()
    plt.show()

# Select top features based on ANOVA and Chi-Square tests
top_num_features = anova_df.head(3)['Feature'].tolist()
top_cat_features = chi2_df.head(3)['Feature'].tolist()

# Plot top numerical features
for feature in top_num_features:
    plot_feature_by_target(df, feature, target)

# Plot top categorical features
for feature in top_cat_features:
    plot_feature_by_target(df, feature, target)

## 7. Feature Importance

Let's get a preliminary idea of feature importance using a simple Random Forest model.

In [None]:
# Basic feature importance with Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Prepare data for modeling
X = df.drop(columns=[target])
y = df[target]

# Encode target if needed
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Create preprocessor
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_features),
        ('cat', categorical_transformer, cat_features)
    ])

# Create and fit model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

model.fit(X, y_encoded)

# Extract feature names after one-hot encoding
ohe = model.named_steps['preprocessor'].named_transformers_['cat']
cat_feature_names = ohe.named_steps['onehot'].get_feature_names_out(cat_features)
feature_names = np.concatenate([num_features, cat_feature_names])

# Get feature importances
importances = model.named_steps['classifier'].feature_importances_

# Create dataframe of feature importances
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Display top 20 features
top_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(20)
display(top_features)

# Plot feature importances
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=top_features)
plt.title('Top 20 Feature Importances', fontsize=16)
plt.tight_layout()
plt.show()

## 8. Findings and Recommendations

### Key Findings

Based on the exploratory data analysis, we can make the following observations:

1. **Target Distribution**:
   - [Fill in after running the notebook]

2. **Important Categorical Features**:
   - [Fill in after running the notebook]

3. **Important Numerical Features**:
   - [Fill in after running the notebook]

4. **Correlations**:
   - [Fill in after running the notebook]

5. **Feature Importance**:
   - [Fill in after running the notebook]

### Recommendations for Feature Engineering

Based on our analysis, we recommend the following feature engineering steps:

1. **Feature Selection**:
   - Consider using the top features identified through statistical tests and feature importance.
   - Remove or combine highly correlated features to reduce multicollinearity.

2. **Feature Transformation**:
   - Apply appropriate scaling to numerical features.
   - Consider log transformation for skewed numerical features.

3. **Feature Creation**:
   - Create academic performance indicators by combining semester data.
   - Consider creating interaction features between economic indicators and academic performance.
   - Develop a socioeconomic status indicator based on parental education and occupation.

4. **Dimensionality Reduction**:
   - Consider using PCA or other dimensionality reduction techniques if needed.
   - Group categorical levels that have similar target distributions.

These recommendations will be implemented in the feature engineering notebook.