---
title: "Predicting Student Dropout and Academic Success"
authors: 
    - "Patricia Götz"
    - "Lana Kabbani"
    - "Noémie Glaus"
    - "Estela Gonzalez Vizcarra"
institute: University of Lausanne
date: today
title-block-banner: "#0095C8"
bibliography: reference.bib
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa.csl
format:
  html:
    theme: cosmo
    toc: true
    toc-depth: 4
    code-fold: true
    code-tools: true
    df-print: paged
    self-contained: true
  pdf:
    toc: false
    echo: false
    include-in-header:
      text: |
        \usepackage{fvextra}
        \DefineVerbatimEnvironment{Highlighting}{Verbatim}{
          commandchars=\\\{\},
          breaklines, breaknonspaceingroup, breakanywhere
        }
    
execute:
  warning: false
  message: false
---

## 1. - Introduction

Student retention and academic success are crucial challenges for higher education institutions worldwide. Recent international observations show rising university dropout trends across multiple regions, including Australia and the United States [@sokolova2024dropout]. Looking closer at Europe, recent data from the German Center for Higher Education Research and Science Studies (2022), show that almost 30% of bachelor’s students in Germany leave university without graduating [@hachmeister2024german]. In Portugal, which is the focus of our analysis, recent data by Statistics Portugal reveal that a considerable portion of young adults (16.8%) aged from 15 to 34 have dropped out at least one level of education during their academic path [@europedata2024portugal]. Moreover, among those who dropped out, over more than half (50.8%) did not complete their tertiary studies, highlighting that higher education represents a critical point of disengagement [@europedata2024portugal].
These figures underline the seriousness of dropouts in higher education and the reinforced need for universities to rely on data-driven insights to identify at-risk students and to design early intervention strategies. 
We chose this topic because predicting student dropout not only helps optimize institutional resources but also supports students in achieving their academic goals. Understanding the factors that influence academic success, such as socio-economic background, previous academic performance, or family situation, can improve educational policies and personalized support systems. This subject is particularly meaningful in data science, as it allows us to combine analytical and predictive methods to better understand and prevent student dropout.


## 1.1 - Project Goals

The main objective of this project is to identify the factors that influence students to drop out, stay enrolled, or graduate from higher education. The dataset provides detailed information on each student’s academic performance, socioeconomic background, and demographic profile, offering a comprehensive view of the variables that shape educational outcomes. By the end of our analysis, we seek to identify the most significant combinations of academic and personal factors that influence student success. 
First, our analysis will focus on academic performance, examining how variables such as admission grades, semester evaluations, and course results relate to final outcomes. For instance, we will analyze whether early academic performance can serve as a reliable predictor of future dropout risk. We will then explore the influence of socioeconomic and personal factors, including parental education, occupation, and financial situation, to understand their impact on academic achievement. Lastly,  the dataset will be used to build and evaluate classification models that predict students’ academic status (Dropout, Enrolled, or Graduate). 
In summary, this study combines exploratory analysis, visualization, and predictive modeling to generate actionable insights that help universities detect at-risk students early and strengthen academic success.

---

## 2 - Related Work

As students' dropout is a major challenge in higher education, it represents a well-established area of research that has widely been studied in the literature over the years. Previous research articles have helped us acquire information about the topic, including the methods used in order to address the different research questions. 

One relevant study titled “Predicting Students’ Academic Success and Dropout Using Supervised Machine Learning” investigates the prediction of student academic success using supervised machine learning classification models. Throughout the paper, the authors compare multiple classification algorithms such as Decision Trees, Random Forest or Logistic Regression to assess their ability to predict student outcomes on student data. Their results show that these models are in fact an effective tool for identifying students at risk of dropping out and thus highlights the relevance of formulating this issue as a classification task. 

Other articles emphasize the importance of constructing robust predictive models, but also the role of feature selection. In particular a recent paper by Anaíle Mendes Rabelo and Luis Enrique Zarate (2024) demonstrates how combining academic performance indicators with contextual variables such as course selection, improves the reliability of dropout prediction models.

In addition, an article published in 2022 named “Towards a Students’ Dropout Prediction Model in Higher Education Institutions Using Machine Learning Algorithms” focuses on the overall analytical pipeline used in educational data mining, from data preprocessing to model evaluation. The authors underline that data quality and preprocessing decisions play a key role in model performance. 

Overall, these research articles guided our methodological choices, particularly our choice of a classification framework, our focus on feature selection and our complete analysis process. Based on this literature, our project combines exploratory data analysis, predictive modeling, and interpretability techniques to better understand student outcomes.



## 3 - Research Questions

- I.  How do academic performance indicators and study conditions influence students’ likelihood of graduation or dropout?
- II. What is the impact of demographic and socioeconomic background on students’ probability of dropping out?
      a.  To what extent do financial factors (debtor status, scholarship holder) affect student retention ?
- III. Can we accurately predict a student’s final status (Dropout, Enrolled, or Graduate) based on their demographic, socioeconomic, and academic characteristics. Which are the most relevant among them?
        a.Which features category, academic (grades, units), socioeconomic (debt, scholarship) or demographic (age, gender)  contribute the most in predicting students’ dropout?

## 4 - Data 

## 4.1 - Data Sourcing
The dataset is publicly available on UCI Machine Learning Repository and was created from multiple databases of higher education institutions in Portugal. It is related to enrolled students in different undergraduate programs and shows how different demographic, socioeconomic and academic factors are related to the dropout. Since the data has already been collected and can be directly downloaded from [UCI MLR - Predict Students' Dropout and Academic Success](https://archive.ics.uci.edu/dataset/697/predict%2Bstudents%2Bdropout%2Band%2Bacademic%2Bsuccess){target="_blank"} - [Accessed on 20th October] , there is no need to collect more data via webscraping or APIs. 

## 4.2 - Data Description
The dataset, containing data from a Portuguese higher education institution, is provided as a CSV file, approximately 520 KB in size, and contains detailed information about students’demographic, academic and socio-economic characteristics. It includes 4424 student records and 37 variables (features). After reviewing the dataset variables, we removed two irrelevant ones, resulting in 35 relevant variables selected  for analysis.

### 4.2.1 - Data Loading

In [None]:
#| label: setup

# Import libraries
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

# Load data
dataset = fetch_ucirepo(id=697)
X = np.array(dataset.data.features)
y = np.array(dataset.data.targets)

# Create dataframe
col_names = dataset.variables["name"]
df = pd.DataFrame(np.column_stack((X, y)), columns=col_names)

print(f"Dataset shape: {df.shape}")

---

### 4.2.2 - Variable Selection

We selected 35 relevant variables for analysis:

In [None]:
#| label: variable-selection

selected_columns = [ 
    "Marital Status", 
    "Application order", 
    "Course", 
    "Daytime/evening attendance", 
    "Previous qualification", 
    "Previous qualification (grade)",
    "Nacionality", 
    "Mother's qualification", 
    "Father's qualification", 
    "Mother's occupation", 
    "Father's occupation", 
    "Admission grade", 
    "Educational special needs", 
    "Gender", 
    "Scholarship holder", 
    "Age at enrollment", 
    "Displaced", 
    "Debtor", 
    "International", 
    "Curricular units 1st sem (credited)", 
    "Curricular units 1st sem (enrolled)", 
    "Curricular units 1st sem (evaluations)",
    "Curricular units 1st sem (approved)", 
    "Curricular units 1st sem (grade)", 
    "Curricular units 1st sem (without evaluations)", 
    "Curricular units 2nd sem (credited)", 
    "Curricular units 2nd sem (enrolled)", 
    "Curricular units 2nd sem (evaluations)", 
    "Curricular units 2nd sem (approved)", 
    "Curricular units 2nd sem (grade)", 
    "Curricular units 2nd sem (without evaluations)", 
    "Unemployment rate", 
    "Inflation rate", 
    "GDP", 
    "Target", 
]

df = df[selected_columns].copy()
print(f"Selected {len(selected_columns)} variables")

### 4.2.3 - Selected Variable Descriptions

In [None]:
#| label: variable-descriptions
#| echo: false
# Create variable information table
variable_info = pd.DataFrame({
    'Variable': [
        'Marital Status',
        'Application order',
        'Course',
        'Daytime/evening attendance',
        'Previous qualification',
        'Previous qualification (grade)',
        'Nacionality',
        "Mother's qualification",
        "Father's qualification",
        "Mother's occupation",
        "Father's occupation",
        'Admission grade',
        'Educational special needs',
        'Gender',
        'Scholarship holder',
        'Age at enrollment',
        'Displaced',
        'Debtor',
        'International',
        'Curricular units 1st sem (credited)',
        'Curricular units 1st sem (enrolled)',
        'Curricular units 1st sem (evaluations)',
        'Curricular units 1st sem (approved)',
        'Curricular units 1st sem (grade)',
        'Curricular units 1st sem (without evaluations)',
        'Curricular units 2nd sem (credited)',
        'Curricular units 2nd sem (enrolled)',
        'Curricular units 2nd sem (evaluations)',
        'Curricular units 2nd sem (approved)',
        'Curricular units 2nd sem (grade)',
        'Curricular units 2nd sem (without evaluations)',
        'Unemployment rate',
        'Inflation rate',
        'GDP',
        'Target'
    ],
    'Description': [
        'Student marital status',
        'Application preference order',
        'Course taken by student',
        'Attendance type (daytime or evening)',
        'Type of previous qualification',
        'Grade of previous qualification',
        'Student nationality',
        'Educational qualification of mother',
        'Educational qualification of father',
        'Occupation of mother',
        'Occupation of father',
        'Admission grade to the program',
        'Whether student has special educational needs',
        'Student gender',
        'Whether student is scholarship holder',
        'Age of student at enrollment',
        'Whether student is displaced from home',
        'Whether student is a debtor',
        'Whether student is international',
        'Credited units in 1st semester',
        'Enrolled units in 1st semester',
        'Number of evaluations in 1st semester',
        'Approved units in 1st semester',
        'Average grade in 1st semester',
        'Units without evaluations in 1st semester',
        'Credited units in 2nd semester',
        'Enrolled units in 2nd semester',
        'Number of evaluations in 2nd semester',
        'Approved units in 2nd semester',
        'Average grade in 2nd semester',
        'Units without evaluations in 2nd semester',
        'Unemployment rate at time of enrollment',
        'Inflation rate at time of enrollment',
        'GDP at time of enrollment',
        'Student status (Dropout, Enrolled, or Graduate)'
    ],
    'Type': [
        'Categorical',
        'Categorical',
        'Categorical',
        'Categorical',
        'Categorical',
        'Numerical (Continuous)',
        'Categorical',
        'Categorical',
        'Categorical',
        'Categorical',
        'Categorical',
        'Numerical (Continuous)',
        'Binary',
        'Binary',
        'Binary',
        'Numerical (Discrete)',
        'Binary',
        'Binary',
        'Binary',
        'Numerical (Discrete)',
        'Numerical (Discrete)',
        'Numerical (Discrete)',
        'Numerical (Discrete)',
        'Numerical (Continuous)',
        'Numerical (Discrete)',
        'Numerical (Discrete)',
        'Numerical (Discrete)',
        'Numerical (Discrete)',
        'Numerical (Discrete)',
        'Numerical (Continuous)',
        'Numerical (Discrete)',
        'Numerical (Continuous)',
        'Numerical (Continuous)',
        'Numerical (Continuous)',
        'Categorical'
    ]
})
# Display table
from IPython.display import Markdown, display
# Create markdown table
table_md = variable_info.to_markdown(index=False)
display(Markdown(table_md))

Through this step, we didn't encounter any difficult challenges. The dataset was already clean and encoded, so we didn't need to perform variable merging, one-hot encoding or ordinal encoding. We only had to convert categorical variables into readable labels to facilitate our visualization analysis.

---

### 4.2.4 - Preprocessing (Data Cleaning and Wrangling)

One of the most important steps in our project is data cleaning and wrangling. After running the code to check for missing values and undefined numerical data, we found that the dataset contains no missing values, no mistakes and no data entry mistakes.

The dataset was already encoded, and we removed “Application mode” and “Tuition fees up to date” variables because they are not relevant to our research questions. Therefore we dropped two columns from the dataset.
Ensuring that the numeric columns are numeric, categorical variables such as “Gender”, “Debtor”, “Displaced” , “Daytime/Evening attendance” were translated to readable string labels for analysis.
Although we had a well-structured and clean dataset, our main challenge was to determine the reliability of our dataset. We verified if there were any missing values, spotting mistakes, and determined irrelevant variables for our analysis. We pursue our cleaning work with the conversion of the categorical variables. Therefore, the reliable dataset was ready to be analyzed. 


In [None]:
#| label: data-cleaning

def clean_dataframe(df, col_missing_thresh=0.30, row_missing_thresh=0.50):
    """Clean dataset with missing value handling."""
  
    # Count number of NaNs
    df = df.copy()
    missing = df.isna().sum()
    missing_data = missing[missing > 0]

    if len(missing_data) > 0:
        print(f"\n⚠️  Missing values found in {len(missing_data)} columns ({missing_data.sum():,} total)\n")
        display(missing_data.to_frame('Count'))
    else:
        print("\n✓ No missing values found!")

    print(f"\nShape after cleaning: {df.shape}")
    print(f"Missing values: {df.isna().sum().sum()}")
        
    # Drop columns with excessive missing
    col_frac = df.isna().mean()
    drop_cols = col_frac[col_frac > col_missing_thresh].index.tolist()
    if drop_cols:
        df.drop(columns=drop_cols, inplace=True)
    
    # Drop rows with excessive missing
    row_frac = df.isna().mean(axis=1)
    drop_rows = row_frac[row_frac > row_missing_thresh].index
    if len(drop_rows):
        df = df.drop(index=drop_rows).reset_index(drop=True)
    
    # Coerce numeric types
    df = df.apply(lambda s: pd.to_numeric(s, errors="ignore"))
    
    # Impute missing values
    for col in df.select_dtypes(include=[np.number]).columns:
        if df[col].isna().any():
            df[col] = df[col].fillna(df[col].median())
    
    for col in df.select_dtypes(include=["category","object"]).columns:
        if df[col].isna().any():
            mode = df[col].mode(dropna=True)
            if not mode.empty:
                df[col] = df[col].fillna(mode.iloc[0])
    
    return df

df = clean_dataframe(df)
print(f"Shape after cleaning: {df.shape}")
print(f"Missing values: {df.isna().sum().sum()}")

Although we had a well-structured and clean dataset, our main challenge was to determine the reliability of our dataset. We verified if there were any missing values, spotting mistakes, and determined irrelevant variables for our analysis. We pursue our cleaning work with the conversion of the categorical variables. Therefore, the reliable dataset was ready to be analyzed.

---

## 5 - Exploratory Data Analysis (EDA)

In this section, we explore the dataset to understand the main characteristics of the variables and how they relate to student outcomes (Dropout, Enrolled, Graduate). The goal of the EDA is to identify patterns, detect anomalies, and determine which features are most informative for predicting dropout.

## 5.1 - Target Variable
We begin by examining the distribution of the target variable.
The three student outcomes (Dropout, Enrolled, and Graduate) are highly imbalanced, with Graduates representing the largest group, followed by Dropouts, and a smaller proportion of Enrolled students.

In [None]:
#| label: target-distribution

# Recode target
target_col = "Target"
df[target_col] = df[target_col].replace({
    0: "Dropout", 
    1: "Enrolled", 
    2: "Graduate"
})
df[target_col] = pd.Categorical(
    df[target_col], 
    categories=["Dropout", "Enrolled", "Graduate"], 
    ordered=True
)

# Visualize
fig, ax = plt.subplots(figsize=(8, 5))
target_counts = df[target_col].value_counts()
colors = ['#F46968', '#BCDCED', '#31709D']
bars = ax.bar(range(len(target_counts)), target_counts.values, 
              color=colors, edgecolor='black', linewidth=1.5)

ax.set_xticks(range(len(target_counts)))
ax.set_xticklabels(target_counts.index)
ax.set_ylabel('Number of Students', fontsize=11)
ax.set_title('Student Outcomes Distribution', fontsize=13, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

# Add labels
for i, v in enumerate(target_counts.values):
    ax.text(i, v + 30, f'{v}\n({v/len(df)*100:.1f}%)', 
            ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

---

## 5.2 - Correlation Analysis

In [None]:
#| label: correlation-matrix

# Calculate correlations
corr = df.corr(numeric_only=True)

# Create heatmap
plt.figure(figsize=(16, 14))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)

sns.heatmap(
    corr, 
    mask=mask,
    cmap='RdBu_r', 
    center=0,
    vmin=-1, 
    vmax=1,
    annot=True, 
    fmt='.2f',
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.8, "label": "Correlation"}
)

plt.title('Correlation Matrix of Numeric Variables', 
          fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

Based on our correlation analysis, we identified several moderately and highly correlated variable pairs that indicate multicollinearity.
A high correlation between international and nationality students can be observed, therefore we choose to remove the variable _international_, since it won’t be as relevant as the _nacionality_ variable.
We can see that the variables _father’s occupation_ and _mother’s occupation_  are highly correlated, but in this case the correlation reflects social structure. They represent two distinct individuals and two potentially different socioeconomic effects. Same thing applies for _mother’s qualification_ and _father’s qualification_.
Although the variables _Curricular units 1st sem (enrolled)_ _Curricular units 2nd sem (enrolled)_ and _Curricular units 1st sem (grade)_ _Curricular units 2nd sem (grade)_ are respectively highly correlated, we keep them because they provide performance progression across different time periods, which is relevant for predicting dropout. Therefore, we excluded 8 redundant semester variables and one nationality variable.


---

## 5.3 - Feature Selection

In [None]:
#| label: feature-selection

# Remove highly correlated features
columns_to_remove = [
    "Curricular units 1st sem (credited)", 
    "Curricular units 1st sem (evaluations)", 
    "Curricular units 1st sem (approved)",
    "Curricular units 1st sem (without evaluations)",
    "Curricular units 2nd sem (credited)", 
    "Curricular units 2nd sem (evaluations)", 
    "Curricular units 2nd sem (approved)",
    "Curricular units 2nd sem (without evaluations)",
    "International",
]

df = df.drop(columns=columns_to_remove)
print(f"Removed {len(columns_to_remove)} highly correlated variables")
print(f"Remaining variables: {df.shape[1]}")

---

## 5.4 - Outlier Detection

We implemented a type-aware outlier detection strategy that applies different methods based on the nature of each variable:

**Binary variables** (e.g., Gender, Scholarship holder): Outlier detection was skipped entirely, as these variables only contain two valid values (0/1).

**Nominal categorical variables** (e.g., Course, Nationality): No outlier detection applied, as these represent distinct categories without natural ordering. We only reported the number of unique categories present.

**Ordinal categorical variables** (e.g., qualifications, occupations): We reported the number of levels but did not apply outlier detection, as these represent ordered categories rather than continuous measurements.

**Grade variables** (0-200 scale): We checked for values outside the valid range (0-200). According to the dataset documentation, grades in the Portuguese system can range from 0 to 200.

**Count variables** (e.g., enrolled courses): We used a more lenient threshold of 3×IQR (Interquartile Range) rather than the standard 1.5×IQR, as count variables naturally exhibit right-skewed distributions where high values may represent legitimate cases (e.g., students enrolling in many courses).

**Continuous variables** (e.g., Age, GDP, Unemployment rate): We applied the standard Tukey method with 1.5×IQR threshold to identify potential outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

This approach ensures that outlier detection is contextually appropriate for each variable type, reducing false positives while identifying genuine data quality issues.

In [None]:
#| label: outlier-detection

def detect_outliers_intelligent(df, var_type_dict):
    """Detect outliers based on variable type using simple statistical rules."""
    results = []
    
    # Binary variables - skip
    print("\n Binary variables (skipping outlier detection):")
    for col in var_type_dict.get('binary', []):
        if col in df.columns:
            unique_vals = sorted(df[col].dropna().unique())
            print(f"  - {col}: values = {unique_vals}")
    
    # Nominal categorical
    print("\n Nominal Categorical (no natural order):")
    for col in var_type_dict.get('nominal', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        print(f"  - {col}: {len(series.unique())} categories")
    
    # Ordinal categorical
    print("\n Ordinal Categorical (meaningful order):")
    for col in var_type_dict.get('ordinal', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        print(f"  - {col}: {len(series.unique())} levels")
    
    # Grade variables (0-200 scale + Z-score)
    print("\n Grade variables (0-200 range + Z-score > 3):")
    for col in var_type_dict.get('grades', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        
        # Check range violations
        invalid = ((series < 0) | (series > 200)).sum()
        
        # Check statistical outliers using Z-score
        mean, std = series.mean(), series.std()
        if std > 0:
            z_scores = np.abs((series - mean) / std)
            statistical_outliers = (z_scores > 3).sum()
        else:
            statistical_outliers = 0
        
        total_outliers = invalid + statistical_outliers
        outlier_pct = 100 * total_outliers / len(series) if len(series) > 0 else 0
        
        print(f"  - {col}: {invalid} out-of-range + {statistical_outliers} extreme (Z>3) = "
              f"{total_outliers} total ({outlier_pct:.1f}%)")
        
        if total_outliers > 0:
            results.append({
                'column': col, 'type': 'grade', 
                'issue': 'out_of_range + extreme',
                'count': total_outliers, 'pct': outlier_pct
            })
    
    # Count variables (Z-score > 3)
    print("\n Count variables (Z-score > 3):")
    for col in var_type_dict.get('counts', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        if len(series) == 0:
            continue
        
        mean, std = series.mean(), series.std()
        if std > 0:
            z_scores = np.abs((series - mean) / std)
            outliers = (z_scores > 3).sum()
        else:
            outliers = 0
        
        outlier_pct = 100 * outliers / len(series)
        
        print(f"  - {col}: extreme values: {outliers} ({outlier_pct:.1f}%)")
        
        if outliers > 0:
            results.append({
                'column': col, 'type': 'count', 
                'issue': 'extreme_outlier',
                'count': outliers, 'pct': outlier_pct
            })
    
    # Continuous variables (Z-score > 3)
    print("\n Continuous variables (Z-score > 3):")
    for col in var_type_dict.get('continuous', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        if len(series) == 0:
            continue
        
        mean, std = series.mean(), series.std()
        if std > 0:
            z_scores = np.abs((series - mean) / std)
            outliers = (z_scores > 3).sum()
        else:
            outliers = 0
        
        outlier_pct = 100 * outliers / len(series)
        
        print(f"  - {col}: extreme values: {outliers} ({outlier_pct:.1f}%)")
        
        if outliers > 0:
            results.append({
                'column': col, 'type': 'continuous', 
                'issue': 'extreme_outlier',
                'count': outliers, 'pct': outlier_pct
            })
    
    return pd.DataFrame(results)

# Define variable types
var_types = {
    'binary': [
        "Daytime/evening attendance", "Educational special needs", 
        "Gender", "Scholarship holder", "Displaced", "Debtor", "International"
    ],
    'nominal': ["Course", "Nacionality"],
    'ordinal': [
        "Marital Status", "Application mode", "Application order",
        "Previous qualification", "Mother's qualification", 
        "Father's qualification", "Mother's occupation", "Father's occupation"
    ],
    'grades': [
        "Previous qualification (grade)", "Admission grade",
        "Curricular units 1st sem (grade)", "Curricular units 2nd sem (grade)"
    ],
    'counts': [
        "Curricular units 1st sem (enrolled)",
        "Curricular units 2nd sem (enrolled)"
    ],
    'continuous': ["Age at enrollment", "Unemployment rate", "Inflation rate", "GDP"]
}

# Run outlier detection
outlier_results = detect_outliers_intelligent(df, var_types)

### 5.4.1 - Outlier Summary

In [None]:
#| label: outlier-summary

if not outlier_results.empty:
    outlier_results = outlier_results.sort_values('pct', ascending=False)
    print("\n Detected Issues:")
    outlier_results
    
    # Visualize problematic variables
    for _, row in outlier_results.iterrows():
        col = row['column']
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
        
        # Histogram
        df[col].hist(bins=30, ax=ax1, edgecolor='black')
        ax1.set_title(f"Distribution")
        ax1.set_xlabel(col)
        ax1.set_ylabel("Frequency")
        ax1.grid(alpha=0.3)
        
        # Boxplot
        sns.boxplot(y=df[col], ax=ax2)
        ax2.set_title(f"Boxplot ({row['type']})")
        ax2.grid(alpha=0.3, axis='y')
        
        plt.suptitle(f"{col}: {row['count']} potential outliers ({row['pct']:.1f}%)", 
                     fontsize=12, fontweight='bold')
        plt.tight_layout()
        plt.show()
else:
    print("\n No significant outliers detected!")

We identified outliers in five different variables. _Curricular units 1st semester (enrolled)_ and _Curricular units 2nd semester (enrolled)_, which represent the number of courses students register for each semester. We observed 106 potential outliers in the first curricular semester and 82 in the second semester. Since the average course load is usually 5 to 6 classes, students taking a much higher or lower number of courses are naturally flagged as outliers. In the first semester, the highest value reaches 26 classes. Although this is an ambitious workload, it remains possible. Several situations could explain such a high number: for instance, a student trying to complete their degree quickly, or a student retaking courses after previous failures. These cases can reflect meaningful academic behaviours, so removing them would risk losing useful information. For the second semester, the maximum value is around 20 classes, leading to similar conclusions. In both semesters, we also observe students enrolled in zero courses, which appears as an extreme value as well. This may correspond to students who completed most of their required courses earlier, or students taking a temporary break while still being officially enrolled. These profiles are still relevant and should be included. In this context, these extreme values are not problematic. On the contrary, they may help us understand whether taking unusually many, or unusually few, courses has an impact on college dropout. For this reason, we decided not to remove or cap these observations.
The next variable with detected outliers is _Age at enrollment_ , for which 101 potential outliers were identified. Since the average age at enrollment is around 20 years old, students beginning their studies at 40 or 50 naturally appear as unusual cases.The oldest student is 70 years old, which, while rare, is no need for concern regarding methodology within our analysis. Being 70 years old is no different regarding being classified as a student and this data point should be included. These values represent real and meaningful student profiles, such as mature students or individuals returning to education after a long break. Excluding them would remove important diversity from the dataset and limit our understanding of the different types of students who may or may not drop out. For this reason, we chose not to remove or limit the age-related outliers. Finally, outliers were also detected in Admission grade (22 cases) and Previous qualification grade (21 cases). These extreme values reflect either exceptionally high academic performance or, conversely, unusually low grades. Since these cases may provide insights into how prior academic achievement relates to dropout behavior, removing or capping them would not be appropriate. We therefore opted to retain all outliers in these grade variables as well. Based on our research questions, we conclude that removing these outliers would not benefit our analysis, as they do not represent errors but rather uncommon yet meaningful observations. Retaining them allows us to capture the full diversity of student profiles and provides a more accurate understanding of the factors that may influence college dropout.


## 6 - Feature Importance Analysis

### 6.1 - Methodology

We used one-way ANOVA (Analysis of Variance) to identify which numeric variables show significant differences across the three target groups (Dropout, Enrolled, Graduate). For each variable, we calculated:

- **p-value**: Statistical significance of differences between groups (α = 0.05)
- **Eta-squared (η²)**: Effect size measure representing the proportion of variance explained by the target variable (ranges from 0 to 1, where higher values indicate stronger association)

Variables with p-value < 0.05 are considered significantly associated with student outcomes and may be strong predictors in classification models.

In [None]:
#| label: anova-analysis

# ANOVA for numeric variables
anova_results = {}
numeric_cols = df.select_dtypes(include=np.number).columns

for col in numeric_cols:
    groups = [df.loc[df[target_col] == cat, col].dropna()
              for cat in df[target_col].cat.categories
              if cat in df[target_col].unique()]
    
    # Need at least 2 non-empty groups
    if sum(len(g) > 0 for g in groups) < 2:
        continue
    
    from scipy.stats import f_oneway
    f_val, p_val = f_oneway(*groups)
    
    # Effect size: eta-squared
    grand_mean = df[col].mean()
    ss_between = sum(len(g) * (g.mean() - grand_mean) ** 2 for g in groups)
    ss_total = ((df[col] - grand_mean) ** 2).sum()
    eta_sq = ss_between / ss_total if ss_total > 0 else np.nan
    
    anova_results[col] = {"p_value": p_val, "eta_sq": eta_sq}

# Create results dataframe
anova_df = (pd.DataFrame(anova_results).T
            .sort_values(["p_value", "eta_sq"], ascending=[True, False]))
anova_df["significant"] = anova_df["p_value"] < 0.05

print(f"Significant variables (p < 0.05): {anova_df['significant'].sum()}")
anova_df.head(15)

When evaluating feature importance, both statistical significance (p-value) and practical significance (effect size) must be considered. With large sample sizes, even trivial differences can reach statistical significance, making effect size interpretation essential.

**Effect Size (η²)** measures the proportion of variance in student outcomes explained by each feature, with interpretations:

- **η² = 0.01** (1%): Small effect
- **η² = 0.06** (6%): Medium effect  
- **η² = 0.14** (14%): Large effect

For example, marital status has a highly significant p-value (2.66e-09) but explains less than 1% of variance (η² = 0.009), indicating negligible practical importance. In contrast, "Curricular units 2nd sem (grade)" explains 34% of variance (η² = 0.339), representing a large and meaningful effect. When identifying important predictors, it's better to prioritize features with larger effect sizes rather than relying only on p-values. 

## 6.2 - Top Predictive Variables

### 6.2.1 - Acadamic Performance Indicators

Our exploratory analysis shows relationships between academic performance measures and student outcomes (Dropout, Enrolled, Graduate). Several patterns emerge across admission grades, semester performance, and course load.

In [None]:
#| label: fig-admission-grade
#| fig-cap: Admission grade by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Admission grade',
    data=df,
    palette=['#F46968', '#BCDCED', '#326E9E']
)
plt.title('Admission grade by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In Figure 1, we observe the distribution of the admission grade across the three categories (Dropout, Enrolled Target, and Graduate). Dropout students have an average admission grade of around 122, with several outliers reaching above 160. Enrolled Target students show a very similar average grade to Dropout students, but with fewer extreme values. Graduate students display a slightly higher average admission grade, around 125, and similarly present a few outliers above 160.
Overall, the three groups show comparable distributions, with considerable overlap in their admission grades. Graduate students tend to have a marginally higher average, which may suggest that stronger academic preparation is associated with a greater likelihood of graduating. However, the presence of high admission grades in both the Dropout and Graduate categories indicates that good grades alone do not fully determine academic outcomes. In other words, while admission grade may play a role, it is not a decisive predictor of whether a student will graduate or drop out.


In [None]:
#| label: fig-prev-qual-grade
#| fig-cap: Previous qualification grade by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Previous qualification (grade)',
    data=df,
    palette=['#F46968', '#BCDCED', '#326E9E']
)
plt.title('Previous qualification (grade) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Figure 2 shows the distribution of the Previous qualification (grade) across the three target students which are Dropout, Enrolled, and Graduate. All three boxplots display similar characteristics, with medians around 130-133. The minimum and maximum values are also comparable from 100 to 165. The three groups have multiple outliers at both lower and upper extremes of the grade distribution, dropout and graduates are the one that show more extreme values.


For the interpretation, as the distributions and medians are quite similar this suggests that previous qualification grade is not a strong predictor of students' performance. Interestingly the Dropout group’s median is quite high which indicates that students who drop out have not necessarily lower prior grades than those who graduate or stay enrolled.The outliers indicate that in each category there are both very high and very low grades, which suggests that there are other factors beyond academic performance.


In [None]:
#| label: fig-st-sem-grade
#| fig-cap: First semester grade by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Curricular units 1st sem (grade)',
    data=df,
    palette=['#F46968', '#BCDCED', '#326E9E']
)
plt.title('Curricular units 1st sem (grade) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In Figure 3, we see the first-semester grades for the three target groups: Dropout, Enrolled Target, and Graduate. Dropout students show a wide range of grades. Enrolled Target students have grades around a median of 12.5, with moderate spread. Graduate students have the highest median, around 13.5, and a tighter distribution.
For interpretation, the wide spread of Dropout students suggests that leaving the program is not only due to low grades. Enrolled Target students show average performance, indicating steady progress but not full completion. Graduate students perform consistently better, suggesting that higher and more stable first-semester grades are associated with graduation.


In [None]:
#| label: fig-boxplot-grades
#| fig-cap: Boxplot of Grades by Target
plt.figure(figsize=(8, 5))

sns.boxplot(
    x=target_col,
    y='Curricular units 2nd sem (grade)',
    data=df,
    palette=['#F46968', '#BCDCED', '#326E9E']
)

plt.title(f"Curricular units 2nd sem (grade) by {target_col}", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Figure 4 reveals distinct patterns across the three groups. The Dropout category displays the widest range of performance. Enrolled students demonstrate moderate variability with a median near 12 units. Graduates show the tightest distribution and highest median at approximately 13 units. By the second semester, the gaps between groups widen. Many dropouts completed few or no units (the distribution starts at 0), indicating this is likely when they left the program. Graduates continued performing well with consistent results around 13 units. Enrolled students fell somewhere in between with decent but mixed performance. The second semester appears to be a turning point where struggling students drop out while successful students keep their momentum.

In [None]:
#| label: fig-1st-sem-enrolled
#| fig-cap: First semester enrollment by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Curricular units 1st sem (enrolled)',
    data=df,
    palette=['#F46968', '#BCDCED', '#326E9E']
)
plt.title('Curricular units 1st sem (enrolled) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Figure 5 demonstrates the relationship between Curricular units 1st Sem (enrolled) and the three Target outcomes (Dropout, Enrolled, Graduate). All three groups show similar box positions with medians around 5-6 units. Dropouts and Enrolled students have nearly identical distributions, while Graduates have a slightly higher box position. All groups show numerous outliers, particularly on the upper end, with some students enrolling in 15-26 units.
Figure 5 reveals that the number of courses taken is not a factor influencing different outcomes, since all groups show similar enrollment patterns. Many high outliers appear across all groups, suggesting that ambitious enrollment is common regardless of eventual outcome.


In [None]:
#| label: fig-2nd-sem-enrolled
#| fig-cap: Second semester enrollment by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Curricular units 2nd sem (enrolled)',
    data=df,
    palette=['#F46968', '#BCDCED', '#326E9E']
)
plt.title('Curricular units 2nd sem (enrolled) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

As shown in Figure 6, Dropout and Enrolled students have similar distributions with their boxes positioned in the lower range. Graduate students show a noticeably higher box position and a wider spread. All three groups display numerous outliers, particularly on the upper end.
Like the 1st semester enrollment patterns, the 2nd semester shows that graduates tend to enroll in slightly more courses, though the differences remain modest. The similar enrollment behavior between dropouts and enrolled students suggests that course load decisions in the 2nd semester don't strongly differentiate these groups - the key difference lies in completion rates rather than enrollment ambitions.


In [None]:
#| label: fig-daytime-evening
#| fig-cap: Daytime/evening attendance by student outcome

tab = (pd.crosstab(df[target_col], df['Daytime/evening attendance'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#F46968', '#8BC68A'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Daytime/evening attendance by Target', fontsize=12, fontweight='bold')
plt.legend(title='Attendance', labels=['Evening', 'Daytime'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

Figure 7 shows the proportion of daytime and evening attendance within the three groups (Dropout, Enrolled, Graduate). Daytime attendance dominates across all three groups, representing approximately 85-90% of students. However, Dropout students show a slightly higher proportion of evening attendance (around 15%) compared to Enrolled and Graduate students (around 10%). 
This small difference might indicate that evening students face additional challenges, though the similarity across all groups suggests attendance timing is not a primary driver of dropout rates.


In [None]:
#| label: fig-application-order
#| fig-cap: Application order by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Application order',
    data=df,
    palette=['#F46968', '#BCDCED', '#326E9E']
)
plt.title('Application order by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Figure 8 shows the Application order by the three target students. All three groups show similar distributions and are positioned in the lower range. The medians are approximately 1.5-2 for all categories. The upper whisker is similar for the three groups, reaching 3 and the lower whisker is at 0 for Graduates and around 1 for Dropout and Enrolled. There are numerous outliers that are at 4, 5 and 6, and even 9 for Enrolled category, indicating that some students applied as their 4th, 5th, 6th and 9th choice.
Regarding the interpretation, as the distribution is similar in the three categories this implies that the application order has not a strong relationship with students' success. Most students have applied to this institution as their first or second choice, suggesting that institutional preferences do not really predict if a student will drop out, stay enrolled or graduate. We can also confirm that, as the outliers are similar, the application order is not a meaningful predictor of a student's performance.


In [None]:
#| label: fig-displaced
#| fig-cap: Displaced status by student outcome

tab = (pd.crosstab(df[target_col], df['Displaced'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#F46968', '#8BC68A'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Displaced by Target', fontsize=12, fontweight='bold')
plt.legend(title='Displaced', labels=['No', 'Yes'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

Figure 9 shows the proportion of displaced students (those who moved or changed residence) across the three target groups. Dropout students have the highest proportion of non-displaced students at around 53%. Enrolled students show about 45% non-displaced. Graduate students have the lowest at approximately 40% non-displaced, meaning 60% of graduates relocated.
The pattern shows that students who relocated for their studies were more likely to graduate. This could be because moving demonstrates stronger commitment to education, or because staying home means dealing with work, family responsibilities, or other obligations that interfere with studying. Dropouts were the least likely to have relocated, suggesting that remaining in their original environment may have made it harder to focus on academics.


### 6.2.2 - Key Findings for Academic Performance and Study Conditions

Graduates have higher admission grades and previous qualification grades compared to dropouts, though the differences are relatively small. This demonstrates that prior academic preparation shows limited predictive power.


First-semester grades are the strongest predictor of students' performance. Students who drop out show dramatically lower grades (many between 0-5), while graduates consistently have higher grades (median around 12). First semester performance is therefore a critical warning signal for identifying at-risk students.


Graduates tend to enroll in more courses in the first semester (median around 6-7) compared to those who drop out (median around 5-6), this may reflect a stronger initial academic engagement, even though this difference remains small.


Daytime/evening attendance suggests an observable difference and proves to be an important predictor. Evening students show higher drop out rates, around 15% of dropouts compared to 10% for graduates. This reflects additional challenges faced by students who must balance work, or family responsibilities with their studies.


Students who are displaced have higher graduation rates (60% of graduates vs around 48% of dropouts). This counter-intuitive pattern reflects that relocating for studies may reflect stronger commitment or independence.


### 6.2.3 - Demographic & Socioeconomic Background


In [None]:
#| label: fig-age
#| fig-cap: Age at enrollment by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Age at enrollment',
    data=df,
    palette=['#F46968', '#BCDCED', '#31709D']
)
plt.title('Age at enrollment by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Figure 10 demonstrates the relationship between Age at enrollment and the three students outcomes. The Dropout group has the highest median age, which is approximately 23 and the widest interquartile range. The Enrolled group has a median age around 20-21, while the Graduate group shows the lowest median age at around 19. It is also the narrowest. The three groups contain numerous outliers, showing particularly older students from late 30th to 70 years old.


This suggests that age at enrollment is a significant predictor of students' performance. We see that students who enroll at a younger age are more likely to graduate, while older students face more risk of dropping out. This can be caused by several factors, such as the fact that younger students may have fewer external responsibilities compared to older students that may deal with multiple commitments that can interfere with their studies. The wider distribution dropout’s group shows that students can occur at any age. However, older students do successfully graduate, showing that age does not determine success alone.

In [None]:
#| label: fig-gender
#| fig-cap: Gender by student outcome

tab = (pd.crosstab(df[target_col], df['Gender'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#F8A88E', '#8CB3D5'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Gender by Target', fontsize=12, fontweight='bold')
plt.legend(title='Gender', labels=['Female', 'Male'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

Figure 11 shows the proportion of genders within each of the three target groups (Dropout, Enrolled Target, Graduate). In the Dropout group, the proportion of male and female students is almost equal. In contrast, both the Enrolled Target and Graduate groups have a higher proportion of female students than male students. 


This suggests that female students tend to persist and complete their studies at higher rates than male students. Male students appear slightly more likely to interrupt or drop out of their programs, which may contribute to the lower proportions observed in the Enrolled Target and Graduate groups.

In [None]:
#| label: fig-scholarship
#| fig-cap: Scholarship holder status by student outcome

#plt.figure(figsize=(8, 5))
tab = (pd.crosstab(df[target_col], df['Scholarship holder'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True, 
        color=['#F46968', '#8BC68A'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Scholarship holder by Target', fontsize=12, fontweight='bold')
plt.legend(title='Scholarship holder', labels=['No', 'Yes'], 
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

Figure 12 shows the proportion of scholarship holders within each target group (Dropout, Enrolled Target, Graduate).
In both the Dropout and Enrolled Target groups, the vast majority of students do not receive a scholarship, with only a small proportion being scholarship holders. In contrast, the Graduate group contains a noticeably higher proportion of scholarship recipients.
This suggests that students who receive a scholarship may be more likely to graduate than those who do not. Scholarships often reduce financial pressure and provide support that may help students remain enrolled and complete their studies. Conversely, students without scholarships seem more represented among dropouts and ongoing enrollments.


In [None]:
#| label: fig-debtor
#| fig-cap: Debtor status by student outcome

tab = (pd.crosstab(df[target_col], df['Debtor'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#F46968', '#8BC68A'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Debtor by Target', fontsize=12, fontweight='bold')
plt.legend(title='Debtor', labels=['No', 'Yes'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

The Figure 13 shows that the dropout group has the largest proportion of students who are debtors. Enrolled students still include some debtors, but the proportion is noticeably smaller. In the graduate group, almost all students have no debt, with only a very small fraction appearing as debtors.


This trend suggests that having debt is more common among students who end up dropping out, hinting that financial pressure may contribute to early departure. Conversely, students without debt seem more likely to remain enrolled and reach graduation.


In [None]:
#| label: fig-marital-status
#| fig-cap: Marital status by student outcome

plt.figure(figsize=(8, 5))

# Encode target categories as numbers
x_encoded = df[target_col].cat.codes

# Add jitter to avoid overlap
x_jitter = x_encoded + np.random.normal(0, 0.05, size=len(df))
y_jitter = df['Marital Status'] + np.random.normal(0, 0.05, size=len(df))

# Beautiful colormap
colors = plt.cm.viridis(x_encoded / x_encoded.max())

plt.scatter(
    x_jitter,
    y_jitter,
    s=40,
    alpha=0.75,
    c=colors,
    edgecolor="black",
    linewidth=0.4
)

plt.xticks(df[target_col].cat.codes, df[target_col])
plt.xlabel("Target")
plt.ylabel("Marital Status")
plt.title("Marital Status by Target", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

@fig-marital-status shows the distribution of marital status across the three target groups using a jittered scatter plot. While marital status achieved statistical significance in the ANOVA test (p < 0.001), its effect size is negligible (η² = 0.009), explaining less than 1% of the variance in student outcomes. This small effect is evident in the plot, where points align in nearly identical horizontal bands for each category, indicating that the marital status profiles are essentially the same among Dropout, Enrolled, and Graduate students.
The visual clearly shows that the vast majority of students are single, which is expected given the typical age range of university students. The less common marital statuses married, divorced, widower, facto union, and legally separated—appear only sporadically and are spread evenly across the three groups, further confirming the lack of meaningful differences.


In [None]:
#| label: fig-mother-qualification
#| fig-cap: Mother's qualification by student outcome

plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y="Mother's qualification",
    data=df,
    palette=['#F46968', '#BCDCED', '#31709D']
)
plt.title("Mother's qualification by Target", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Figure 15 shows the Mother’s qualification across the three target groups. Enrolled and  Graduate distributions are identical. The three groups display nearly identical distributions with all median at approximately 19. However, the Dropout group shows a slightly wider interquartile range, extending lower compared to the two other groups. They reach similar limits and there are no outliers.

While the three distributions are very similar, the dropout category shows slightly fewer students with mothers who have very  low qualifications, but this is minimal.  The overall similarity indicates that a mother's qualification has little influence on whether students complete their studies. This is also supported by the small effect size.


### 6.2.4 - Key Findings for Demographic & Socioeconomic Background 

Younger students are more likely to graduate while older students face higher dropout risk. The dropout group also shows the widest age variation. This suggests that older students may face competing life responsibilities that interfere with their studies.
Female students graduate at slightly higher rates, making up a larger proportion of both enrolled and graduate groups compared to dropouts (50-50 split).


Graduates have a noticeably higher proportion of scholarship recipients compared to dropouts and enrolled students, where the majority receive no scholarship. This suggests financial support helps complete their studies. 
Dropouts have the largest proportion of students who are debtors. Enrolled students include some debtors but fewer than Dropout students. Graduates have almost no debt. Financial pressure is strongly associated with dropout, while financial stability is associated with graduation.
All three groups show identical distributions. Other marital statuses appear equally as outliers across all categories. Marital status has no relationship with student outcomes and no predictive value.
The three groups display nearly identical distributions with medians around 19. While the dropout group shows a slightly wider spread extending lower, the difference is minimal. Mother’s qualification has little to no influence on whether students complete their studies.


## 7 - Predictive Modelling

In this section, we begin building a predictive model aimed at understanding which factors are most strongly associated with student dropout. Our goal is not only to classify students into the three outcome categories (Dropout, Enrolled, Graduate), but also to identify which variables contribute most to the risk of dropping out.

We train a Random Forest classifier as a first baseline model. This allows us to evaluate predictive performance and obtain a first indication of which features may be important. To further interpret and validate these results, we use LIME explanations, both at the individual level (example students) and globally across multiple samples.

This modelling part is therefore an exploratory step toward understanding dropout risk: the aim is to identify meaningful patterns, highlight influential academic or demographic factors, and evaluate which features could be most relevant for predicting student success or failure. Later, these insights can be refined and made more specific to dropout prediction.


In [None]:
#| label: classification-model

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.utils.class_weight import compute_class_weight
import lime
import lime.lime_tabular
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

# Create working dataframe
df_model = df.copy()

# Remove rows with missing values
df_model = df_model.dropna()

# Separate features and target
X = df_model.drop('Target', axis=1)
y = df_model['Target']

# Encode categorical variables
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    label_encoders[col] = le

# Split data with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Compute class weights to handle imbalance
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weight_dict = dict(zip(np.unique(y_train), class_weights))

# Train Random Forest model with class weights
rf_model = RandomForestClassifier(
    n_estimators=100,  
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced'
)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)

# ============================================================================
# MODEL EVALUATION
# ============================================================================

display(Markdown("\n## 7.1 - Model Performance\n"))

# Accuracy as a formatted statement
accuracy = accuracy_score(y_test, y_pred)
display(Markdown(f"**Overall Accuracy:** {accuracy:.3f} ({accuracy*100:.1f}%)"))

# Classification Report as DataFrame
report_dict = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report_dict).transpose()

# Format the dataframe nicely
report_df = report_df.round(3)
if 'support' in report_df.columns:
    report_df['support'] = report_df['support'].astype(int)

# Main classes only
main_classes_df = report_df.loc[['Dropout', 'Enrolled', 'Graduate']].copy()
display(Markdown("\n### Performance by Class\n"))
display(main_classes_df)

# Summary metrics
summary_df = report_df.loc[['accuracy', 'macro avg', 'weighted avg']].copy()
display(Markdown("\n### Global Performance\n"))
display(summary_df)

# Class distribution comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

y_test.value_counts().sort_index().plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('True Class Distribution (Test Set)')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

pd.Series(y_pred).value_counts().sort_index().plot(kind='bar', ax=axes[1], color='lightcoral')
axes[1].set_title('Predicted Class Distribution')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

display(Markdown("""
The Random Forest classifier achieved an overall accuracy of 67.3%, correctly predicting students' academic performance in approximately two-thirds of cases. However, accuracy by itself is not enough; it is necessary to examine the model's performance for each class.

**Performance varies significantly across the three outcomes.** The model performs best at identifying Dropout and Graduate students, with F1-scores of 0.707 and 0.773 respectively. However, it struggles with the Enrolled category, achieving only an F1-score of 0.392. This means that the model has difficulty distinguishing between currently enrolled students and those who will drop out or graduate.

**The confusion matrix reveals specific prediction patterns.** The model correctly identifies 185 out of 284 dropouts (65.1% recall) and 338 out of 442 graduates (76.5% recall). The struggle with Enrolled students is evident: only 73 out of 159 (45.9%) are correctly classified, with many being misclassified as either future graduates (53 cases) or potential dropouts (33 cases). This reflects the inherent difficulty in predicting outcomes for students still in progress, their final status remains uncertain until they complete their program or drop out.

**Handling class imbalance.** With a smaller number of Enrolled students in the dataset (159 vs 284 Dropouts and 442 Graduates), the model has less training data to learn patterns for this group. To address this imbalance, we used the `class_weight='balanced'` parameter in the Random Forest model, which automatically adjusts weights inversely proportional to the frequency of classes.  This ensures the model gives equal attention to all three classes during training rather than being biased toward the majority class. While this helps mitigate the imbalance, the fundamental challenge remains: enrolled students represent an "in-between" state that is harder to characterize than the final outcomes of dropping out or graduating.

**Predicted distribution.** The last figure shows the predicted distribution of classes after applying the model. Despite the class imbalance in the original data, the model with balanced class weights manages to capture quite well the real distribution, proving that the balancing strategy is effective.

"""))

# ============================================================================
# RANDOM FOREST FEATURE IMPORTANCE
# ============================================================================

display(Markdown("\n## 7.2 - Random Forest Feature Importance\n"))

display(Markdown("""
Random Forest models calculate feature importance by measuring how much each feature contributes to reducing prediction error across all decision trees in the forest. Features that consistently lead to better splits and more accurate predictions receive higher importance scores. It measures each variable’s contribution to prediction while considering all other features, capturing interactions and non-linear relationships.
"""))

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

display(Markdown("\n### 7.2.1-Top 15 Most Important Features (Random Forest)\n"))

# Plot feature importance
plt.figure(figsize=(10, 8))
top_15 = feature_importance.head(15)
plt.barh(range(len(top_15)), top_15['importance'])
plt.yticks(range(len(top_15)), top_15['feature'])
plt.xlabel('Importance')
plt.title('Top 15 Feature Importance (Random Forest)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

display(Markdown("""
Even though Random Forest differs from ANOVA by evaluating variables in combination rather than in isolation, results obtained in both cases show that academic performance variables (_1st and 2nd semester grade_) are dominant.

Variables like financial or administrative factors showed weak effects in ANOVA but appear among the top 15 Random Forest  features because they interact with academic variables to improve predictions. Therefore, these variables reveal conditional effects such as financial stress that can increase academic risk when performance is already low. Thus, these interactions enhance both predictive accuracy and the interpretation of underlying patterns in the data.
"""))

# Create a custom discretizer that respects class distributions
explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train.values,
    feature_names=X.columns.tolist(),
    class_names=[str(c) for c in sorted(y.unique())],
    mode='classification',
    random_state=42,
    discretize_continuous=True,  # Better for imbalanced data
    sample_around_instance=True  # More focused local sampling
)

# Also create a wrapper that can handle class weights in predictions
def balanced_predict_proba(X_sample):
    """Prediction function that applies class weights to probabilities"""
    probs = rf_model.predict_proba(X_sample)
    
    # Optional: Apply class weight adjustment to probabilities
    # This helps LIME understand the balanced decision boundaries
    weights = np.array([class_weight_dict[c] for c in sorted(y.unique())])
    adjusted_probs = probs * weights
    
    # Renormalize
    adjusted_probs = adjusted_probs / adjusted_probs.sum(axis=1, keepdims=True)
    
    return adjusted_probs

# ============================================================================
# PART 1: INDIVIDUAL EXAMPLES (One per class for illustration)
# ============================================================================

display(Markdown("\n## 7.3 - LIME Explanations - Individual Examples\n"))
display(Markdown("*Showing one representative example from each class*\n"))

display(Markdown("""
LIME (Local Interpretable Model-agnostic Explanations) helps us understand why the model made a specific prediction for an individual student. Unlike feature importance which shows what matters globally across all predictions, LIME reveals which features drove the decision for this particular case. The tables below show the top 10 features that most influenced this prediction. Positive weights push the prediction toward the predicted class, while negative weights push against it. The magnitude indicates the strength of influence.
"""))

# Select one sample from each class with fixed seed for reproducibility
np.random.seed(42)
sample_indices = []
for class_label in sorted(y.unique()):
    class_indices = X_test[y_test == class_label].index
    if len(class_indices) > 0:
        sample_indices.append(np.random.choice(class_indices, size=1)[0])

# Dictionary to store class-specific interpretations
# Dictionary to store class-specific interpretations
class_interpretations = {
    'Dropout': """
**Interpretation:** This student was correctly identified as at risk of dropping out. The dominant factor is their poor academic performance, with a second semester grade of 10.78 or below strongly pushing toward dropout (large negative weight). Additionally, lacking scholarship support increases dropout risk. Protective factors such as not being in debt, younger age (≤19), and decent first semester grades (≤11) provide some counterbalance, are not enough to overcome poor results in the second semester. This highlights how academic difficulties, especially in later semesters, are key indicators of the risk of dropping out.
""",
    'Enrolled': """
**Interpretation:** This enrolled student presents a highly mixed profile that makes prediction challenging. Poor second semester grades (≤10.78) strongly push toward dropout, while positive factors like having no scholarship (paradoxically protective here), not being in debt, low unemployment rate, and moderate academic performance in the first semester push toward remaining enrolled or graduating. The model shows significant uncertainty, with multiple features pulling in different directions. This reflects the inherent difficulty in predicting outcomes for students still in progress, they're in a transitional state where their trajectory could go either way.
""",
    'Graduate': """
**Interpretation:** This student shows mixed but ultimately positive indicators of academic success. While poor second semester grades (≤10.78) push against graduation, several protective factors dominate: having no scholarship, not being in debt, low unemployment rate, being in a traditional age range (20-25), and having moderate previous qualifications all strongly predict graduation. The combination of financial stability (no debt) and adequate academic preparation outweighs the concerning second semester performance, allowing the model to confidently predict graduation. This demonstrates that graduation is driven by multiple factors beyond just grades alone.
"""
}

for idx, sample_idx in enumerate(sample_indices):
    sample = X_test.loc[sample_idx].values
    true_label = y_test.loc[sample_idx]
    pred_label = rf_model.predict([sample])[0]
    pred_proba = rf_model.predict_proba([sample])[0]
    
    display(Markdown(f"\n### Sample {idx + 1}: {true_label}\n"))
    
    sample_info = pd.DataFrame({
        'Metric': ['True Label', 'Predicted Label', 'Dropout Prob', 'Enrolled Prob', 'Graduate Prob'],
        'Value': [true_label, pred_label, f"{pred_proba[0]:.3f}", f"{pred_proba[1]:.3f}", f"{pred_proba[2]:.3f}"]
    })
    display(sample_info)
    
    # Generate LIME explanation with more samples for better stability
    exp = explainer.explain_instance(
        sample,
        rf_model.predict_proba,
        num_features=10,
        num_samples=5000
    )
    
    # Show explanation as table
    explanation_data = []
    for feature, weight in exp.as_list():
        explanation_data.append({'Feature': feature, 'Weight': f"{weight:.3f}"})
    
    explanation_df = pd.DataFrame(explanation_data)
    display(Markdown("\n**Top 10 features influencing this prediction:**\n"))
    display(explanation_df)
    
    # Plot explanation with custom title
    fig = exp.as_pyplot_figure()
    # Update the title to reflect the actual sample
    fig.suptitle(f'Local Explanation for Sample {idx + 1}: {true_label}\n(Predicted: {pred_label})', 
                 fontsize=14, y=0.98)
    plt.tight_layout()
    plt.show()
    
    # Add class-specific interpretation
    if true_label in class_interpretations:
        display(Markdown(class_interpretations[true_label]))

# ============================================================================
# PART 2: CLASS-WISE LIME ANALYSIS (Multiple samples per class)
# ============================================================================

# ============================================================================
# DETAILED MISCLASSIFICATION ANALYSIS
# ============================================================================
# ============================================================================
# COMPARISON ACROSS CLASSES
# ============================================================================

# ============================================================================
# PART 3: GLOBAL LIME IMPORTANCE (Optional - overall patterns)
# ============================================================================

display(Markdown("\n## 7.4 - Global LIME Feature Importance\n"))

sample_size = len(X_test)
sample_indices_global = np.random.choice(len(X_test), size=sample_size, replace=False)

lime_weights_global = {feature: [] for feature in X.columns}

for i in sample_indices_global:
    exp = explainer.explain_instance(
        X_test.iloc[i].values,
        rf_model.predict_proba,
        num_features=len(X.columns)
    )
    
    for feature, weight in exp.as_list():
        feature_name = feature.split('<=')[0].split('>')[0].split('=')[0].strip()
        for col in X.columns:
            if col in feature_name or feature_name in col:
                lime_weights_global[col].append(abs(weight))
                break

# Compute average
lime_importance_global = pd.DataFrame({
    'feature': list(lime_weights_global.keys()),
    'lime_importance': [np.mean(weights) if weights else 0 for weights in lime_weights_global.values()]
}).sort_values('lime_importance', ascending=False)

display(Markdown("\n### Top 15 Most Important Features (LIME Global)\n"))
display(lime_importance_global.head(15).round(4))

display(Markdown("""
**Global LIME Feature Importance:** When aggregating LIME explanations across all three classes (Dropout, Enrolled, and Graduate), we obtain a global view of which features most consistently influence the model's predictions regardless of outcome. The ranking remains remarkably similar to the dropout-specific analysis, with second semester grades (0.0583) again dominating as the single most important predictor. Scholarship holder status (0.0527) and debtor status (0.0369) maintain their strong positions, reinforcing that financial factors and academic performance are universally important across all prediction scenarios.

The consistency between dropout-specific and global feature importance suggests that the same fundamental factors drive all student outcomes, just in different directions. Strong academic performance and financial stability predict graduation, while their absence predicts dropout. Enrolled students fall somewhere in between, exhibiting mixed patterns of these key indicators. This global perspective validates our earlier findings and demonstrates that interventions targeting academic support and financial aid would benefit students across all outcome categories, not just those at risk of dropping out.
"""))

display(Markdown("\n### Where Does the Model Struggle?\n"))

# Analyze confusion patterns in LIME sample
confusion_patterns = {
    'Dropout': {'predicted_as': {'Dropout': 0, 'Enrolled': 0, 'Graduate': 0}},
    'Enrolled': {'predicted_as': {'Dropout': 0, 'Enrolled': 0, 'Graduate': 0}},
    'Graduate': {'predicted_as': {'Dropout': 0, 'Enrolled': 0, 'Graduate': 0}}
}

# Count predictions for each sample in the global LIME analysis
for idx in sample_indices_global:
    true_label = y_test.iloc[idx]
    sample = X_test.iloc[idx].values
    pred_label = rf_model.predict([sample])[0]
    confusion_patterns[true_label]['predicted_as'][pred_label] += 1

# Create confusion matrix for LIME sample
confusion_display_data = []
for true_class in sorted(y.unique()):
    row = {'True Class': true_class}
    for pred_class in sorted(y.unique()):
        count = confusion_patterns[true_class]['predicted_as'][pred_class]
        total = sum(confusion_patterns[true_class]['predicted_as'].values())
        row[f'Pred: {pred_class}'] = f"{count} ({count/total*100:.0f}%)"
    confusion_display_data.append(row)

confusion_display_df = pd.DataFrame(confusion_display_data)
display(Markdown("\n**Confusion patterns in LIME sample:**\n"))
display(confusion_display_df)

# Identify problematic patterns
display(Markdown("\n**Key Issues:**\n"))

for true_class in sorted(y.unique()):
    total = sum(confusion_patterns[true_class]['predicted_as'].values())
    correct = confusion_patterns[true_class]['predicted_as'][true_class]
    
    if correct / total < 0.7:  # Less than 70% accuracy
        main_confusion = max(
            [(pred, count) for pred, count in confusion_patterns[true_class]['predicted_as'].items() if pred != true_class],
            key=lambda x: x[1]
        )
        display(Markdown(
            f"- **{true_class}** students are often misclassified as **{main_confusion[0]}** "
            f"({main_confusion[1]}/{total} cases, {main_confusion[1]/total*100:.0f}%)"
        ))

display(Markdown("""Due to the class imbalance the model can struggle, since most of the cases people are Graduated, there is a bias towards that group in the data, hence, it is more likelt to predict that for cases when it is hesitant."""))

In [None]:
display(Markdown("\n## 7.5 - Feature Importance Comparison: Random Forest, LIME, and ANOVA\n"))

# Prepare data from all three methods
comparison_all = feature_importance.merge(lime_importance_global, on='feature', how='left')
comparison_all['lime_importance'] = comparison_all['lime_importance'].fillna(0)

# Add ANOVA results (eta_sq) - need to match feature names
anova_dict = anova_df['eta_sq'].to_dict()
comparison_all['anova_eta_sq'] = comparison_all['feature'].map(anova_dict).fillna(0)

# Normalize all importance scores to 0-1 range for fair comparison
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

comparison_all['rf_normalized'] = scaler.fit_transform(comparison_all[['importance']])
comparison_all['lime_normalized'] = scaler.fit_transform(comparison_all[['lime_importance']])
comparison_all['anova_normalized'] = scaler.fit_transform(comparison_all[['anova_eta_sq']])

# Select top 15 features by Random Forest importance
top_features = comparison_all.nlargest(15, 'importance')

# Create comparison plot
plt.figure(figsize=(14, 8))
x = np.arange(len(top_features))
width = 0.25

plt.barh(x - width, top_features['rf_normalized'], width, label='Random Forest', alpha=0.8)
plt.barh(x, top_features['lime_normalized'], width, label='LIME', alpha=0.8)
plt.barh(x + width, top_features['anova_normalized'], width, label='ANOVA (η²)', alpha=0.8)

plt.yticks(x, top_features['feature'])
plt.xlabel('Normalized Importance Score (0-1)')
plt.title('Feature Importance Comparison: Random Forest vs LIME vs ANOVA')
plt.legend()
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

display(Markdown("""
**Comparing Three Approaches to Feature Importance:**

This comparison reveals how different methods assess feature importance:

- **Random Forest (blue)**: Global feature importance based on how much each feature reduces prediction error across all decision trees. Captures complex interactions and non-linear relationships.

- **LIME (orange)**: Local feature importance averaged across 100 samples from each class. Explains individual predictions by measuring how changes in feature values affect model outputs locally.

- **ANOVA η² (green)**: Statistical effect size measuring the proportion of variance each feature explains in the target variable. Captures linear relationships and univariate associations.

**Key Observations:**

All three methods agree that **2nd semester grades** and **1st semester grades** are the most important predictors, validating academic performance as the dominant factor. However, they diverge on other features:

- **Scholarship holder** and **Debtor** rank highly in LIME (practical prediction influence) but lower in Random Forest, suggesting these features work through interactions rather than independently.
- **ANOVA** identifies strong linear relationships (high η² for grades) but may underestimate features that work through complex interactions.
- **Random Forest** balances both direct effects and feature interactions, providing a comprehensive view of predictive power in the classification context.

The convergence on academic performance across all three methods strongly validates its critical role, while divergences highlight how financial and demographic factors contribute through different mechanisms, some through direct effects (captured by ANOVA), others through complex interactions (captured by Random Forest and LIME).
"""))

## 8 - Conclusion 
This project aimed to understand the factors driving student dropout and academic success. The objective was also to assess whether these outcomes can be predicted accurately, using demographic, socioeconomic, and academic information. Therefore, we were able to answer our research questions through a combination of exploratory data analysis and predictive modeling techniques.

1. The first research question therefore focuses on understanding how students’ academic indicators influence their likelihood of graduating or dropping out. a. To what extent do financial factors (debtor status, scholarship holder) affect student retention ?

The analysis indicates that academic performance indicators are the strongest determinant of students’ final academic outcomes. Second-semester grades constitute the most decisive factor in distinguishing between students who drop out, remain enrolled, or graduate, as reflected by a very large ANOVA effect size (η² = 0.339). First-semester grades are used as a critical warning signal for identifying students in difficulty. In contrast, indicators of prior academic preparation and course enrollment intensity show lower significance, which suggest that success at the university level relies more heavily on student’s adaptation and performance within the academic environment than on their background before university.
Study conditions contribute to outcomes in a more nuanced way. Students attending evening programs face slightly higher dropout risk, likely due to external commitments, while students who relocate for their studies appear more likely to graduate, possibly reflecting stronger educational commitment or fewer competing obligations.
Overall, these results point to a critical transition between the first and second semesters. While early academic difficulties can be identified shortly after enrollment, second-semester performance has a major role in determining whether they rebound or lose interest. From a practical perspective, this emphasizes the importance of early academic monitoring and targeted interventions during the first year to reduce dropout risk and promote student success.

2. What is the impact of demographic and socioeconomic background on students’ probability of dropping out? a.Which features category, academic, socioeconomic or demographic contribute the most in predicting students’ dropout?

While demographic and socioeconomic factors have an impact on student outcomes, it is not as significant compared to academic performance. A clear pattern is observed for _Age at enrollment_: younger students are more likely to graduate, while older students face a higher risk of dropout (η² ≈ 0.065). However, age alone does not determine success. We also observe gender differences, with female students being more likely to remain enrolled and graduate, despite the effect size being moderate. 
Financial factors also show a clear correlation with student outcomes. Dropping out indeed correlates with indebtedness. Students with debt are more likely to drop out, whereas those who graduate are less concerned about it. These findings support the LIME analysis, which shows that debt and scholarships are two of the most important non-academic factors: financial issues increase the risk of dropping out, while financial support improves retention.
Considering other demographic variables such as marital status , application order and parental background, their effect sizes are very small, their distributions are similar across outcome groups and model-based explanations consistently rank them among the least important predictors. Therefore, although they are statistically significant, their effects are negligible and do not meaningfully explain differences in student outcomes.
The findings indicate that socioeconomic factors, such as financial stress (_debt_) and financial support (_scholarships_), have a significant impact on student retention. Whereas demographic characteristics such as age and gender play a secondary role compared to academic performance and financial stability.

3. Can we accurately predict a student’s final status (Dropout, Enrolled, or Graduate), and which characteristics are most relevant?

While the previous research questions focused on identifying and interpreting the individual effects of academic, socioeconomic, and demographic factors on student dropout, the final step in the analysis evaluates these factors jointly within a predictive modeling framework. This shift from explanation to prediction assesses how well a machine-learning model can classify students’ final academic status and which categories of variables contribute most to its performance.

The results indicate that predicting students’ final academic status is feasible, though subject to important limitations. The Random Forest model performs well for students with clearly defined outcomes, particularly graduates and dropouts, while currently enrolled students are harder to classify, reflecting the inherent uncertainty of trajectories that are still in progress rather than shortcomings of the model itself.

Regarding predictor relevance, a clear hierarchy emerges consistently across all methods. Academic features dominate predictive performance, accounting for approximately 60–70% of explanatory power, with second semester grades and first semester grades ranking highest across ANOVA, Random Forest importance, and LIME explanations. Socioeconomic factors form the second most influential category (around 20–30%), with scholarship holding and debtor status standing out as key non-academic predictors that condition students’ ability to sustain academic performance. Demographic characteristics contribute more modestly, with age at enrollment being the most relevant, while contextual and macroeconomic variables show minimal predictive value. These findings confirm that student outcomes can be predicted with moderate accuracy, primarily driven by academic performance and reinforced by financial stability.

---

Taken together, this analysis reveals a clear hierarchy in the determinants of student dropout and academic success. This analysis highlights a clear hierarchy in the factors shaping student dropout and academic success. Academic performance stands out as the most influential determinant, followed by financial conditions, while demographic characteristics play a more limited supporting role and contextual variables show little direct impact. Across all methods, semester grades and especially second-semester performance, consistently emerge as the strongest indicators of students’ final outcomes.

Rather than being a simple correlation, academic performance appears to be the channel through which other factors take effect. Financial pressure, age-related responsibilities, and study conditions influence students’ capacity to perform academically, which in turn directly affects their likelihood of persisting or dropping out. This makes the first year of study, and particularly the transition to the second semester, a period for intervention

The practical implications are clear. Effective dropout prevention should focus on early academic monitoring, combined with targeted financial support, particularly during the first year and before second-semester outcomes become decisive. Encouragingly, the most influential factors identified are also those that institutions can directly address through academic support services, financial aid policies, and early warning systems.

Overall, this project demonstrates that student dropout is neither inevitable nor primarily driven by factors beyond institutional control. With timely, data-driven interventions targeting the right factors at the right moments, higher education institutions can meaningfully improve student retention and academic success.


## 9 - Appendix
Analyzing 100 samples per class to identify robust patterns and handle class imbalance. 
This following analysis is used to validate the robustness of our results, while the main body focuses on the full dataset to reflect real student outcome distributions.

In [None]:
display(Markdown("\n## 7.1 - LIME Analysis - Class-Wise Aggregation\n"))

# Number of samples to analyze per class
samples_per_class = 100  # Increased from 20 for more robust analysis

# For minority classes, we might want to analyze more
class_sample_sizes = {}
for class_label in sorted(y.unique()):
    class_count = len(X_test[y_test == class_label])
    # Use up to 30 samples, or all available if less
    class_sample_sizes[class_label] = min(samples_per_class, class_count)

# Store LIME weights for each class separately
class_lime_weights = {
    'Dropout': {feature: [] for feature in X.columns},
    'Enrolled': {feature: [] for feature in X.columns},
    'Graduate': {feature: [] for feature in X.columns}
}

# Track prediction accuracy
prediction_tracking = {
    'Dropout': {'correct': 0, 'incorrect': 0},
    'Enrolled': {'correct': 0, 'incorrect': 0},
    'Graduate': {'correct': 0, 'incorrect': 0}
}

# Process each class
for class_label in sorted(y.unique()):
    
    # Get indices for this class
    class_indices = X_test[y_test == class_label].index
    
    # Sample
    n_samples = class_sample_sizes[class_label]
    sample_indices_class = np.random.choice(class_indices, size=n_samples, replace=False)
    
    # Generate LIME explanations for each sample
    for idx, sample_idx in enumerate(sample_indices_class):
        sample = X_test.loc[sample_idx].values
        true_label = y_test.loc[sample_idx]
        pred_label = rf_model.predict([sample])[0]
        
        # Track prediction accuracy
        if pred_label == true_label:
            prediction_tracking[class_label]['correct'] += 1
        else:
            prediction_tracking[class_label]['incorrect'] += 1
        
        # Generate LIME explanation with increased samples
        exp = explainer.explain_instance(
            sample,
            rf_model.predict_proba,
            num_features=len(X.columns),
            num_samples=5000  # More samples = more stable explanations
        )
        
        # Extract weights for this class
        for feature, weight in exp.as_list():
            feature_name = feature.split('<=')[0].split('>')[0].split('=')[0].strip()
            
            for col in X.columns:
                if col in feature_name or feature_name in col:
                    class_lime_weights[class_label][col].append(abs(weight))
                    break
        

class_importance_dfs = {}

for class_label in sorted(y.unique()):
    importance_list = []
    
    for feature in X.columns:
        weights = class_lime_weights[class_label][feature]
        avg_importance = np.mean(weights) if len(weights) > 0 else 0
        std_importance = np.std(weights) if len(weights) > 0 else 0
        n_appearances = len(weights)
        
        importance_list.append({
            'feature': feature,
            'mean_importance': avg_importance,
            'std_importance': std_importance,
            'appearances': n_appearances
        })
    
    class_df = pd.DataFrame(importance_list).sort_values('mean_importance', ascending=False)
    class_importance_dfs[class_label] = class_df
    
    if class_label == "Dropout":
        display(Markdown(f"\n### Top 15 Features for Predicting: **{class_label}**\n"))
        display(Markdown(f"*Based on {n_samples} samples*\n"))
    
    top_15 = class_df.head(15).copy()
    top_15['mean_importance'] = top_15['mean_importance'].round(4)
    top_15['std_importance'] = top_15['std_importance'].round(4)
    if class_label == "Dropout":
        display(top_15)

display(Markdown("""
**Analysis of Dropout Risk Factors:** The aggregated LIME analysis across 100 dropout samples reveals the most influential features driving dropout predictions. Second semester grades emerge as the dominant predictor with a mean importance of 0.0852, significantly higher than any other feature. This is followed by scholarship holder status (0.0525) and debtor status (0.0370), highlighting how financial pressures compound academic struggles. Age at enrollment (0.0283) and first semester grades (0.0234) also play important roles. Notably, all top features appear in all 100 samples, demonstrating their consistent relevance across different dropout cases. The relatively low standard deviations (especially for scholarship holder and debtor status) indicate these features have stable, predictable effects, making them reliable indicators for early intervention systems targeting at-risk students.
"""))

display(Markdown("\n## 7.2 - Feature Importance Comparison Across Classes\n"))

# Get top features from all classes
all_top_features = set()
for class_label in sorted(y.unique()):
    all_top_features.update(class_importance_dfs[class_label].head(10)['feature'].tolist())

# Create comparison dataframe
comparison_data = []
for feature in all_top_features:
    row = {'Feature': feature}
    for class_label in sorted(y.unique()):
        class_df = class_importance_dfs[class_label]
        importance = class_df[class_df['feature'] == feature]['mean_importance'].values
        row[class_label] = importance[0] if len(importance) > 0 else 0
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)

# Sort by average importance
comparison_df['avg'] = comparison_df[sorted(y.unique())].mean(axis=1)
comparison_df = comparison_df.sort_values('avg', ascending=False).drop('avg', axis=1)

# Grouped bar chart
fig, ax = plt.subplots(figsize=(14, 10))

plot_features = comparison_df.head(15)['Feature'].tolist()
plot_data = comparison_df[comparison_df['Feature'].isin(plot_features)].set_index('Feature')

x = np.arange(len(plot_features))
width = 0.25
colors = ['#F46968', '#BCDCED', '#31709D']

for idx, class_label in enumerate(sorted(y.unique())):
    values = [plot_data.loc[feat, class_label] for feat in plot_features]
    offset = (idx - 1) * width
    ax.barh(x + offset, values, width, label=class_label, alpha=0.8, color=colors[idx])

ax.set_yticks(x)
ax.set_yticklabels(plot_features)
ax.set_xlabel('LIME Importance (Mean Absolute Weight)', fontsize=12)
ax.set_title('Feature Importance by Class (LIME Analysis)', fontsize=14, fontweight='bold')
ax.legend(title='Target Class', fontsize=10)
ax.invert_yaxis()
plt.tight_layout()
plt.show()

display(Markdown("""Derived from aggregated local LIME,  this figure shows which variables influence the most the decision of the model for each predictive class. For the Dropout class, the features with the strongest influence are _Curricular units 2nd sem (grade)_, _Scholarship holder_, _Debtor_ and _Age_ at enrollment. This indicates that the model frequently relies on these variables when producing predictions classified as Dropout. 

For the Enrolled class, Curricular units 2nd sem (grade), Scholarship holder, Debtor and Curricular units 1st sem (grade) are found to have the highest contributions, indicating a greater reliance upon academic performance and enrollment-related variables in the model decision process.

Finally, for the Graduate class, Scholarship holder, Curricular units 2nd sem (grade), Debtor and Curricular units 1st sem (enrolled) are the most influential features. Overall, the results show that several features are consistently influential across all classes, with Age at enrollment being particularly distinctive for the Dropout class compared to the other classes where it appears with slightly lower influence. These contributions reflect aggregated local decision patterns of the model rather than global feature importance or causal relationships.

Conversely, some features show lower aggregated LIME contributions across all predicted classes, suggesting that the model relies less frequently on these variables in its local decisions, although this doesn’t mean they are unimportant in certain individual cases. Across all classes, the least influential features are Course, Educational special needs, Unemployment rate and Curricular units 1st sem (enrolled).

Overall, these results indicate that the importance of the academic and financial individual characteristics overcomes the importance of contextual characteristics in the predictions. The prevalence of grades and financial variables, like Scholarship holder or Debtor, indicates that the characteristics of individual students have been taken into consideration. On the other hand, environmental characteristics, like Course or Unemployment rate, have less influence over the predictions.

"""))