#### 3️⃣ 📊 Exploratory Data Analysis (EDA)

<small>

📌 **Goal:** Understand student patterns & test hypotheses.

- Descriptive statistics (demographics + performance).
- Correlation analysis (top drivers of G3).
- Group comparisons:
- Study time vs Grades
- Failures vs Outcomes
- School support vs Performance
- Hypothesis testing (examples):
- H1: More study time → higher G3.
- H2: School support → better grades.
- H3: More absences → lower performance.

---


In [1]:
# ===============================
# 📚 Essential Libraries for Project
# ===============================

# Data handling
import pandas as pd
import numpy as np

# Fetch UCI ML Repository datasets
from ucimlrepo import fetch_ucirepo

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import plotly.graph_objects as go
import missingno as msno

# Handle Warning
import warnings
warnings.filterwarnings("ignore")


# Machine Learning (Supervised & Unsupervised)
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV,
)
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import (
    LogisticRegression,
    LinearRegression,
    Ridge,
    Lasso,
    ElasticNet,
)
from sklearn.ensemble import (
    RandomForestClassifier,
    RandomForestRegressor,
    AdaBoostRegressor,
)
from sklearn.svm import SVC, SVR
from sklearn.cluster import KMeans
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    silhouette_score,
    make_scorer,
    f1_score,
    precision_score,
    recall_score,
    mean_squared_error,
    r2_score,
    mean_absolute_error,
)


from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

# Dimensionality Reduction & Feature Selection
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2

# Stats & Hypothesis Testing
import scipy.stats as stats
from scipy.stats import chi2_contingency, ttest_ind, f_oneway

# Dashboard
import streamlit as st

# Save Models
from joblib import dump, load
from pickle import dump, load


# Set style for consistent plotting
plt.style.use("default")
sns.set_palette("husl")
plt.rcParams["figure.figsize"] = (12, 8)

ModuleNotFoundError: No module named 'ucimlrepo'

In [None]:
df_no_leak = pd.read_csv("data-set/student_data_no_leakage.csv")
df_with_leak = pd.read_csv("data-set/student_data_with_leakage.csv")

In [4]:
print(f"No leakage dataset: {df_no_leak.shape}")
print(f"With G1/G2 dataset: {df_with_leak.shape}")

No leakage dataset: (649, 47)
With G1/G2 dataset: (649, 50)


In [5]:
# Use the dataset without leakage for EDA (more realistic)
df = df_no_leak.copy()

3.1 DESCRIPTIVE STATISTICS


In [6]:
# Select key variables for analysis
key_variables = [
    "age",
    "absences",
    "studytime",
    "failures",
    "freetime",
    "goout",
    "G3",
    "attendance_rate",
    "study_efficiency",
    "family_edu_avg",
]

In [7]:
# Filter available variables
available_key_vars = [var for var in key_variables if var in df.columns]
print(f"Analyzing key variables: {available_key_vars}")

Analyzing key variables: ['age', 'absences', 'studytime', 'failures', 'freetime', 'goout', 'G3', 'attendance_rate', 'study_efficiency', 'family_edu_avg']


In [8]:
# Generate descriptive statistics
desc_stats = df[available_key_vars].describe().round(2)
print(f"\nDescriptive Statistics Table:")
print("=" * 80)
print(desc_stats.to_string())


Descriptive Statistics Table:
          age  absences  studytime  failures  freetime   goout      G3  attendance_rate  study_efficiency  family_edu_avg
count  649.00    649.00     649.00    649.00    649.00  649.00  649.00           649.00            649.00          649.00
mean    16.74      3.51       1.93      0.22      3.18    3.18   11.91             0.77              4.20            2.41
std      1.22      4.09       0.83      0.59      1.05    1.18    3.23             0.27              1.35            1.01
min     15.00      0.00       1.00      0.00      1.00    1.00    0.00             0.00              0.56            0.00
25%     16.00      0.00       1.00      0.00      3.00    2.00   10.00             0.60              3.33            1.50
50%     17.00      2.00       2.00      0.00      3.00    3.00   12.00             0.87              4.11            2.50
75%     18.00      6.00       2.00      0.00      4.00    4.00   14.00             1.00              5.00          

In [9]:
# Additional statistics
print(f"\nAdditional Statistics:")
print("-" * 25)
for var in available_key_vars:
    if var in df.columns:
        skewness = stats.skew(df[var])
        kurtosis = stats.kurtosis(df[var])
        print(f"{var:15} | Skewness: {skewness:6.2f} | Kurtosis: {kurtosis:6.2f}")


Additional Statistics:
-------------------------
age             | Skewness:   0.42 | Kurtosis:   0.06
absences        | Skewness:   1.24 | Kurtosis:   0.80
studytime       | Skewness:   0.70 | Kurtosis:   0.03
failures        | Skewness:   3.09 | Kurtosis:   9.74
freetime        | Skewness:  -0.18 | Kurtosis:  -0.40
goout           | Skewness:  -0.01 | Kurtosis:  -0.87
G3              | Skewness:  -0.91 | Kurtosis:   2.68
attendance_rate | Skewness:  -1.24 | Kurtosis:   0.80
study_efficiency | Skewness:   0.47 | Kurtosis:   0.47
family_edu_avg  | Skewness:   0.12 | Kurtosis:  -1.13


In [None]:
# Skewness:
#   - Measures **asymmetry** of the distribution.
#   - `0` = perfectly symmetric (normal).
#   - `> 0` = right-skewed (tail on the right).
#   - `< 0` = left-skewed (tail on the left).

# Kurtosis:
#   - Measures **tailedness / peakedness**.
#   - `0` = normal distribution.
#   - `> 0` = heavy tails (outliers more likely).
#   - `< 0` = light tails (values clustered around mean).

<span style="font-size: 14px; line-height: 1.4;">

#### 🔍 Additional Insights

- **Academic Performance**

  - Final grade distribution suggests **average performance (~12/20)**.
  - Students with **higher attendance & study efficiency** tend to score better.

- **Attendance & Absences**

  - Strong **attendance rate (≥87%)** for most students.
  - A few students with **zero attendance** may indicate dropouts or data anomalies.

- **Study Behavior**

  - Despite low average **study time (1–2 on 4 scale)**, efficiency scores suggest **some compensate with smarter study methods**.
  - Outliers exist with **very high efficiency (8.83)**.

- **Social Life**

  - Students maintain a balance between **study and socializing**.
  - Extreme cases (always going out vs. never) could correlate with performance differences.

- **Family Education**
  - Average around **secondary school level (2.4/4)**.
  - Parental education may influence **study habits & academic outcomes**.

---

#### 📌 Key Insights

- **Age**

  - Mean: **16.7 years** (range: 15 – 22).
  - Most students are **16–18 years old**.

- **Absences**

  - Average: **3.5 days** (std: 4.1, range: 0 – 15).
  - 25% of students had **no absences**.

- **Study Time (1–4 scale)**

  - Median: **2** (≈ 2–5 hours weekly).
  - Most students report **low to moderate study time**.

- **Failures**

  - Average: **0.22** (range: 0 – 3).
  - Majority of students have **no previous failures**.

- **Free Time & Going Out (1–5 scale)**

  - Free time avg: **3.2**, Going out avg: **3.2**.
  - Students balance study with **moderate social life**.

- **Final Grade (G3)**

  - Mean: **11.9** (range: 0 – 19).
  - 50% of students scored **10–14**.

- **Attendance Rate**

  - Average: **77%**.
  - 75% of students achieved **≥ 100% attendance**.

- **Study Efficiency**

  - Mean: **4.2** (range: 0.56 – 8.83).
  - Indicates variability in **study habits & performance**.

- **Family Education Average (0–4 scale)**
  - Mean: **2.4** (≈ secondary to higher education).
  - Families tend to have **mid-level educational background**.

---

#### 📌 Variable-Level Insights

- **Age**

  - Skewness: **0.42** → Slightly right-skewed (a few older students, up to 22).
  - Kurtosis: **0.06** → Close to normal distribution.

- **Absences**

  - Skewness: **1.24** → Right-skewed, most students have low absences but some have very high.
  - Kurtosis: **0.80** → Mild heavy tails → occasional extreme absences.

- **Study Time**

  - Skewness: **0.70** → Slight right skew, most report low–moderate study, few study a lot.
  - Kurtosis: **0.03** → Near-normal distribution.

- **Failures**

  - Skewness: **3.09** → Strong right skew, majority = 0 failures, few with 2–3 failures.
  - Kurtosis: **9.74** → Extremely heavy tails, strong outlier presence (students repeatedly failing).

- **Free Time**

  - Skewness: **-0.18** → Slight left skew, balanced free-time use.
  - Kurtosis: **-0.40** → Light tails, most students cluster around mid-values.

- **Going Out**

  - Skewness: **-0.01** → Symmetric, evenly distributed social activity.
  - Kurtosis: **-0.87** → Flat distribution, less extreme behavior.

- **Final Grade (G3)**

  - Skewness: **-0.91** → Left-skewed, more students score high; fewer with very low grades.
  - Kurtosis: **2.68** → Heavy tails, performance extremes exist.

- **Attendance Rate**

  - Skewness: **-1.24** → Strong left skew, most students attend regularly, few with poor attendance.
  - Kurtosis: **0.80** → Slightly heavy tails (outliers with zero attendance).

- **Study Efficiency**

  - Skewness: **0.47** → Slight right skew, most are average, a few very efficient.
  - Kurtosis: **0.47** → Somewhat heavier tails, showing variability in learning strategies.

- **Family Education Avg**
  - Skewness: **0.12** → Very close to symmetric.
  - Kurtosis: **-1.13** → Light tails, family education levels cluster around average.

---

✅ **Overall Conclusion**:  
Students are mostly teenagers (16–18) with **good attendance**, **low absences**, and **average academic performance**. A small portion struggles with failures or poor attendance, while **study efficiency and family education level** appear as strong differentiators in outcomes.

</span>


In [33]:
# Calculate correlation matrix for numerical variables
numerical_vars = df[available_key_vars].select_dtypes(include=["int64", "float64"])
correlation_matrix = numerical_vars.corr()

# Print correlation matrix
print("\nCorrelation Matrix:")
print("=" * 50)
print(correlation_matrix.round(2).to_string())

# Print strongest correlations
print("\nStrongest Correlations:")
print("=" * 50)

# Get unique pairs of correlations
correlations = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        correlations.append(
            {
                "var1": correlation_matrix.columns[i],
                "var2": correlation_matrix.columns[j],
                "corr": abs(correlation_matrix.iloc[i, j]),
            }
        )

# Sort by absolute correlation value
correlations.sort(key=lambda x: x["corr"], reverse=True)

# Print top 10 strongest correlations
print("\nTop 10 Strongest Variable Relationships:")
print("-" * 40)
for i, corr in enumerate(correlations[:10], 1):
    actual_corr = correlation_matrix.loc[corr["var1"], corr["var2"]]
    direction = "positive" if actual_corr > 0 else "negative"
    strength = (
        "strong"
        if abs(actual_corr) > 0.5
        else "moderate" if abs(actual_corr) > 0.3 else "weak"
    )
    print(
        f"{i}. {corr['var1']} vs {corr['var2']}: {actual_corr:.3f} ({strength} {direction})"
    )


Correlation Matrix:
                   age  absences  studytime  failures  freetime  goout    G3  attendance_rate  study_efficiency  family_edu_avg
age               1.00      0.16      -0.01      0.32     -0.00   0.11 -0.11            -0.16             -0.13           -0.13
absences          0.16      1.00      -0.11      0.13     -0.02   0.10 -0.10            -1.00              0.01            0.01
studytime        -0.01     -0.11       1.00     -0.15     -0.07  -0.08  0.25             0.11             -0.62            0.08
failures          0.32      0.13      -0.15      1.00      0.11   0.05 -0.39            -0.13             -0.20           -0.19
freetime         -0.00     -0.02      -0.07      0.11      1.00   0.35 -0.12             0.02             -0.03           -0.01
goout             0.11      0.10      -0.08      0.05      0.35   1.00 -0.09            -0.10             -0.00            0.02
G3               -0.11     -0.10       0.25     -0.39     -0.12  -0.09  1.00       

In [None]:
# Analyze correlations by category
categories = {
    "Academic": ["G3", "studytime", "failures", "study_efficiency"],
    "Attendance": ["attendance_rate", "absences"],
    "Social": ["freetime", "goout"],
    "Family": ["family_edu_avg"],
}

print("\nCorrelation Analysis by Category:")
print("=" * 50)

for category, vars in categories.items():
    available_vars = [var for var in vars if var in df.columns]
    if len(available_vars) > 1:  # Need at least 2 variables for correlation
        print(f"\n{category} Factors:")
        print("-" * 30)
        cat_corr = df[available_vars].corr()

        # Print correlations within category
        for i in range(len(available_vars)):
            for j in range(i):
                corr = cat_corr.iloc[i, j]
                var1, var2 = available_vars[i], available_vars[j]
                strength = (
                    "Strong"
                    if abs(corr) > 0.5
                    else "Moderate" if abs(corr) > 0.3 else "Weak"
                )
                direction = "positive" if corr > 0 else "negative"
                print(
                    f"{var1} vs {var2}: {corr:.3f} ({strength} {direction} correlation)"
                )

# Print correlations with final grade (G3)
if "G3" in df.columns:
    print("\nCorrelations with Final Grade (G3):")
    print("-" * 40)
    g3_corrs = correlation_matrix["G3"].sort_values(key=abs, ascending=False)
    g3_corrs = g3_corrs[g3_corrs.index != "G3"]  # Remove self-correlation

    for var, corr in g3_corrs.items():
        strength = (
            "Strong" if abs(corr) > 0.5 else "Moderate" if abs(corr) > 0.3 else "Weak"
        )
        direction = "positive" if corr > 0 else "negative"
        print(f"{var:15} | {corr:6.3f} ({strength} {direction})")


Correlation Analysis by Category:

Academic Factors:
------------------------------
studytime vs G3: 0.250 (Weak positive correlation)
failures vs G3: -0.393 (Moderate negative correlation)
failures vs studytime: -0.147 (Weak negative correlation)
study_efficiency vs G3: 0.528 (Strong positive correlation)
study_efficiency vs studytime: -0.623 (Strong negative correlation)
study_efficiency vs failures: -0.200 (Weak negative correlation)

Attendance Factors:
------------------------------
absences vs attendance_rate: -1.000 (Strong negative correlation)

Social Factors:
------------------------------
goout vs freetime: 0.346 (Moderate positive correlation)

Correlations with Final Grade (G3):
----------------------------------------
study_efficiency |  0.528 (Strong positive)
failures        | -0.393 (Moderate negative)
studytime       |  0.250 (Weak positive)
family_edu_avg  |  0.249 (Weak positive)
freetime        | -0.123 (Weak negative)
age             | -0.107 (Weak negative)
abse

<span style="font-size: 14px; line-height: 1.4;">

1. **Interpretation Guide:**

   - Correlation ranges from -1 to +1
   - Positive values indicate variables move together
   - Negative values indicate inverse relationships
   - Values closer to ±1 indicate stronger relationships

2. **Correlation Strength Categories:**

   - Strong: |r| > 0.5
   - Moderate: 0.3 < |r| ≤ 0.5
   - Weak: |r| ≤ 0.3

3. **Key Findings:**

   - **Academic Factors:** Study efficiency and failures show strongest correlations with final grades
   - **Attendance Impact:** Moderate correlation between attendance rate and academic performance
   - **Social Balance:** Weak to moderate correlations between social activities and grades
   - **Family Background:** Family education level shows notable influence on student performance

4. **Important Relationships:**

   - Strong negative correlation between failures and final grades
   - Positive correlation between study efficiency and performance
   - Moderate negative correlation between absences and grades
   - Social factors (freetime, goout) show weaker correlations

5. **Practical Implications:**
   - Focus on reducing failures and improving study efficiency
   - Monitor and improve attendance
   - Balance social activities with academic commitments
   - Consider family background in student support planning


3.2 TARGET VARIABLE ANALYSIS


In [None]:
g3_stats = df["G3"].describe()
print(f"G3 Grade Distribution:")
print(f"- Mean: {g3_stats.mean():.2f}")
print(f"- Median: {g3_stats.median():.2f}")
print(f"- Standard Deviation: {g3_stats.std():.2f}")
print(f"- Range: {g3_stats.min():.0f} - {g3_stats.max():.0f}")

# Pass/Fail analysis
pass_rate = df["pass_binary"].mean()
print(f"- Pass Rate (G3 ≥ 10): {pass_rate:.1%}")
print(f"- Fail Rate (G3 < 10): {1-pass_rate:.1%}")

# Risk category distribution
risk_dist = df["risk_category"].value_counts(normalize=True).sort_index()
print(f"- Risk Category Distribution:")
for category, pct in risk_dist.items():
    print(f"  • {category}: {pct:.1%}")

G3 Grade Distribution:
- Mean: 11.91
- Median: 12.00
- Standard Deviation: 3.23
- Range: 0 - 19
- Pass Rate (G3 ≥ 10): 84.6%
- Fail Rate (G3 < 10): 15.4%
- Risk Category Distribution:
  • High_Risk: 15.4%
  • Low_Risk: 29.9%
  • Medium_Risk: 54.7%


In [None]:
# Calculate correlations with G3
numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns
correlations = df[numerical_cols].corr()["G3"].sort_values(key=abs, ascending=False)

print("Strongest correlations with G3 (final grade):")
print("=" * 50)
for var, corr in correlations.items():
    if var != "G3" and abs(corr) > 0.1:  # Show meaningful correlations
        direction = "positive" if corr > 0 else "negative"
        strength = (
            "strong" if abs(corr) > 0.5 else "moderate" if abs(corr) > 0.3 else "weak"
        )
        print(f"{var:20} | {corr:6.3f} ({strength} {direction})")

Strongest correlations with G3 (final grade):
pass_binary          |  0.663 (strong positive)
study_efficiency     |  0.528 (strong positive)
has_failures         | -0.438 (moderate negative)
failures             | -0.393 (moderate negative)
studytime            |  0.250 (weak positive)
family_edu_avg       |  0.249 (weak positive)
family_edu_max       |  0.244 (weak positive)
Medu                 |  0.240 (weak positive)
Fedu                 |  0.212 (weak positive)
Dalc                 | -0.205 (weak negative)
Walc                 | -0.177 (weak negative)
traveltime           | -0.127 (weak negative)
freetime             | -0.123 (weak negative)
age                  | -0.107 (weak negative)


In [15]:
# Top 5 positive and negative correlations
top_positive = correlations[correlations > 0].head(6)[1:]  # Exclude G3 itself
top_negative = correlations[correlations < 0].head(5)

print(f"\nTop 5 Positive Predictors of G3:")
for var, corr in top_positive.items():
    print(f"  • {var}: {corr:.3f}")

print(f"\nTop 5 Negative Predictors of G3:")
for var, corr in top_negative.items():
    print(f"  • {var}: {corr:.3f}")


Top 5 Positive Predictors of G3:
  • pass_binary: 0.663
  • study_efficiency: 0.528
  • studytime: 0.250
  • family_edu_avg: 0.249
  • family_edu_max: 0.244

Top 5 Negative Predictors of G3:
  • has_failures: -0.438
  • failures: -0.393
  • Dalc: -0.205
  • Walc: -0.177
  • traveltime: -0.127


3.3 GROUP COMPARISONS


In [43]:
df.columns

Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G3',
       'school_MS', 'sex_M', 'address_U', 'famsize_LE3', 'Pstatus_T',
       'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher',
       'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher',
       'reason_home', 'reason_other', 'reason_reputation', 'guardian_mother',
       'guardian_other', 'schoolsup_yes', 'famsup_yes', 'paid_yes',
       'activities_yes', 'nursery_yes', 'higher_yes', 'internet_yes',
       'romantic_yes', 'attendance_rate', 'pass_binary', 'risk_category',
       'study_efficiency', 'has_failures', 'family_edu_avg', 'family_edu_max'],
      dtype='object')

In [56]:
from scipy.stats import ttest_ind, f_oneway
import pandas as pd


def compare_groups(df, group_var, target_var="G3"):
    """Compare target variable across groups and run statistical tests"""

    if group_var not in df.columns or target_var not in df.columns:
        raise ValueError(f"Variable {group_var} or {target_var} not in dataframe.")

    # Ensure categorical is string
    df[group_var] = df[group_var].astype(str)

    # Group stats
    groups = df.groupby(group_var)[target_var].agg(["count", "mean", "std"]).round(2)

    # Collect values
    group_values = [
        df[df[group_var] == group][target_var].values
        for group in df[group_var].unique()
        if len(df[df[group_var] == group]) > 1
    ]

    test_result = None
    if len(group_values) == 2:
        stat, p_value = ttest_ind(group_values[0], group_values[1], equal_var=False)
        significance = "significant" if p_value < 0.05 else "not significant"
        test_result = {
            "test": "t-test",
            "stat": stat,
            "p_value": p_value,
            "significance": significance,
        }
    elif len(group_values) > 2:
        stat, p_value = f_oneway(*group_values)
        significance = "significant" if p_value < 0.05 else "not significant"
        test_result = {
            "test": "ANOVA",
            "stat": stat,
            "p_value": p_value,
            "significance": significance,
        }
    else:
        test_result = {"test": None, "note": "Not enough data for statistical test"}

    return groups, test_result

In [58]:
categorical_vars = [
    "studytime",
    "school_MS",
    "sex_M",
    "address_U",
    "famsize_LE3",
    "Pstatus_T",
    "Mjob_health",
    "Mjob_other",
    "Mjob_services",
    "Mjob_teacher",
    "Fjob_health",
    "Fjob_other",
    "Fjob_services",
    "Fjob_teacher",
    "reason_home",
    "reason_other",
    "reason_reputation",
    "guardian_mother",
    "guardian_other",
    "schoolsup_yes",
    "famsup_yes",
    "paid_yes",
    "activities_yes",
    "nursery_yes",
    "higher_yes",
    "internet_yes",
    "romantic_yes",
    "has_failures",
    "family_edu_max",
]

for col in categorical_vars:
    groups, result = compare_groups(df, col, "G3")
    print(f"--- {col} ---")
    print(groups)
    print(result)
    print("\n")

--- studytime ---
           count   mean   std
studytime                    
1            212  10.84  3.22
2            305  12.09  3.24
3             97  13.23  2.50
4             35  13.06  3.04
{'test': 'ANOVA', 'stat': 15.876267993177123, 'p_value': 5.705728458962843e-10, 'significance': 'significant'}


--- school_MS ---
           count   mean   std
school_MS                    
False        423  12.58  2.63
True         226  10.65  3.83
{'test': 't-test', 'stat': 6.754491544530737, 'p_value': 6.211839408463177e-11, 'significance': 'significant'}


--- sex_M ---
       count   mean   std
sex_M                    
False    383  12.25  3.12
True     266  11.41  3.32
{'test': 't-test', 'stat': 3.274707393354231, 'p_value': 0.0011245651360440646, 'significance': 'significant'}


--- address_U ---
           count   mean   std
address_U                    
False        197  11.09  3.61
True         452  12.26  2.99
{'test': 't-test', 'stat': 4.019882754058349, 'p_value': 7.2738534089

3.4 HYPOTHESIS TESTING


In [20]:
# Hypothesis 1: Students with higher study time perform better
print("\nHYPOTHESIS 1: Higher study time → Better performance")
if "studytime" in df.columns and "G3" in df.columns:
    corr_study_g3 = df["studytime"].corr(df["G3"])
    print(f"Study time - G3 correlation: {corr_study_g3:.3f}")

    # Group analysis
    study_groups = df.groupby("studytime")["G3"].mean()
    print(f"Average G3 by study time:")
    for time, avg_grade in study_groups.items():
        print(f"  Study time {time}: {avg_grade:.2f}")

    # Test high vs low study time
    high_study = df[df["studytime"] >= 3]["G3"]
    low_study = df[df["studytime"] <= 2]["G3"]
    stat, p_val = ttest_ind(high_study, low_study)
    result = (
        "SUPPORTED"
        if p_val < 0.05 and high_study.mean() > low_study.mean()
        else "NOT SUPPORTED"
    )
    print(f"Result: {result} (p-value: {p_val:.4f})")


HYPOTHESIS 1: Higher study time → Better performance
Study time - G3 correlation: 0.250
Average G3 by study time:
  Study time 1: 10.84
  Study time 2: 12.09
  Study time 3: 13.23
  Study time 4: 13.06
Result: SUPPORTED (p-value: 0.0000)


In [21]:
# Hypothesis 2: Students with school support perform better
print("\nHYPOTHESIS 2: School support → Better performance")
if "schoolsup_yes" in df.columns and "G3" in df.columns:
    support_yes = df[df["schoolsup_yes"] == 1]["G3"]
    support_no = df[df["schoolsup_yes"] == 0]["G3"]
    stat, p_val = ttest_ind(support_yes, support_no)
    print(f"With school support: {support_yes.mean():.2f}")
    print(f"Without school support: {support_no.mean():.2f}")
    result = (
        "SUPPORTED"
        if p_val < 0.05 and support_yes.mean() > support_no.mean()
        else "NOT SUPPORTED"
    )
    print(f"Result: {result} (p-value: {p_val:.4f})")


HYPOTHESIS 2: School support → Better performance
With school support: 11.28
Without school support: 11.98
Result: NOT SUPPORTED (p-value: 0.0910)


In [22]:
# Hypothesis 3: Higher absences lead to lower performance
print("\nHYPOTHESIS 3: Higher absences → Lower performance")
if "absences" in df.columns and "G3" in df.columns:
    corr_abs_g3 = df["absences"].corr(df["G3"])
    print(f"Absences - G3 correlation: {corr_abs_g3:.3f}")

    high_absence = df[df["absences"] > df["absences"].median()]["G3"]
    low_absence = df[df["absences"] <= df["absences"].median()]["G3"]
    stat, p_val = ttest_ind(high_absence, low_absence)
    result = "SUPPORTED" if p_val < 0.05 and corr_abs_g3 < 0 else "NOT SUPPORTED"
    print(f"High absences avg G3: {high_absence.mean():.2f}")
    print(f"Low absences avg G3: {low_absence.mean():.2f}")
    print(f"Result: {result} (p-value: {p_val:.4f})")


HYPOTHESIS 3: Higher absences → Lower performance
Absences - G3 correlation: -0.099
High absences avg G3: 11.66
Low absences avg G3: 12.10
Result: NOT SUPPORTED (p-value: 0.0845)


In [23]:
# Hypothesis 3: Higher absences lead to lower performance
print("\nHYPOTHESIS 3: Higher absences → Lower performance")
if "absences" in df.columns and "G3" in df.columns:
    corr_abs_g3 = df["absences"].corr(df["G3"])
    print(f"Absences - G3 correlation: {corr_abs_g3:.3f}")

    high_absence = df[df["absences"] > df["absences"].median()]["G3"]
    low_absence = df[df["absences"] <= df["absences"].median()]["G3"]
    stat, p_val = ttest_ind(high_absence, low_absence)
    result = "SUPPORTED" if p_val < 0.05 and corr_abs_g3 < 0 else "NOT SUPPORTED"
    print(f"High absences avg G3: {high_absence.mean():.2f}")
    print(f"Low absences avg G3: {low_absence.mean():.2f}")
    print(f"Result: {result} (p-value: {p_val:.4f})")


HYPOTHESIS 3: Higher absences → Lower performance
Absences - G3 correlation: -0.099
High absences avg G3: 11.66
Low absences avg G3: 12.10
Result: NOT SUPPORTED (p-value: 0.0845)


In [24]:
# Hypothesis 4: Past failures predict current performance
print("\nHYPOTHESIS 4: Past failures → Lower performance")
if "failures" in df.columns and "G3" in df.columns:
    no_failures = df[df["failures"] == 0]["G3"]
    with_failures = df[df["failures"] > 0]["G3"]
    stat, p_val = ttest_ind(no_failures, with_failures)
    print(f"No past failures avg G3: {no_failures.mean():.2f}")
    print(f"With past failures avg G3: {with_failures.mean():.2f}")
    result = (
        "SUPPORTED"
        if p_val < 0.05 and no_failures.mean() > with_failures.mean()
        else "NOT SUPPORTED"
    )
    print(f"Result: {result} (p-value: {p_val:.4f})")


HYPOTHESIS 4: Past failures → Lower performance
No past failures avg G3: 12.51
With past failures avg G3: 8.59
Result: SUPPORTED (p-value: 0.0000)


In [25]:
# Hypothesis 5: Family education level affects performance
print("\nHYPOTHESIS 5: Higher family education → Better performance")
if "family_edu_avg" in df.columns and "G3" in df.columns:
    corr_fam_edu_g3 = df["family_edu_avg"].corr(df["G3"])
    print(f"Family education - G3 correlation: {corr_fam_edu_g3:.3f}")

    high_edu = df[df["family_edu_avg"] >= 3]["G3"]
    low_edu = df[df["family_edu_avg"] < 3]["G3"]
    stat, p_val = ttest_ind(high_edu, low_edu)
    result = "SUPPORTED" if p_val < 0.05 and corr_fam_edu_g3 > 0 else "NOT SUPPORTED"
    print(f"High family education avg G3: {high_edu.mean():.2f}")
    print(f"Low family education avg G3: {low_edu.mean():.2f}")
    print(f"Result: {result} (p-value: {p_val:.4f})")


HYPOTHESIS 5: Higher family education → Better performance
Family education - G3 correlation: 0.249
High family education avg G3: 12.80
Low family education avg G3: 11.34
Result: SUPPORTED (p-value: 0.0000)


3.5 BEHAVIORAL PATTERNS ANALYSIS


In [26]:
# Create behavioral profile
behavioral_vars = ["studytime", "goout", "freetime", "absences"]
available_behavioral = [var for var in behavioral_vars if var in df.columns]

if available_behavioral:
    print("Student behavioral patterns and their relationship with performance:")
    print("=" * 65)

    # High performers vs Low performers
    if "pass_binary" in df.columns:
        high_performers = df[df["pass_binary"] == 1]
        low_performers = df[df["pass_binary"] == 0]

        print(f"HIGH PERFORMERS (n={len(high_performers)}):")
        for var in available_behavioral:
            mean_high = high_performers[var].mean()
            print(f"  • {var}: {mean_high:.2f}")

        print(f"\nLOW PERFORMERS (n={len(low_performers)}):")
        for var in available_behavioral:
            mean_low = low_performers[var].mean()
            print(f"  • {var}: {mean_low:.2f}")

        print(f"\nKEY BEHAVIORAL DIFFERENCES:")
        for var in available_behavioral:
            diff = high_performers[var].mean() - low_performers[var].mean()
            direction = "higher" if diff > 0 else "lower"
            print(f"  • High performers have {direction} {var} ({diff:+.2f})")

Student behavioral patterns and their relationship with performance:
HIGH PERFORMERS (n=549):
  • studytime: 1.99
  • goout: 3.15
  • freetime: 3.14
  • absences: 3.36

LOW PERFORMERS (n=100):
  • studytime: 1.61
  • goout: 3.37
  • freetime: 3.41
  • absences: 4.32

KEY BEHAVIORAL DIFFERENCES:
  • High performers have higher studytime (+0.38)
  • High performers have lower goout (-0.22)
  • High performers have lower freetime (-0.27)
  • High performers have lower absences (-0.96)


3.6 FEATURE IMPORTANCE FOR CLUSTERING


In [None]:
# Analyze clustering features
clustering_features = ["studytime", "absences", "goout", "freetime", "attendance_rate"]
available_clustering = [feat for feat in clustering_features if feat in df.columns]

if available_clustering:
    print("Behavioral clustering features summary:")
    cluster_stats = df[available_clustering].describe().round(2)
    print(cluster_stats.to_string())

    # Calculate feature variance (important for clustering)
    print(f"\nFeature variance (higher = more discriminative):")
    for feat in available_clustering:
        variance = df[feat].var()
        print(f"  • {feat}: {variance:.3f}")


8. FEATURE ANALYSIS FOR CLUSTERING
--------------------------------------
Behavioral clustering features summary:
       studytime  absences   goout  freetime  attendance_rate
count     649.00    649.00  649.00    649.00           649.00
mean        1.93      3.51    3.18      3.18             0.77
std         0.83      4.09    1.18      1.05             0.27
min         1.00      0.00    1.00      1.00             0.00
25%         1.00      0.00    2.00      3.00             0.60
50%         2.00      2.00    3.00      3.00             0.87
75%         2.00      6.00    4.00      4.00             1.00
max         4.00     15.00    5.00      5.00             1.00

Feature variance (higher = more discriminative):
  • studytime: 0.688
  • absences: 16.695
  • goout: 1.382
  • freetime: 1.105
  • attendance_rate: 0.074


3.6 ADDITIONAL EDA INSIGHTS


In [None]:
# 2. Outlier Analysis
print("\nOutlier Analysis")
print("=" * 50)


def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column]
    return len(outliers), lower_bound, upper_bound


numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns
print("\nOutlier Summary:")
for column in numerical_columns:
    n_outliers, lower, upper = detect_outliers(df, column)
    if n_outliers > 0:
        print(f"\n{column}:")
        print(f"- Number of outliers: {n_outliers}")
        print(f"- Outlier boundaries: [{lower:.2f}, {upper:.2f}]")
        print(f"- Actual range: [{df[column].min():.2f}, {df[column].max():.2f}]")


Outlier Analysis

Outlier Summary:

age:
- Number of outliers: 1
- Outlier boundaries: [13.00, 21.00]
- Actual range: [15.00, 22.00]

traveltime:
- Number of outliers: 16
- Outlier boundaries: [-0.50, 3.50]
- Actual range: [1.00, 4.00]

failures:
- Number of outliers: 100
- Outlier boundaries: [0.00, 0.00]
- Actual range: [0.00, 3.00]

famrel:
- Number of outliers: 51
- Outlier boundaries: [2.50, 6.50]
- Actual range: [1.00, 5.00]

freetime:
- Number of outliers: 45
- Outlier boundaries: [1.50, 5.50]
- Actual range: [1.00, 5.00]

Dalc:
- Number of outliers: 34
- Outlier boundaries: [-0.50, 3.50]
- Actual range: [1.00, 5.00]

G3:
- Number of outliers: 16
- Outlier boundaries: [4.00, 20.00]
- Actual range: [0.00, 19.00]

pass_binary:
- Number of outliers: 100
- Outlier boundaries: [1.00, 1.00]
- Actual range: [0.00, 1.00]

study_efficiency:
- Number of outliers: 13
- Outlier boundaries: [0.83, 7.50]
- Actual range: [0.56, 8.83]


In [None]:
# 3. Categorical Variable Analysis
print("\nCategorical Variable Analysis")
print("=" * 50)

categorical_columns = df.select_dtypes(include=["object", "category"]).columns

if len(categorical_columns) > 0:
    print("\nCategory Distribution:")
    for column in categorical_columns:
        value_counts = df[column].value_counts()
        percentages = df[column].value_counts(normalize=True) * 100

        print(f"\n{column}:")
        for value, count in value_counts.items():
            percentage = percentages[value]
            print(f"- {value}: {count} ({percentage:.1f}%)")


Categorical Variable Analysis

Category Distribution:

studytime:
- 2: 305 (47.0%)
- 1: 212 (32.7%)
- 3: 97 (14.9%)
- 4: 35 (5.4%)

school_MS:
- False: 423 (65.2%)
- True: 226 (34.8%)

sex_M:
- False: 383 (59.0%)
- True: 266 (41.0%)

address_U:
- True: 452 (69.6%)
- False: 197 (30.4%)

famsize_LE3:
- False: 457 (70.4%)
- True: 192 (29.6%)

Pstatus_T:
- True: 569 (87.7%)
- False: 80 (12.3%)

Mjob_health:
- False: 601 (92.6%)
- True: 48 (7.4%)

Mjob_other:
- False: 391 (60.2%)
- True: 258 (39.8%)

Mjob_services:
- False: 513 (79.0%)
- True: 136 (21.0%)

Mjob_teacher:
- False: 577 (88.9%)
- True: 72 (11.1%)

Fjob_health:
- False: 626 (96.5%)
- True: 23 (3.5%)

Fjob_other:
- True: 367 (56.5%)
- False: 282 (43.5%)

Fjob_services:
- False: 468 (72.1%)
- True: 181 (27.9%)

Fjob_teacher:
- False: 613 (94.5%)
- True: 36 (5.5%)

reason_home:
- False: 500 (77.0%)
- True: 149 (23.0%)

reason_other:
- False: 577 (88.9%)
- True: 72 (11.1%)

reason_reputation:
- False: 506 (78.0%)
- True: 143 (22.0%

In [None]:
# 4. Feature Distribution Analysis
print("\nFeature Distribution Analysis")
print("=" * 50)

numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns

for column in numerical_columns:
    distribution_stats = df[column].describe()

    print(f"\n{column} Distribution:")
    print(f"- Mean: {distribution_stats['mean']:.2f}")
    print(f"- Median: {distribution_stats['50%']:.2f}")
    print(f"- Mode: {df[column].mode().values[0]:.2f}")
    print(f"- Standard Deviation: {distribution_stats['std']:.2f}")
    print(
        f"- Coefficient of Variation: {(distribution_stats['std'] / distribution_stats['mean']):.2f}"
    )

    # Distribution shape
    skewness = df[column].skew()
    kurtosis = df[column].kurtosis()

    # Interpret distribution shape
    skew_interpretation = (
        "symmetric"
        if abs(skewness) < 0.5
        else "moderately skewed" if abs(skewness) < 1 else "highly skewed"
    )
    skew_direction = "right" if skewness > 0 else "left" if skewness < 0 else "none"

    kurt_interpretation = (
        "normal"
        if abs(kurtosis) < 0.5
        else "moderate" if abs(kurtosis) < 1 else "extreme"
    )
    kurt_direction = (
        "heavy-tailed"
        if kurtosis > 0
        else "light-tailed" if kurtosis < 0 else "normal-tailed"
    )

    print(f"- Distribution Shape:")
    print(
        f"  • Skewness: {skewness:.2f} ({skew_interpretation} with {skew_direction} skew)"
    )
    print(f"  • Kurtosis: {kurtosis:.2f} ({kurt_interpretation} {kurt_direction})")


Feature Distribution Analysis

age Distribution:
- Mean: 16.74
- Median: 17.00
- Mode: 17.00
- Standard Deviation: 1.22
- Coefficient of Variation: 0.07
- Distribution Shape:
  • Skewness: 0.42 (symmetric with right skew)
  • Kurtosis: 0.07 (normal heavy-tailed)

Medu Distribution:
- Mean: 2.51
- Median: 2.00
- Mode: 2.00
- Standard Deviation: 1.13
- Coefficient of Variation: 0.45
- Distribution Shape:
  • Skewness: -0.03 (symmetric with left skew)
  • Kurtosis: -1.26 (extreme light-tailed)

Fedu Distribution:
- Mean: 2.31
- Median: 2.00
- Mode: 2.00
- Standard Deviation: 1.10
- Coefficient of Variation: 0.48
- Distribution Shape:
  • Skewness: 0.22 (symmetric with right skew)
  • Kurtosis: -1.11 (extreme light-tailed)

traveltime Distribution:
- Mean: 1.57
- Median: 1.00
- Mode: 1.00
- Standard Deviation: 0.75
- Coefficient of Variation: 0.48
- Distribution Shape:
  • Skewness: 1.25 (highly skewed with right skew)
  • Kurtosis: 1.11 (extreme heavy-tailed)

failures Distribution:
- Me

3.7 EDA INSIGHTS SUMMARY

In [66]:
insights = []
# Demographic insights
insights.append(
    f"Age range: {df['age'].min()}-{df['age'].max()} years (avg: {df['age'].mean():.1f})"
)

# Study behavior insights
study_corr = df["studytime"].corr(df["G3"])
study_eff_corr = df["study_efficiency"].corr(df["G3"])
insights.append(
    f"Study efficiency impact: r={study_eff_corr:.3f} (strongest academic predictor)"
)
insights.append(
    f"Students with high study time (3-4 hrs): {len(df[df['studytime'].astype(int) >= 3])/len(df):.1%}"
)

# Risk patterns
risk_dist = df["risk_category"].value_counts(normalize=True)
insights.append(
    f"Risk distribution: {risk_dist['High_Risk']:.1%} high, {risk_dist['Medium_Risk']:.1%} medium"
)

# Family background impact
fam_edu_corr = df["family_edu_avg"].corr(df["G3"])
insights.append(f"Family education influence: r={fam_edu_corr:.3f}")

# Attendance patterns
att_corr = df["attendance_rate"].corr(df["G3"])
insights.append(f"Attendance correlation: r={att_corr:.3f}")

# Failure rate impact
if "has_failures" in df.columns:
    fail_impact = (
        df[df["has_failures"] == 1]["G3"].mean()
        - df[df["has_failures"] == 0]["G3"].mean()
    )
    insights.append(f"Impact of past failures: {fail_impact:.1f} grade points")
# Generate insights based on analysis
if "G3" in df.columns:
    avg_grade = df["G3"].mean()
    pass_rate = df["pass_binary"].mean() if "pass_binary" in df.columns else None
    insights.append(f"Average final grade: {avg_grade:.1f}/20")
    if pass_rate:
        insights.append(f"Overall pass rate: {pass_rate:.1%}")

# Top correlations
if "G3" in df.columns:
    num_cols = df.select_dtypes(include=["int64", "float64"]).columns
    if len(num_cols) > 1:
        corrs = df[num_cols].corr()["G3"].abs().sort_values(ascending=False)
        top_predictor = corrs.index[1]  # Skip G3 itself
        top_corr = corrs.iloc[1]
        insights.append(f"Strongest predictor: {top_predictor} (r={top_corr:.3f})")

# Behavioral insights
if "studytime" in df.columns and "G3" in df.columns:
    study_corr = df["studytime"].corr(df["G3"])
    insights.append(f"Study time correlation with grades: {study_corr:.3f}")

if "absences" in df.columns and "G3" in df.columns:
    abs_corr = df["absences"].corr(df["G3"])
    insights.append(f"Absence impact on grades: {abs_corr:.3f}")

print("KEY INSIGHTS:")
for i, insight in enumerate(insights, 1):
    print(f"{i}. {insight}")

print(f"\n" + "=" * 60)
print("EXPLORATORY DATA ANALYSIS COMPLETE ")
print("=" * 60)

KEY INSIGHTS:
1. Age range: 15-22 years (avg: 16.7)
2. Study efficiency impact: r=0.528 (strongest academic predictor)
3. Students with high study time (3-4 hrs): 20.3%
4. Risk distribution: 15.4% high, 54.7% medium
5. Family education influence: r=0.249
6. Attendance correlation: r=0.099
7. Impact of past failures: nan grade points
8. Average final grade: 11.9/20
9. Overall pass rate: 84.6%
10. Strongest predictor: pass_binary (r=0.663)
11. Study time correlation with grades: 0.250
12. Absence impact on grades: -0.099

EXPLORATORY DATA ANALYSIS COMPLETE 
