## 1) Load dataset

### Project Overview

**Dataset:** Student Performance Dataset (Portuguese or Math course)

**Objective:** Predict student academic success (pass/fail) based on demographic, social, and school-related features

**Methods:**
1. **Exploratory Data Analysis (EDA)**: Analyze distributions, correlations, and class balance
2. **Data Preprocessing**: Handle missing values, encode categorical variables, create target variable
3. **Supervised Learning**: Train and evaluate three classification models
   - Logistic Regression
   - Random Forest
   - Gradient Boosting
4. **Model Evaluation**: Use accuracy, precision, recall, F1-score, ROC AUC, and cross-validation
5. **Unsupervised Learning**: K-Means clustering to identify student segments
6. **Association Rule Mining**: Discover frequent patterns in student characteristics

**Target Variable:** Binary classification - Pass (G3 >= 10) or Fail (G3 < 10)

---


In [6]:
# Read the dataset into a pandas DataFrame
import pandas as pd
import os

# Update this path if needed (using raw strings to handle backslashes correctly)
if os.path.exists(r'C:\Users\Acer\Downloads\Group Projrct\student-por.csv'):
    df = pd.read_csv(r'C:\Users\Acer\Downloads\Group Projrct\student-por.csv', sep=';')
elif os.path.exists(r'C:\Users\Acer\Downloads\Group Projrct\student-mat.csv'):
    df = pd.read_csv(r'C:\Users\Acer\Downloads\Group Projrct\student-mat.csv', sep=';')
else:
    # Try to find any csv in working dir
    csvs = [f for f in os.listdir('.') if f.lower().endswith('.csv')]
    if csvs:
        df = pd.read_csv(csvs[0], sep=';')
    else:
        df = pd.DataFrame()
        print('No CSV found in working directory. Please upload or set the correct path.')

print('Data shape:', df.shape)
df.head()

ModuleNotFoundError: No module named 'pandas'

## 2) Exploratory Data Analysis (EDA)
Inspect distributions, missing values, correlations, and class balance.

In [None]:
# Basic EDA
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100

print('\n--- Dataset Info ---')
display(df.info())

print('\n--- Descriptive Statistics (numeric) ---')
display(df.describe())

print('\n--- Missing Values ---')
missing = df.isnull().sum()
if missing.sum() > 0:
    display(missing[missing > 0])
else:
    print('No missing values found!')

# Plot distributions for numeric features (one chart per numeric column)
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f'\nPlotting distributions for {len(num_cols)} numeric features...')
for col in num_cols[:15]:  # Limit to first 15 to avoid too many plots
    plt.figure(figsize=(6,2.5))
    plt.hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
    plt.title(f'Distribution: {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.grid(True, alpha=0.3)
    plt.show()

# Plot categorical value counts for top categorical columns
cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()
print(f'\nPlotting value counts for {min(len(cat_cols), 6)} categorical features...')
for col in cat_cols[:6]:
    plt.figure(figsize=(6,2.5))
    df[col].value_counts().plot(kind='bar', edgecolor='black', alpha=0.7)
    plt.title(f'Value Counts: {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

In [None]:
# Correlation analysis for numeric features
numeric_df = df.select_dtypes(include=[np.number])
if len(numeric_df.columns) > 1:
    plt.figure(figsize=(12, 10))
    correlation_matrix = numeric_df.corr()
    sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
                square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Matrix of Numeric Features')
    plt.tight_layout()
    plt.show()
    
    # Show features most correlated with G3 (final grade)
    if 'G3' in correlation_matrix.columns:
        print('\nFeatures most correlated with Final Grade (G3):')
        g3_corr = correlation_matrix['G3'].sort_values(ascending=False)
        print(g3_corr.to_string())
else:
    print('Not enough numeric columns for correlation analysis.')


In [None]:
# Check class balance for target variable (G3 grades)
if 'G3' in df.columns:
    # Show distribution of final grades with new pass/fail threshold
    plt.figure(figsize=(10, 4))
    
    plt.subplot(1, 2, 1)
    plt.hist(df['G3'], bins=21, edgecolor='black')
    plt.xlabel('Final Grade (G3)')
    plt.ylabel('Count')
    plt.title('Distribution of Final Grades')
    plt.axvline(x=10, color='r', linestyle='--', label='Pass Threshold (G3=10)')
    plt.legend()
    
    plt.subplot(1, 2, 2)
    pass_count = (df['G3'] >= 10).sum()
    fail_count = (df['G3'] < 10).sum()
    plt.bar(['Fail (G3<10)', 'Pass (G3>=10)'], [fail_count, pass_count], color=['#ff6b6b', '#4ecdc4'])
    plt.ylabel('Count')
    plt.title('Class Balance: Pass vs Fail')
    for i, v in enumerate([fail_count, pass_count]):
        plt.text(i, v + 5, str(v), ha='center', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print(f'Pass rate: {pass_count}/{len(df)} = {pass_count/len(df)*100:.1f}%')
    print(f'Fail rate: {fail_count}/{len(df)} = {fail_count/len(df)*100:.1f}%')


## 3) Data Cleaning & Preprocessing
Handle missing values, encode categorical variables, and create target variable.

**Example target:** predict final grade pass/fail (G3 >= 10 -> pass). Adjust to your project's objective.

In [None]:
# Example preprocessing pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer

# Make a copy
data = df.copy()

# Example: create binary target 'pass' from final grade column G3 if present
if 'G3' in data.columns:
    data['pass'] = (data['G3'] >= 10).astype(int)
    target_col = 'pass'
else:
    # If no numeric grade present, user should set target manually
    print('No G3 column found. Please define your target_col manually.')
    target_col = None

# Drop columns unlikely to be helpful (example: drop G1 and G2 to avoid leakage)
# G1 and G2 are period grades that directly predict G3, so excluding them makes the problem more realistic
drop_cols = ['G1', 'G2']
for c in drop_cols:
    if c in data.columns:
        data.drop(columns=c, inplace=True)

# Separate features and target
if target_col:
    X = data.drop(columns=[target_col])
    y = data[target_col]
else:
    X = data.copy()
    y = None

# Simple handling: numeric columns fillna with median, categorical with mode, one-hot encode categoricals
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median'))])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                     ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

preproc = ColumnTransformer([('num', num_pipe, num_cols),
                             ('cat', cat_pipe, cat_cols)], remainder='drop')

# Note: For supervised learning, preproc will be fit on training data only (see Cell 11)
# For unsupervised learning (clustering), we'll fit on the full dataset
print('Feature columns identified:')
print(f'  - Numeric: {len(num_cols)} columns')
print(f'  - Categorical: {len(cat_cols)} columns')

In [None]:
# Extract column names after OneHotEncoder for interpretability
# Fit preproc on full X temporarily just to get feature names
try:
    preproc_temp = ColumnTransformer([('num', num_pipe, num_cols),
                                      ('cat', cat_pipe, cat_cols)], remainder='drop')
    preproc_temp.fit(X)
    
    ohe = None
    for name, trans, cols in preproc_temp.transformers_:
        if name == 'cat':
            ohe = trans.named_steps['onehot']
            cat_in_cols = cols
    feature_names = []
    # numeric names
    feature_names.extend(num_cols)
    # onehot names
    if ohe is not None:
        ohe_names = ohe.get_feature_names_out(cat_in_cols)
        feature_names.extend(list(ohe_names))
    print('Number of features after preprocessing:', len(feature_names))
except Exception as e:
    feature_names = None
    print('Could not extract feature names automatically.', e)

feature_names[:50] if feature_names else None

## 4) Predictive Modeling (Supervised)
We build three models: Logistic Regression, Random Forest, and Gradient Boosting. Use cross-validation and provide performance metrics (accuracy, precision, recall, F1, ROC AUC).

In [None]:
# Modeling: train/test split and basic training pipeline
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report

# Only proceed if we have a target
if y is None:
    print('No target variable defined. Define y to proceed with supervised modeling.')
else:
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
    # Fit preprocessor ONLY on training data to avoid data leakage
    X_train_pre = preproc.fit_transform(X_train)
    X_test_pre = preproc.transform(X_test)

    models = {
        'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
        'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42),
        'GradientBoosting': GradientBoostingClassifier(n_estimators=200, random_state=42)
    }

    results = {}
    for name, model in models.items():
        model.fit(X_train_pre, y_train)
        y_pred = model.predict(X_test_pre)
        y_proba = model.predict_proba(X_test_pre)[:,1] if hasattr(model, 'predict_proba') else None
        res = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred, zero_division=0),
            'recall': recall_score(y_test, y_pred, zero_division=0),
            'f1': f1_score(y_test, y_pred, zero_division=0),
        }
        if y_proba is not None:
            res['roc_auc'] = roc_auc_score(y_test, y_proba)
        results[name] = res
        print(f'--- {name} ---')
        print(classification_report(y_test, y_pred))
    print('\nSummary results:')
    import pandas as pd
    display(pd.DataFrame(results).T)

In [None]:
# Visualize model performance comparison
if y is not None and len(results) > 0:
    import pandas as pd
    results_df = pd.DataFrame(results).T
    
    # Plot metrics comparison
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    metrics = ['accuracy', 'precision', 'recall', 'f1']
    
    for idx, metric in enumerate(metrics):
        ax = axes[idx // 2, idx % 2]
        if metric in results_df.columns:
            results_df[metric].plot(kind='bar', ax=ax, color=['#3498db', '#2ecc71', '#e74c3c'])
            ax.set_title(f'{metric.capitalize()} by Model')
            ax.set_ylabel(metric.capitalize())
            ax.set_xlabel('Model')
            ax.set_ylim([0, 1])
            ax.grid(True, alpha=0.3, axis='y')
            ax.set_xticklabels(results_df.index, rotation=45, ha='right')
            
            # Add value labels on bars
            for i, v in enumerate(results_df[metric]):
                ax.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Display ROC AUC comparison if available
    if 'roc_auc' in results_df.columns:
        plt.figure(figsize=(8, 5))
        results_df['roc_auc'].plot(kind='bar', color=['#3498db', '#2ecc71', '#e74c3c'])
        plt.title('ROC AUC Score by Model')
        plt.ylabel('ROC AUC')
        plt.xlabel('Model')
        plt.ylim([0, 1])
        plt.grid(True, alpha=0.3, axis='y')
        plt.xticks(rotation=45, ha='right')
        
        # Add value labels
        for i, v in enumerate(results_df['roc_auc']):
            plt.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')
        
        plt.tight_layout()
        plt.show()


In [None]:
# Cross-validation (Stratified K-Fold) for each model
# Use Pipeline to avoid data leakage - preprocessor fits only on training folds
if y is not None:
    from sklearn.pipeline import Pipeline
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_results = {}
    
    # Create pipelines for each model to ensure proper cross-validation
    for name, model in models.items():
        # Create a fresh preprocessor for CV to avoid leakage
        from sklearn.compose import ColumnTransformer
        num_pipe_cv = Pipeline([('imputer', SimpleImputer(strategy='median'))])
        cat_pipe_cv = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                             ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])
        preproc_cv = ColumnTransformer([('num', num_pipe_cv, num_cols),
                                     ('cat', cat_pipe_cv, cat_cols)], remainder='drop')
        
        # Create pipeline with preprocessing and model
        pipeline = Pipeline([('preprocess', preproc_cv), ('classifier', model)])
        scores = cross_val_score(pipeline, X, y, cv=skf, scoring='f1')
        cv_results[name] = scores
    print('Cross-val F1 scores:')
    display(pd.DataFrame(cv_results))

### Feature importance (for tree-based models)
Show top features from RandomForest.

In [None]:
# Feature importances from RandomForest
import numpy as np
if 'RandomForest' in models and feature_names is not None:
    rf = models['RandomForest']
    importances = rf.feature_importances_
    idx = np.argsort(importances)[::-1]
    
    print('Top 20 Most Important Features:')
    print('-' * 50)
    for i, feat_idx in enumerate(idx[:20], 1):
        print(f'{i:2d}. {feature_names[feat_idx]:40s} {importances[feat_idx]:.4f}')
    
    # Visualize top features
    plt.figure(figsize=(10, 6))
    top_n = 15
    top_idx = idx[:top_n]
    plt.barh(range(top_n), importances[top_idx])
    plt.yticks(range(top_n), [feature_names[i] for i in top_idx])
    plt.xlabel('Feature Importance')
    plt.title('Top 15 Feature Importances (Random Forest)')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print('Feature names not available or RandomForest not trained.')

## 5) Unsupervised Modeling
### 5.1 K-Means clustering
Cluster students and inspect cluster characteristics.

In [None]:
# KMeans clustering on preprocessed features (use PCA to reduce dimensionality for visualization)
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# For unsupervised learning, fit preprocessor on full dataset (no data leakage concerns)
preproc_unsup = ColumnTransformer([('num', num_pipe, num_cols),
                                   ('cat', cat_pipe, cat_cols)], remainder='drop')
X_full_pre = preproc_unsup.fit_transform(X)

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_full_pre)

# Choose k with simple elbow method
inertia = []
K = range(2,8)
for k in K:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_full_pre)
    inertia.append(km.inertia_)

plt.figure(figsize=(6,3))
plt.plot(list(K), inertia, marker='o')
plt.title('Elbow Method for Optimal K Selection')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.grid(True, alpha=0.3)
plt.show()

# Fit k=3 as example (adjust based on elbow plot)
k = 3
km = KMeans(n_clusters=k, random_state=42, n_init=10)
clusters = km.fit_predict(X_full_pre)

plt.figure(figsize=(6,4))
plt.scatter(X_pca[:,0], X_pca[:,1], c=clusters, cmap='viridis', alpha=0.6)
plt.title('PCA Projection Colored by Cluster')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.colorbar(label='Cluster')
plt.grid(True, alpha=0.3)
plt.show()

# Attach cluster labels back to original DataFrame (first N rows aligned)
df_clustered = X.copy()
df_clustered = df_clustered.reset_index(drop=True)
df_clustered['cluster'] = clusters
df_clustered['pca1'] = X_pca[:,0]
df_clustered['pca2'] = X_pca[:,1]

# Show cluster characteristics (numeric columns only)
print('\nCluster Characteristics (mean values for numeric features):')
numeric_cols_clustered = df_clustered.select_dtypes(include=[np.number]).columns.tolist()
display(df_clustered.groupby('cluster')[numeric_cols_clustered].mean().T)

# Show cluster sizes
print('\nCluster Sizes:')
cluster_counts = df_clustered['cluster'].value_counts().sort_index()
display(cluster_counts)

# For categorical columns, show most common value per cluster
print('\nMost Common Categorical Values per Cluster:')
cat_cols_clustered = df_clustered.select_dtypes(include=['object', 'category']).columns.tolist()
if len(cat_cols_clustered) > 0:
    for col in cat_cols_clustered[:5]:  # Show first 5 categorical columns
        print(f'\n{col}:')
        display(df_clustered.groupby('cluster')[col].agg(lambda x: x.mode()[0] if len(x.mode()) > 0 else None))

### 5.2 Association Rule Mining (optional)
Find frequent itemsets and association rules from categorical features using mlxtend.

Note: mlxtend must be installed in your environment. If running on Colab, uncomment the pip install line.

In [None]:
# Association rules using mlxtend
# Uncomment to install: !pip install mlxtend

try:
    from mlxtend.preprocessing import TransactionEncoder
    from mlxtend.frequent_patterns import apriori, association_rules

    # Prepare transactions: take categorical columns and treat each row as a transaction of 'col=value'
    cat_for_rules = cat_cols[:8] if len(cat_cols) >= 1 else []
    transactions = []
    
    if len(cat_for_rules) > 0:
        for _, row in df[cat_for_rules].iterrows():
            # Only include non-null values
            tx = [f'{col}={row[col]}' for col in cat_for_rules if pd.notna(row[col])]
            transactions.append(tx)

        te = TransactionEncoder()
        te_ary = te.fit(transactions).transform(transactions)
        df_tf = pd.DataFrame(te_ary, columns=te.columns_)

        # Find frequent itemsets
        freq = apriori(df_tf, min_support=0.05, use_colnames=True)
        
        if len(freq) > 0:
            print('Top 20 Frequent Itemsets:')
            display(freq.sort_values('support', ascending=False).head(20))
            
            # Generate association rules
            if len(freq) > 1:
                rules = association_rules(freq, metric='lift', min_threshold=1.2)
                if len(rules) > 0:
                    print('\nTop 20 Association Rules by Lift:')
                    display(rules.sort_values('lift', ascending=False).head(20))
                else:
                    print('No association rules found with the given threshold.')
            else:
                print('Not enough frequent itemsets to generate rules.')
        else:
            print('No frequent itemsets found with min_support=0.05')
    else:
        print('Not enough categorical columns for association rules.')
        
except ImportError:
    print('mlxtend not installed. Run: pip install mlxtend')
except Exception as e:
    print(f'Error in association rule mining: {e}')

## 6) Visualizations & Reporting
Create concise, publication-ready visuals for the report and slides.

In [None]:
# Visualizations: ROC curves and Confusion Matrix
from sklearn.metrics import roc_curve, auc, ConfusionMatrixDisplay

if y is not None and len(results) > 0:
    # Plot ROC curves for all models
    plt.figure(figsize=(8, 6))
    for name, model in models.items():
        if hasattr(model, 'predict_proba'):
            y_proba = model.predict_proba(X_test_pre)[:,1]
            fpr, tpr, _ = roc_curve(y_test, y_proba)
            roc_auc = auc(fpr, tpr)
            plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')
    
    plt.plot([0,1], [0,1], 'k--', label='Random Classifier')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves - Model Comparison')
    plt.legend(loc='lower right')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Find best model by ROC AUC
    best_model_name = None
    best_auc = -1
    for name, res in results.items():
        if 'roc_auc' in res and res['roc_auc'] > best_auc:
            best_auc = res['roc_auc']
            best_model_name = name
    
    # Plot confusion matrix for best model
    if best_model_name:
        print(f'\nConfusion Matrix for Best Model: {best_model_name}')
        model = models[best_model_name]
        y_pred = model.predict(X_test_pre)
        
        fig, ax = plt.subplots(figsize=(6, 5))
        ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=ax, cmap='Blues')
        plt.title(f'Confusion Matrix - {best_model_name}')
        plt.tight_layout()
        plt.show()
else:
    print('No models trained to visualize.')

## 7) Save cleaned/processed dataset
Save the cleaned dataset and any intermediate artifacts to disk for reproducibility.

In [None]:
# Save cleaned DataFrame and clustered results
if not df.empty:
    cleaned_path = 'cleaned_student_data.csv'
    df.to_csv(cleaned_path, index=False)
    print('Saved cleaned data to', cleaned_path)
    try:
        df_clustered.to_csv('clustered_student_data.csv', index=False)
        print('Saved clustered data to clustered_student_data.csv')
    except Exception:
        pass
else:
    print('No dataframe to save.')

## 8) Conclusion & Next Steps

### Key Findings

**Model Performance:**
- Compared three classification models: Logistic Regression, Random Forest, and Gradient Boosting
- Evaluated using multiple metrics: accuracy, precision, recall, F1-score, and ROC AUC
- Cross-validation used to ensure robust performance estimates and avoid overfitting

**Feature Importance:**
- Random Forest feature importances reveal which factors most strongly predict student success
- Use these insights to identify at-risk students early and target interventions

**Student Segmentation:**
- K-Means clustering identified distinct student groups based on demographic and behavioral characteristics
- Clusters can inform personalized intervention strategies

**Association Rules:**
- Discovered patterns in categorical features that frequently co-occur
- Can reveal behavioral patterns associated with academic performance

### Recommendations for Interventions

1. **Early Warning System**: Use predictive models to identify students at risk of failing
2. **Targeted Tutoring**: Focus resources on students with low study time, high absences, or past failures
3. **Parent Engagement**: Reach out to families of at-risk students, especially where family support is low
4. **Behavioral Interventions**: Address factors like excessive going out, alcohol consumption, and low study time

### Model Improvements & Next Steps

1. **Handle Class Imbalance**: If pass/fail classes are imbalanced, try:
   - SMOTE (Synthetic Minority Over-sampling)
   - Class weights in model training
   - Stratified sampling

2. **Feature Engineering**: Create interaction features or polynomial features
3. **Hyperparameter Tuning**: Use GridSearchCV or RandomizedSearchCV for optimal parameters
4. **Ensemble Methods**: Combine multiple models for potentially better performance
5. **Fairness Analysis**: Evaluate model performance across demographic subgroups (sex, age, school)
6. **Temporal Analysis**: If longitudinal data available, track student progress over time

### Ethical Considerations

- Ensure model predictions don't perpetuate bias against certain demographic groups
- Use predictions to help students, not punish them
- Maintain student privacy and data security
- Validate model decisions with educators before taking action

---

*This analysis provides a comprehensive data mining approach to student performance prediction using supervised learning (classification), unsupervised learning (clustering), and association rule mining.*