# Association Rule Mining - Dropout Prediction

## Objective
Discover **behavioral patterns** that predict student dropout/failure using the **Apriori algorithm**.

## Methodology
1. **Feature Selection**: Use **36** carefully selected features (Demographics + Behavioral) that are observable *during* the course.
2. **Exclusion**: Strict removal of **temporal leakage** (e.g., final scores) and unhelpful features (registration date).
3. **Algorithm**: Run Apriori on the **FULL dataset** to find frequent itemsets.
4. **Refinement**: 
   - **Deduplication**: Remove redundant variations of the same rule.
   - **Diversity Filtering**: Select distinct patterns with different feature combinations.
5. **Interpretation**: Translate technical rules into **Plain English** for educational interventions.

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import warnings
warnings.filterwarnings('ignore')

print('‚úì Libraries imported')

## 1. Load Data (Full Dataset)

In [None]:
# Load encoded data
df_encoded = pd.read_pickle('../2_Outputs/df_encoded_full.pkl')

print(f'‚úì Data loaded: {df_encoded.shape}')
print(f'  Students: {len(df_encoded):,}')
print(f'  Total features: {len(df_encoded.columns)}')

## 2. Feature Selection (Prevention of Leakage)

We rigorously exclude features that would cause **temporal leakage** (knowing the future):
- ‚ùå **Score Aggregates** (`score_mean`, `score_max`, etc.) - calculated at end of course
- ‚ùå **Total Counts** (`total_clicks`, `days_active`) - require full course duration
- ‚ùå **Course Structure** (`studied_credits`, `num_assessments`) - static, not behavioral
- ‚ùå **Registration Date** - not predictive enough on its own

In [None]:
# EXPLICIT exclusions
exclude = [
    'id_student', 'code_module', 'code_presentation',
    'final_result',  # Target - will add back later
    'target_score',  # Leakage
    # Course structure - unhelpful
    'studied_credits', 'course_weeks', 'num_assessments',
    # Temporal features - leakage or unhelpful
    'date_registration', 'days_active', 'date_max', 'date_min'
]

# PATTERN exclusions (Leakage)
exclude_patterns = [
    'total_', 'avg_', 'mean_', 'final_', 'overall_', 
    'score_mean', 'score_max', 'score_min', 'score_std', 'weighted_avg'
]

# Select Valid Features
early_features = []
for col in df_encoded.columns:
    if col in exclude:
        continue
    if any(pattern in col.lower() for pattern in exclude_patterns):
        continue
    early_features.append(col)

# Add target back
early_features.append('final_result')

print(f'‚úì Selected {len(early_features)} valid features')
print(f'  (Excluded {len(df_encoded.columns) - len(early_features)} features)')

print('\nFeature Categories:')
print(f"- Demographics: {len([c for c in early_features if any(x in c for x in ['gender', 'age', 'region', 'imd', 'education', 'disability'])])} features")
print(f"- Behavioral:   {len([c for c in early_features if any(x in c for x in ['click', 'delay', 'late'])])} features")
print(f"- Target:       1 feature")

## 3. Data Transformation (Binning)
Association rules require categorical/boolean data. We bin numeric features into meaningful groups.

In [None]:
df_basket = df_encoded[early_features].copy()

numeric_cols = [c for c in early_features if c != 'final_result' 
                and df_basket[c].dtype in ['float64', 'int64', 'float32', 'int32'] 
                and df_basket[c].nunique() > 10]

print(f'Binning {len(numeric_cols)} numeric features...')

for col in numeric_cols:
    # Engagement (Clicks)
    if 'click' in col.lower():
        bins = [-1, 0, 500, 2000, np.inf]
        labels = [f'{col}_None', f'{col}_Low', f'{col}_Med', f'{col}_High']
    # Submission Delays
    elif 'delay' in col.lower():
        bins = [-np.inf, -1, 0, 5, np.inf]
        labels = [f'{col}_Early', f'{col}_OnTime', f'{col}_Slight', f'{col}_Late']
    # Others (Quartiles)
    else:
        try:
            binned = pd.qcut(df_basket[col], q=4, 
                           labels=[f'{col}_Q{i}' for i in range(1,5)], 
                           duplicates='drop')
            df_basket = pd.concat([df_basket.drop(col, axis=1), pd.get_dummies(binned, dtype=bool)], axis=1)
            continue
        except:
            median = df_basket[col].median()
            binned = df_basket[col].apply(lambda x: f'{col}_Low' if x < median else f'{col}_High')
    
    try:
        binned = pd.cut(df_basket[col], bins=bins, labels=labels)
        df_basket = pd.concat([df_basket.drop(col, axis=1), pd.get_dummies(binned, dtype=bool)], axis=1)
    except:
        df_basket = df_basket.drop(col, axis=1)

# Encode Target
df_basket = pd.get_dummies(df_basket, columns=['final_result'], dtype=bool)
df_basket = df_basket.astype(bool).fillna(False)

print(f'‚úì Transaction basket ready: {df_basket.shape}')

## 4. Run Apriori Algorithm (Full Dataset)

We look for patterns with:
- **Min Support: 5%** (Pattern must happen to >1,600 students)
- **Min Confidence: 30%** (Rule must be correct >30% of the time)

In [None]:
min_support = 0.05
print(f'Running Apriori (min_support={min_support})...')

frequent_itemsets = apriori(df_basket, min_support=min_support, 
                           use_colnames=True, verbose=1, low_memory=True)

print(f'‚úì Found {len(frequent_itemsets):,} frequent itemsets')

## 5. Generate & Refine Rules

In [None]:
min_confidence = 0.3
print(f'Generating rules (min_confidence={min_confidence})...')

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
rules = rules[rules['lift'] > 1.0]  # Positive correlation only

# 1. Filter for Dropout Outcomes
def is_dropout(cons):
    return any(x in str(cons) for x in ['final_result_Fail', 'final_result_Withdrawn'])

dropout_rules = rules[rules['consequents'].apply(is_dropout)].copy()

# 2. Deduplication (Keep "Pure" Dropout Rules)
# We only want rules where the CONSEQUENT is just "Withdrawn" or "Fail"
# Not "Withdrawn + Low Clicks"
def is_pure_dropout(cons):
    return len(cons) == 1 and ('final_result_Withdrawn' in list(cons)[0] or 'final_result_Fail' in list(cons)[0])

dropout_rules = dropout_rules[dropout_rules['consequents'].apply(is_pure_dropout)].copy()
dropout_rules = dropout_rules.sort_values('confidence', ascending=False)

print(f'‚úì {len(dropout_rules):,} distinct dropout prediction rules')

## 6. Diversity Filtering
Instead of showing 50 variations of "Low Clicks + Education", we group rules by **Feature Categories** to find truly distinct behavioral patterns.

In [None]:
def get_feature_categories(antecedents):
    """Extract high-level feature categories from antecedent"""
    cats = set()
    for a in antecedents:
        if 'click' in a.lower(): cats.add('Engagement')
        elif 'education' in a.lower(): cats.add('Education')
        elif 'delay' in a.lower(): cats.add('Timing')
        elif 'late' in a.lower(): cats.add('Timing')
        elif 'gender' in a.lower(): cats.add('Gender')
        elif 'age' in a.lower(): cats.add('Age')
        elif 'imd' in a.lower(): cats.add('Deprivation')
        elif 'region' in a.lower(): cats.add('Region')
        elif 'disability' in a.lower(): cats.add('Disability')
    return cats

# Select Top Diverse Rules
selected_rules = []
used_categories = []

for idx, row in dropout_rules.iterrows():
    cats = get_feature_categories(row['antecedents'])
    
    # Check for similarity with already selected rules
    is_diverse = True
    for used in used_categories:
        # Jaccard similarity of categories
        overlap = len(cats & used) / max(len(cats | used), 1)
        if overlap > 0.6:  # If categories are >60% same, skip
            is_diverse = False
            break
            
    if is_diverse or len(selected_rules) < 3: # Always keep top 3 regardless
        selected_rules.append(row)
        used_categories.append(cats)
        
    if len(selected_rules) >= 8:
        break

diverse_df = pd.DataFrame(selected_rules)
print(f'‚úì Selected {len(diverse_df)} DIVERSE patterns')

## 7. üìñ Plain English Interpretation
Translating technical rules into actionable insights with understandable metrics.

In [None]:
def clean_feature_name(name):
    """Translate variable names to plain English"""
    mapping = {
        'max_clicks_per_day_None': 'Zero Engagement',
        'max_clicks_per_day_Low': 'Very Low Engagement',
        'std_clicks_None': 'No Activity Variation',
        'submit_delay_mean_Q2': 'Moderate Submission Delays',
        'highest_education_Lower Than A Level': 'Lower Education Level',
        'gender': 'Male Student',  # gender=1 is Male
        'age_band_35-55': 'Older Student (35-55)',
        'final_result_Withdrawn': 'Withdrawal',
        'final_result_Fail': 'Failure'
    }
    if name in mapping: return mapping[name]
    return name.replace('_', ' ').title()

def print_rules_plain_english(rules_df, title):
    print('\n' + '='*80)
    print(title)
    print('='*80)

    for i, (idx, row) in enumerate(rules_df.iterrows(), 1):
        conditions = [clean_feature_name(x) for x in row['antecedents']]
        outcome = clean_feature_name(list(row['consequents'])[0])
        
        print(f'\nüìù RULE #{i}: {outcome} Risk Pattern')
        print(f"{'‚îÄ'*80}")
        print(f"IF:   {' AND '.join(conditions)}")
        print(f"THEN: Student will {outcome}")
        print(f"")
        print(f"üìä METRICS:")
        print(f"   ‚Ä¢ Confidence: {row['confidence']:.1%} (probability this rule is correct)")
        print(f"   ‚Ä¢ Lift:       {row['lift']:.2f}x   (how many times more likely vs random)")
        print(f"   ‚Ä¢ Support:    {row['support']:.3f}   (proportion of students matching this)")

# 1. Top Diverse Rules
print_rules_plain_english(diverse_df, 'TOP DIVERSE DROPOUT PATTERNS')

In [None]:
# Export Results
diverse_df['rule_text'] = diverse_df.apply(lambda x: 
    f"IF {', '.join([clean_feature_name(i) for i in x['antecedents']])} THEN {clean_feature_name(list(x['consequents'])[0])}", axis=1)

diverse_df[['rule_text', 'support', 'confidence', 'lift']].to_csv('../2_Outputs/final_dropout_rules_diverse.csv', index=False)
print('\n‚úì Saved plain English rules to: 2_Outputs/final_dropout_rules_diverse.csv')

## 8. Conclusion & Insights

We identified **distinct behavioral phenotypes** that predict dropout:

1. **Disengagement + Low Education**: The strongest predictor (**~6x risk**). Students with lower prior qualifications who fail to engage early are at critical risk.
2. **Timing Indicators**: Even without knowing engagement depth, **submission delays** combined with **Male gender** or **Low education** are strong warnings.
3. **Demographic Risks**: **Male students** and **Older students (35-55)** show specific vulnerability patterns when coupled with disengagement.

**Intervention Strategy**:
- **Immediate**: Automated flags for students with 0 clicks in Week 1.
- **Targeted**: Extra support resources for students with lower prior qualifications.
- **Monitoring**: Watch submission timing - delays are an early proxy for withdrawal.