# PrepaData: Feature Engineering and Metrics Computation

## Overview

The PrepaData microservice processes normalized data from the LMS connector to compute aggregated metrics at the **student-module-presentation** level. These metrics serve as features for downstream tasks:

- **Student Profiling**: Clustering students based on learning patterns
- **Path Prediction**: Predicting student success using XGBoost
- **Recommendation System**: Building personalized learning paths

## What Metrics Are Computed?

For each combination of `(student_id, code_module, code_presentation)`, we compute:

1. **avg_score**: Weighted average of assessment scores
2. **completion_rate**: Ratio of completed assessments to total assessments (0.0 to 1.0)
3. **total_clicks**: Total number of VLE (Virtual Learning Environment) clicks
4. **active_days**: Number of unique days with recorded activity
5. **final_result**: Final course result (Pass/Fail/Withdrawn/Distinction)

## Pipeline Position

```
01_LMSConnector → 02_PrepaData → 03_StudentProfiler → 04_PathPredictor → 05_RecoBuilder
     (Raw Data)    (Features)      (Clustering)        (ML Model)        (Recommendations)
```

## Input Data

This notebook expects normalized CSV files in `data/processed/`:
- `student_info_normalized.csv`
- `registrations_normalized.csv`
- `assessments_normalized.csv`
- `student_assessment_normalized.csv`
- `student_vle_normalized.csv`
- `vle_info_normalized.csv`
- `courses_normalized.csv`

## Output

Generates `data/processed/student_module_metrics.csv` with aggregated metrics ready for ML pipelines.


In [None]:
# Import the PrepaData pipeline
from libs.prepa_data import run_prepa_data_pipeline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    try:
        plt.style.use('seaborn')
    except OSError:
        plt.style.use('default')
sns.set_palette("husl")


In [None]:
# Run the complete PrepaData pipeline
metrics_df = run_prepa_data_pipeline()


In [None]:
# Display the first few rows
print("First 10 rows of computed metrics:")
print(metrics_df.head(10))
print(f"\nDataFrame shape: {metrics_df.shape}")
print(f"\nColumn names: {list(metrics_df.columns)}")
print(f"\nData types:")
print(metrics_df.dtypes)


In [None]:
# Display basic statistics
print("Summary Statistics:")
print(metrics_df.describe())


## Visualizations

Let's explore the distributions of key metrics:


In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Distribution of average scores
axes[0, 0].hist(metrics_df['avg_score'].dropna(), bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribution of Average Scores', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Average Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(True, alpha=0.3)

# 2. Distribution of completion rates
axes[0, 1].hist(metrics_df['completion_rate'], bins=50, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_title('Distribution of Completion Rates', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Completion Rate')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].grid(True, alpha=0.3)

# 3. Distribution of total clicks (log scale for better visualization)
axes[1, 0].hist(metrics_df[metrics_df['total_clicks'] > 0]['total_clicks'], 
                bins=50, edgecolor='black', alpha=0.7, color='orange')
axes[1, 0].set_title('Distribution of Total Clicks (Non-zero)', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Total Clicks')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_yscale('log')
axes[1, 0].grid(True, alpha=0.3)

# 4. Distribution of active days
axes[1, 1].hist(metrics_df['active_days'], bins=50, edgecolor='black', alpha=0.7, color='purple')
axes[1, 1].set_title('Distribution of Active Days', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Active Days')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Correlation heatmap of numeric metrics
numeric_cols = ['avg_score', 'completion_rate', 'total_clicks', 'active_days']
correlation_matrix = metrics_df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Student-Module Metrics', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()


In [None]:
# Show final result distribution
if 'final_result' in metrics_df.columns:
    result_counts = metrics_df['final_result'].value_counts()
    
    plt.figure(figsize=(10, 6))
    result_counts.plot(kind='bar', color='steelblue', edgecolor='black', alpha=0.7)
    plt.title('Distribution of Final Results', fontsize=14, fontweight='bold')
    plt.xlabel('Final Result')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()
    
    print("\nFinal Result Distribution:")
    print(result_counts)
    print(f"\nPercentages:")
    print((result_counts / len(metrics_df) * 100).round(2))


In [None]:
# Display the output file path
from pathlib import Path
from libs.utils import get_data_paths

_, processed_dir = get_data_paths()
output_file = processed_dir / "student_module_metrics.csv"

print("=" * 60)
print("OUTPUT FILE LOCATION")
print("=" * 60)
print(f"\n✓ Generated file: {output_file}")
print(f"✓ File exists: {output_file.exists()}")
if output_file.exists():
    print(f"✓ File size: {output_file.stat().st_size / 1024:.2f} KB")
    print(f"✓ Total rows: {len(metrics_df)}")
print("\n" + "=" * 60)
