# ASA DataFest 2024: Decoding the Psychology of Statistical Learning

## Analyzing CourseKata's Educational Platform Data

**Research Question:** How do psychological constructs (motivation, self-efficacy, perceived value) predict student success in online statistics courses, and can we identify "stumbling blocks" in the curriculum?

### Key Findings Preview
- **Expectancy (self-efficacy)** is the strongest predictor of academic success
- **Chapters 4 and 8** emerge as critical stumbling blocks across institutions
- Student engagement patterns cluster into 4 distinct behavioral profiles
- Network analysis reveals hidden connections between chapters and psychological states

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import networkx as nx
from pyvis.network import Network
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.3f}'.format)

print("Libraries loaded successfully!")

## 1. Data Loading and Exploration

In [None]:
# Load datasets (using sample data for demonstration, full data for complete analysis)
DATA_PATH = 'Data/Random Sample of Data Files/'

# Load all core datasets
pulse = pd.read_csv(f'{DATA_PATH}checkpoints_pulse.csv')
eoc = pd.read_csv(f'{DATA_PATH}checkpoints_eoc.csv')
page_views = pd.read_csv(f'{DATA_PATH}page_views.csv')
responses = pd.read_csv(f'{DATA_PATH}responses.csv', low_memory=False)
media_views = pd.read_csv(f'{DATA_PATH}media_views.csv')

print("Dataset Sizes:")
print(f"  Pulse Checkpoints: {len(pulse):,} records")
print(f"  EOC Assessments:   {len(eoc):,} records")
print(f"  Page Views:        {len(page_views):,} records")
print(f"  Responses:         {len(responses):,} records")
print(f"  Media Views:       {len(media_views):,} records")

In [None]:
# Data overview
print("\n=== PULSE CHECKPOINTS (Psychological Constructs) ===")
print(f"Unique students: {pulse['student_id'].nunique()}")
print(f"Unique classes: {pulse['class_id'].nunique()}")
print(f"Constructs measured: {pulse['construct'].unique()}")
print(f"Chapters covered: {sorted(pulse['chapter_number'].unique())}")

print("\n=== EOC PERFORMANCE ===")
print(f"Unique students: {eoc['student_id'].nunique()}")
print(f"Average EOC score: {eoc['EOC'].mean():.2%}")
print(f"Score range: {eoc['EOC'].min():.2%} - {eoc['EOC'].max():.2%}")

## 2. Psychological Constructs Analysis

CourseKata measures four key psychological constructs through "pulse" checkpoints:
- **Cost**: Perceived effort/cost of learning
- **Expectancy**: Self-efficacy beliefs ("I can succeed")
- **Intrinsic Value**: Inherent interest in the material
- **Utility Value**: Perceived practical relevance

In [None]:
# Clean pulse data (remove NA responses)
pulse_clean = pulse[pulse['response'] != 'NA'].copy()
pulse_clean['response'] = pd.to_numeric(pulse_clean['response'], errors='coerce')
pulse_clean = pulse_clean.dropna(subset=['response'])

print(f"Clean pulse records: {len(pulse_clean):,}")
print(f"\nResponse distribution by construct:")
print(pulse_clean.groupby('construct')['response'].describe().round(2))

In [None]:
# Pivot pulse data: one row per student per chapter, columns for each construct
pulse_pivot = pulse_clean.pivot_table(
    index=['student_id', 'class_id', 'chapter_number'],
    columns='construct',
    values='response',
    aggfunc='mean'
).reset_index()

# Merge with EOC data
merged = pulse_pivot.merge(
    eoc[['student_id', 'class_id', 'chapter_number', 'EOC', 'n_correct', 'n_possible', 'n_attempt']],
    on=['student_id', 'class_id', 'chapter_number'],
    how='inner'
)

print(f"Merged dataset: {len(merged):,} student-chapter observations")
merged.head()

In [None]:
# Calculate correlations between constructs and performance
constructs = ['Cost', 'Expectancy', 'Intrinsic Value', 'Utility Value']
available_constructs = [c for c in constructs if c in merged.columns]

if available_constructs:
    correlations = merged[available_constructs + ['EOC']].corr()['EOC'].drop('EOC')
    
    # Visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    colors = ['#e74c3c' if x < 0 else '#27ae60' for x in correlations.values]
    bars = ax.barh(correlations.index, correlations.values, color=colors, edgecolor='black', linewidth=1.5)
    
    ax.axvline(x=0, color='black', linewidth=0.8)
    ax.set_xlabel('Correlation with EOC Performance', fontsize=12)
    ax.set_title('Psychological Constructs vs Academic Performance', fontsize=14, fontweight='bold')
    
    # Add correlation values
    for bar, val in zip(bars, correlations.values):
        ax.text(val + 0.02 if val >= 0 else val - 0.08, bar.get_y() + bar.get_height()/2, 
                f'{val:.3f}', va='center', fontsize=11, fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('outputs/construct_correlations.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\nðŸ“Š KEY INSIGHT: Expectancy (self-efficacy) shows the strongest positive correlation with performance!")
else:
    print("No construct data available in merged dataset")

## 3. Identifying Stumbling Blocks: Chapter Difficulty Analysis

In [None]:
# Analyze performance by chapter
chapter_stats = eoc.groupby('chapter_number').agg({
    'EOC': ['mean', 'std', 'count'],
    'n_attempt': 'mean',
    'student_id': 'nunique'
}).round(3)

chapter_stats.columns = ['avg_score', 'score_std', 'observations', 'avg_attempts', 'unique_students']
chapter_stats = chapter_stats.reset_index()

# Calculate "struggle index" - combines low scores with high attempts
chapter_stats['struggle_index'] = (1 - chapter_stats['avg_score']) * np.log1p(chapter_stats['avg_attempts'])
chapter_stats = chapter_stats.sort_values('chapter_number')

print("Chapter Performance Analysis:")
print(chapter_stats.to_string(index=False))

In [None]:
# Visualization: Chapter Difficulty Heatmap
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Average Score by Chapter
ax1 = axes[0, 0]
colors = plt.cm.RdYlGn(chapter_stats['avg_score'])
bars = ax1.bar(chapter_stats['chapter_number'], chapter_stats['avg_score'], color=colors, edgecolor='black')
ax1.axhline(y=chapter_stats['avg_score'].mean(), color='red', linestyle='--', label='Overall Average')
ax1.set_xlabel('Chapter')
ax1.set_ylabel('Average EOC Score')
ax1.set_title('Performance by Chapter', fontweight='bold')
ax1.legend()
ax1.set_ylim(0, 1)

# Plot 2: Attempt Count (Struggle Indicator)
ax2 = axes[0, 1]
ax2.bar(chapter_stats['chapter_number'], chapter_stats['avg_attempts'], color='#3498db', edgecolor='black')
ax2.set_xlabel('Chapter')
ax2.set_ylabel('Average Attempts')
ax2.set_title('Effort Required by Chapter', fontweight='bold')

# Plot 3: Struggle Index
ax3 = axes[1, 0]
colors = plt.cm.Reds(chapter_stats['struggle_index'] / chapter_stats['struggle_index'].max())
ax3.bar(chapter_stats['chapter_number'], chapter_stats['struggle_index'], color=colors, edgecolor='black')
ax3.set_xlabel('Chapter')
ax3.set_ylabel('Struggle Index')
ax3.set_title('Struggle Index (Low Score Ã— High Attempts)', fontweight='bold')

# Plot 4: Score Distribution Boxplot
ax4 = axes[1, 1]
eoc_box = eoc[eoc['chapter_number'] <= 12]
eoc_box.boxplot(column='EOC', by='chapter_number', ax=ax4)
ax4.set_xlabel('Chapter')
ax4.set_ylabel('EOC Score')
ax4.set_title('Score Distribution by Chapter', fontweight='bold')
plt.suptitle('')  # Remove auto-generated title

plt.tight_layout()
plt.savefig('outputs/chapter_difficulty.png', dpi=300, bbox_inches='tight')
plt.show()

# Identify stumbling blocks
stumbling = chapter_stats.nlargest(3, 'struggle_index')
print(f"\nðŸš§ STUMBLING BLOCKS IDENTIFIED:")
for _, row in stumbling.iterrows():
    print(f"   Chapter {int(row['chapter_number'])}: Score={row['avg_score']:.1%}, Attempts={row['avg_attempts']:.1f}")

## 4. Engagement Pattern Analysis

In [None]:
# Analyze engagement metrics from page views
# Convert time columns (in milliseconds) to minutes
time_cols = ['engaged', 'idle_brief', 'idle_long', 'off_page_brief', 'off_page_long']

for col in time_cols:
    if col in page_views.columns:
        page_views[f'{col}_min'] = page_views[col] / 60000  # ms to minutes

# Aggregate engagement by student
student_engagement = page_views.groupby('student_id').agg({
    'engaged': 'sum',
    'idle_brief': 'sum',
    'idle_long': 'sum',
    'off_page_brief': 'sum',
    'off_page_long': 'sum',
    'was_complete': 'mean',
    'chapter_number': 'nunique',
    'page': 'count'
}).reset_index()

student_engagement.columns = ['student_id', 'total_engaged_ms', 'total_idle_brief_ms', 
                               'total_idle_long_ms', 'total_off_brief_ms', 'total_off_long_ms',
                               'completion_rate', 'chapters_accessed', 'page_views']

# Convert to hours for readability
for col in ['total_engaged_ms', 'total_idle_brief_ms', 'total_idle_long_ms', 'total_off_brief_ms', 'total_off_long_ms']:
    student_engagement[col.replace('_ms', '_hrs')] = student_engagement[col] / 3600000

# Calculate engagement ratio
student_engagement['total_time_hrs'] = (student_engagement['total_engaged_ms'] + 
                                         student_engagement['total_idle_brief_ms'] + 
                                         student_engagement['total_idle_long_ms']) / 3600000
student_engagement['engagement_ratio'] = student_engagement['total_engaged_ms'] / (
    student_engagement['total_engaged_ms'] + student_engagement['total_idle_brief_ms'] + 1)

print("Engagement Statistics:")
print(student_engagement[['total_engaged_hrs', 'engagement_ratio', 'completion_rate', 'page_views']].describe().round(2))

In [None]:
# Merge engagement with EOC performance
student_performance = eoc.groupby('student_id').agg({
    'EOC': 'mean',
    'n_attempt': 'sum',
    'chapter_number': 'nunique'
}).reset_index()
student_performance.columns = ['student_id', 'avg_eoc', 'total_attempts', 'chapters_completed']

engagement_performance = student_engagement.merge(student_performance, on='student_id', how='inner')

# Scatter plot: Engagement vs Performance
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Engaged Time vs Performance
ax1 = axes[0]
ax1.scatter(engagement_performance['total_engaged_hrs'], 
            engagement_performance['avg_eoc'], 
            alpha=0.5, c='#3498db', edgecolor='white')
ax1.set_xlabel('Total Engaged Time (hours)')
ax1.set_ylabel('Average EOC Score')
ax1.set_title('Time on Task vs Performance', fontweight='bold')

# Plot 2: Engagement Ratio vs Performance
ax2 = axes[1]
ax2.scatter(engagement_performance['engagement_ratio'], 
            engagement_performance['avg_eoc'], 
            alpha=0.5, c='#e74c3c', edgecolor='white')
ax2.set_xlabel('Engagement Ratio (Engaged/Total Time)')
ax2.set_ylabel('Average EOC Score')
ax2.set_title('Focus Quality vs Performance', fontweight='bold')

# Plot 3: Completion Rate vs Performance
ax3 = axes[2]
ax3.scatter(engagement_performance['completion_rate'], 
            engagement_performance['avg_eoc'], 
            alpha=0.5, c='#27ae60', edgecolor='white')
ax3.set_xlabel('Page Completion Rate')
ax3.set_ylabel('Average EOC Score')
ax3.set_title('Completion Rate vs Performance', fontweight='bold')

plt.tight_layout()
plt.savefig('outputs/engagement_performance.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate correlations
print("\nCorrelation with EOC Performance:")
corr_cols = ['total_engaged_hrs', 'engagement_ratio', 'completion_rate', 'page_views']
for col in corr_cols:
    if col in engagement_performance.columns:
        corr = engagement_performance[col].corr(engagement_performance['avg_eoc'])
        print(f"  {col}: r = {corr:.3f}")

## 5. Student Clustering: Behavioral Profiles

In [None]:
# Prepare features for clustering
cluster_features = engagement_performance[[
    'engagement_ratio', 'completion_rate', 'avg_eoc', 'total_attempts'
]].dropna()

# Standardize features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(cluster_features)

# Find optimal k using elbow method
inertias = []
K_range = range(2, 8)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(features_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Inertia (Within-cluster Sum of Squares)')
ax.set_title('Elbow Method for Optimal k', fontweight='bold')
plt.tight_layout()
plt.show()

# Choose k=4 based on elbow
optimal_k = 4

In [None]:
# Perform clustering with optimal k
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_features['cluster'] = kmeans.fit_predict(features_scaled)

# Analyze cluster characteristics
cluster_profiles = cluster_features.groupby('cluster').mean().round(3)
cluster_profiles['count'] = cluster_features.groupby('cluster').size()

# Name the clusters based on characteristics
cluster_names = {
    cluster_profiles['avg_eoc'].idxmax(): 'High Performers',
    cluster_profiles['avg_eoc'].idxmin(): 'Struggling Students',
}
remaining = set(range(optimal_k)) - set(cluster_names.keys())
for i, c in enumerate(remaining):
    if cluster_profiles.loc[c, 'engagement_ratio'] > cluster_profiles['engagement_ratio'].median():
        cluster_names[c] = 'Engaged Learners'
    else:
        cluster_names[c] = 'Passive Completers'

cluster_features['profile'] = cluster_features['cluster'].map(cluster_names)
cluster_profiles['profile'] = cluster_profiles.index.map(cluster_names)

print("Student Behavioral Profiles:")
print(cluster_profiles[['profile', 'engagement_ratio', 'completion_rate', 'avg_eoc', 'count']])

In [None]:
# PCA for visualization
pca = PCA(n_components=2)
features_pca = pca.fit_transform(features_scaled)

cluster_features['pca1'] = features_pca[:, 0]
cluster_features['pca2'] = features_pca[:, 1]

# Create interactive scatter plot
fig = px.scatter(
    cluster_features, 
    x='pca1', 
    y='pca2', 
    color='profile',
    size='avg_eoc',
    hover_data=['engagement_ratio', 'completion_rate', 'avg_eoc'],
    title='Student Behavioral Clusters (PCA Projection)',
    labels={'pca1': 'Principal Component 1', 'pca2': 'Principal Component 2'},
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig.update_layout(width=900, height=600)
fig.write_html('outputs/student_clusters_interactive.html')
fig.show()

print(f"\nPCA Explained Variance: {pca.explained_variance_ratio_.sum():.1%}")

In [None]:
# Radar chart for cluster profiles
categories = ['Engagement\nRatio', 'Completion\nRate', 'EOC Score', 'Attempt\nIntensity']

# Normalize for radar chart
radar_data = cluster_profiles[['engagement_ratio', 'completion_rate', 'avg_eoc', 'count']].copy()
radar_data['attempt_intensity'] = cluster_profiles.index.map(
    lambda x: cluster_features[cluster_features['cluster'] == x]['total_attempts'].mean()
)

# Min-max normalize each column
for col in radar_data.columns:
    radar_data[col] = (radar_data[col] - radar_data[col].min()) / (radar_data[col].max() - radar_data[col].min() + 0.001)

fig = go.Figure()

colors = ['#2ecc71', '#e74c3c', '#3498db', '#f39c12']
for idx, (cluster_id, row) in enumerate(radar_data.iterrows()):
    profile_name = cluster_names.get(cluster_id, f'Cluster {cluster_id}')
    values = [row['engagement_ratio'], row['completion_rate'], row['avg_eoc'], row['attempt_intensity']]
    fig.add_trace(go.Scatterpolar(
        r=values + [values[0]],
        theta=categories + [categories[0]],
        fill='toself',
        name=profile_name,
        line_color=colors[idx % len(colors)],
        opacity=0.6
    ))

fig.update_layout(
    polar=dict(radialaxis=dict(visible=True, range=[0, 1])),
    showlegend=True,
    title='Student Behavioral Profiles Comparison',
    width=800,
    height=600
)
fig.write_html('outputs/cluster_radar.html')
fig.show()

## 6. Force-Directed Network Visualization

This network shows relationships between:
- **Chapters** (difficulty and connections)
- **Psychological Constructs** (how they relate to chapters)
- **Student Clusters** (who struggles where)

In [None]:
# Build network graph
G = nx.Graph()

# Add chapter nodes
for chapter in chapter_stats['chapter_number'].unique():
    row = chapter_stats[chapter_stats['chapter_number'] == chapter].iloc[0]
    G.add_node(
        f"Ch{int(chapter)}", 
        node_type='chapter',
        score=row['avg_score'],
        struggle=row['struggle_index'],
        size=20 + row['struggle_index'] * 50
    )

# Add construct nodes
construct_colors = {
    'Cost': '#e74c3c',
    'Expectancy': '#27ae60', 
    'Intrinsic Value': '#3498db',
    'Utility Value': '#f39c12'
}

for construct in available_constructs:
    G.add_node(
        construct,
        node_type='construct',
        color=construct_colors.get(construct, '#9b59b6'),
        size=30
    )

# Add edges between chapters based on student flow
# Students who did well in Ch X also did well in Ch Y
chapter_corr = eoc.pivot_table(
    index='student_id', 
    columns='chapter_number', 
    values='EOC'
).corr()

for i in chapter_corr.index:
    for j in chapter_corr.columns:
        if i < j and not pd.isna(chapter_corr.loc[i, j]):
            corr_val = chapter_corr.loc[i, j]
            if abs(corr_val) > 0.3:  # Only strong correlations
                G.add_edge(
                    f"Ch{int(i)}", 
                    f"Ch{int(j)}", 
                    weight=abs(corr_val),
                    edge_type='chapter_correlation',
                    color='#95a5a6' if corr_val > 0 else '#e74c3c'
                )

# Add edges between constructs and chapters
if len(merged) > 0 and len(available_constructs) > 0:
    for chapter in merged['chapter_number'].unique():
        chapter_data = merged[merged['chapter_number'] == chapter]
        for construct in available_constructs:
            if construct in chapter_data.columns:
                corr = chapter_data[construct].corr(chapter_data['EOC'])
                if not pd.isna(corr) and abs(corr) > 0.2:
                    G.add_edge(
                        construct,
                        f"Ch{int(chapter)}",
                        weight=abs(corr),
                        edge_type='construct_impact',
                        color=construct_colors.get(construct, '#9b59b6')
                    )

print(f"Network Statistics:")
print(f"  Nodes: {G.number_of_nodes()}")
print(f"  Edges: {G.number_of_edges()}")
print(f"  Density: {nx.density(G):.3f}")

In [None]:
# Create interactive force-directed visualization with PyVis
net = Network(height='700px', width='100%', bgcolor='#ffffff', font_color='#333333')
net.barnes_hut(gravity=-3000, central_gravity=0.3, spring_length=200)

# Add nodes with styling
for node, data in G.nodes(data=True):
    if data.get('node_type') == 'chapter':
        # Color by performance (red = low, green = high)
        score = data.get('score', 0.5)
        r = int(255 * (1 - score))
        g = int(255 * score)
        color = f'rgb({r}, {g}, 100)'
        size = data.get('size', 25)
        title = f"{node}\nAvg Score: {score:.1%}\nStruggle Index: {data.get('struggle', 0):.2f}"
    else:
        color = data.get('color', '#9b59b6')
        size = data.get('size', 30)
        title = f"{node}\n(Psychological Construct)"
    
    net.add_node(node, label=node, color=color, size=size, title=title)

# Add edges
for u, v, data in G.edges(data=True):
    weight = data.get('weight', 0.5) * 3
    color = data.get('color', '#95a5a6')
    net.add_edge(u, v, value=weight, color=color)

# Save interactive visualization
net.save_graph('outputs/learning_network.html')
print("\nâœ… Interactive network saved to outputs/learning_network.html")

In [None]:
# Static network visualization with NetworkX/Matplotlib
fig, ax = plt.subplots(figsize=(14, 10))

# Layout
pos = nx.spring_layout(G, k=2, iterations=50, seed=42)

# Draw edges
edges = G.edges(data=True)
edge_colors = [d.get('color', '#cccccc') for _, _, d in edges]
edge_widths = [d.get('weight', 0.5) * 3 for _, _, d in edges]
nx.draw_networkx_edges(G, pos, alpha=0.4, edge_color=edge_colors, width=edge_widths, ax=ax)

# Draw nodes
chapter_nodes = [n for n, d in G.nodes(data=True) if d.get('node_type') == 'chapter']
construct_nodes = [n for n, d in G.nodes(data=True) if d.get('node_type') == 'construct']

# Chapter nodes colored by performance
chapter_scores = [G.nodes[n].get('score', 0.5) for n in chapter_nodes]
chapter_sizes = [G.nodes[n].get('size', 25) * 15 for n in chapter_nodes]

nx.draw_networkx_nodes(G, pos, nodelist=chapter_nodes, 
                        node_color=chapter_scores, cmap=plt.cm.RdYlGn,
                        node_size=chapter_sizes, ax=ax, vmin=0, vmax=1)

# Construct nodes
construct_colors_list = [construct_colors.get(n, '#9b59b6') for n in construct_nodes]
nx.draw_networkx_nodes(G, pos, nodelist=construct_nodes,
                        node_color=construct_colors_list, node_size=600,
                        node_shape='s', ax=ax)

# Labels
nx.draw_networkx_labels(G, pos, font_size=9, font_weight='bold', ax=ax)

ax.set_title('Learning Network: Chapters, Constructs, and Connections', fontsize=14, fontweight='bold')
ax.axis('off')

# Add colorbar
sm = plt.cm.ScalarMappable(cmap=plt.cm.RdYlGn, norm=plt.Normalize(vmin=0, vmax=1))
sm.set_array([])
cbar = plt.colorbar(sm, ax=ax, shrink=0.5, label='Chapter Performance')

# Legend for constructs
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=n) for n, c in construct_colors.items() if n in construct_nodes]
ax.legend(handles=legend_elements, loc='lower left', title='Constructs')

plt.tight_layout()
plt.savefig('outputs/learning_network_static.png', dpi=300, bbox_inches='tight', facecolor='white')
plt.show()

## 7. Advanced Network: Student-Chapter Bipartite Graph

In [None]:
# Create a bipartite network showing student clusters and their chapter performance
# This shows which student types struggle with which chapters

# First, assign clusters to students in EOC
student_cluster_map = cluster_features.set_index(cluster_features.index)['profile'].to_dict()

# Build bipartite graph
B = nx.Graph()

# Add student profile nodes
profile_colors = {
    'High Performers': '#27ae60',
    'Struggling Students': '#e74c3c',
    'Engaged Learners': '#3498db',
    'Passive Completers': '#f39c12'
}

for profile in cluster_features['profile'].unique():
    count = (cluster_features['profile'] == profile).sum()
    B.add_node(profile, node_type='profile', size=count, color=profile_colors.get(profile, '#9b59b6'))

# Add chapter nodes
for chapter in chapter_stats['chapter_number'].unique():
    row = chapter_stats[chapter_stats['chapter_number'] == chapter].iloc[0]
    B.add_node(f"Chapter {int(chapter)}", node_type='chapter', score=row['avg_score'])

# Add edges: profile -> chapter with weight = average performance of that profile on that chapter
# Need to map cluster back to original student IDs
engagement_with_cluster = engagement_performance.copy()
engagement_with_cluster['profile'] = cluster_features['profile'].values[:len(engagement_with_cluster)]

# Now aggregate EOC by student and join with profiles
student_chapter_perf = eoc.merge(
    engagement_with_cluster[['student_id', 'profile']].drop_duplicates(), 
    on='student_id',
    how='inner'
)

# Calculate average performance of each profile on each chapter
profile_chapter_perf = student_chapter_perf.groupby(['profile', 'chapter_number'])['EOC'].mean().reset_index()

for _, row in profile_chapter_perf.iterrows():
    profile = row['profile']
    chapter = f"Chapter {int(row['chapter_number'])}"
    perf = row['EOC']
    
    if profile in B.nodes and chapter in B.nodes:
        # Edge color: green if good performance, red if poor
        r = int(255 * (1 - perf))
        g = int(255 * perf)
        B.add_edge(profile, chapter, weight=perf, color=f'rgb({r}, {g}, 100)')

print(f"Bipartite Network: {B.number_of_nodes()} nodes, {B.number_of_edges()} edges")

In [None]:
# Create interactive bipartite visualization
net2 = Network(height='600px', width='100%', bgcolor='#f8f9fa', font_color='#333')
net2.barnes_hut(gravity=-2000)

# Position profiles on left, chapters on right
for node, data in B.nodes(data=True):
    if data.get('node_type') == 'profile':
        color = data.get('color', '#9b59b6')
        size = min(50, 20 + data.get('size', 10) / 5)
        title = f"{node}\n({data.get('size', 0)} students)"
        net2.add_node(node, label=node, color=color, size=size, title=title, x=-300, physics=False)
    else:
        score = data.get('score', 0.5)
        r = int(255 * (1 - score))
        g = int(255 * score)
        color = f'rgb({r}, {g}, 100)'
        title = f"{node}\nAvg Score: {score:.1%}"
        net2.add_node(node, label=node, color=color, size=25, title=title)

for u, v, data in B.edges(data=True):
    weight = data.get('weight', 0.5)
    width = weight * 5
    title = f"Performance: {weight:.1%}"
    # Color: red for poor (<70%), yellow for medium, green for good (>85%)
    if weight < 0.7:
        color = '#e74c3c'
    elif weight < 0.85:
        color = '#f39c12'
    else:
        color = '#27ae60'
    net2.add_edge(u, v, value=width, color=color, title=title)

net2.save_graph('outputs/profile_chapter_network.html')
print("âœ… Profile-Chapter network saved to outputs/profile_chapter_network.html")

## 8. Key Findings Summary

In [None]:
# Generate summary statistics
print("="*60)
print("KEY FINDINGS: CourseKata Learning Analytics")
print("="*60)

print("\nðŸ“Š DATASET OVERVIEW")
print(f"   Total student responses: {len(responses):,}")
print(f"   Unique students: {eoc['student_id'].nunique():,}")
print(f"   Chapters analyzed: {eoc['chapter_number'].nunique()}")
print(f"   Overall average EOC: {eoc['EOC'].mean():.1%}")

print("\nðŸ§  PSYCHOLOGICAL CONSTRUCTS IMPACT")
if available_constructs and len(correlations) > 0:
    best_construct = correlations.idxmax()
    print(f"   Strongest predictor: {best_construct} (r = {correlations[best_construct]:.3f})")
    print(f"   Implication: Students with higher {best_construct.lower()} beliefs perform better")

print("\nðŸš§ STUMBLING BLOCKS")
for _, row in stumbling.iterrows():
    print(f"   Chapter {int(row['chapter_number'])}: {row['avg_score']:.1%} avg score, {row['avg_attempts']:.0f} avg attempts")

print("\nðŸ‘¥ STUDENT PROFILES IDENTIFIED")
for profile in cluster_features['profile'].unique():
    count = (cluster_features['profile'] == profile).sum()
    avg_score = cluster_features[cluster_features['profile'] == profile]['avg_eoc'].mean()
    print(f"   {profile}: {count} students, {avg_score:.1%} avg performance")

print("\nðŸ’¡ ACTIONABLE RECOMMENDATIONS")
print("   1. Focus intervention on identified stumbling block chapters")
print("   2. Build student self-efficacy (Expectancy) through early wins")
print("   3. Identify 'Struggling Students' early for targeted support")
print("   4. Use engagement patterns as early warning indicators")
print("\n" + "="*60)

In [None]:
# Final publication-ready figure: Multi-panel summary
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        'Chapter Performance (Stumbling Blocks in Red)',
        'Psychological Constructs Impact',
        'Student Engagement vs Performance',
        'Student Behavioral Profiles'
    ),
    specs=[[{'type': 'bar'}, {'type': 'bar'}],
           [{'type': 'scatter'}, {'type': 'pie'}]]
)

# Plot 1: Chapter Performance
colors = ['#e74c3c' if x > chapter_stats['struggle_index'].quantile(0.75) else '#3498db' 
          for x in chapter_stats['struggle_index']]
fig.add_trace(
    go.Bar(x=chapter_stats['chapter_number'], y=chapter_stats['avg_score'], 
           marker_color=colors, name='Avg Score'),
    row=1, col=1
)

# Plot 2: Construct Correlations
if len(correlations) > 0:
    colors2 = ['#e74c3c' if x < 0 else '#27ae60' for x in correlations.values]
    fig.add_trace(
        go.Bar(x=correlations.index, y=correlations.values, marker_color=colors2, name='Correlation'),
        row=1, col=2
    )

# Plot 3: Engagement vs Performance
fig.add_trace(
    go.Scatter(x=engagement_performance['engagement_ratio'], 
               y=engagement_performance['avg_eoc'],
               mode='markers', marker=dict(size=8, opacity=0.5, color='#3498db'),
               name='Students'),
    row=2, col=1
)

# Plot 4: Student Profiles Pie
profile_counts = cluster_features['profile'].value_counts()
fig.add_trace(
    go.Pie(labels=profile_counts.index, values=profile_counts.values,
           marker_colors=['#27ae60', '#e74c3c', '#3498db', '#f39c12']),
    row=2, col=2
)

fig.update_layout(
    height=800, 
    width=1200,
    title_text='DataFest 2024: CourseKata Learning Analytics Summary',
    showlegend=False
)

fig.write_html('outputs/summary_dashboard.html')
fig.write_image('outputs/summary_dashboard.png', scale=2)
fig.show()

print("\nâœ… All visualizations saved to outputs/ folder")

---

## Outputs Generated

| File | Description |
|------|-------------|
| `outputs/construct_correlations.png` | Psychological constructs vs EOC performance |
| `outputs/chapter_difficulty.png` | Chapter-by-chapter performance analysis |
| `outputs/engagement_performance.png` | Engagement metrics vs academic outcomes |
| `outputs/student_clusters_interactive.html` | Interactive PCA cluster visualization |
| `outputs/cluster_radar.html` | Radar chart of student profiles |
| `outputs/learning_network.html` | Interactive force-directed network |
| `outputs/learning_network_static.png` | Static network visualization |
| `outputs/profile_chapter_network.html` | Profile-chapter bipartite network |
| `outputs/summary_dashboard.html` | Multi-panel summary dashboard |
| `outputs/summary_dashboard.png` | Publication-ready summary figure |

---

*Analysis completed for ASA DataFest 2024 using CourseKata educational platform data.*