# 5.1 Institutional Pattern Discovery with Unsupervised Learning## Course 3: Advanced Classification Models for Student Success

## Opening Narrative> *"We've spent this course predicting who will leave. Now we ask a different question:> **What natural groupings exist among our students — and what do those patterns tell us> about how to serve them?**"*### The Provost's QuestionImagine your Provost says:> *"We've been treating all first-year students the same in our retention programs.> But intuitively, we know there are different kinds of at-risk students —> some struggle academically, some financially, some socially.> Can we **discover** these groups from the data itself,> rather than imposing categories from above?"***That question is the heart of unsupervised learning.**Unlike supervised learning (Modules 1–4), where we predicted a known target (`DEPARTED`),unsupervised learning has **no target variable**. Instead, we let the data revealits own structure — clusters of similar students, latent dimensions that compressdozens of variables into interpretable factors, or hidden patterns that no onethought to look for.### What You Will Learn| Concept | What It Does | Institutional Use ||:--------|:-------------|:------------------|| **K-Means Clustering** | Groups similar observations into k clusters | Segment students into distinct risk/behavior profiles || **PCA** (Principal Component Analysis) | Reduces many variables to a few key dimensions | Simplify complex student data for visualization and insight || **Elbow Method & Silhouette Score** | Determine optimal number of clusters | Justify the number of student segments to leadership || **Cluster Profiling** | Describe what makes each cluster unique | Translate data patterns into actionable advising strategies |### Key Difference from Supervised Learning| | Supervised (Modules 1–4) | Unsupervised (This Module) ||:--|:------------------------|:--------------------------|| **Goal** | Predict a known outcome | Discover hidden structure || **Target variable** | Yes (`DEPARTED`) | No || **Evaluation** | AUC, F1, Accuracy | Silhouette, Inertia, Interpretability || **Question** | "Will this student leave?" | "What *kinds* of students do we have?" || **Action** | Flag individual students | Design group-level interventions |

## 1. Setup and Libraries

In [None]:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings('ignore')from sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCAfrom sklearn.metrics import silhouette_score, silhouette_samplesRANDOM_STATE = 42np.random.seed(RANDOM_STATE)# Visualization defaultssns.set_style('whitegrid')plt.rcParams['figure.figsize'] = (10, 6)plt.rcParams['font.size'] = 12print("All libraries loaded successfully!")print("This module uses matplotlib/seaborn for static, publication-ready visuals.")

---## Case Study 1: Student Entry Segmentation### The Provost's Question> *"We have 1,000 incoming first-year students with their high school records> and demographic indicators. Can we identify **natural groupings**> among these students to tailor our orientation and advising programs?"*### Variables (7)| Variable | Description ||:---------|:------------|| `HS_GPA` | High school GPA (2.0–4.0) || `HS_MATH_GPA` | High school math GPA (1.5–4.0) || `HS_ENGL_GPA` | High school English GPA (1.5–4.0) || `UNITS_ATTEMPTED_1` | Units attempted in first semester (9–18) || `FIRST_GEN` | First-generation status (0/1) || `PELL_ELIGIBLE` | Pell Grant eligibility (0/1) || `DISTANCE_FROM_CAMPUS` | Miles from campus (1–200) |

In [None]:
# --- Generate synthetic student entry data ---n = 1000entry_df = pd.DataFrame({    'HS_GPA':             np.random.normal(3.2, 0.45, n).clip(2.0, 4.0),    'HS_MATH_GPA':        np.random.normal(3.0, 0.55, n).clip(1.5, 4.0),    'HS_ENGL_GPA':        np.random.normal(3.1, 0.50, n).clip(1.5, 4.0),    'UNITS_ATTEMPTED_1':  np.random.choice([9, 12, 13, 14, 15, 16, 17, 18], n,                                            p=[0.05, 0.15, 0.10, 0.15, 0.25, 0.15, 0.10, 0.05]),    'FIRST_GEN':          np.random.binomial(1, 0.42, n),    'PELL_ELIGIBLE':      np.random.binomial(1, 0.38, n),    'DISTANCE_FROM_CAMPUS': np.random.exponential(25, n).clip(1, 200).round(1)})print(f"Dataset: {entry_df.shape[0]:,} students × {entry_df.shape[1]} variables")entry_df.describe().round(2)

### Step 1: Scale the Data**Why scale?** K-Means uses Euclidean distance. Without scaling, variables with larger ranges(like `DISTANCE_FROM_CAMPUS` at 1–200) would dominate over variables with smaller ranges(like `FIRST_GEN` at 0–1). `StandardScaler` puts all variables on the same footing.

In [None]:
scaler = StandardScaler()entry_scaled = scaler.fit_transform(entry_df)print("Before scaling:")print(f"  HS_GPA range:    {entry_df['HS_GPA'].min():.1f} – {entry_df['HS_GPA'].max():.1f}")print(f"  DISTANCE range:  {entry_df['DISTANCE_FROM_CAMPUS'].min():.1f} – {entry_df['DISTANCE_FROM_CAMPUS'].max():.1f}")print(f"\nAfter scaling (mean ≈ 0, std ≈ 1):")print(f"  HS_GPA range:    {entry_scaled[:, 0].min():.2f} – {entry_scaled[:, 0].max():.2f}")print(f"  DISTANCE range:  {entry_scaled[:, 6].min():.2f} – {entry_scaled[:, 6].max():.2f}")

### Step 2: Find the Optimal Number of ClustersWe use two complementary methods:1. **Elbow Method** — Plot inertia (within-cluster sum of squares) vs. k. Look for the "elbow" where adding more clusters stops helping much.2. **Silhouette Score** — Measures how similar each point is to its own cluster vs. other clusters. Ranges from -1 to +1; higher is better.

In [None]:
# Elbow method + Silhouette scoresK_range = range(2, 11)inertias = []silhouettes = []for k in K_range:    km = KMeans(n_clusters=k, n_init=10, random_state=RANDOM_STATE)    labels = km.fit_predict(entry_scaled)    inertias.append(km.inertia_)    silhouettes.append(silhouette_score(entry_scaled, labels))fig, axes = plt.subplots(1, 2, figsize=(14, 5))# Elbow plotaxes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)axes[0].set_xlabel('Number of Clusters (k)')axes[0].set_ylabel('Inertia (Within-Cluster SS)')axes[0].set_title('Elbow Method')axes[0].axvline(x=4, color='red', linestyle='--', alpha=0.7, label='k = 4')axes[0].legend()# Silhouette plotaxes[1].plot(K_range, silhouettes, 'go-', linewidth=2, markersize=8)axes[1].set_xlabel('Number of Clusters (k)')axes[1].set_ylabel('Silhouette Score')axes[1].set_title('Silhouette Analysis')axes[1].axvline(x=4, color='red', linestyle='--', alpha=0.7, label='k = 4')axes[1].legend()plt.tight_layout()plt.show()print(f"\nSilhouette scores: " + ", ".join([f"k={k}: {s:.3f}" for k, s in zip(K_range, silhouettes)]))

### Step 3: Fit K-Means with k=4

In [None]:
km_entry = KMeans(n_clusters=4, n_init=10, random_state=RANDOM_STATE)entry_df['Cluster'] = km_entry.fit_predict(entry_scaled)print(f"Silhouette Score: {silhouette_score(entry_scaled, entry_df['Cluster']):.3f}")print(f"\nCluster Sizes:")print(entry_df['Cluster'].value_counts().sort_index())

### Step 4: PCA for VisualizationWith 7 variables, we can't visualize the clusters directly. PCA compresses the datainto 2 dimensions that capture the most variance, letting us *see* the clusters.

In [None]:
pca = PCA(n_components=2, random_state=RANDOM_STATE)entry_pca = pca.fit_transform(entry_scaled)fig, ax = plt.subplots(figsize=(10, 7))scatter = ax.scatter(entry_pca[:, 0], entry_pca[:, 1],                     c=entry_df['Cluster'], cmap='Set2', alpha=0.6, s=30, edgecolors='w', linewidth=0.3)ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')ax.set_title('Student Entry Segments (PCA Projection)')plt.colorbar(scatter, label='Cluster')plt.tight_layout()plt.show()print(f"PC1 explains {pca.explained_variance_ratio_[0]:.1%} of variance")print(f"PC2 explains {pca.explained_variance_ratio_[1]:.1%} of variance")print(f"Together: {sum(pca.explained_variance_ratio_[:2]):.1%}")

### Step 5: Cluster ProfilingThis is the most important step for institutional use. We compute the mean of eachvariable within each cluster, then **interpret what makes each group distinctive**.

In [None]:
profile = entry_df.groupby('Cluster').mean().round(2)profile['Count'] = entry_df['Cluster'].value_counts().sort_index()print("=" * 70)print("CLUSTER PROFILES — Student Entry Segmentation")print("=" * 70)print(profile.to_string())print("=" * 70)# Heatmap of cluster profiles (z-scored for comparison)profile_z = (profile.drop(columns='Count') - profile.drop(columns='Count').mean()) / profile.drop(columns='Count').std()fig, ax = plt.subplots(figsize=(10, 5))sns.heatmap(profile_z, annot=profile.drop(columns='Count').values, fmt='.2f',            cmap='RdYlGn', center=0, linewidths=1, ax=ax)ax.set_title('Cluster Profiles (color = z-score, numbers = raw means)')ax.set_ylabel('Cluster')plt.tight_layout()plt.show()

### Interpretation for StakeholdersAfter examining the cluster profiles, you would write institutional interpretations like:| Cluster | Label | Key Characteristics | Suggested Intervention ||:--------|:------|:-------------------|:----------------------|| 0 | **Prepared Commuters** | High GPA, lives far from campus | Flexible scheduling, online office hours || 1 | **At-Risk First-Gen** | Lower GPA, first-gen, Pell-eligible | Intensive advising, bridge programs || 2 | **Strong Local Students** | High GPA, lives close, full load | Honors programs, research opportunities || 3 | **Moderate & Uncertain** | Average GPA, moderate distance | Mentoring, exploratory advising |> **Note**: Exact labels depend on your data. The key is translating cluster statistics into> language that advisors and administrators can act on.

---## Case Study 2: Program Portfolio Segmentation### The Dean's Question> *"We have 300 academic programs across the university. Can we identify> which programs are thriving, which are struggling, and which are in between —> based on enrollment, outcomes, and efficiency data?"*### Variables (6)| Variable | Description ||:---------|:------------|| `ENROLLMENT` | Total headcount (50–2000) || `RETENTION_RATE` | First-year retention (0.40–0.98) || `GRADUATION_RATE` | 6-year grad rate (0.20–0.90) || `FACULTY_RATIO` | Student-to-faculty ratio (8–40) || `COST_PER_STUDENT` | Annual cost per student ($5k–$50k) || `JOB_PLACEMENT_RATE` | Employment within 1 year (0.50–0.98) |

In [None]:
# --- Generate synthetic program portfolio data ---n_prog = 300prog_df = pd.DataFrame({    'ENROLLMENT':        np.random.lognormal(5.5, 0.7, n_prog).clip(50, 2000).astype(int),    'RETENTION_RATE':    np.random.beta(8, 3, n_prog).clip(0.40, 0.98).round(3),    'GRADUATION_RATE':   np.random.beta(5, 4, n_prog).clip(0.20, 0.90).round(3),    'FACULTY_RATIO':     np.random.normal(22, 7, n_prog).clip(8, 40).round(1),    'COST_PER_STUDENT':  np.random.lognormal(9.5, 0.5, n_prog).clip(5000, 50000).round(0),    'JOB_PLACEMENT_RATE': np.random.beta(7, 3, n_prog).clip(0.50, 0.98).round(3)})print(f"Dataset: {prog_df.shape[0]} programs × {prog_df.shape[1]} variables")prog_df.describe().round(2)

In [None]:
# Scale → Elbow + Silhouette → Fit K-Means → PCA → Profilescaler_prog = StandardScaler()prog_scaled = scaler_prog.fit_transform(prog_df)# Optimal kinertias_p, sils_p = [], []for k in range(2, 11):    km = KMeans(n_clusters=k, n_init=10, random_state=RANDOM_STATE)    labels = km.fit_predict(prog_scaled)    inertias_p.append(km.inertia_)    sils_p.append(silhouette_score(prog_scaled, labels))fig, axes = plt.subplots(1, 2, figsize=(14, 5))axes[0].plot(range(2, 11), inertias_p, 'bo-', linewidth=2, markersize=8)axes[0].set_xlabel('k'); axes[0].set_ylabel('Inertia'); axes[0].set_title('Elbow Method — Programs')axes[0].axvline(x=4, color='red', linestyle='--', alpha=0.7)axes[1].plot(range(2, 11), sils_p, 'go-', linewidth=2, markersize=8)axes[1].set_xlabel('k'); axes[1].set_ylabel('Silhouette'); axes[1].set_title('Silhouette — Programs')axes[1].axvline(x=4, color='red', linestyle='--', alpha=0.7)plt.tight_layout(); plt.show()

In [None]:
# Fit with k=4km_prog = KMeans(n_clusters=4, n_init=10, random_state=RANDOM_STATE)prog_df['Cluster'] = km_prog.fit_predict(prog_scaled)# PCA visualizationpca_prog = PCA(n_components=2, random_state=RANDOM_STATE)prog_pca = pca_prog.fit_transform(prog_scaled)fig, ax = plt.subplots(figsize=(10, 7))scatter = ax.scatter(prog_pca[:, 0], prog_pca[:, 1],                     c=prog_df['Cluster'], cmap='Set2', alpha=0.6, s=40, edgecolors='w', linewidth=0.3)ax.set_xlabel(f'PC1 ({pca_prog.explained_variance_ratio_[0]:.1%} variance)')ax.set_ylabel(f'PC2 ({pca_prog.explained_variance_ratio_[1]:.1%} variance)')ax.set_title('Program Portfolio Segments (PCA)')plt.colorbar(scatter, label='Cluster')plt.tight_layout(); plt.show()# Profileprofile_p = prog_df.groupby('Cluster').mean().round(2)profile_p['Count'] = prog_df['Cluster'].value_counts().sort_index()print("=" * 80)print("CLUSTER PROFILES — Program Portfolio Segmentation")print("=" * 80)print(profile_p.to_string())

### Interpretation for the Dean| Cluster | Label | Characteristics | Action ||:--------|:------|:---------------|:-------|| 0 | **High-Performing Flagships** | High enrollment, high outcomes, good placement | Invest and scale || 1 | **Efficient Small Programs** | Small but strong retention and placement | Protect from cuts || 2 | **Under-Performing, High-Cost** | Low outcomes, high cost per student | Restructure or sunset || 3 | **Growing Programs** | Rising enrollment, moderate outcomes | Invest in faculty, support |

---## Case Study 3: Student Pathway Segmentation### The Academic Affairs Question> *"Can we identify distinct **behavioral pathways** among students based on their> first-year academic behavior — course load, grades, DFW rates — to design> targeted second-year interventions?"*### Variables (5)| Variable | Description ||:---------|:------------|| `GPA_1` | First-semester GPA || `GPA_2` | Second-semester GPA || `DFW_RATE_1` | First-semester DFW rate || `DFW_RATE_2` | Second-semester DFW rate || `UNITS_COMPLETED_RATIO` | Proportion of attempted units completed |

In [None]:
# --- Generate synthetic pathway data ---n_path = 1000pathway_df = pd.DataFrame({    'GPA_1':                 np.random.normal(2.8, 0.7, n_path).clip(0.0, 4.0).round(2),    'GPA_2':                 np.random.normal(2.9, 0.65, n_path).clip(0.0, 4.0).round(2),    'DFW_RATE_1':            np.random.beta(2, 8, n_path).clip(0.0, 1.0).round(3),    'DFW_RATE_2':            np.random.beta(2, 8, n_path).clip(0.0, 1.0).round(3),    'UNITS_COMPLETED_RATIO': np.random.beta(8, 2, n_path).clip(0.3, 1.0).round(3)})print(f"Dataset: {pathway_df.shape[0]:,} students × {pathway_df.shape[1]} variables")pathway_df.describe().round(2)

In [None]:
# Full pipeline: Scale → Elbow/Silhouette → K-Means → PCA → Profilescaler_path = StandardScaler()path_scaled = scaler_path.fit_transform(pathway_df)# Optimal ksils_path = []for k in range(2, 11):    km = KMeans(n_clusters=k, n_init=10, random_state=RANDOM_STATE)    sils_path.append(silhouette_score(path_scaled, km.fit_predict(path_scaled)))fig, ax = plt.subplots(figsize=(8, 5))ax.plot(range(2, 11), sils_path, 'go-', linewidth=2, markersize=8)ax.set_xlabel('k'); ax.set_ylabel('Silhouette Score')ax.set_title('Silhouette Analysis — Student Pathways')ax.axvline(x=4, color='red', linestyle='--', alpha=0.7, label='k = 4')ax.legend(); plt.tight_layout(); plt.show()# Fit k=4km_path = KMeans(n_clusters=4, n_init=10, random_state=RANDOM_STATE)pathway_df['Cluster'] = km_path.fit_predict(path_scaled)# PCA + scatterpca_path = PCA(n_components=2, random_state=RANDOM_STATE)path_pca = pca_path.fit_transform(path_scaled)fig, ax = plt.subplots(figsize=(10, 7))scatter = ax.scatter(path_pca[:, 0], path_pca[:, 1],                     c=pathway_df['Cluster'], cmap='Set2', alpha=0.6, s=30, edgecolors='w', linewidth=0.3)ax.set_xlabel(f'PC1 ({pca_path.explained_variance_ratio_[0]:.1%})')ax.set_ylabel(f'PC2 ({pca_path.explained_variance_ratio_[1]:.1%})')ax.set_title('Student Pathway Segments (PCA)')plt.colorbar(scatter, label='Cluster')plt.tight_layout(); plt.show()# Profileprofile_path = pathway_df.groupby('Cluster').mean().round(3)profile_path['Count'] = pathway_df['Cluster'].value_counts().sort_index()print("=" * 70)print("CLUSTER PROFILES — Student Pathways")print("=" * 70)print(profile_path.to_string())

### Interpretation for Academic Affairs| Cluster | Label | Characteristics | Intervention ||:--------|:------|:---------------|:-------------|| 0 | **Steady Performers** | Consistent GPA, low DFW, high completion | Light-touch advising || 1 | **Improving Trajectory** | GPA rises semester 2, DFW drops | Encourage momentum, mentoring || 2 | **Declining Trajectory** | GPA drops semester 2, DFW rises | Early alert, intensive advising || 3 | **Chronically Struggling** | Low GPA both semesters, high DFW | Major intervention, bridge support |

---## Final Capstone: Course Bottleneck Detection### The Provost & Curriculum Committee Question> *"We have data on 1,000 course sections. Some courses seem to act as **bottlenecks** —> high DFW rates, low pass rates, long repeat delays. Can we identify clusters> of problematic courses and create a **Risk Prioritization Index** to guide> curriculum reform?"*### Variables (13)| Variable | Description ||:---------|:------------|| `ENROLLMENT` | Section enrollment || `DFW_RATE` | DFW rate for the section || `PASS_RATE` | Pass rate (C or better) || `AVG_GPA` | Average grade in the section || `REPEAT_RATE` | Proportion of students repeating || `AVG_REPEAT_DELAY` | Average semesters before repeat || `SECTION_SIZE` | Number of seats || `PCT_FIRST_YEAR` | Proportion of first-year students || `PCT_STEM` | Proportion of STEM majors || `INSTRUCTOR_RATING` | Student evaluation score || `PREREQUISITE_COUNT` | Number of prerequisites || `IS_GATEWAY` | Gateway course flag (0/1) || `UNITS` | Course unit value |

In [None]:
# --- Generate synthetic course section data ---n_courses = 1000course_df = pd.DataFrame({    'ENROLLMENT':        np.random.lognormal(3.5, 0.5, n_courses).clip(15, 500).astype(int),    'DFW_RATE':          np.random.beta(2, 7, n_courses).clip(0.01, 0.70).round(3),    'PASS_RATE':         np.random.beta(7, 2, n_courses).clip(0.30, 0.99).round(3),    'AVG_GPA':           np.random.normal(2.7, 0.5, n_courses).clip(0.5, 4.0).round(2),    'REPEAT_RATE':       np.random.beta(1.5, 8, n_courses).clip(0.0, 0.50).round(3),    'AVG_REPEAT_DELAY':  np.random.exponential(1.5, n_courses).clip(0.5, 6.0).round(1),    'SECTION_SIZE':      np.random.choice([25, 30, 35, 40, 50, 60, 80, 100, 150, 200], n_courses),    'PCT_FIRST_YEAR':    np.random.beta(3, 5, n_courses).round(3),    'PCT_STEM':          np.random.beta(3, 4, n_courses).round(3),    'INSTRUCTOR_RATING': np.random.normal(3.8, 0.6, n_courses).clip(1.0, 5.0).round(1),    'PREREQUISITE_COUNT': np.random.choice([0, 1, 2, 3, 4, 5], n_courses, p=[0.15, 0.30, 0.25, 0.15, 0.10, 0.05]),    'IS_GATEWAY':        np.random.binomial(1, 0.30, n_courses),    'UNITS':             np.random.choice([1, 2, 3, 4, 5], n_courses, p=[0.05, 0.10, 0.50, 0.30, 0.05])})print(f"Dataset: {course_df.shape[0]:,} course sections × {course_df.shape[1]} variables")course_df.describe().round(2)

In [None]:
# Full pipeline: Scale → Elbow/Silhouette → K-Means → PCA → Profilescaler_c = StandardScaler()course_scaled = scaler_c.fit_transform(course_df)# Optimal ksils_c = []for k in range(2, 11):    km = KMeans(n_clusters=k, n_init=10, random_state=RANDOM_STATE)    sils_c.append(silhouette_score(course_scaled, km.fit_predict(course_scaled)))fig, ax = plt.subplots(figsize=(8, 5))ax.plot(range(2, 11), sils_c, 'go-', linewidth=2, markersize=8)ax.set_xlabel('k'); ax.set_ylabel('Silhouette Score')ax.set_title('Silhouette Analysis — Course Sections')ax.axvline(x=4, color='red', linestyle='--', alpha=0.7, label='k = 4')ax.legend(); plt.tight_layout(); plt.show()# Fit k=4km_course = KMeans(n_clusters=4, n_init=10, random_state=RANDOM_STATE)course_df['Cluster'] = km_course.fit_predict(course_scaled)# PCA visualizationpca_c = PCA(n_components=2, random_state=RANDOM_STATE)course_pca = pca_c.fit_transform(course_scaled)fig, ax = plt.subplots(figsize=(10, 7))scatter = ax.scatter(course_pca[:, 0], course_pca[:, 1],                     c=course_df['Cluster'], cmap='Set2', alpha=0.6, s=30, edgecolors='w', linewidth=0.3)ax.set_xlabel(f'PC1 ({pca_c.explained_variance_ratio_[0]:.1%})')ax.set_ylabel(f'PC2 ({pca_c.explained_variance_ratio_[1]:.1%})')ax.set_title('Course Section Segments (PCA)')plt.colorbar(scatter, label='Cluster')plt.tight_layout(); plt.show()

In [None]:
# Cluster Profileprofile_c = course_df.groupby('Cluster').mean().round(3)profile_c['Count'] = course_df['Cluster'].value_counts().sort_index()print("=" * 100)print("CLUSTER PROFILES — Course Bottleneck Detection")print("=" * 100)print(profile_c.to_string())print("=" * 100)# Heatmapprofile_cz = (profile_c.drop(columns='Count') - profile_c.drop(columns='Count').mean()) / profile_c.drop(columns='Count').std()fig, ax = plt.subplots(figsize=(14, 5))sns.heatmap(profile_cz, annot=True, fmt='.2f', cmap='RdYlGn_r', center=0, linewidths=1, ax=ax)ax.set_title('Course Cluster Profiles (z-scored — red = concerning, green = favorable)')ax.set_ylabel('Cluster')plt.tight_layout(); plt.show()

### Risk Prioritization IndexWe create a composite risk score for each course section to help the Provost andCurriculum Committee prioritize which courses to review first.The index combines three weighted factors:- **DFW Rate** (weight: 0.4) — direct measure of student failure- **Repeat Rate** (weight: 0.3) — indicates students are stuck- **Inverse Pass Rate** (weight: 0.3) — lower pass rate = higher risk

In [None]:
# Risk Prioritization Indexcourse_df['RISK_INDEX'] = (    0.4 * (course_df['DFW_RATE'] / course_df['DFW_RATE'].max()) +    0.3 * (course_df['REPEAT_RATE'] / course_df['REPEAT_RATE'].max()) +    0.3 * ((1 - course_df['PASS_RATE']) / (1 - course_df['PASS_RATE']).max())).round(3)# Top 20 highest-risk sectionstop_risk = course_df.nlargest(20, 'RISK_INDEX')[    ['ENROLLMENT', 'DFW_RATE', 'PASS_RATE', 'REPEAT_RATE', 'IS_GATEWAY', 'RISK_INDEX', 'Cluster']]print("=" * 80)print("TOP 20 HIGHEST-RISK COURSE SECTIONS")print("=" * 80)print(top_risk.to_string())# Risk distribution by clusterfig, axes = plt.subplots(1, 2, figsize=(14, 5))# Box plotcourse_df.boxplot(column='RISK_INDEX', by='Cluster', ax=axes[0])axes[0].set_title('Risk Index by Cluster')axes[0].set_xlabel('Cluster'); axes[0].set_ylabel('Risk Index')plt.sca(axes[0]); plt.title('Risk Index by Cluster')# Histogramfor c in sorted(course_df['Cluster'].unique()):    axes[1].hist(course_df[course_df['Cluster'] == c]['RISK_INDEX'],                 alpha=0.5, bins=20, label=f'Cluster {c}')axes[1].set_xlabel('Risk Index'); axes[1].set_ylabel('Count')axes[1].set_title('Risk Index Distribution by Cluster')axes[1].legend()plt.tight_layout(); plt.show()

### Final Capstone DeliverablesIn a real institutional analysis, you would deliver:1. **Cluster summary table** — One row per cluster with mean statistics and interpretation2. **PCA scatter plot** — Visual proof that clusters are separable3. **Risk Prioritization Index** — Ranked list of courses needing review4. **Executive memo** — 1-page summary for the Provost (see template below)#### Executive Memo Template> **To:** Provost / Curriculum Committee>> **From:** Institutional Research>> **Re:** Course Bottleneck Analysis — Risk-Prioritized Findings>> Using K-Means clustering on 13 course-level metrics for 1,000 sections,> we identified four distinct course profiles. Cluster [X] contains the highest-risk> sections (avg DFW rate = [Y]%, avg repeat rate = [Z]%). These [N] sections> account for [P]% of total DFW enrollments despite being only [Q]% of sections.>> **Recommendation:** Prioritize curriculum review for the top 20 sections> identified by the Risk Prioritization Index (attached).

---## Summary and Final Teaching Outcomes### What You Learned| Concept | Key Takeaway ||:--------|:-------------|| **Unsupervised vs. Supervised** | No target variable — we discover structure, not predict outcomes || **StandardScaler** | Always scale before K-Means (distance-based) || **Elbow Method** | Plot inertia vs. k, look for the bend || **Silhouette Score** | Quantitative measure of cluster quality (higher = better) || **K-Means** | `KMeans(n_clusters=k).fit_predict(X_scaled)` — that's it! || **PCA** | `PCA(n_components=2).fit_transform(X_scaled)` — for visualization || **Cluster Profiling** | `df.groupby('Cluster').mean()` — the institutional insight step || **Risk Index** | Weighted composite score for prioritization |### The Consistent PatternEvery case study followed the same pipeline:```Scale → Find k → Fit K-Means → PCA Visualization → Profile → Interpret```### Key Institutional Applications1. **Student segmentation** — Tailor advising and retention programs2. **Program portfolio analysis** — Identify thriving vs. struggling programs3. **Behavioral pathway detection** — Design targeted year-2 interventions4. **Course bottleneck identification** — Prioritize curriculum reform### What's NextIn the **Special Topics module** (Module 6), you can explore:- Hierarchical clustering (dendrograms)- DBSCAN (density-based clustering)- t-SNE and UMAP (advanced visualization)- LightGBM, CatBoost, and Neural Networks