## Can we group students based on the number of upper division (updv.num) courses they’ve taken and their comfortability in math, stats, and programming? 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('data/background-clean.csv')

print(f"Dataset shape: {df.shape}")
print(f"Number of students: {df.shape[0]}")

Dataset shape: (51, 30)
Number of students: 51


In [10]:
# Convert updv.num from categorical to numerical (sort of roughly estimating the number of courses)
# Convert categorical course numbers to numeric scale
updv_mapping = {
    '0-2': 1,    
    '3-5': 4,      
    '6-8': 7,    
    '9+': 10     
}

df['updv_numeric'] = df['updv.num'].map(updv_mapping)

feature_names = ['updv_numeric', 'prog.comf', 'math.comf', 'stat.comf']
feature_labels = ['Upper Div Courses', 'Programming Comfort', 'Math Comfort', 'Statistics Comfort']

# Create clustering dataset
X = df[feature_names].copy()

In [12]:
# missing values? No
print(f"\nMissing values per feature:")
print(X.isnull().sum())


Missing values per feature:
updv_numeric    0
prog.comf       0
math.comf       0
stat.comf       0
dtype: int64


## Exploratory Data Analysis

In [8]:
print("\nSummary Statistics:")
print(X.describe())


Summary Statistics:
       updv_numeric  prog.comf  math.comf  stat.comf
count     51.000000  51.000000  51.000000  51.000000
mean       7.882353   3.862745   4.039216   4.039216
std        2.635504   0.748855   0.773583   0.799019
min        1.000000   2.000000   3.000000   2.000000
25%        7.000000   3.000000   3.000000   3.500000
50%       10.000000   4.000000   4.000000   4.000000
75%       10.000000   4.000000   5.000000   5.000000
max       10.000000   5.000000   5.000000   5.000000


We can see that the mean number of upper division courses taken is 7.88 which roughly falls under the 6-8 courses original category. Additionally, more students are comforable with math and stats rather than programming.

In [9]:
# correlation analysis
print("\nCorrelation Matrix:")
correlation_matrix = cluster_data.corr()
print(correlation_matrix)


Correlation Matrix:
              updv_numeric  prog.comf  math.comf  stat.comf
updv_numeric      1.000000   0.153794   0.276982   0.382135
prog.comf         0.153794   1.000000   0.216623   0.243152
math.comf         0.276982   0.216623   1.000000   0.579885
stat.comf         0.382135   0.243152   0.579885   1.000000
