<p style="font-family: Arial; font-size:3em;color:black;"> Lab Exercise 9</p>

In [6]:
# For this example, we will use K-Means Clustering Project database from 
# Kaggle (https://www.kaggle.com/faressayah/k-means-clustering-private-vs-public-universities)
# We actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, 
# since that is an unsupervised learning algorithm.
# As we will shortly see, we have a data frame with 777 observations on 18 variables.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [11]:
df = pd.read_csv('College_Data',index_col=0)
df.columns


(777, 18)

In [18]:
df['Grad.Rate']['Cazenovia College'] = 100

# Try removing various columns (features) from the dataset and examin if it improves/degrades your 
# K-Means model performance, or it may have little impact.
# Report 10 cases where you removed one or more features and indicate how it impacted the model performance.

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['Grad.Rate']['Cazenovia College'] = 100
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Grad.Rate']['Caz

In [12]:
# Shows the first 5 row labels.
df.index[:5]

Index(['Abilene Christian University', 'Adelphi University', 'Adrian College',
       'Agnes Scott College', 'Alaska Pacific University'],
      dtype='object')

In [15]:
# shows the total number of rows 777 and columns 18
df.shape


(777, 18)

In [16]:
# shows the columns names
df.columns

Index(['Private', 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc',
       'F.Undergrad', 'P.Undergrad', 'Outstate', 'Room.Board', 'Books',
       'Personal', 'PhD', 'Terminal', 'S.F.Ratio', 'perc.alumni', 'Expend',
       'Grad.Rate'],
      dtype='object')

In [32]:
# removing the Column "Private" as it contains categorical value. For K Mean clustering only numerical values required. 
y_private = df["Private"]

# Features = everything except 'Private' (all numeric columns)
X = df.drop(columns=["Private"])

X.shape


(777, 17)

In [21]:
X.isna().sum().sum()

np.int64(0)

In [23]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled.shape

(777, 17)

In [24]:
kmeans = KMeans(n_clusters=2, n_init=20, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

clusters[:10]

array([0, 0, 0, 1, 0, 0, 0, 1, 1, 0], dtype=int32)

In [27]:
# calculating our baseline value 

sil = silhouette_score(X_scaled, clusters)
sil

0.2286626165244013

In [33]:
def kmeans_silhouette_after_removal(df, remove_cols, k=2, seed=42):
    # 1) Drop the label column (Private) + the columns we want to remove
    X = df.drop(columns=["Private"] + remove_cols)

    # 2) Scale features so distances are fair
    X_scaled = StandardScaler().fit_transform(X)

    # 3) Fit KMeans and get cluster labels
    km = KMeans(n_clusters=k, n_init=20, random_state=seed)
    clusters = km.fit_predict(X_scaled)

    # 4) Return silhouette score (higher = better clustering separation)
    return silhouette_score(X_scaled, clusters)


In [29]:
# testing the baseline against removing one column

baseline = kmeans_silhouette_after_removal(df, [], k=2)
case1 = kmeans_silhouette_after_removal(df, ["Books"], k=2)

print("Baseline (remove none):", baseline)
print("Remove Books:", case1)
print("Change:", case1 - baseline)

Baseline (remove none): 0.2286626165244013
Remove Books: 0.24526244203629538
Change: 0.016599825511894067


In [None]:
# 'Books' is behaving like noise. When removed, the “distance picture” between 
# schools becomes a bit cleaner, so K-Means finds slightly more separated clusters.

In [30]:
# Here we run 10 removal cases automatically and make a report table

cases = [
    [],                             # Case 0: baseline
    ["Books"],                      # Case 1
    ["Personal"],                   # Case 2
    ["Books", "Personal"],          # Case 3
    ["Room.Board"],                 # Case 4
    ["Outstate"],                   # Case 5
    ["Outstate", "Room.Board"],     # Case 6
    ["perc.alumni"],                # Case 7
    ["Expend"],                     # Case 8
    ["S.F.Ratio", "PhD", "Terminal"]# Case 9 (a “faculty metrics” bundle)
]

baseline = kmeans_silhouette_after_removal(df, [], k=2)

rows = []
for i, remove_cols in enumerate(cases):
    sil = kmeans_silhouette_after_removal(df, remove_cols, k=2)
    rows.append({
        "Case": i,
        "Removed Columns": ", ".join(remove_cols) if remove_cols else "(none)",
        "Silhouette": sil,
        "Change vs Baseline": sil - baseline
    })

report = pd.DataFrame(rows).sort_values("Silhouette", ascending=False)
report

Unnamed: 0,Case,Removed Columns,Silhouette,Change vs Baseline
9,9,"S.F.Ratio, PhD, Terminal",0.390645,0.161982
6,6,"Outstate, Room.Board",0.368168,0.139505
5,5,Outstate,0.354737,0.126074
3,3,"Books, Personal",0.257723,0.02906
1,1,Books,0.245262,0.0166
4,4,Room.Board,0.24295,0.014288
2,2,Personal,0.240881,0.012218
7,7,perc.alumni,0.23764,0.008977
0,0,(none),0.228663,0.0
8,8,Expend,0.223165,-0.005498


In [31]:
best = report.iloc[0]
worst = report.iloc[-1]

print("Best case:", best["Removed Columns"], "Silhouette:", best["Silhouette"], "Change:", best["Change vs Baseline"])
print("Worst case:", worst["Removed Columns"], "Silhouette:", worst["Silhouette"], "Change:", worst["Change vs Baseline"])


Best case: S.F.Ratio, PhD, Terminal Silhouette: 0.3906447726524985 Change: 0.16198215612809716
Worst case: Expend Silhouette: 0.22316457148584 Change: -0.0054980450385613
