### Lab Exercise 9

# Group 10
Team members:
- Poleo Vargas, Ricardo
- Naji, Mohamed Oussama
- Ali, Syed Hamza
- Jayswal, Ishabahen Dilipkumar
- Karande, Pranali
- Tran, Cong Le Anh Tu

In [1]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [2]:
# For this example, we will use K-Means Clustering Project database from Kaggle (https://www.kaggle.com/faressayah/k-means-clustering-private-vs-public-universities)
# We actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.
# As we will shortly see, we have a data frame with 777 observations on 18 variables.


In [3]:
df = pd.read_csv('College_Data',index_col=0)
df.columns

Index(['Private', 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc',
       'F.Undergrad', 'P.Undergrad', 'Outstate', 'Room.Board', 'Books',
       'Personal', 'PhD', 'Terminal', 'S.F.Ratio', 'perc.alumni', 'Expend',
       'Grad.Rate'],
      dtype='object')

In [4]:
# Converting 'Private' column to numeric format (1 for 'Yes', 0 for 'No')
df['Private'] = df['Private'].map({'Yes': 1, 'No': 0})


In [5]:
df['Grad.Rate']['Cazenovia College'] = 100

# Try removing various columns (features) from the dataset and examin if it improves/degrades your K-Means model performance, or it may have little impact.
# Report 10 cases where you removed one or more features and indicate how it impacted the model performance.

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Grad.Rate']['Cazenovia College'] = 100


In [6]:

# Function to evaluate K-Means performance
def evaluate_kmeans_performance(data, n_clusters=2):
    """
    Fit a K-Means model to the data and evaluate its performance.

    Parameters:
    data (DataFrame): The input data for clustering.
    n_clusters (int): The number of clusters to form.

    Returns:
    tuple: A tuple containing the inertia and silhouette score of the model.
    """
    # Scaling the data
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(data)

    # Fitting K-Means
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(data_scaled)

    # Evaluating performance
    inertia = kmeans.inertia_
    silhouette = silhouette_score(data_scaled, kmeans.labels_)

    return inertia, silhouette

# Prepare to document the results for 10 different cases
results = []

# Original columns in the dataset
original_columns = df.columns.tolist()

# Proceeding with 10 different scenarios of feature removal
for i in range(10):
    # Randomly select a number of features to remove
    num_features_to_remove = np.random.randint(1, len(original_columns))
    features_to_remove = np.random.choice(original_columns, num_features_to_remove, replace=False)
    modified_data = df.drop(columns=features_to_remove)

    # Evaluate and document the performance
    inertia, silhouette = evaluate_kmeans_performance(modified_data)
    results.append({
        "Case": i+1,
        "Removed Features": features_to_remove.tolist(),
        "Inertia": inertia,
        "Silhouette Score": silhouette
    })

# Displaying the results for the 10 cases
results_df = pd.DataFrame(results)
results_df




Unnamed: 0,Case,Removed Features,Inertia,Silhouette Score
0,1,[Books],10337.870018,0.276798
1,2,"[Expend, Top25perc, Accept, Private, Grad.Rate]",8078.352094,0.306957
2,3,"[Top25perc, Personal, Enroll, F.Undergrad]",8460.90821,0.214368
3,4,"[Top25perc, Accept, Enroll, Grad.Rate, Termina...",7274.206567,0.207724
4,5,"[Accept, Terminal, Top25perc, Top10perc]",8265.525308,0.296638
5,6,"[Personal, PhD, Terminal, P.Undergrad, F.Under...",3564.947861,0.234446
6,7,"[Enroll, Personal, Terminal, Private]",8476.083662,0.240436
7,8,"[Books, Top25perc, perc.alumni, P.Undergrad, G...",3099.973523,0.502702
8,9,"[perc.alumni, Terminal]",9701.149689,0.30428
9,10,"[Top10perc, Outstate]",9704.275338,0.302484


# ***Conclusion:*** Removing non-informative or redundant features tends to enhance model performance by simplifying the data structure, whereas eliminating key features can degrade clustering quality.