This cell is used to determine what features are important to the clustering. By disregarding the features that are identical across all of the clusters,
you are left with around 13 features. Among those features, if there is significant difference in the presence of that data among the cluster and the entire
data set, the feature must be important in how the clustering was determined. 

In [None]:
import pandas as pd
from kmodes.kmodes import KModes
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import pickle
import kmodes as KModes
import seaborn as sns

In [None]:
df = pd.read_csv("../data/clustered_data.csv")

In [None]:
with open('../models/kmodes_model.pkl', 'rb') as f:
    km = pickle.load(f)

In [None]:
features = df.columns.tolist()[:-1] #does every feature in df except cluster
centroids = km.cluster_centroids_

For most features, the modes of each cluster are identical, whether it be that a feature is present across all clusters or is missing. Because of this,
the features that are identical have no variance and therefore do not help to understand what features may be important. This cell is used to determine 
what what features have modes that differ across the clusters.

In [None]:
unique_modes_features = []
for feature_idx, feature in enumerate(features):
    modes_across_clusters = [centroids[c][feature_idx] for c in range(len(centroids))]
    unique_modes = len(set(modes_across_clusters))
    if unique_modes > 1:
        unique_modes_features.append(feature)

For every cluster, this cell will print the mode of the uniqued features determined in the cell above and how prevalent/absent the feature is within the 
cluster and within the entire dataset. If a feature is roughly equally present in the cluster and the data set, the cluster is likely not important to 
defining the cluster. A factor of two has been arbitrarily chosen to determine what features might be particularly defining of the cluster. If a feature in
the cluster is either twice or half as present/absent compared the data set, the feature likely contriubutes heavily to how the cluster was determined and
is labeled at the bottom of each cluster with the feature name, mode value, presence in cluster, and presence in dataset.

In [None]:
for cluster in range(len(centroids)):
    print(f"\n{'='*50}")
    print(f"Cluster {cluster}")
    print(f"{'='*50}")
    
    cluster_data = df[df["cluster"] == cluster]
    print(f"Size: {len(cluster_data)} samples\n")
    print("Distinguishing features:")
    important_features = []
    for feature in unique_modes_features:
        feature_idx = features.index(feature)
        mode_value = centroids[cluster][feature_idx]
        count = (cluster_data[feature] == mode_value).sum()
        cluster_prevalence = (count / len(cluster_data)) * 100
        df_prevalence = (len(df[df[feature] == mode_value]) / df[feature].shape[0]) * 100 #prevalence of a value in a feature across the entre df
        print(f"  {feature}: {mode_value} ({cluster_prevalence:.1f}%) || ({df_prevalence:.1f}%)")
        
        if max(cluster_prevalence, df_prevalence) / min(cluster_prevalence, df_prevalence) >= 2:
            important_features.append((feature, mode_value, cluster_prevalence, df_prevalence))
    
    for feature in important_features:
        print(f"** Important Feature - {feature[0]}: {feature[1]} ({feature[2]:.1f}%) || ({feature[3]:.1f}%)")

In [None]:
len(unique_modes_features)