# KMeans Clustering Forward - Inverse Project


## Introduction to project


### Background

KMeans Clustering is a commonly used unsupervised machine learning technique for understanding which data points may be related based on similarity between feature variable values.  Because supervised learning techniques make use of labeled data, the accuracy of the fit can be assessing by splitting the data into training and testing set, fitting the model with the training set, and measuring the accuracy of the predictions on the testing set.  However, assessing the quality of the fit for an unsupervised model with an un-labeled data set is not so straight forward.  Presumably, the KMeans algorithm should do a better job of properly identifying clusters when the inter-cluster distances significantly exceeds intra-cluster data point distances, but accuracy should degrade as the clusters are moved closer together and the data points begin to overlap.


### Overview of Project

The goal of this project is to assess the fit of the KMeans clustering algorithm in `Scikit-Learn` by forward modeling a data set of several clusters with noise, applying the KMeans clustering algorithm to divide the generated data set into clusters, and then visualizing the sets of clusters to compare the forward and inverse models.  

Selection of the number of clusters to use (K in KMeans) is accomplished by fitting the data with a series of K values and selecting the 'best' fit using the Akaike Information Criterion (AIC).  A more common strategy for selecting the 'best' fit is to plot the residual sum of squares versus the number of clusters and select the point of maximum curvature, i.e., the 'elbow' in the curve.  In my experience trying to apply this approach to picking regularization hyper-parameters in regression/inversion modeling, it is easy to pick the elbow visually, but automating the algorithm to pick the point of maximum curvature regularly fails because of noise in the plot that the eye can smooth over; this noise can come from many sources, e.g., an incomplete fit due to hitting an `it_max` in the clustering algorithm.  Since the goal of the project is to implement fully automated picking I felt that AIC was a more robust solution.

Assessment of the accuracy of the fit is done visually, although in the future it could be quantified using a Fisher's Exact test of a $\chi^2$-test.  The output are a pair of plots of the forward modeled data points, color coded by the true label, alongside the inverse modeled data points, color coded by the label assigned by the clustering algorithm; labels for the forward and inverse data are reconciled to avoid any label mismatch due to the order of the inverted cluster centroids provided by the KMeans algorithm.

The overall algorithm is made interactive so the user can select forward model input values, run the full algorithm, and see the updated output simply by adjusting the values with slider widgets and hitting the 'Run' button.


### Structure of Project

The project is structured in four section: the forward model, the inverse model, data visualization, and interactive implementation.  Each section contians a series of relevant functions, culminating in a single function that implements that section of the algorithm.  For example, the inverse modeling section has functions to fit the forward modeled data with an inverse model of K clusters, to repeat this fit procedure for different K values, to 'score' each fit, to select the ideal K value based on the returned scores, and to reconcile labels to facilitate plotting, all called by a 'master' function for that section that returns the relevant output.  The 'master' functions for each of the first three sections are called in the `main_function()`, which itself is called by the `interact_manual()` function, in the interactive section of the program.  Thus, changing variable values using the interactive widgets generates new forward and inverse model data, which are visualized using the data visualization section.



Scroll to the bottom of this page use this algorithm, or slowly work your way down to see how it works.

In [1]:
#preamble
#Some of these packeges are probably not part of the normal distribution,
#but I chose not to force install them
import pandas as pd
import numpy as np
import math

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances

import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

from ipywidgets import interact_manual
import ipywidgets as widgets

In [2]:
#centers Matplotlib figures on the page
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

## Forward Model Section

This section implements the forward model by generating a series of cluster along the perimeter of a circle.  This secion takes the number of clusters and the radius of the circle to define (x,y) values for the forward model cluster centers.  Normaly distributed random noise is then added to the data to create a cluster of data points.  The total number of all the data points across all clusters is set by the user.  This section returns a dictionary that contains the true cluster centers, (x,y) values for the individual data points, and the true cluster labels.

In [3]:
#functions for defining cluster centers, making the clusters, and computing true distance between cluser centers

def define_cluster_center(n, r, start_radians=math.pi/2):
    #defines the centroid for each cluster of points
    #clusters are equally spaced along perimiter of circle with radius r around origin
    #n=number of clusters
    #r=radius
    #start_radians=angle to put first cluster center point, math.pi/2 means it starts at North
    #returns x_y list of paired values
    
    #computes the radial positions in radians
    rad=np.arange(n)*2*math.pi/n + start_radians
    
    #converts polar corrdinates to cartesian
    #saves in list of tuples [(x0,y0), (x1,y1), ... , (xn,yn)]
    #https://stackoverflow.com/questions/53074230/how-to-combine-list-of-x-and-list-of-y-into-one-x-y-list
    x_y_list = list(zip(r*np.cos(rad), r*np.sin(rad)))
    
    #return(rad)
    return(x_y_list)
#testing define_cluster_center function
#rad=define_cluster_center(6,1,0)
#x_y_list=define_cluster_center(6,1,math.pi/2)
#print(x_y_list)

def euclidean_dist(x1,y1,x2,y2):
    #computes euclidean distance between two points
    #x1,x2,y1,y2 are x,y coordinates for the two points
    #simplified 2D solution, not general nD solution
    return(np.sqrt((x2-x1)**2 + (y2-y1)**2))

#computes true distance between adjacent cluser centers
#[-1] to [0], [0] to [1], etc.
#dist_list=[euclidean_dist(*x_y_list[i],*x_y_list[i+1]) for i in range(-1, len(x_y_list)-1)]
#print(dist_list)
#print(sum(dist_list)/6)

def make_clusters(x_y_list=[(0,0)], nn=1000, std=1):
    #makes the clusters of data points
    #x_y_list=list of tuples of x,y for centers of clusters
    #n=total number of data points
    #std=1sd for cluster data points around cluster center
    
    #for reasons of reproducibility I have turned set a consistent seed for the random number generator
    #need to deactivate to do MonteCarlo simulations
    return(make_blobs(n_samples=nn, n_features=2, centers=x_y_list, cluster_std=std, random_state=0))
#testing for the make clusters function 
#test_clusters=make_clusters(x_y_list, n=1000, std=1)
#test_clusters[0]

def generate_fwd_data(n=6, r=1, std=1, nn=1000, start_radians=math.pi/2):
    #master function for this section
    #generates cluster centers and clustered data around centers
    #n=number of clusters
    #r=radius
    #std=1sd for cluster data points around cluster center
    #nn=number of data points
    #start_radians=angle to put first cluster center point, math.pi/2 means it starts at North
    #returns list of the center points and the data points
    
    #generates the centroids
    x_y_list=define_cluster_center(n=n,r=r,start_radians=math.pi/2)
    
    #creates cluster blubs around the specified centers
    data_points=make_clusters(x_y_list,nn=nn, std=std)
    
    #packages x_y_list and data points as a dictionary to pass as a single object
    fwd_dict={'x_y_list':x_y_list, 'data_points':data_points}
    
    #return x_y_list, data_points
    return fwd_dict
#testing for generate_clusters
#fwd_dict=generate_fwd_data(n=6, r=10, std=1, nn=1000, start_radians=math.pi/2)
#x_y_list=fwd_dict['x_y_list']
#test_data_points=fwd_dict['data_points']
#test_data_points[0]
#x_y_list

## Inverse Model Section

This section implements the inverrse model by taking the data points generated in the forward model section and fitting the data points with K clusters.  The data fit with different values of K, each fit is 'scored' by AIC, and the 'best' fit is returned.  Even if the inverse centroids perfectly line up with the forward modeled cluster centers, they may not be sorted in the same order by the KMeans algorithm, thus the labels for each data point could be wrong even if they have been perfectly clustered.  Labels are reconciled by matching the inverse centroids to their nearest foward modeled cluster centers.  This section returns the ideal K value, the corresponding `KMeans()` object, and an array of the tested AIC values for visualization.

In [4]:
#functions to compute the clusters, AIC score, and select 'best' result

def compute_clusters(X, n=6):
    #computes a cluster object with n clusters for data points X
    #X is the feature data set
    #n is the number of clusters to compute, I am not sure why but I think n must be >=2 (no single cluster)
    #returns the cluster object
    
    #for reasons of reproducibility I have turned set a consistent seed for the random number generator
    clust_obj=KMeans(n_clusters = n, init='k-means++', random_state=0)#, random_state=0
    
    #fits the data points in feature data set
    clust_obj.fit(X)
    
    return(clust_obj)
#tests the cluster function
#test_clust=compute_clusters(test_data_points[0], n=6)
#print(test_clust.labels_)
#test_clust.inertia_
#len(test_clust.cluster_centers_)
#rint(test_clust.score(test_data_points[0]))

def KMeans_AIC(clust_fit, data_points):
    #computes an AIC score for the current cluster fit
    #clust_fit is the current kmeans cluster
    #https://stats.stackexchange.com/questions/271516/akaike-information-criterion-for-k-means
    #returns AIC value
    
    #do not assume data set has been scaled
    #compute the mean of the data and the standard deviation distance
    #if data has been scaled then standard deviation should be 1
    mean_data=np.mean(data_points,axis=0)
    #print(mean_data)
    
    #computes euclidean distance from mean data points to all data points
    dist_from_mean=euclidean_dist(*mean_data,data_points[:,0],data_points[:,1])
    #print(dist_from_mean)
    
    #computes the standard deviation of the data points from the mean
    std_dist=np.sqrt(np.sum(dist_from_mean**2)/len(dist_from_mean-1))
    #print(std_dist)
    
    #solves AIC score equation, assuming data is not scaled so std=1
    #AIC=1/std^2 * SSQ(distance between point and nearest cluster center) 
    #    + 2 * num clusters * dimenss in each data point
    AIC_val = 1/std_dist**2 * clust_fit.inertia_ + 2 * len(clust_fit.cluster_centers_) * clust_fit.n_features_in_
    #print(2 * len(clust_fit.cluster_centers_) * clust_fit.n_features_in_)
    
    return(AIC_val)
#testing
#AIC_test=KMeans_AIC(test_clust,test_data_points[0])
#print(AIC_test)


def cluster_and_score(data_points, min_clust=2, max_clust=10):
    #computes KMeans cluster fits and AIC score for the given set of data points
    #for a number of clusters from k=min_clust to k=max_clust
    #I don't think Kmeans algorithm allows for single cluster so min_clust>=2
    #data points=the set of data points
    #min_clust, max_clust are the miminum and maximum number of clusters to try and fit the data with
    
    #enforces min_clust>=2
    min_clust=min([min_clust,2])
    
    #fits the data with k clusters
    #I have adaptive algorithms that use a spline fit to search for the 'best' fit and can
    #expand the search range beyon what was initialy specified, but for this project I am just goint to
    #brute force through the options
    clust_fit_list=[compute_clusters(data_points, k) for k in range(min_clust, max_clust+1)]
    
    #computes AIC values for all of the clusters 
    AIC_arr=np.array([KMeans_AIC(clust_fit, data_points) for clust_fit in clust_fit_list])
    
    #finds the minimum AIC value and returns the number of clusters
    k_ind=np.argmin(AIC_arr)
    ideal_k=k_ind+min_clust
    #print(ideal_k)
    
    #returns the ideal number of clusters
    #the corresponding cluster object
    #and the AIC score array
    return(ideal_k, clust_fit_list[k_ind], AIC_arr)
#testing, test_data_points[0] are the x-y for the data points test_data_points[1] are the labels
#ideal_k, clust_obj, AIC_arr=cluster_and_score(test_data_points[0], min_clust=2, max_clust=4)
#AIC_arr
#clust_obj.cluster_centers_
#print(clust_obj.labels_.shape)

def reconcile_labels(x_y_list,clust_obj1):
    #even if the inverse modeled clusters in clust_obj are very similar in location to the 
    #forward modeled clusters in x_y_list, there is no garantuees the labels will be consistent
    #matches clust_obj.cluster_centers_ to the closest value in x_y_list, and relabels the clusters
    #based on that.
    
    #gets the number of labels in the forward model
    num_fwd_labels=len(x_y_list)
    #gets the number of labels in the inverse model
    num_inv_labels=len(clust_obj1.cluster_centers_)
    
    #intializes a new labels matrix
    #will be put in clust_obj.labels_reconciled_=new_labels
    new_labels=np.zeros_like(clust_obj1.labels_)
    
    def update_label_val(new_labels, current_labels, current_value_to_change, new_label):
        #takes new_labels and updates the current_labels where the values is the current_value_to_change
        #to the new label
        #new_labels=np.where(clust_obj1.labels_==fwd_label_inv_label[fwd_label],
        #                   fwd_label,new_labels)
        new_labels=np.where(current_labels==current_value_to_change,
                            new_label,new_labels)
        return(new_labels)
    
    #reconiles the labels using different approaches based on on how many inverse and forward labels there are
    #there may be a more elegent way to do this, but I broke it into the case where there are more inverse labels and 
    #we need to match the the forward labels to the closes inverse label and deal with any excess inverse labels
    #and the case where there are fewer inverse labels and we need to match the inverse labels to the closest
    #forward label
    if num_inv_labels>=num_fwd_labels:
        #when there are as many or more inverse labels as there are forward labels
        
        #computes pairwise distance between the forward cluster points in x_y_list 
        #and the inverse cluster_points in clust_obj.cluster_centers_
        dist_mat_fwd_inv=pairwise_distances(X=x_y_list, Y=clust_obj1.cluster_centers_)
        
        #looks for the minimum distance in each row
        #this is the inverse label value that most closely corresponds to this forward centroid
        #[14 15  9  8 12 11] means forward label 0 most closely aligns with inverse label 14, 
        #forward label 1 most closely aligns with inverse label 15
        #this assumes the there are more inverse labels than forward labels
        #if there are fewer inverse labels than we need to figure out which forward label most closely corresponds
        #to the inverse label in a seperate step
        fwd_label_inv_label=dist_mat_fwd_inv.argmin(axis=1)
        
        #print(clust_obj.cluster_centers_)
        #print(clust_obj.labels_)
        #print(dist_mat_fwd_inverse)
        #print(fwd_label_inv_label)
        
        #list of values to indicate which inverse centroids correspond to which forward centroids 
        #once they are aligned in the reconciliation process
        #intializa empty list of correct length
        #list says which reconciled forward label the inverse label corresponds to
        centroid_reconcile_labels=[None] * num_inv_labels
        
        #goes through and updates the inverse labels to their closest forward label
        for fwd_label in range(num_fwd_labels):
            #new_labels=np.where(clust_obj1.labels_==fwd_label_inv_label[fwd_label],
             #                   fwd_label,new_labels)
            new_labels=update_label_val(new_labels, clust_obj1.labels_, fwd_label_inv_label[fwd_label], fwd_label)
            
            centroid_reconcile_labels[fwd_label_inv_label[fwd_label]]=fwd_label
            
        #print(new_labels)
        #print(centroid_reconcile_labels)
        
        if num_inv_labels>num_fwd_labels:
            
            #initiates a variable to be a rolling counter for the new labels to replace 
            #set as the last of the forward labels
            new_label_val=num_fwd_labels-1
            
            #these are the inverse labels that did not correspond most closely correspond to a forward label
            #and now need to be shifted since we may have already used this label name in reconciling 
            #forward and inverse labels in the last step
            labels_to_shift=[i for i in range(num_inv_labels) if i not in fwd_label_inv_label]
            #print(labels_to_shift)
            
            #steps through each of the inv_labels in labels_to_shift
            #shifts them to the next available label number and updates the values in new_labels
            for inv_label in labels_to_shift:
                
                #increments the new label value variable to keep track of the next available lable number
                new_label_val+=1
                
                #reorders the inverse labels that do not correspond to a forward label
                new_labels=update_label_val(new_labels, clust_obj1.labels_, inv_label, new_label_val)
                
                centroid_reconcile_labels[inv_label]=new_label_val
                #print(new_label_val)
            #print(centroid_reconcile_labels)
        
    else:
        #when there fewer inverse labels than forward labels
        
        #computes pairwise distance between the forward cluster points in x_y_list 
        #and the inverse cluster_points in clust_obj.cluster_centers_
        dist_mat_inv_fwd=pairwise_distances(X=clust_obj1.cluster_centers_, Y=x_y_list)
        
        #if there are fewer inverse labels than we need to figure out which forward label most closely corresponds
        #to the inverse labels
        #[5 2 0 3] mean inverse label 0 correponds most closely with forward label 5, 
        #and inverse label 1 matches most closely with forward label 2
        inv_label_fwd_label=dist_mat_inv_fwd.argmin(axis=1)
        #print(inv_label_fwd_label)
        #this is already in the correct format for centroid_reconcile_labels
        centroid_reconcile_labels=inv_label_fwd_label
        
        #goes through and updates the inverse labels to their closest forward label
        for inv_label in range(num_inv_labels):
            #new_labels=np.where(clust_obj1.labels_==fwd_label_inv_label[fwd_label],
             #                   fwd_label,new_labels)
            new_labels=update_label_val(new_labels, clust_obj1.labels_, inv_label, inv_label_fwd_label[inv_label])
    
    #print(clust_obj.labels_)
    #print(new_labels)
    
    #updates clust_obj with the reconciled labels while preserving the original labels
    clust_obj1.labels_reconcile_=new_labels
    clust_obj1.centroid_reconcile_labels_=centroid_reconcile_labels
    
    return(clust_obj1)
#testing
#clust_obj=reconcile_labels(fwd_dict['x_y_list'],clust_obj)

def fit_clusters(fwd_dict, min_clust=2, max_clust=10):
    #master function for this section
    #implements the full KMeans cluster fit
    #fwd_dict=the dictionary of forward modeled data
    #fwd_dict['data_points'][0] are the xy data points to fit
    #min_clust=minimum number of clusters to try, needs to be >=2
    #max_clust=max number of clusters to try, increasing number increases processing time
    
    
    #fits the data with a clustering algorithm
    #tries a series of numbers of clusters and chooses the 'ideal' answer based on AIC score
    #returns the ideal_k value based on AIC score
    #returns the cluster_object that corresponds to ideal_k
    ideal_k, clust_obj, AIC_arr=cluster_and_score(fwd_dict['data_points'][0], min_clust, max_clust)
    
    #updates clust_obj to reconcile the labels to plot in a unified color scheme
    clust_obj=reconcile_labels(fwd_dict['x_y_list'],clust_obj)
    
    return ideal_k, clust_obj, AIC_arr
#testing
#ideal_k, clust_obj, AIC_arr=fit_clusters(fwd_dict, min_clust=2, max_clust=12)
#print(clust_obj.centroid_reconcile_labels_)

## Visualization Section 1 - `Matplotlib`

This section was my first cut at visualizing the ouput of the foward and inverse models using subplots from `Matplotlib`.  While the plots look nice, I realized it would be nice to be able to zoom and move around the plots, which can not be done interactively with `Matplotlib`.  I abandonded this approach in favor of using `Plotly` in a second visualization section, but I decided to preserve the code I had already written for future reference.

In [5]:
#fig=plt.figure(figsize=(12, 4))

#matplotlib ploting, I think the plotly look better and you can zoom in
#I will keep this code here in case I want to refer to it in the future
    
def initialize_fig(figsize=(24, 20)):
    fig=plt.figure(figsize=figsize)
    
    return(fig)
#testing
#fig=initialize_fig()

def scatter_plot_inverse_clusters(clust_obj, fwd_dict, fig):
    #plots invese problem clusters, color coded by cluster number
    #I considered marking mis clustered data points but it feels too crowded
    #uses clust_obj.labels_reconcile_ so that cluster colors are consistent across plots

    #unpacks fwd_dict
    fwd_cluster_centers=fwd_dict['x_y_list']
    data_points=fwd_dict['data_points']
    #data_points[0] is x,y, data_points[1] are forward labels
    fwd_xy=data_points[0]
    fwd_label=data_points[1]
    #print(fwd_xy)
    
    #number of clusters to use in making the plot
    #extracted from the number of clusters in the cluster object
    k_use=len(set(clust_obj.labels_))
    
    #extracts the maximum number of clusters that will be plotted
    #this tries to match color ranges between plots
    k_color=max([len(set(clust_obj.labels_)),len(fwd_cluster_centers)])
    
    #gets a color map so each cluster is plotted in a different color
    colors = plt.cm.Spectral(np.linspace(0, 1, k_color))
    
    #adds the axis to the figure
    ax=fig.add_subplot(2,2,1)
    
    #steps through each cluster number and plots it using the assigned color
    #for k, col in zip(range(k_use), colors[:k_use]):
    for k in range(k_use):
        
        #boolean for whether data points are members of the current cluster
        cluster_member_bool = (clust_obj.labels_ == k)
        
        #sets color based on clust_obj.labels_reconcile_ so colors match between plots
        k_reconcile=clust_obj.labels_reconcile_[cluster_member_bool][0]
        col=colors[k_reconcile]
        #print([k,k_reconcile])
        
        #recovers corresponding cluster center, which need not be 
        cluster_center = clust_obj.cluster_centers_[k]
        
        #plots the cluster data points
        ax.plot(fwd_xy[cluster_member_bool,0], fwd_xy[cluster_member_bool,1], linewidth=0, color='w', 
                markerfacecolor=col, marker='.', markersize=10)
        
        #plots the cluster centroids
        ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,  markeredgecolor='k', markersize=15)
        
    #returns the updated figure for further plotting
    return (fig)
#fig=scatter_plot_inverse_clusters(clust_obj, fwd_dict, fig)
#fig.show()

def scatter_plot_forward_clusters(clust_obj, fwd_dict, fig):
    #plots forward problem clusters, color coded by cluster number
    #I considered marking mis clustered data points but it feels too crowded
    #uses clust_obj.labels_reconcile_ so that cluster colors are consistent across plots

    #unpacks fwd_dict
    fwd_cluster_centers=fwd_dict['x_y_list']
    data_points=fwd_dict['data_points']
    #data_points[0] is x,y, data_points[1] are forward labels
    fwd_xy=data_points[0]
    fwd_label=data_points[1]
    
    
    #number of clusters to use in making the plot
    #extracted from the number of clusters in the cluster object
    k_use=len(fwd_cluster_centers)
    
    #extracts the maximum number of clusters that will be plotted
    #this tries to match color ranges between plots
    k_color=max([len(set(clust_obj.labels_)),len(fwd_cluster_centers)])
    
    #gets a color map so each cluster is plotted in a different color
    colors = plt.cm.Spectral(np.linspace(0, 1, k_color))
    
    #adds the axis to the figure
    ax=fig.add_subplot(2,2,3)
    
    #steps through each cluster number and plots it using the assigned color
    for k, col in zip(range(k_use), colors[:k_use]):
    #for k in range(k_use):
    
        #boolean for whether data points are members of the current cluster
        cluster_member_bool = (fwd_label == k)
        
        ##sets color based on clust_obj.labels_reconcile_ so colors match between plots
        #k_reconcile=clust_obj.labels_[cluster_member_bool][0]
        #col=colors[k_reconcile]
        #print([k,k_reconcile])
        
        #recovers corresponding cluster center, which need not be 
        cluster_center = fwd_cluster_centers[k]
        
        #plots the cluster data points
        ax.plot(fwd_xy[cluster_member_bool,0], fwd_xy[cluster_member_bool,1], linewidth=0, color='w', 
                markerfacecolor=col, marker='.', markersize=10)
        
        #plots the cluster centroids
        ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,  markeredgecolor='k', markersize=15)
        
    #returns the updated figure for further plotting
    return (fig)
#fig=scatter_plot_forward_clusters(clust_obj, fwd_dict, fig)

## Visualization Section 2 - `Plotly` and `Matplotlib`

This section was my second cut at visualizing the output of the function using a mix of `Plotly` and `Matplotlib` generated plots.  This section takes the forward modeled data in `fwd_dict` and the inverse modeled fit in `clust_obj`, and generates a pair of `Plotly` plots to show data point labeling and cluster centers/centroids in the forward and inverse models, as well as a pair of `Matplotlib` plots to visualize the confusion matrix for the labels and the results of the AIC fit.  The `Plotly` plots are dynamic and linked, so zooming in or panning in one results in a comensurate change in the other.  The `Matplotlib` plots are not dynamic, within any run of the plotting function, but present data that is suitable for static display.

In [6]:
#builds plots in plotly


def build_dfs(fwd_dict,clust_obj):
    #plotly seems to work best with dataframes so I am migrating cluster forward and inverse data to dataframe
    #fwd_dict is the forward data and 
    
    #print(fwd_dict['data_points'][0][:,0])
    fwd_data_df=pd.DataFrame({'x': fwd_dict['data_points'][0][:,0],
                              'y': fwd_dict['data_points'][0][:,1],
                              'Labels': fwd_dict['data_points'][1],
                              'Reconciled Labels': fwd_dict['data_points'][1],
                              'Data Type':'Forward'})
    
    #print(fwd_data_df.head())
    #print([fwd_dict['x_y_list'][i][0] for i in range(len(fwd_dict['x_y_list']))])
    fwd_center_df=pd.DataFrame({'x': [fwd_dict['x_y_list'][i][0] for i in range(len(fwd_dict['x_y_list']))],
                                'y': [fwd_dict['x_y_list'][i][1] for i in range(len(fwd_dict['x_y_list']))],
                                'Labels':list(range(len(fwd_dict['x_y_list']))),
                                'Reconciled Labels':list(range(len(fwd_dict['x_y_list']))),
                                'Data Type':'Forward'})
    #print(fwd_center_df.head(6))
    inv_data_df=pd.DataFrame({'x': fwd_dict['data_points'][0][:,0],
                              'y': fwd_dict['data_points'][0][:,1],
                              'Labels': clust_obj.labels_,
                              'Reconciled Labels': clust_obj.labels_reconcile_,
                              'Data Type':'Inverse',
                              'True Labels':fwd_dict['data_points'][1]})
    
    #print(inv_data_df.head())
    #print((clust_obj.centroid_reconcile_labels_))
    #print(list(range(len(clust_obj.cluster_centers_))))
    #print(clust_obj.cluster_centers_)
    inv_center_df=pd.DataFrame({'x': [clust_obj.cluster_centers_[i][0] for i in range(len(clust_obj.cluster_centers_))],
                                'y': [clust_obj.cluster_centers_[i][1] for i in range(len(clust_obj.cluster_centers_))],
                                'Labels': list(range(len(clust_obj.cluster_centers_))),
                                'Reconciled Labels': clust_obj.centroid_reconcile_labels_,
                                'Data Type':'Inverse'})
    
    #print(inv_center_df.head())
    return fwd_data_df, fwd_center_df, inv_data_df, inv_center_df
#testing
#fwd_data_df, fwd_center_df, inv_data_df, inv_center_df = build_dfs(fwd_dict,clust_obj)
#build_dfs(fwd_dict,clust_obj)

def plot_centroids(fwd_data_df, fwd_center_df, inv_data_df, inv_center_df):
    #builds a plotly express plot to visualize the fwd and inverse models
    #px is easy but not sure on how to overlay centroids ontop of data points and tweak colors
    
    #stacks the two df into a single df to simplify plotting in plotly
    stacked_data_points=pd.concat([fwd_data_df, inv_data_df], axis=0)
    #converts 'Reconciled Labels' to string to simplify legend building
    stacked_data_points['Reconciled Labels']=stacked_data_points['Reconciled Labels'].astype('str')
    
    
    #generates the plots with the base data points
    fig=px.scatter(stacked_data_points, x='x', y='y', color='Reconciled Labels', facet_col='Data Type')
    
    return(fig)
#testing
#fig=plot_centroids(fwd_data_df, fwd_center_df, inv_data_df, inv_center_df)
#fig.show()

def plot_centroids2(fwd_data_df, fwd_center_df, inv_data_df, inv_center_df):
    #builds a plotly 1x2 plot to visualize the fwd and inverse models
    
    #creates a 1x2 subplot with linked x and y axes and adds titles
    fig=make_subplots(rows=1, cols=2,
                     shared_xaxes='all', shared_yaxes='all',
                     subplot_titles=('Forward Data','Inverse Data'))
    #print(fwd_data_df['Reconciled Labels'].to_list())
    
    #fwd plot
    #data points
    fig.add_trace(go.Scatter(x=fwd_data_df['x'], y=fwd_data_df['y'], mode='markers',
                             marker=dict(color=fwd_data_df['Reconciled Labels'].to_list(),coloraxis="coloraxis",
                                        size=3,
                                        line=dict(width=1,
                                                    color='white'))),
                 row=1, col=1)
    #centroids
    fig.add_trace(go.Scatter(x=fwd_center_df['x'], y=fwd_center_df['y'], mode='markers',
                             marker=dict(color=fwd_center_df['Reconciled Labels'].to_list(),coloraxis="coloraxis",
                                        size=12,
                                        line=dict(width=2,
                                                    color='black'))),
                 row=1, col=1)
    
    #inverse plot
    #data points
    fig.add_trace(go.Scatter(x=inv_data_df['x'], y=inv_data_df['y'], mode='markers',
                             marker=dict(color=inv_data_df['Reconciled Labels'].to_list(),coloraxis="coloraxis",
                                         size=3,
                                         line=dict(width=1,
                                                    color='white'))),
                 row=1, col=2)
    #centroids
    fig.add_trace(go.Scatter(x=inv_center_df['x'], y=inv_center_df['y'], mode='markers',
                             marker=dict(color=inv_center_df['Reconciled Labels'].to_list(),coloraxis="coloraxis",
                                         size=12,
                                         line=dict(width=2,
                                                    color='black'))),
                 row=1, col=2)
    
    #updates color scale to rainbow and removes color bar
    fig.update_layout(coloraxis=dict(colorscale='Rainbow'), showlegend=False)
    fig.update_coloraxes(showscale=False)
    #generates the plots with the base data points
    #fig=px.scatter(stacked_data_points, x='x', y='y', color='Reconciled Labels', facet_col='Data Type')
    
    return(fig)
#upper_fig=plot_centroids2(fwd_data_df, fwd_center_df, inv_data_df, inv_center_df)
#upper_fig.show()

def build_confusion_matrix(inv_data_df, col_names):
    #builds confusion matrix from inv_data_df
    #confusion_mat=pd.crosstab(inv_data_df['Reconciled Labels'], inv_data_df['True Labels'])
    confusion_mat=pd.crosstab(inv_data_df[col_names[0]], inv_data_df[col_names[1]])
    
    #print(confusion_mat)
    
    return(confusion_mat)
#confusion_mat=build_confusion_matrix(inv_data_df)

def initialize_fig(figsize=(14, 5)):
    #initialize the matplotlib figure
    #second row of figures
    #I am sure there is a more automated way to determine figsize so that it scales appropriately
    #with the size monitor being used, but I have figured it out
    fig=plt.figure(figsize=figsize)
    #fig=plt.figure()
    
    return(fig)

def generate_confusion_matrix_plot(fig, inv_data_df):
    #generates and plots the confusion matrix using seaborn
    
    #generates confusion matrix
    confusion_mat=build_confusion_matrix(inv_data_df, ['Reconciled Labels','True Labels'])
    #pint(confusion_mat)
    
    #adds a subplot to figure for the confusion matrix
    ax=fig.add_subplot(1,2,1)
    
    #plots the confusion matrix, labels come from the dataframe, fmt='.0f' specifies integer
    ax=sns.heatmap(confusion_mat, annot=True, fmt='.0f', cmap='Blues')
    ax.set_title('Confusion Matrix')
    #plt.show()
    return(fig)
#generate_confusion_matrix_plot(inv_data_df)

def generate_AIC_arr_plot(fig, AIC_arr, min_clust, max_clust, x_y_list, ideal_k):
    #generates and plots the confusion matrix using seaborn
    
    #plots the AIC values 
    
    #adds a subplot to figure for the confusion matrix
    ax=fig.add_subplot(1,2,2)
    
    #plots the confustion matrix, labels come from the dataframe, fmt='.0f' specifies integer
    ax.plot(range(min_clust,max_clust+1), AIC_arr,'b.-')
    ax.set_title('AIC Values for Number of Clusters')
    ax.set_xlabel('Number of Clusters')
    ax.set_ylabel('AIC value')
    
    #adds a vertical line at the true number of clusters in the forward model
    ax.axvline(x=len(x_y_list), color='k', linestyle='--', label='True Num Clusters')
    
    #adds a vertical line at the inverse model selected number of clusters
    ax.axvline(x=ideal_k, color='r', linestyle=':', label='Selected Num Clusters')
    
    #adds a legend for line
    ax.legend()
    
    #plt.show()
    return(fig)
#generate_AIC_arr_plot(AIC_arr, 2, 12, x_y_list,ideal_k)

def build_lower_fig(inv_data_df,AIC_arr, min_clust, max_clust, fwd_dict, ideal_k):
#builds the lower figures in the output
    fig=initialize_fig()
    
    #makes the confusion matrix plot
    fig=generate_confusion_matrix_plot(fig,inv_data_df)
    
    #makes the AIC plot
    fig=generate_AIC_arr_plot(fig,AIC_arr, min_clust, max_clust, fwd_dict['x_y_list'],ideal_k)
    
    #fig.show()
    
    return(fig)
#lower_fig=build_lower_fig(inv_data_df,AIC_arr, 2, 12, x_y_list, ideal_k)

def builds_all_plots(fwd_dict,clust_obj,AIC_arr, min_clust, max_clust, ideal_k):
    #master function for this section
    #builds the upper plotly plots
    #and the lower confusion matrix and parameter selection plots
    
    #builds data frames for plotting
    fwd_data_df, fwd_center_df, inv_data_df, inv_center_df = build_dfs(fwd_dict,clust_obj)
    
    #builds the upper plotly data point figures
    print('These figures are dynamic and linked')
    upper_fig=plot_centroids2(fwd_data_df, fwd_center_df, inv_data_df, inv_center_df)
    upper_fig.show()
    
    #builds the lower confusion matrix and parameter selection plots
    print('These figures are static')
    lower_fig=build_lower_fig(inv_data_df,AIC_arr, min_clust, max_clust, fwd_dict, ideal_k)
    #lower_fig.show()
    return
#testing output   
#builds_all_plots(fwd_dict,clust_obj, AIC_arr, 2, 12, x_y_list, ideal_k)

## Interactive Control Section

This section implements all of the previous sections in `main_function()`, and ties `main_function()` to an `interact_manual()` widget.  By adjusting the input variable values with the widgets, a new forward model is generated, fit with an inverse model, and the results are visualized.  The inverse modeling and visualization sections require too much procesing time to live update results as you drag each widget slider bar, so a manual `Run_Interact` button is used to reduce lag.  Once values for the variables are set, clicking `Run_Interact` will run the algorithm and generate the plots.

User inputs are: 
- `num_clusters` = the number of clusters to forward model
- `radius` = the radius of the circle on whose perimeter the cluster centers are placed
- `data_stdev` = the standard deviation of the noise used to generate each cluster of data points
- `num_data_points` = the total number of data points to generate for all of the clusters
- `max_clusters` = maximum number of clusters to try in the inverse model, defualts to `2*num_clusters`

See what happens as the ratio of `radius` to `data_stdev` decreases, or when `max_clusters` is less than `num_clusters`, or both!

In [7]:
#merges all of the functions into a single master function which forward models, inverse models, and plots
#main fuction engages with the interact widget
def main_function(num_clusters,radius,data_stdev,num_data_points, max_clusters):
    #takes user input and generates the clusters
    #then inverse models the clusters using KMeans
    #and visualizes the results
    
    #inputs 
    #num_clusters=number of clusters to forward model
    #radius=radius of the circle to place clusters on perimeter of
    #data_stdev=1std noise on cluster data points
    #num_data_points=total number of data points to be divided by n clusters
    #max_clusters=the maximum number of clusters to try and fit the data with
    
    #KMeans() crashes with less than 2 clusters
    min_clust=2 
        
    #forward model data and generate fwd_dict containing relevant 
    #data points, labels, and centroids
    fwd_dict=generate_fwd_data(n=num_clusters, r=radius, std=data_stdev, nn=num_data_points, start_radians=math.pi/2)
    
    #inverse model the data points
    #fit KMeans clustering model trying different numbers of clusters and selecting based on AIC
    
    ideal_k, clust_obj, AIC_arr=fit_clusters(fwd_dict, min_clust=min_clust, max_clust=max_clusters)
    
    #generates plots
    builds_all_plots(fwd_dict,clust_obj, AIC_arr, min_clust=min_clust, max_clust=max_clusters, ideal_k=ideal_k)
    
    return
#testing
#main_function(6,10,1,1000)

#builds the widgets
num_clusters_widget=widgets.IntSlider(min=2, max=10, step=1, value=6)
radius_widget=widgets.FloatSlider(min=0.1, max=25, step=0.1, value=10)
data_stdev_widget=widgets.FloatSlider(min=0.1, max=5, step=0.1, value=1)
num_data_points_widget=widgets.IntSlider(min=100, max=10000, step=10, value=1000)
max_clusters_widget=widgets.IntSlider(min=2, max=30, step=1, value=12)

#ties update_max_clusters_range to num_clusters_widget
#defaults the max_clusters_widget value to 5 * num_clusters
#and sets the max as 2 * num_clusters
def update_max_clusters_range(*args):
    max_clusters_widget.value = 2 * num_clusters_widget.value
    max_clusters_widget.max = 5 * num_clusters_widget.value
num_clusters_widget.observe(update_max_clusters_range, 'value')

#uses interact function to build sliders for all of the input variables
#The program takes too long to allow just sliding the sliders
#instead we set the settings and then click run instead of constantly fighting lag
interact_manual(main_function,num_clusters=num_clusters_widget,
                              radius=radius_widget,
                              data_stdev=data_stdev_widget,
                              num_data_points=num_data_points_widget,
                              max_clusters=max_clusters_widget)


interactive(children=(IntSlider(value=6, description='num_clusters', max=10, min=2), FloatSlider(value=10.0, d…

<function __main__.main_function(num_clusters, radius, data_stdev, num_data_points, max_clusters)>