# Data Driven Subgrouping

# Clustering Techniques

## Clustering is a data driven way to groups samples of your data. Also known as unsupervised machine learning, clustering techniques are often times not guided by a set of correct predictions but must learn to make the appropriate choices through analysis of the data alone. It can be used for things such as attempting to classify objects based on the groupings, perform regression stasks by using the groupings to approximate values, as well as find hidden subgroups within your data. As with any sort of learning algorithm there are numerous variations but for today we will look at two of the main ones. 

### [K-Means](https://en.wikipedia.org/wiki/K-means_clustering)
* uses the idea of "simularity" of data points (census tracts) to group them based on some given K averages or center points the data will be grouped around. When the algorithm begins there will be k randomly designated center points or means and each sample/census tract will be assigned to one based on which one that particular sample is closest to based on some form of distance metric such as euclidean or city block distance. After each assignment step when all samples have been assigned to a group the samples will be used to calculate the new center for the groups and then the assignment step begins again reassigning samples based on the new centers. The process continues until no more samples change groups. The idea is that if you due have data that can be clustered into groups based on the variables or factors of each sample and you choose the appropriate number of means/centers to group with you can eventually have clusters where each clusters members are far closter to each other, than they are from samples in other clusters. In this way you seek to sepereate out the data into different groups that have similar characterstics to cluster members, but different ones from members of other clusters. If this is the result you can analyze the clusters to see various things such as what characteristics are similar within each cluster to help identify what characterstics are making up the clusters. Things such as the ANOVA test you saw before can help with this.  


### [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN):
* Unlike the KNN, DBSCAN will choose the appropriate number of groups for you. You set other hyper parameters that control how the algorithm compares samples, and it will decide the appropriate number of groupings to choose. The you must test how well these grouping differ


### Visualization of Clustering

* for simple 2-D or 3-D data you can easily visualize the clustering by coloring the values based on groupings and plotting, but as the number of dimensions in your data (variables) increases if becomes difficult to visualize purely from the data. There is a algorithm that transforms the data into what ever number of dimensions you would like and is said to appropriately still represent the data for visualization purpoes known as [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) and this is another tool that will be show to you today


### Todays Goals
* learn about
    * unsupervised machine learning
    * [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
    * [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
    * [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)
* Cluster you data using both KNN and DBSCAN using your current set of tested variables, 
* play around with both clustering algorithms to see how well you can get them to cluster the data
    * look at the performance metrics described below
    * play around with tuning the algorithm hyperparameters to see if you can improve it
* play around with different sets of your variables to see if you can get better performance that way
* visualize your clusterings and save some interesting ones for next time for the best performing ones

* [clustering performance metrics](https://scikit-learn.org/stable/modules/classes.html?highlight=metrics#module-sklearn.metrics.cluster)
    > * ***[Calinski-Harabasz index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score)*** (sklearn.metrics.calinski_harabasz_score) - also known as the Variance Ratio Criterion - can be used to evaluate the model, where a ***higher Calinski-Harabasz score relates to a model with better defined clusters***. The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.
    
    > * ***[Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score)***(sklearn.metrics.davies_bouldin_score):The score is defined as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther apart and less dispersed will result in a better score.The minimum score is zero, with ***lower values indicating better clustering***.



In [1]:
import numpy as np
import pandas as pd


# Machine learning tools
from sklearn.cluster import KMeans, DBSCAN
# from sklearn.model_selection import cross_val_score, KFold

# performance metrices
from sklearn.metrics.cluster import calinski_harabasz_score, davies_bouldin_score

# visualization tools
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors

from time import time

# Your data path
# data_path = r'C:\Users\gjone\DeepLearningDeepSolar\_Data\_SCALED_CLEANED_MODELS\PaperSet_6_28_OLR_NRML_allDROP_NOIMP.xlsx'
data_path = r'C:\Users\gjone\ConvergentDataTrainer\_Data\ConvergentMiniEXTEND.csv'
data_path = r'C:\Users\gjone\ConvergentDataTrainer\_Data\__PaperSet_7_6_OLR_None_FF_.xlsx'




data_df = pd.read_excel(data_path,)

data_df.head(10)

#model_data = data_df.filter(items=)



data_dfORIG = data_df.copy()
    


In [2]:
for v in sorted(data_dfORIG.columns.tolist()):
    print("'{}'".format(v))

'# 1 Person Homes'
'# 2 Person Homes'
'# 3 Person Homes'
'# 4 Person Homes'
'# CarPool rate'
'# Households'
'# Transpo Bike'
'% 1 Person Homes'
'% 2 Person Homes'
'% 3 Person Homes'
'% 4 Person Homes'
'% Green_Travelers'
'% Some College > 25yrs'
'AL'
'AR'
'AZ'
'Age_Median'
'Asian_Pop'
'Asian_pct'
'Average_Commute'
'Average_Household_Size'
'Avg_Monthly_Consumption_kwh'
'BS_or_ABOVE_rate'
'BS_rate'
'Black_AA'
'Black_pct'
'CA'
'CO'
'CT'
'Climate_Zone'
'College_Edu'
'Commutes__40min_%'
'Cooling_Degree_Days'
'Cooling_Degree_Days_std'
'DC'
'DE'
'Daily_Solar_Radiation'
'DeepSolar_HighSolar'
'Dem_Rep_Ratio_2012E'
'Energy_Cost($/kWh)'
'Energy_Cost_x_Consumption'
'Estimated_Yearly_savings_$'
'FL'
'Fam_Med_Income'
'Families_in_Poverty'
'Family_Household_rate'
'Female_%'
'Female_Pop'
'GA'
'Gender_Ratio'
'Has_Net_metering'
'Has_Property_tax'
'Heating_Degree_Days'
'Heating_Degree_Days_std'
'HighIncome_own_energy_cost'
'High_Solar_Areas'
'Hispanic_Pop'
'Hispanic_pct'
'Home_owner_rt'
'Homes Built afte

In [4]:

# useful for showin the columns
#for v in sorted(data_df.columns.tolist()):
#    print("'{}',".format(v))

# remove all columns in checklist that 
# have only zero values
def remove_allone(df, checklist):
    for c in checklist:
        if c in df.columns.tolist():
            if df.loc[df[c] > 0, :].shape[0] == 0:
                df.drop(columns=[c], inplace=True)
    return df
    

    
# your list of predictors/independent variables
usecols = [
'Total residential solar installations ',   # 0
'Adoption',                                 # 1    

'# >= 25 years of age ',
'# 2 personhouseholds',
'# Homeowners costs > $1k.1',
'# Homeowner',
'Gender_(female)',
# 'Hot_Spots_hh',
'Nonresidential Panel Area',
'Total RPV roof tops',
'Total number of households',
'popden_x_TotOK_RCnt',
'popden_x_TotOK_Rm2',
'Daily_solar_radiation',
'PopulationDdensity',
'Median_Household_income',
'Income_x_EnergyCost',
'Heating_degree_day_std',
'Family Income',
'Gender (Male)',
'Bachelors #',
'# Some college or More',
'# Nonresidential State Inc',
'# Housing Units',
'Heating Source Coal (%)',
'Cooling_degree_days_std',
# 'High_Solar_Areas',  

#  'Hot_Spots_AvgAr',
 'Housing unit count',
 
'Total Area',
'% Admin Occu',
'% Agriculture Occu',
'% Arts Occu',

 '% Information Occu',
 '% Owner Occupied',
 '% Working from home',
# 'DS_HighSolar',

 'Cooling_degree_days',
'Median Age',
]


# the states you want to use
my_states = [
    'al', 
    'ar', 
    'fl', 
    'ga', 
     'ky', 
     'la', 
     'ms',
     'nc', 
     'ok', 
     'sc', 
     'tn', 
     'tx', 
     'va', 
    'wv',  
]

data_dfORIG = data_df.copy()


# get a list of your 2 targets
targets_l = usecols[:2]


# make a list of only the features
features_only = usecols[2:]

# make a list of all the features and 
features = usecols[2:]  

# use only your selected state data
# data_df = data_df.loc[data_df['State'].isin(my_states), :]


# data_df.drop(columns=['State'], inplace=True)

print("Before Imputaton: ", data_df.shape)
# use only your selected features
Wmodel_data = data_df.filter(items=features+targets_l)




#  get a cleaned working model
# this way if you mess this one up just rerun this cell
Wmodel_data = Wmodel_data.dropna()


display(Wmodel_data.head())
print("Working Model: ", Wmodel_data.shape)





Before Imputaton:  (59052, 170)


Unnamed: 0,Income_x_EnergyCost
0,7588.936933
1,5547.918385
2,7061.118753
3,9438.881134
4,9017.289466


Working Model:  (58921, 1)


In [8]:
target = 'Total residential solar installations '
target = 'Solar_installations_per_home_owner'
target = 'Solar_installations_per_household'
met_dict = {}
for state in data_dfORIG['State'].unique():
    met_dict[state] = data_dfORIG.loc[data_dfORIG['State'] == state, target ].mean()

# print(met_dict)
met_dict = dict(sorted(met_dict.items(), key=lambda x: x[1],))
    
for state in met_dict: 
    print("{}: {:.3f}".format(state, met_dict[state]))

fontdict = {
                'family': 'serif',
                'style': 'normal',
                #'variant': 'normal',
                #'weight': 'bold',
                #'size': 'large',
                'size': 20,
            }
fontdict_Labels = {
                'family': 'serif',
                'style': 'normal',
                'variant': 'normal',
                'weight': 'bold',
                #'size': 'large',
                'size': 30,
            }


    
    
fig, ax = plt.subplots(1, figsize=(30, 30))
ax.barh(list(range(len(met_dict))),list(met_dict.values()), height=.5)

ax.set_yticklabels([ c.upper()  for c in list(met_dict.keys())], fontdict=fontdict)
ax.set_yticks(range(len(met_dict)))
ax.set_xlabel("Average Roof Top Solar Installations per Household by State".format(target), fontdict=fontdict_Labels)
ax.set_title("Average Roof Top Solar Installations per Household by State".format(target), fontdict=fontdict_Labels)
ax.set_ylim((-.5, len(met_dict)+.5))
cnt = 0
for c in met_dict:
    val = met_dict[c]
    m = 1
    while np.around(val, 0) < 1:
        m = m*10
        val = val*m
    val = np.around(val, 0)
    m *= 10
    # width, height
    ax.text(met_dict[c], cnt-.2, "{}/{}".format(int(val), m), fontdict={'size':20})
    cnt += 1
    
ax.text(0, 10, "hi")

KeyError: 'State'

# set up your data for clustering

In [6]:
# Select your target/dependent 
target = targets_l[0]


# do you want to only use the independent variables or do you want to use them and your dependent variable
# 

use_independent=False

if use_independent:
    X_vars = features_only
else:
    X_vars = features_only + [target]



Xo = Wmodel_data.filter(items=X_vars)




# create learning and vislualization tools

In [7]:
# create DBSCAN tool
DB_clster = DBSCAN(eps=.1,              #The maximum distance between two samples for one to be considered as in the neighborhood of the other.
                   min_samples=200,        # minimum number of neighbors to look at to make a cluster high == larger cluster often
                   metric='euclidean',  
                   # algorithm='auto', 
                   algorithm='ball_tree',
                   leaf_size=10,
                   p=None, 
                   n_jobs=None)



# create KMeans tool
kmu_clster = KMeans(
       n_clusters=8, 
       init='k-means++', 
       n_init=10, 
       max_iter=300, 
       tol=0.0001, 
       verbose=0, 
       random_state=None, 
       copy_x=True, 
       algorithm='auto',
      
        )


# create TSNE visualization tool
tsne_viz_2D = TSNE(
                n_components=2,
                perplexity=30.0, 
                early_exaggeration=12.0, 
                learning_rate=200.0, 
                n_iter=1000, 
                n_iter_without_progress=300, 
                min_grad_norm=1e-07, 
                metric='euclidean', 
                init='random', 
                verbose=0, 
                random_state=None, 
                method='barnes_hut', 
                angle=0.5, 
                n_jobs=None, )

# create TSNE visualization tool
tsne_viz_3D = TSNE(
                n_components=3,
                perplexity=30.0, 
                early_exaggeration=12.0, 
                learning_rate=200.0, 
                n_iter=1000, 
                n_iter_without_progress=300, 
                min_grad_norm=1e-07, 
                metric='euclidean', 
                init='random', 
                verbose=0, 
                random_state=None, 
                method='barnes_hut', 
                angle=0.5, 
                n_jobs=None, )

print("machine learning  and visualizations tools created")

# use DBscan to cluster the data and get the cluster assignements
# fit them to do the clustering
DB_clster.fit(Xo)
kmu_clster.fit(Xo)

# get the cluster labels
db_labels = DB_clster.labels_
kmu_labels = kmu_clster.labels_


# add them to your working data
Xo['cluster-db'] = db_labels
Xo['cluster-kmu'] = kmu_labels

# 
if len(Xo['cluster-db'].unique()) > 1:
    # attempt to score them:
    # calinski_harabasz_score, davies_bouldin_score
    db_chs, db_dbs = calinski_harabasz_score(Xo[X_vars], Xo['cluster-db']), davies_bouldin_score(Xo[X_vars], Xo['cluster-db'])
    print("DB cluster scores:")
    print("calinski_harabasz_score: ", db_chs)
    print("davies_bouldin_score: ", db_dbs)
else:
    db_chs, db_dbs = -1, -1
    print("DBSCAN failed to generate clusters")

    
kmu_chs, kmu_dbs = calinski_harabasz_score(Xo[X_vars], Xo['cluster-kmu']), davies_bouldin_score(Xo[X_vars], Xo['cluster-kmu'])    
print("K-means cluster scores:")
print("calinski_harabasz_score: ", kmu_chs)
print("davies_bouldin_score: ", kmu_dbs)  


# lets look at the number of labels
print(Xo['cluster-db'].unique())
print(Xo['cluster-kmu'].unique())

machine learning  and visualizations tools created
DBSCAN failed to generate clusters


KeyError: "['% Information Occu', 'Median HH Income', '# >= 25 years of age ', '# Housing Units', 'Gender_(female)', '% Admin Occu', '# Homeowners costs > $1k.1', 'Median Age', '% Owner Occupied', '# 2 personhouseholds', 'Family Income', 'Gender (Male)', '% Agriculture Occu', 'Total Area', 'Daily_solar_radiation', '# Homeowner', '# Nonresidential State Inc', 'Bachelors #', 'Housing unit count', 'Heating_degree_day_std', 'Total residential solar installations ', 'PopulationDdensity', 'Cooling_degree_days_std', 'Total number of households', 'Heating Source Coal (%)', '% Working from home', '# Some college or More', 'Cooling_degree_days', 'Total RPV roof tops', '% Arts Occu', 'Nonresidential Panel Area', 'popden_x_TotOK_RCnt', 'popden_x_TotOK_Rm2'] not in index"

# Methods for the tasks

In [None]:
# make a dictionary for the different cluster groups so each as a color

def get_cluster_colors(labels, colors):
    # get the unique values
    allcluster = list(set(labels))
    print(allcluster)
    allcolr = {c:clr for c, clr in zip(allcluster, colors)}
    # remove the non labeled colors
    if -1 in allcolr:
        del allcolr[-1]
    return allcolr

def plot_clusters(xtsne, labels, color_dict, fontdict=None, s=800, title="Cluster Plot", figsize=(20,20)):
    if fontdict is None:
        fontdict = {
                'family': 'serif',
                'style': 'normal',
                'variant': 'normal',
                'weight': 'bold',
                #'size': 'large',
                'size': 60,
                }
    fig, ax = plt.subplots(figsize=figsize)
    cnt = 0
    for pnt in  xtsne:
        # print(pnt)
        if labels[cnt] >= 0:
            ax.scatter(pnt[0], pnt[1], c=color_dict[labels[cnt]], s=s)
        cnt += 1

    #ax.invert_yaxis()  # labels read top-to-bottom

    ax.set_title(title, fontdict=fontdict)
    ax.set_facecolor("whitesmoke")
    fig.tight_layout()
    plt.show()

    
def plot_clusters3D(xtsne, labels, color_dict, fontdict=None, s=800, title="Cluster Plot", figsize=(20,20)):
    if fontdict is None:
        fontdict = {
                'family': 'serif',
                'style': 'normal',
                'variant': 'normal',
                'weight': 'bold',
                #'size': 'large',
                'size': 60,
                }
    fig, ax = plt.subplots(figsize=figsize)
    cnt = 0
    for pnt in xtsne:
        # print(pnt)
        if labels[cnt] >= 0:
            ax.scatter(pnt[0], pnt[1], pnt[2], c=color_dict[labels[cnt]], s=s)
        cnt += 1
    #ax.invert_yaxis()  # labels read top-to-bottom
    ax.set_title(title, fontdict=fontdict)
    ax.set_facecolor("whitesmoke")
    fig.tight_layout()
    plt.show()
    
    

In [None]:



# list of colors to be "mostly" be randomly assigned to cluster
# from the DBSCAN/kmeans
colors = [
    'b',
    'r',
    'g',
    'c',
    'm',
    'y',
    'darkblue',
    'orange',
    'gray',
    'khaki',
    'coral',
    'darkviolet',
    'darkgreen',
    'lime',
    'indigo',
    'bisque',
    'lavender',
    'gold',
    'purple',
    'firebrick',
    'black',
    'seagreen',
    'slateblue',
    'royalblue',
    'cornflowerblue',
]


colors += list(np.random.choice(list(mcolors.CSS4_COLORS.keys()), len(mcolors.CSS4_COLORS.keys()), replace=False))


db_cluster_colors = get_cluster_colors(Xo['cluster-db'], colors)
kmu_cluster_colors = get_cluster_colors(Xo['cluster-kmu'], colors)


print(db_cluster_colors)
print(kmu_cluster_colors)

print("running tsne")
# fit your data using the TSNE algorithm to compress it into only N-dimensions
db_2d = tsne_viz_2D.fit_transform(Xo)
db_3d = tsne_viz_3D.fit_transform(Xo)


kmu_2d = tsne_viz_2D.fit_transform(Xo)
kmu_3d = tsne_viz_3D.fit_transform(Xo)

fontdict2 = {
                'family': 'serif',
                'style': 'normal',
                'variant': 'normal',
                'weight': 'bold',
                #'size': 'large',
                'size': 60,
            }                           
                           

print("visualizations set up")                  

In [None]:
# db_2dx = tsne_viz_2D.transform()

# 2D plots of the clusters

In [None]:
#Now lets try to plot them in 2d 

if len(Xo['cluster-db'].unique()) == 1 and -1 in Xo['cluster-db'].unique():
    pass
else:
    print("plotting the DBSCAN generated clusters")
    plot_clusters(db_2d, Xo['cluster-db'].values.tolist(), db_cluster_colors, fontdict=fontdict2, s=800, title="Cluster Plot-DB", figsize=(20,20))

print("Plotting the Kmeans clusters")

plot_clusters(kmu_2d, Xo['cluster-kmu'].values.tolist(), kmu_cluster_colors, fontdict=fontdict2, s=800, 
                  title="Cluster Plot-KMU", figsize=(20,20))       

# 3D plots of the clusters

In [None]:
if len(Xo['cluster-db'].unique()) == 1 and -1 in Xo['cluster-db'].unique():
    pass
else:
    print("plotting the DBSCAN generated clusters")
    plot_clusters3D(db_3d, Xo['cluster-db'].values.tolist(), db_cluster_colors, fontdict=fontdict2, s=800, title="Cluster Plot-DB", figsize=(20,20))

print("Plotting the Kmueans clusters")
plot_clusters3D(kmu_3d, Xo['cluster-kmu'].values.tolist(),kmu_cluster_colors, fontdict=fontdict2, s=800, 
                  title="Cluster Plot-KMU", figsize=(20,20))     