Our preliminary tests showed, that Optigrid performed well on Data of November 2018. In this notebook we will explore a few other months and look for similarities and descrepancies in the results. We will use preprocessed data, where things like source stability and voltage breakdowns are indicated. Moreover, for now we will limit ourselfs to stable running sources, i.e. time periods with a low variance and a high current in the BCT25. We use the already preprocessed datasets.

### Module loading
We use the Python modules from the ionsrcopt package that will be loaded in the next cells.

In [1]:
%run ../ionsrcopt/import_notebooks/Setup.ipynb

In [17]:
%run ../ionsrcopt/import_notebooks/Clustering.ipynb

First, we need to specifiy all the columns we are interested in. There are three types: Parameters, these are the ones that will be clustered later on, Measurments and columns from preprocessing.

In [3]:
time = [SourceFeatures.TIMESTAMP]
parameters = [
        SourceFeatures.BIASDISCAQNV, 
        SourceFeatures.GASAQN, 
        SourceFeatures.OVEN1AQNP,
        SourceFeatures.SAIREM2_FORWARDPOWER,
        SourceFeatures.SOLINJ_CURRENT,
        SourceFeatures.SOLCEN_CURRENT,
        SourceFeatures.SOLEXT_CURRENT,
        SourceFeatures.SOURCEHTAQNI,
        SourceFeatures.BCT25_CURRENT]
measurements = []
preprocessing = [
        ProcessingFeatures.SOURCE_STABILITY, 
        ProcessingFeatures.HT_VOLTAGE_BREAKDOWN, 
        ProcessingFeatures.DATAPOINT_DURATION,
        ProcessingFeatures.SOURCE_RUNNING]

columns_to_load = time + parameters + measurements + preprocessing

Next, specify the important files..

In [4]:
input_folder = '../Data_Preprocessed/'
input_files = ['Jan2018.csv', 'Feb2018.csv', 'Mar2018.csv', 'Apr2018.csv', 'May2018.csv', 'Jun2018.csv', 'Jul2018.csv', 'Aug2018.csv', 'Sep2018.csv', 'Oct2018.csv', 'Nov2018.csv']
input_paths = [input_folder + f for f in input_files]
output_folder = '../Data_Clustered/'
output_file = 'JanNov2018_robust.csv'
output_path = output_folder + output_file

cluster_logfile = output_folder + 'cluster_runs.log'

...and load them.

In [5]:
df_total = read_data_from_csv(input_paths, columns_to_load, None)
df_total = fill_columns(df_total, None, fill_nan_with_zeros=True)
df_total = convert_column_types(df_total)
df_total.memory_usage()

Loading data from csv file '../Data_Preprocessed/Jan2018.csv'
Loading data from csv file '../Data_Preprocessed/Feb2018.csv'
Loading data from csv file '../Data_Preprocessed/Mar2018.csv'
Loading data from csv file '../Data_Preprocessed/Apr2018.csv'
Loading data from csv file '../Data_Preprocessed/May2018.csv'
Loading data from csv file '../Data_Preprocessed/Jun2018.csv'
Loading data from csv file '../Data_Preprocessed/Jul2018.csv'
Loading data from csv file '../Data_Preprocessed/Aug2018.csv'
Loading data from csv file '../Data_Preprocessed/Sep2018.csv'
Loading data from csv file '../Data_Preprocessed/Oct2018.csv'
Loading data from csv file '../Data_Preprocessed/Nov2018.csv'
Forward filling missing values...
Converting column types...


Index                            46599944
IP.NSRCGEN:BIASDISCAQNV          23299972
IP.NSRCGEN:GASAQN                23299972
IP.NSRCGEN:OVEN1AQNP             23299972
IP.NSRCGEN:SOURCEHTAQNI          23299972
IP.SAIREM2:FORWARDPOWER          23299972
IP.SOLCEN.ACQUISITION:CURRENT    23299972
IP.SOLEXT.ACQUISITION:CURRENT    23299972
IP.SOLINJ.ACQUISITION:CURRENT    23299972
ITF.BCT25:CURRENT                23299972
datapoint_duration               23299972
source_stable                    23299972
ht_voltage_breakdown             23299972
source_running                    5824993
dtype: int64

In [6]:
df_total.shape

(5824993, 13)

Now we select what data we are interested in.

In [7]:
def select_values(df_total, parameters, selector):
    data = df_total.loc[selector, parameters].values
    weights = df_total.loc[selector, ProcessingFeatures.DATAPOINT_DURATION].values
    return data, weights

Once the data is ready we can begin clustering. But first we standard scale it, so that all parameters have the same variance.

In [8]:
from sklearn import preprocessing

def scale_values(values, scaler):
    if not scaler:
        scaler = preprocessing.RobustScaler((10,90)).fit(values)
    values_scaled = scaler.transform(values)
    return scaler, values_scaled

The parameters for optigrid can be chosen by visually examening the distribution of normalized data, see below.

In [9]:
optigrid_params = {
    'd' : len(parameters), 
    'q' : 1, 
    'max_cut_score' : 0.04, 
    'noise_level' : 0.05,
    'kde_bandwidth' : [0.014, 0.011, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014],
    'verbose' : True}

In [10]:
def run_optigrid(values_scaled, weights, optigrid_params):
    optigrid = Optigrid(**optigrid_params)
    optigrid.fit(values_scaled, weights)
    return optigrid

Once the clusters are found, we set an according column in the original dataframe containing all data.

In [11]:
def assign_clusters_df_total(df_total, optigrid, num_values, selector):
    clusters = np.zeros(num_values)

    for i, cluster in enumerate(optigrid.clusters):
        clusters[cluster] = i
    df_total.loc[selector, ProcessingFeatures.CLUSTER] = clusters

And here we bundle all these steps together.

In [12]:
def cluster(df_total, parameters, source_stable, optigrid_params):
    print("Starting clustering for source stability {}".format(source_stable))
    source_stability = df_total[ProcessingFeatures.SOURCE_STABILITY] == source_stable
    voltage_breakdown_selection = df_total[ProcessingFeatures.HT_VOLTAGE_BREAKDOWN] > 0
    source_running = df_total[ProcessingFeatures.SOURCE_RUNNING] == True
    
    selector = source_stability & ~voltage_breakdown_selection & source_running
    values, weights = select_values(df_total, parameters, selector) # First, get the data without breakdowns,
    scaler, values_scaled = scale_values(values, None) # standard scale it
    optigrid = run_optigrid(values_scaled, weights, optigrid_params) # and compute the clusters.
    assign_clusters_df_total(df_total, optigrid, len(values), selector) # Then, assign the found clusters to the original dataframe in a new column 'optigrid_clusters'
    
    print("Scoring voltage breakdowns")
    selector = source_stability & voltage_breakdown_selection & source_running
    values, weights = select_values(df_total, parameters, selector) # Now, get the datapoints when the voltage broke down
    _, values_scaled = scale_values(values, scaler) # scale it to the same ranges
    scored_samples = optigrid.score_samples(values_scaled) # and find the corresponding clusters.
    df_total.loc[selector, ProcessingFeatures.CLUSTER] = scored_samples

In [18]:
df_total[ProcessingFeatures.CLUSTER] = -1
cluster(df_total, parameters, 1, optigrid_params)
cluster(df_total, parameters, 0, optigrid_params)

Starting clustering for source stability 1
Found following cuts: [(-5.659585432572799, 3, 0.0)]
Evaluating subgrid: 6.97% of datapoints
Found following cuts: [(1.7265652309764516, 5, 3.3531017968717147e-67)]
Evaluating subgrid: 6.67% of datapoints
Found following cuts: [(-1.8748212462723859, 1, 1.3492957552397255e-50)]
Evaluating subgrid: 5.39% of datapoints
Found following cuts: [(0.7578633265061812, 5, 1.229026978554436e-79)]
Evaluating subgrid: 1.19% of datapoints
Found following cuts: [(-0.20429084517739032, 2, 9.663252188038074e-37)]
Evaluating subgrid: 0.33% of datapoints
Found cluster 0: 0.33% of datapoints
Evaluating subgrid: 0.86% of datapoints
Found following cuts: [(0.02217063098920114, 2, 0.00045711205160885545)]
Evaluating subgrid: 0.28% of datapoints
Found cluster 1: 0.28% of datapoints
Evaluating subgrid: 0.58% of datapoints
Found cluster 2: 0.58% of datapoints
Evaluating subgrid: 4.20% of datapoints
Found following cuts: [(-2.8564594138150263, 0, 6.045355668609282e-41)]

Found cluster 40: 0.39% of datapoints
Evaluating subgrid: 2.66% of datapoints
Found following cuts: [(-0.3999016465562763, 1, 2.71570502542682e-11)]
Evaluating subgrid: 1.90% of datapoints
Found following cuts: [(0.690288568987991, 2, 5.072302078033743e-06)]
Evaluating subgrid: 0.74% of datapoints
Found following cuts: [(-1.6624136404557661, 0, 0.027213987568214876)]
Evaluating subgrid: 0.20% of datapoints
Found cluster 41: 0.20% of datapoints
Evaluating subgrid: 0.54% of datapoints
Found cluster 42: 0.54% of datapoints
Evaluating subgrid: 1.17% of datapoints
Found following cuts: [(0.9885301192601522, 2, 1.0896420195928585e-05)]
Evaluating subgrid: 0.46% of datapoints
Found cluster 43: 0.46% of datapoints
Evaluating subgrid: 0.70% of datapoints
Found cluster 44: 0.70% of datapoints
Evaluating subgrid: 0.76% of datapoints
Found following cuts: [(0.9719160435476688, 2, 3.669616168142805e-83)]
Evaluating subgrid: 0.25% of datapoints
Found cluster 45: 0.25% of datapoints
Evaluating subgri

Found following cuts: [(0.7597022980752617, 0, 5.572998145186105e-22)]
Evaluating subgrid: 1.23% of datapoints
Found following cuts: [(-0.3672743161218335, 2, 1.9406419366743496e-16)]
Evaluating subgrid: 0.49% of datapoints
Found cluster 83: 0.49% of datapoints
Evaluating subgrid: 0.75% of datapoints
Found following cuts: [(-0.2387830861891159, 1, 3.3133062184353634e-12)]
Evaluating subgrid: 0.27% of datapoints
Found cluster 84: 0.27% of datapoints
Evaluating subgrid: 0.47% of datapoints
Found cluster 85: 0.47% of datapoints
Evaluating subgrid: 2.49% of datapoints
Found cluster 86: 2.49% of datapoints
Evaluating subgrid: 0.24% of datapoints
Found cluster 87: 0.24% of datapoints
Evaluating subgrid: 0.70% of datapoints
Found following cuts: [(0.5481126979746, 0, 0.029627018819364852)]
Evaluating subgrid: 0.48% of datapoints
Found cluster 88: 0.48% of datapoints
Evaluating subgrid: 0.23% of datapoints
Found cluster 89: 0.23% of datapoints
Evaluating subgrid: 0.33% of datapoints
Found clus

Evaluating subgrid: 0.22% of datapoints
Found cluster 127: 0.22% of datapoints
Evaluating subgrid: 0.19% of datapoints
Found cluster 128: 0.19% of datapoints
Evaluating subgrid: 23.20% of datapoints
Found following cuts: [(-1.9880825931375676, 1, 9.983336425188878e-41)]
Evaluating subgrid: 0.64% of datapoints
Found cluster 129: 0.64% of datapoints
Evaluating subgrid: 22.56% of datapoints
Found following cuts: [(-0.2659017395491552, 3, 6.528180082136974e-18)]
Evaluating subgrid: 0.99% of datapoints
Found following cuts: [(-0.9014783905010031, 1, 5.107625786508194e-102)]
Evaluating subgrid: 0.26% of datapoints
Found cluster 130: 0.26% of datapoints
Evaluating subgrid: 0.73% of datapoints
Found following cuts: [(-0.44338119210618915, 0, 0.010319006423262327)]
Evaluating subgrid: 0.50% of datapoints
Found cluster 131: 0.50% of datapoints
Evaluating subgrid: 0.23% of datapoints
Found cluster 132: 0.23% of datapoints
Evaluating subgrid: 21.56% of datapoints
Found following cuts: [(0.12912595

Found following cuts: [(-0.5628478009291369, 1, 8.409666065385918e-10)]
Evaluating subgrid: 0.39% of datapoints
Found cluster 169: 0.39% of datapoints
Evaluating subgrid: 0.67% of datapoints
Found following cuts: [(-0.03792156113518608, 2, 1.9909055871055078e-09)]
Evaluating subgrid: 0.37% of datapoints
Found cluster 170: 0.37% of datapoints
Evaluating subgrid: 0.29% of datapoints
Found cluster 171: 0.29% of datapoints
Evaluating subgrid: 0.59% of datapoints
Found cluster 172: 0.59% of datapoints
Evaluating subgrid: 0.90% of datapoints
Found cluster 173: 0.90% of datapoints
Evaluating subgrid: 1.75% of datapoints
Found following cuts: [(-0.06309423663399438, 1, 4.27089470866763e-11)]
Evaluating subgrid: 0.95% of datapoints
Found cluster 174: 0.95% of datapoints
Evaluating subgrid: 0.80% of datapoints
Found following cuts: [(0.5534290198725883, 0, 0.038330111260560164)]
Evaluating subgrid: 0.43% of datapoints
Found cluster 175: 0.43% of datapoints
Evaluating subgrid: 0.37% of datapoints

Found following cuts: [(2.1660043458745935, 5, 5.012219730299329e-34)]
Evaluating subgrid: 82.98% of datapoints
Found following cuts: [(1.4856766022817052, 5, 3.3201056279510465e-28)]
Evaluating subgrid: 80.37% of datapoints
Found following cuts: [(1.1648056206077038, 5, 1.0750959299181945e-23)]
Evaluating subgrid: 79.62% of datapoints
Found following cuts: [(-2.640138566493988, 4, 1.6432572412273258e-21)]
Evaluating subgrid: 2.74% of datapoints
Found following cuts: [(0.49569731830346453, 5, 3.506161178745788e-28)]
Evaluating subgrid: 2.24% of datapoints
Found following cuts: [(0.15863568072367196, 5, 3.943555871294979e-29)]
Evaluating subgrid: 1.71% of datapoints
Found following cuts: [(-3.3013736551458184, 4, 1.5894452461778538e-19)]
Evaluating subgrid: 0.81% of datapoints
Found cluster 17: 0.81% of datapoints
Evaluating subgrid: 0.90% of datapoints
Found following cuts: [(-0.4619666301842893, 0, 2.5494061633964444e-10)]
Evaluating subgrid: 0.46% of datapoints
Found cluster 18: 0.46

Found cluster 56: 0.52% of datapoints
Evaluating subgrid: 12.65% of datapoints
Found following cuts: [(-0.5590600329216082, 0, 1.0147108114408526e-06)]
Evaluating subgrid: 1.12% of datapoints
Found cluster 57: 1.12% of datapoints
Evaluating subgrid: 11.53% of datapoints
Found following cuts: [(-0.08500470747851362, 5, 4.1697488970451373e-07)]
Evaluating subgrid: 1.40% of datapoints
Found following cuts: [(0.6349206163425638, 0, 8.290399218356006e-19)]
Evaluating subgrid: 1.20% of datapoints
Found following cuts: [(0.9325806661085649, 1, 1.1458917176951991e-07)]
Evaluating subgrid: 1.01% of datapoints
Found following cuts: [(0.6732245141809639, 1, 0.00022994198797586965)]
Evaluating subgrid: 0.79% of datapoints
Found cluster 58: 0.79% of datapoints
Evaluating subgrid: 0.21% of datapoints
Found cluster 59: 0.21% of datapoints
Evaluating subgrid: 0.19% of datapoints
Found cluster 60: 0.19% of datapoints
Evaluating subgrid: 0.20% of datapoints
Found cluster 61: 0.20% of datapoints
Evaluati

Found following cuts: [(1.425087581981312, 2, 1.936600134604768e-15)]
Evaluating subgrid: 2.17% of datapoints
Found following cuts: [(-0.14468460853653725, 2, 3.195174052122184e-07)]
Evaluating subgrid: 1.35% of datapoints
Found following cuts: [(0.5813892543917955, 5, 2.523864657386124e-09)]
Evaluating subgrid: 0.24% of datapoints
Found cluster 101: 0.24% of datapoints
Evaluating subgrid: 1.11% of datapoints
Found following cuts: [(0.11915065784647005, 0, 1.1230789041523967e-06)]
Evaluating subgrid: 0.74% of datapoints
Found following cuts: [(-0.4659606299617074, 2, 0.020350354919198742)]
Evaluating subgrid: 0.21% of datapoints
Found cluster 102: 0.21% of datapoints
Evaluating subgrid: 0.53% of datapoints
Found cluster 103: 0.53% of datapoints
Evaluating subgrid: 0.37% of datapoints
Found cluster 104: 0.37% of datapoints
Evaluating subgrid: 0.82% of datapoints
Found cluster 105: 0.82% of datapoints
Evaluating subgrid: 0.27% of datapoints
Found cluster 106: 0.27% of datapoints
Evaluati

Found cluster 141: 0.28% of datapoints
Evaluating subgrid: 0.44% of datapoints
Found cluster 142: 0.44% of datapoints
Evaluating subgrid: 0.39% of datapoints
Found cluster 143: 0.39% of datapoints
Evaluating subgrid: 0.30% of datapoints
Found cluster 144: 0.30% of datapoints
Evaluating subgrid: 0.21% of datapoints
Found cluster 145: 0.21% of datapoints
Evaluating subgrid: 4.34% of datapoints
Found following cuts: [(0.015831824956518248, 4, 6.764831175646442e-25)]
Evaluating subgrid: 0.21% of datapoints
Found cluster 146: 0.21% of datapoints
Evaluating subgrid: 4.12% of datapoints
Found following cuts: [(0.030041892119128555, 8, 0.0003249136296330472)]
Evaluating subgrid: 0.83% of datapoints
Found cluster 147: 0.83% of datapoints
Evaluating subgrid: 3.29% of datapoints
Found cluster 148: 3.29% of datapoints
Evaluating subgrid: 0.64% of datapoints
Found cluster 149: 0.64% of datapoints
Evaluating subgrid: 0.75% of datapoints
Found following cuts: [(-0.2560039433566006, 6, 1.4249740550624

#### Long term storage
We will save the clustered data to a file.

First, create the logging string.

In [19]:
from datetime import datetime

now = datetime.now()
dt_string = now.strftime("%d.%m.%Y %H:%M:%S")

logstring = "[{}] \'{}\' cluster results saved to \'{}\'. Columns used: {}. Parameters used: {}\n".format(dt_string, input_paths, output_path, parameters, optigrid_params)
with open(cluster_logfile, "a") as myfile:
    myfile.write(logstring)

logstring

"[11.12.2019 17:40:07] '['../Data_Preprocessed/Jan2018.csv', '../Data_Preprocessed/Feb2018.csv', '../Data_Preprocessed/Mar2018.csv', '../Data_Preprocessed/Apr2018.csv', '../Data_Preprocessed/May2018.csv', '../Data_Preprocessed/Jun2018.csv', '../Data_Preprocessed/Jul2018.csv', '../Data_Preprocessed/Aug2018.csv', '../Data_Preprocessed/Sep2018.csv', '../Data_Preprocessed/Oct2018.csv', '../Data_Preprocessed/Nov2018.csv']' cluster results saved to '../Data_Clustered/JanNov2018_robust.csv'. Columns used: ['IP.NSRCGEN:BIASDISCAQNV', 'IP.NSRCGEN:GASAQN', 'IP.NSRCGEN:OVEN1AQNP', 'IP.SAIREM2:FORWARDPOWER', 'IP.SOLINJ.ACQUISITION:CURRENT', 'IP.SOLCEN.ACQUISITION:CURRENT', 'IP.SOLEXT.ACQUISITION:CURRENT', 'IP.NSRCGEN:SOURCEHTAQNI', 'ITF.BCT25:CURRENT']. Parameters used: {'d': 9, 'q': 1, 'max_cut_score': 0.04, 'noise_level': 0.05, 'kde_bandwidth': [0.014, 0.011, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014], 'verbose': True}\n"

Now we can save the dataframe to a file.

In [20]:
df_total = df_total.astype({ProcessingFeatures.CLUSTER : 'int64'})
df_total[df_total.shift(1)==df_total] = np.nan
df_total.to_csv(output_path)

In [None]:
cluster_performance_dbi(values_scaled, optigrid.clusters, optigrid.num_clusters)

In [None]:
def silhouette_coefficient(a, b):
    if a < b:
        return 1 - a/b
    elif a == b:
        return 0
    else:
        return b/a - 1
    
def cluster_performance_silhouette(df_total, values_scaled, clusters, source_stability, voltage_breakdown_selection, num_clusters):
    mean_distances = np.array([np.array([np.sum(np.linalg.norm(values_scaled[cluster]-x, axis=1)) / len(cluster) for cluster in clusters]) for x in values_scaled])
    optigrid_cluster = df_total.loc[source_stability & voltage_breakdown_selection, 'optigrid_cluster']
    selector = np.ones((len(values_scaled), num_clusters), dtype=bool)
    selector[range(len(values)), optigrid_cluster] = False
    print(mean_distances)
    print(optigrid_cluster)
    print(selector)
    print(np.ma.masked_array(mean_distances, ~selector))
    df_total.loc[source_stability & voltage_breakdown_selection, 'mean_dist_same_cluster'] = np.amin(np.ma.masked_array(mean_distances, selector), axis=1)
    df_total.loc[source_stability & voltage_breakdown_selection, 'min_mean_dist_different_cluster'] = np.amin(np.ma.masked_array(mean_distances, ~selector), axis=1)
    df_total.loc[source_stability & voltage_breakdown_selection, 'silhouette'] = np.vectorize(silhouette_coefficient)(df_total.loc[source_stability & voltage_breakdown_selection, 'mean_dist_same_cluster'], df_total.loc[source_stability & voltage_breakdown_selection, 'min_mean_dist_different_cluster'])

In [None]:
def all_pairs_euclid_squared_numpy(A, B):
    sqrA = np.broadcast_to(np.sum(np.power(A, 2), 1).reshape(A.shape[0], 1), (A.shape[0], B.shape[0]))
    sqrB = np.broadcast_to(np.sum(np.power(B, 2), 1).reshape(B.shape[0], 1), (B.shape[0], A.shape[0])).transpose()

    return sqrA - 2*np.matmul(A, B.transpose()) + sqrB

def cluster_performance_dbi(values_scaled, clusters, num_clusters):
    print("values_scaled: {}".format(values_scaled))
    values_per_cluster = [np.take(values_scaled, c, axis=0) for c in clusters]
    means = np.array([np.mean(c, axis=0) for c in values_per_cluster])
    print("values_per_cluster: {}".format(values_per_cluster[0][:10]))
    print("means: {}".format(means))
    assigned_cluster_mean = np.zeros((len(values_scaled), len(values_scaled[0])))
    for i, c in enumerate(clusters):
        assigned_cluster_mean[c] = means[i]
    print("assigned_cluster_mean: {}".format(assigned_cluster_mean))
        
    dists_from_means = np.linalg.norm(values_scaled-assigned_cluster_mean, axis=1)
    print("dists_from_means: {}".format([dists_from_means[c] for c in clusters]))
    s = np.array([np.sqrt(1./len(c) * np.sum(dists_from_means[c])) for c in clusters])
    print("s: {}".format(s))
    
    dists_between_clusters = all_pairs_euclid_squared_numpy(means, means)
    np.fill_diagonal(dists_between_clusters, np.nan)
    print("dists_between_clusters: {}".format(dists_between_clusters))
    
    r = np.tile(s, (num_clusters, 1))
    r = (r + r.T) / dists_between_clusters
    print("r: {}".format(r))
    d = np.nanmax(r, axis=1)
    dbi = np.mean(d)
    print("Davies-Bouldin index per cluster: {}".format(d))
    print("Davies-Bouldin index total: {}".format(dbi))

In [None]:
def describe_clusters(optigrid, data, parameters):
    values = ['mean', 'std', 'min', '25%', '50%', '75%', 'max']
    result = pd.DataFrame(columns = pd.MultiIndex.from_tuples([(p, v) for p in parameters for v in values] + [('DENSITY', 'count'), ('DENSITY', 'percentage')]))
    result.index.name = 'OPTIGRID_CLUSTER'
    
    for i, cluster in enumerate(optigrid.clusters):
        cluster_data = np.take(data, cluster, axis=0)
        mean = np.mean(cluster_data, axis=0)
        std = np.std(cluster_data, axis=0)
        quantiles = np.quantile(cluster_data, [0, 0.25, 0.5, 0.75, 1], axis=0)
        cluster_description = [[mean[i], std[i], quantiles[0][i], quantiles[1][i], quantiles[2][i], quantiles[3][i], quantiles[4][i]] for i in range(len(parameters))]
        cluster_description = [item for sublist in cluster_description for item in sublist]
        cluster_description.append(len(cluster))
        cluster_description.append(len(cluster)/len(data)*100)
        result.loc[i] = cluster_description
    return result

described = describe_clusters(optigrid, data, parameters)

In [None]:
pd.set_option('display.max_columns', 500)
wanted_statistics = [[(param, 'mean'), (param, 'std')] for param in parameters]
wanted_statistics = [item for sublist in wanted_statistics for item in sublist] + [('DENSITY', 'percentage')]

num_of_clusters_to_print = 10
described.sort_values(by=[('DENSITY', 'percentage')], ascending=False, inplace = True)
print("Sum of densities of printed clusters: {:.1f}%".format(described.head(n=num_of_clusters_to_print)[('DENSITY', 'percentage')].sum()))
described.head(n=num_of_clusters_to_print)[wanted_statistics].round(3)

For visualizing the clusters we will plot the densities of the parameters. For comparability we will use explicit ranges for the x-axis per parameter. Those ranges should be chosen beforehand by an expert to validate or falsify his intuition.

In [None]:
num_clusters = 6 # number of clusters to visualize
data = df[parameters].values # We select the unscaled data again, because by clustering we did not change any ordering and this data corresponds to the real world
num_datapoints = len(data)

resolution = 200
bandwidth = [1, 0.01, 1, 10, 0.1, 0.001]
num_kde_samples = 40000

parameter_ranges = [[0,0] for i in range(len(parameters))]
parameter_ranges[0] = [-300, -200] # Biasdisc x-axis

parameter_ranges[1] = [5.1, 5.3] # Gas x-axis
#parameter_ranges[2] = [0, 3] # High voltage current x-axis
parameter_ranges[2] = [200, 300] # SolCen current x-axis
#parameter_ranges[3] = [900, 2100] # Forwardpower x-axis
parameter_ranges[3] = [1200, 1300] # SolExt current x-axis
parameter_ranges[4] = [5, 20] # Oven1 power x-axis
parameter_ranges[5] = [0, 0.05] # BCT25 current x-axis

best_clusters = sorted(optigrid.clusters, key=lambda x: len(x), reverse=True)
for i, cluster in enumerate(best_clusters[:num_clusters]):
    median = [described.iloc[i,described.columns.get_loc((param, '50%'))] for param in parameters]
    plot_cluster(data, cluster, parameters, parameter_ranges, resolution=resolution, median=median, bandwidth=bandwidth, percentage_of_values=1, num_kde_samples=num_kde_samples)

Now, we want to find all high voltage breakdowns that correspond to the currently considered source stability, and find out to which cluster each datapoint belongs.

In [None]:
wanted_statistics.append(('num_of_breakdowns', ''))
described.head(n=num_of_clusters_to_print)[wanted_statistics].round(3)

In [None]:
wanted_statistics = [[(param, 'mean')] for param in parameters]
wanted_statistics = [item for sublist in wanted_statistics for item in sublist] + [('num_of_breakdowns', '')]
corr_described = described[wanted_statistics].corr()
corr_described.style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)

In [None]:
pd.set_option('display.max_columns', 500)
wanted_statistics = [[(param, 'mean'), (param, 'std'),  (param, 'min'),  (param, 'max')] for param in parameters]
wanted_statistics = [item for sublist in wanted_statistics for item in sublist]
df_breakdowns.groupby('is_breakdown').describe()[wanted_statistics].round(2)

In [None]:
import numpy as np

def d(x,y):
    return np.linalg.norm(x-y)

size = 10
data = np.random.uniform(0, 1, (size, 1))
data

In [None]:
a = np.array([np.sum(np.linalg.norm(data-x, axis=1)) for x in data]) / (size - 1)
a

In [None]:
import pandas as pd
import numpy as np

values = [0, 2, 2, 2, 3, 3, 1]
values = np.array([[x, x] for x in values])
clusters = [[0, 6], [1, 2, 3], [4, 5]]

values, clusters

In [None]:
cluster_performance_dbi(values, clusters, len(clusters))

In [None]:
from sklearn.metrics import davies_bouldin_score

davies_bouldin_score(values, [0, 1, 1, 1, 2, 2, 0])

In [None]:
source_stable = 1
print("Starting clustering for source stability {}".format(source_stable))
source_stability = df_total['source_stable'] == source_stable
voltage_breakdown_selection = df_total['is_breakdown'] > 0

values = select_values(df_total, parameters, source_stability, ~voltage_breakdown_selection) # First, get the data without breakdowns,
scaler, values_scaled = scale_values(values, None) # standard scale it
print(values_scaled)
optigrid = run_optigrid(values_scaled, optigrid_params) # and compute the clusters.
print(values_scaled)
#assign_clusters_df_total(df_total, optigrid, len(values), source_stability, ~voltage_breakdown_selection) # Then, assign the found clusters to the original dataframe in a new column 'optigrid_clusters'
print("Calculating cluster performance cluster performance")
#cluster_performance_silhouette(df_total, values_scaled, optigrid.clusters, source_stability, voltage_breakdown_selection, optigrid.num_clusters)