Our preliminary tests showed, that Optigrid performed well on Data of November 2018. In this notebook we will explore a few other months and look for similarities and descrepancies in the results. We will use preprocessed data, where things like source stability and voltage breakdowns are indicated. Moreover, for now we will limit ourselfs to stable running sources, i.e. time periods with a low variance and a high current in the BCT25. We use the already preprocessed datasets.

### Module loading
We use the Python modules from the ionsrcopt package that will be loaded in the next cells.

In [1]:
%run ../ionsrcopt/import_notebooks/Setup.ipynb

In [2]:
%run ../ionsrcopt/import_notebooks/Clustering.ipynb

First, we need to specifiy all the columns we are interested in. There are three types: Parameters, these are the ones that will be clustered later on, Measurments and columns from preprocessing.

In [3]:
time = [SourceFeatures.TIMESTAMP]
parameters = [
        SourceFeatures.BIASDISCAQNV, 
        SourceFeatures.GASAQN, 
        SourceFeatures.THOMSON_FORWARDPOWER,
        SourceFeatures.SOLINJ_CURRENT,
        SourceFeatures.SOLCEN_CURRENT,
        SourceFeatures.SOLEXT_CURRENT,
        SourceFeatures.SOURCEHTAQNI]
measurements = [
        SourceFeatures.OVEN1AQNP,
        SourceFeatures.BCT25_CURRENT]
preprocessing = [
        ProcessingFeatures.SOURCE_STABILITY, 
        ProcessingFeatures.HT_VOLTAGE_BREAKDOWN, 
        ProcessingFeatures.DATAPOINT_DURATION,
        ProcessingFeatures.SOURCE_RUNNING]

columns_to_load = time + parameters + measurements + preprocessing

Next, specify the important files..

In [4]:
input_folder = '../Data_Preprocessed/'
input_files = ['Jan2016.csv', 'Feb2016.csv', 'Mar2016.csv', 'Apr2016.csv', 'May2016.csv', 'Jun2016.csv', 'Jul2016.csv', 'Aug2016.csv', 'Sep2016.csv', 'Oct2016.csv', 'Nov2016.csv']
input_paths = [input_folder + f for f in input_files]
output_folder = '../Data_Clustered/'
output_file = 'JanNov2016.csv'
output_path = output_folder + output_file

cluster_logfile = output_folder + 'cluster_runs.log'

...and load them.

In [5]:
df_total = read_data_from_csv(input_paths, columns_to_load, None)
df_total = fill_columns(df_total, None, fill_nan_with_zeros=True)
df_total = convert_column_types(df_total)
df_total.memory_usage()

Loading data from csv file '../Data_Preprocessed/Jan2016.csv'
Loading data from csv file '../Data_Preprocessed/Feb2016.csv'
Loading data from csv file '../Data_Preprocessed/Mar2016.csv'
Loading data from csv file '../Data_Preprocessed/Apr2016.csv'
Loading data from csv file '../Data_Preprocessed/May2016.csv'
Loading data from csv file '../Data_Preprocessed/Jun2016.csv'
Loading data from csv file '../Data_Preprocessed/Jul2016.csv'
Loading data from csv file '../Data_Preprocessed/Aug2016.csv'
Loading data from csv file '../Data_Preprocessed/Sep2016.csv'
Loading data from csv file '../Data_Preprocessed/Oct2016.csv'
Loading data from csv file '../Data_Preprocessed/Nov2016.csv'
Forward filling missing values...
Converting column types...


Index                            39764480
IP.NSRCGEN:BIASDISCAQNV          19882240
IP.NSRCGEN:GASAQN                19882240
IP.NSRCGEN:OVEN1AQNP             19882240
IP.NSRCGEN:RFTHOMSONAQNFWD       19882240
IP.NSRCGEN:SOURCEHTAQNI          19882240
IP.SOLCEN.ACQUISITION:CURRENT    19882240
IP.SOLEXT.ACQUISITION:CURRENT    19882240
IP.SOLINJ.ACQUISITION:CURRENT    19882240
ITF.BCT25:CURRENT                19882240
datapoint_duration               19882240
source_stable                    19882240
ht_voltage_breakdown             19882240
source_running                    4970560
dtype: int64

In [6]:
df_total.shape

(4970560, 13)

Now we select what data we are interested in.

In [7]:
def select_values(df_total, parameters, selector):
    data = df_total.loc[selector, parameters].values
    weights = df_total.loc[selector, ProcessingFeatures.DATAPOINT_DURATION].values
    return data, weights

Once the data is ready we can begin clustering. But first we standard scale it, so that all parameters have the same variance.

In [8]:
from sklearn import preprocessing

def scale_values(values, scaler):
    if not scaler:
        scaler = preprocessing.RobustScaler((10,90)).fit(values)
    values_scaled = scaler.transform(values)
    return scaler, values_scaled

The parameters for optigrid can be chosen by visually examening the distribution of normalized data, see below.

In [9]:
optigrid_params = {
    'd' : len(parameters), 
    'q' : 1, 
    'max_cut_score' : 0.04, 
    'noise_level' : 0.05,
    'kde_bandwidth' : [0.014, 0.011, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014],
    'verbose' : True}

In [10]:
def run_optigrid(values_scaled, weights, optigrid_params):
    optigrid = Optigrid(**optigrid_params)
    optigrid.fit(values_scaled, weights)
    return optigrid

Once the clusters are found, we set an according column in the original dataframe containing all data.

In [11]:
def assign_clusters_df_total(df_total, optigrid, num_values, selector):
    clusters = np.zeros(num_values)

    for i, cluster in enumerate(optigrid.clusters):
        clusters[cluster] = i
    df_total.loc[selector, ProcessingFeatures.CLUSTER] = clusters

And here we bundle all these steps together.

In [12]:
def cluster(df_total, parameters, source_stable, optigrid_params):
    print("Starting clustering for source stability {}".format(source_stable))
    source_stability = df_total[ProcessingFeatures.SOURCE_STABILITY] == source_stable
    voltage_breakdown_selection = df_total[ProcessingFeatures.HT_VOLTAGE_BREAKDOWN] > 0
    source_running = df_total[ProcessingFeatures.SOURCE_RUNNING] == True
    
    selector = source_stability & ~voltage_breakdown_selection & source_running
    values, weights = select_values(df_total, parameters, selector) # First, get the data without breakdowns,
    scaler, values_scaled = scale_values(values, None) # standard scale it
    optigrid = run_optigrid(values_scaled, weights, optigrid_params) # and compute the clusters.
    assign_clusters_df_total(df_total, optigrid, len(values), selector) # Then, assign the found clusters to the original dataframe in a new column 'optigrid_clusters'
    
    print("Scoring voltage breakdowns")
    selector = source_stability & voltage_breakdown_selection & source_running
    values, weights = select_values(df_total, parameters, selector) # Now, get the datapoints when the voltage broke down
    _, values_scaled = scale_values(values, scaler) # scale it to the same ranges
    scored_samples = optigrid.score_samples(values_scaled) # and find the corresponding clusters.
    df_total.loc[selector, ProcessingFeatures.CLUSTER] = scored_samples

In [13]:
df_total[ProcessingFeatures.CLUSTER] = -1
cluster(df_total, parameters, 1, optigrid_params)
cluster(df_total, parameters, 0, optigrid_params)

Starting clustering for source stability 1
Found following cuts: [(1.771704825488004, 4, 3.199275662846151e-60)]
Evaluating subgrid: 98.87% of datapoints
Found following cuts: [(-0.7697825528154469, 5, 1.5177394613060173e-53)]
Evaluating subgrid: 6.94% of datapoints
Found following cuts: [(0.23725503261643222, 4, 2.6252298415117064e-80)]
Evaluating subgrid: 6.28% of datapoints
Found following cuts: [(0.21134945840546582, 3, 1.0143585980132177e-39)]
Evaluating subgrid: 4.86% of datapoints
Found following cuts: [(-1.2975912521583866, 0, 1.4162223613298884e-21)]
Evaluating subgrid: 2.23% of datapoints
Found following cuts: [(-0.3735504072121899, 4, 1.1443755555599827e-19)]
Evaluating subgrid: 1.82% of datapoints
Found following cuts: [(-0.07983980470835561, 1, 4.726839425250721e-06)]
Evaluating subgrid: 0.63% of datapoints
Found following cuts: [(-1.031938744313789, 5, 0.020921857262081376)]
Evaluating subgrid: 0.34% of datapoints
Found cluster 0: 0.34% of datapoints
Evaluating subgrid: 0

Found cluster 41: 0.44% of datapoints
Evaluating subgrid: 0.18% of datapoints
Found cluster 42: 0.18% of datapoints
Evaluating subgrid: 37.26% of datapoints
Found following cuts: [(1.0912976018106095, 0, 3.935324224491255e-22)]
Evaluating subgrid: 36.64% of datapoints
Found following cuts: [(-1.8681255013051659, 4, 1.1298832541043243e-17)]
Evaluating subgrid: 4.54% of datapoints
Found following cuts: [(-0.815589002888612, 1, 1.6971014150676444e-233)]
Evaluating subgrid: 0.31% of datapoints
Found cluster 43: 0.31% of datapoints
Evaluating subgrid: 4.23% of datapoints
Found following cuts: [(-0.6549756755732526, 2, 2.070706323233933e-71)]
Evaluating subgrid: 0.29% of datapoints
Found cluster 44: 0.29% of datapoints
Evaluating subgrid: 3.94% of datapoints
Found cluster 45: 3.94% of datapoints
Evaluating subgrid: 32.10% of datapoints
Found following cuts: [(-0.33138873360373755, 1, 2.389962345104719e-28)]
Evaluating subgrid: 6.00% of datapoints
Found following cuts: [(0.29260909715385147, 

Found following cuts: [(-1.0810886476979111, 2, 6.531848899488311e-27)]
Evaluating subgrid: 0.48% of datapoints
Found cluster 85: 0.48% of datapoints
Evaluating subgrid: 7.53% of datapoints
Found following cuts: [(-0.1261994618960101, 4, 1.7120514029289708e-15)]
Evaluating subgrid: 5.65% of datapoints
Found following cuts: [(-0.3508345040709081, 4, 1.4921928951626324e-12)]
Evaluating subgrid: 5.09% of datapoints
Found following cuts: [(-0.16638604860113115, 2, 1.756484429319844e-47)]
Evaluating subgrid: 4.29% of datapoints
Found following cuts: [(0.5457773879923001, 1, 2.798732014514027e-10)]
Evaluating subgrid: 0.32% of datapoints
Found cluster 86: 0.32% of datapoints
Evaluating subgrid: 3.97% of datapoints
Found following cuts: [(0.482905971055681, 0, 1.601312438796204e-12)]
Evaluating subgrid: 2.79% of datapoints
Found cluster 87: 2.79% of datapoints
Evaluating subgrid: 1.18% of datapoints
Found following cuts: [(0.5621722099756954, 5, 0.023216026442585053)]
Evaluating subgrid: 0.82

Found following cuts: [(-0.49765806776857135, 3, 5.369984955537967e-14)]
Evaluating subgrid: 0.48% of datapoints
Found following cuts: [(0.03086328664512345, 5, 0.011787677446698626)]
Evaluating subgrid: 0.29% of datapoints
Found cluster 128: 0.29% of datapoints
Evaluating subgrid: 0.19% of datapoints
Found cluster 129: 0.19% of datapoints
Evaluating subgrid: 1.65% of datapoints
Found following cuts: [(-0.639295467824647, 1, 5.17172305383954e-29)]
Evaluating subgrid: 0.65% of datapoints
Found following cuts: [(0.03227941240325119, 5, 0.01578515970840448)]
Evaluating subgrid: 0.44% of datapoints
Found cluster 130: 0.44% of datapoints
Evaluating subgrid: 0.21% of datapoints
Found cluster 131: 0.21% of datapoints
Evaluating subgrid: 1.00% of datapoints
Found following cuts: [(-0.25807822021571075, 1, 3.979225831179978e-11)]
Evaluating subgrid: 0.36% of datapoints
Found cluster 132: 0.36% of datapoints
Evaluating subgrid: 0.63% of datapoints
Found following cuts: [(0.03086328664512345, 5, 

Found cluster 171: 0.48% of datapoints
Evaluating subgrid: 0.45% of datapoints
Found cluster 172: 0.45% of datapoints
Evaluating subgrid: 3.66% of datapoints
Found following cuts: [(0.2025188472535875, 3, 1.0085372234564636e-31)]
Evaluating subgrid: 3.36% of datapoints
Found following cuts: [(0.8758242876842769, 4, 7.99043916379331e-18)]
Evaluating subgrid: 1.78% of datapoints
Found cluster 173: 1.78% of datapoints
Evaluating subgrid: 1.58% of datapoints
Found following cuts: [(-0.21639471963951062, 0, 1.1742104383909332e-07)]
Evaluating subgrid: 1.21% of datapoints
Found following cuts: [(0.1279461309313774, 2, 2.638474506530547e-05)]
Evaluating subgrid: 0.58% of datapoints
Found following cuts: [(-0.5042774713400638, 5, 0.024197625651118538)]
Evaluating subgrid: 0.28% of datapoints
Found cluster 174: 0.28% of datapoints
Evaluating subgrid: 0.30% of datapoints
Found cluster 175: 0.30% of datapoints
Evaluating subgrid: 0.63% of datapoints
Found following cuts: [(-0.5028674530260491, 5,

Found cluster 16: 1.29% of datapoints
Evaluating subgrid: 2.75% of datapoints
Found following cuts: [(-0.46320346268740575, 0, 1.9272648766621883e-229)]
Evaluating subgrid: 2.18% of datapoints
Found cluster 17: 2.18% of datapoints
Evaluating subgrid: 0.57% of datapoints
Found following cuts: [(-0.9695763058132596, 5, 0.02405341446038802)]
Evaluating subgrid: 0.35% of datapoints
Found cluster 18: 0.35% of datapoints
Evaluating subgrid: 0.21% of datapoints
Found cluster 19: 0.21% of datapoints
Evaluating subgrid: 2.84% of datapoints
Found following cuts: [(0.2617856360445119, 1, 7.656649656090495e-14)]
Evaluating subgrid: 0.78% of datapoints
Found cluster 20: 0.78% of datapoints
Evaluating subgrid: 2.06% of datapoints
Found following cuts: [(0.38748620134411427, 3, 8.011592504725076e-58)]
Evaluating subgrid: 0.63% of datapoints
Found following cuts: [(-0.9685997083933666, 5, 0.021693137470640722)]
Evaluating subgrid: 0.27% of datapoints
Found cluster 21: 0.27% of datapoints
Evaluating su

Found cluster 60: 0.66% of datapoints
Evaluating subgrid: 8.53% of datapoints
Found following cuts: [(1.3495430621233853, 3, 3.131925242567481e-61)]
Evaluating subgrid: 7.70% of datapoints
Found following cuts: [(-0.5057719909783566, 0, 6.224773197261259e-16)]
Evaluating subgrid: 0.58% of datapoints
Found cluster 61: 0.58% of datapoints
Evaluating subgrid: 7.12% of datapoints
Found following cuts: [(-0.22291644053025683, 3, 1.7605967725406556e-15)]
Evaluating subgrid: 4.41% of datapoints
Found following cuts: [(0.07828066204533446, 1, 3.0267822854770748e-06)]
Evaluating subgrid: 2.65% of datapoints
Found following cuts: [(-0.10151857318300195, 1, 3.960257216671106e-08)]
Evaluating subgrid: 0.90% of datapoints
Found cluster 62: 0.90% of datapoints
Evaluating subgrid: 1.75% of datapoints
Found following cuts: [(0.39249639408756987, 0, 1.7056235167172367e-25)]
Evaluating subgrid: 0.84% of datapoints
Found following cuts: [(1.0288149280981584, 5, 0.02398072986103862)]
Evaluating subgrid: 0

Found cluster 100: 0.21% of datapoints
Evaluating subgrid: 0.17% of datapoints
Found cluster 101: 0.17% of datapoints
Evaluating subgrid: 1.59% of datapoints
Found following cuts: [(0.5266433847701234, 5, 0.03775860776534893)]
Evaluating subgrid: 1.05% of datapoints
Found cluster 102: 1.05% of datapoints
Evaluating subgrid: 0.54% of datapoints
Found cluster 103: 0.54% of datapoints
Evaluating subgrid: 1.54% of datapoints
Found cluster 104: 1.54% of datapoints
Evaluating subgrid: 2.24% of datapoints
Found following cuts: [(0.39991085171097457, 5, 8.021651862184541e-11)]
Evaluating subgrid: 1.85% of datapoints
Found following cuts: [(0.1655844016508623, 0, 1.2079914916975188e-18)]
Evaluating subgrid: 0.88% of datapoints
Found following cuts: [(0.028933294625444847, 5, 0.028066024708839454)]
Evaluating subgrid: 0.35% of datapoints
Found cluster 105: 0.35% of datapoints
Evaluating subgrid: 0.53% of datapoints
Found cluster 106: 0.53% of datapoints
Evaluating subgrid: 0.97% of datapoints
Fo

Found cluster 149: 1.82% of datapoints
Evaluating subgrid: 0.92% of datapoints
Found following cuts: [(-1.1987376273280441, 3, 5.922322827751026e-17)]
Evaluating subgrid: 0.29% of datapoints
Found cluster 150: 0.29% of datapoints
Evaluating subgrid: 0.63% of datapoints
Found following cuts: [(0.529001448491607, 5, 0.03357708358135046)]
Evaluating subgrid: 0.37% of datapoints
Found cluster 151: 0.37% of datapoints
Evaluating subgrid: 0.26% of datapoints
Found cluster 152: 0.26% of datapoints
Optigrid found 153 clusters.
Scoring voltage breakdowns


#### Long term storage
We will save the clustered data to a file.

First, create the logging string.

In [14]:
from datetime import datetime

now = datetime.now()
dt_string = now.strftime("%d.%m.%Y %H:%M:%S")

logstring = "[{}] \'{}\' cluster results saved to \'{}\'. Columns used: {}. Parameters used: {}\n".format(dt_string, input_paths, output_path, parameters, optigrid_params)
with open(cluster_logfile, "a") as myfile:
    myfile.write(logstring)

logstring

"[13.12.2019 10:25:22] '['../Data_Preprocessed/Jan2016.csv', '../Data_Preprocessed/Feb2016.csv', '../Data_Preprocessed/Mar2016.csv', '../Data_Preprocessed/Apr2016.csv', '../Data_Preprocessed/May2016.csv', '../Data_Preprocessed/Jun2016.csv', '../Data_Preprocessed/Jul2016.csv', '../Data_Preprocessed/Aug2016.csv', '../Data_Preprocessed/Sep2016.csv', '../Data_Preprocessed/Oct2016.csv', '../Data_Preprocessed/Nov2016.csv']' cluster results saved to '../Data_Clustered/JanNov2016.csv'. Columns used: ['IP.NSRCGEN:BIASDISCAQNV', 'IP.NSRCGEN:GASAQN', 'IP.NSRCGEN:RFTHOMSONAQNFWD', 'IP.SOLINJ.ACQUISITION:CURRENT', 'IP.SOLCEN.ACQUISITION:CURRENT', 'IP.SOLEXT.ACQUISITION:CURRENT', 'IP.NSRCGEN:SOURCEHTAQNI']. Parameters used: {'d': 7, 'q': 1, 'max_cut_score': 0.04, 'noise_level': 0.05, 'kde_bandwidth': [0.014, 0.011, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014, 0.014], 'verbose': True}\n"

Now we can save the dataframe to a file.

In [15]:
df_total = df_total.astype({ProcessingFeatures.CLUSTER : 'int64'})
df_total[df_total.shift(1)==df_total] = np.nan
df_total.to_csv(output_path)

In [None]:
cluster_performance_dbi(values_scaled, optigrid.clusters, optigrid.num_clusters)

In [None]:
def silhouette_coefficient(a, b):
    if a < b:
        return 1 - a/b
    elif a == b:
        return 0
    else:
        return b/a - 1
    
def cluster_performance_silhouette(df_total, values_scaled, clusters, source_stability, voltage_breakdown_selection, num_clusters):
    mean_distances = np.array([np.array([np.sum(np.linalg.norm(values_scaled[cluster]-x, axis=1)) / len(cluster) for cluster in clusters]) for x in values_scaled])
    optigrid_cluster = df_total.loc[source_stability & voltage_breakdown_selection, 'optigrid_cluster']
    selector = np.ones((len(values_scaled), num_clusters), dtype=bool)
    selector[range(len(values)), optigrid_cluster] = False
    print(mean_distances)
    print(optigrid_cluster)
    print(selector)
    print(np.ma.masked_array(mean_distances, ~selector))
    df_total.loc[source_stability & voltage_breakdown_selection, 'mean_dist_same_cluster'] = np.amin(np.ma.masked_array(mean_distances, selector), axis=1)
    df_total.loc[source_stability & voltage_breakdown_selection, 'min_mean_dist_different_cluster'] = np.amin(np.ma.masked_array(mean_distances, ~selector), axis=1)
    df_total.loc[source_stability & voltage_breakdown_selection, 'silhouette'] = np.vectorize(silhouette_coefficient)(df_total.loc[source_stability & voltage_breakdown_selection, 'mean_dist_same_cluster'], df_total.loc[source_stability & voltage_breakdown_selection, 'min_mean_dist_different_cluster'])

In [None]:
def all_pairs_euclid_squared_numpy(A, B):
    sqrA = np.broadcast_to(np.sum(np.power(A, 2), 1).reshape(A.shape[0], 1), (A.shape[0], B.shape[0]))
    sqrB = np.broadcast_to(np.sum(np.power(B, 2), 1).reshape(B.shape[0], 1), (B.shape[0], A.shape[0])).transpose()

    return sqrA - 2*np.matmul(A, B.transpose()) + sqrB

def cluster_performance_dbi(values_scaled, clusters, num_clusters):
    print("values_scaled: {}".format(values_scaled))
    values_per_cluster = [np.take(values_scaled, c, axis=0) for c in clusters]
    means = np.array([np.mean(c, axis=0) for c in values_per_cluster])
    print("values_per_cluster: {}".format(values_per_cluster[0][:10]))
    print("means: {}".format(means))
    assigned_cluster_mean = np.zeros((len(values_scaled), len(values_scaled[0])))
    for i, c in enumerate(clusters):
        assigned_cluster_mean[c] = means[i]
    print("assigned_cluster_mean: {}".format(assigned_cluster_mean))
        
    dists_from_means = np.linalg.norm(values_scaled-assigned_cluster_mean, axis=1)
    print("dists_from_means: {}".format([dists_from_means[c] for c in clusters]))
    s = np.array([np.sqrt(1./len(c) * np.sum(dists_from_means[c])) for c in clusters])
    print("s: {}".format(s))
    
    dists_between_clusters = all_pairs_euclid_squared_numpy(means, means)
    np.fill_diagonal(dists_between_clusters, np.nan)
    print("dists_between_clusters: {}".format(dists_between_clusters))
    
    r = np.tile(s, (num_clusters, 1))
    r = (r + r.T) / dists_between_clusters
    print("r: {}".format(r))
    d = np.nanmax(r, axis=1)
    dbi = np.mean(d)
    print("Davies-Bouldin index per cluster: {}".format(d))
    print("Davies-Bouldin index total: {}".format(dbi))

In [None]:
def describe_clusters(optigrid, data, parameters):
    values = ['mean', 'std', 'min', '25%', '50%', '75%', 'max']
    result = pd.DataFrame(columns = pd.MultiIndex.from_tuples([(p, v) for p in parameters for v in values] + [('DENSITY', 'count'), ('DENSITY', 'percentage')]))
    result.index.name = 'OPTIGRID_CLUSTER'
    
    for i, cluster in enumerate(optigrid.clusters):
        cluster_data = np.take(data, cluster, axis=0)
        mean = np.mean(cluster_data, axis=0)
        std = np.std(cluster_data, axis=0)
        quantiles = np.quantile(cluster_data, [0, 0.25, 0.5, 0.75, 1], axis=0)
        cluster_description = [[mean[i], std[i], quantiles[0][i], quantiles[1][i], quantiles[2][i], quantiles[3][i], quantiles[4][i]] for i in range(len(parameters))]
        cluster_description = [item for sublist in cluster_description for item in sublist]
        cluster_description.append(len(cluster))
        cluster_description.append(len(cluster)/len(data)*100)
        result.loc[i] = cluster_description
    return result

described = describe_clusters(optigrid, data, parameters)

In [None]:
pd.set_option('display.max_columns', 500)
wanted_statistics = [[(param, 'mean'), (param, 'std')] for param in parameters]
wanted_statistics = [item for sublist in wanted_statistics for item in sublist] + [('DENSITY', 'percentage')]

num_of_clusters_to_print = 10
described.sort_values(by=[('DENSITY', 'percentage')], ascending=False, inplace = True)
print("Sum of densities of printed clusters: {:.1f}%".format(described.head(n=num_of_clusters_to_print)[('DENSITY', 'percentage')].sum()))
described.head(n=num_of_clusters_to_print)[wanted_statistics].round(3)

For visualizing the clusters we will plot the densities of the parameters. For comparability we will use explicit ranges for the x-axis per parameter. Those ranges should be chosen beforehand by an expert to validate or falsify his intuition.

In [None]:
num_clusters = 6 # number of clusters to visualize
data = df[parameters].values # We select the unscaled data again, because by clustering we did not change any ordering and this data corresponds to the real world
num_datapoints = len(data)

resolution = 200
bandwidth = [1, 0.01, 1, 10, 0.1, 0.001]
num_kde_samples = 40000

parameter_ranges = [[0,0] for i in range(len(parameters))]
parameter_ranges[0] = [-300, -200] # Biasdisc x-axis

parameter_ranges[1] = [5.1, 5.3] # Gas x-axis
#parameter_ranges[2] = [0, 3] # High voltage current x-axis
parameter_ranges[2] = [200, 300] # SolCen current x-axis
#parameter_ranges[3] = [900, 2100] # Forwardpower x-axis
parameter_ranges[3] = [1200, 1300] # SolExt current x-axis
parameter_ranges[4] = [5, 20] # Oven1 power x-axis
parameter_ranges[5] = [0, 0.05] # BCT25 current x-axis

best_clusters = sorted(optigrid.clusters, key=lambda x: len(x), reverse=True)
for i, cluster in enumerate(best_clusters[:num_clusters]):
    median = [described.iloc[i,described.columns.get_loc((param, '50%'))] for param in parameters]
    plot_cluster(data, cluster, parameters, parameter_ranges, resolution=resolution, median=median, bandwidth=bandwidth, percentage_of_values=1, num_kde_samples=num_kde_samples)

Now, we want to find all high voltage breakdowns that correspond to the currently considered source stability, and find out to which cluster each datapoint belongs.

In [None]:
wanted_statistics.append(('num_of_breakdowns', ''))
described.head(n=num_of_clusters_to_print)[wanted_statistics].round(3)

In [None]:
wanted_statistics = [[(param, 'mean')] for param in parameters]
wanted_statistics = [item for sublist in wanted_statistics for item in sublist] + [('num_of_breakdowns', '')]
corr_described = described[wanted_statistics].corr()
corr_described.style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)

In [None]:
pd.set_option('display.max_columns', 500)
wanted_statistics = [[(param, 'mean'), (param, 'std'),  (param, 'min'),  (param, 'max')] for param in parameters]
wanted_statistics = [item for sublist in wanted_statistics for item in sublist]
df_breakdowns.groupby('is_breakdown').describe()[wanted_statistics].round(2)

In [None]:
import numpy as np

def d(x,y):
    return np.linalg.norm(x-y)

size = 10
data = np.random.uniform(0, 1, (size, 1))
data

In [None]:
a = np.array([np.sum(np.linalg.norm(data-x, axis=1)) for x in data]) / (size - 1)
a

In [None]:
import pandas as pd
import numpy as np

values = [0, 2, 2, 2, 3, 3, 1]
values = np.array([[x, x] for x in values])
clusters = [[0, 6], [1, 2, 3], [4, 5]]

values, clusters

In [None]:
cluster_performance_dbi(values, clusters, len(clusters))

In [None]:
from sklearn.metrics import davies_bouldin_score

davies_bouldin_score(values, [0, 1, 1, 1, 2, 2, 0])

In [None]:
source_stable = 1
print("Starting clustering for source stability {}".format(source_stable))
source_stability = df_total['source_stable'] == source_stable
voltage_breakdown_selection = df_total['is_breakdown'] > 0

values = select_values(df_total, parameters, source_stability, ~voltage_breakdown_selection) # First, get the data without breakdowns,
scaler, values_scaled = scale_values(values, None) # standard scale it
print(values_scaled)
optigrid = run_optigrid(values_scaled, optigrid_params) # and compute the clusters.
print(values_scaled)
#assign_clusters_df_total(df_total, optigrid, len(values), source_stability, ~voltage_breakdown_selection) # Then, assign the found clusters to the original dataframe in a new column 'optigrid_clusters'
print("Calculating cluster performance cluster performance")
#cluster_performance_silhouette(df_total, values_scaled, optigrid.clusters, source_stability, voltage_breakdown_selection, optigrid.num_clusters)