# Number of clusters

Let's consider the proper number of clusters, so that system to be used in the live session will be justified. We are going to do this by clustering all the objectives separately and selecting correct number of clusters for everyone.

In [1]:
 %matplotlib inline
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from ASF import ASF
from gradutil import *
from pyomo.opt import SolverFactory
seedn = 1

In [2]:
%%time
revenue, carbon, deadwood, ha = init_boreal()
n_revenue = nan_to_bau(revenue)
n_carbon= nan_to_bau(carbon)
n_deadwood = nan_to_bau(deadwood)
n_ha = nan_to_bau(ha)
ide = ideal(False)
nad = nadir(False)
opt = SolverFactory('glpk')

In [3]:
x = pd.concat((n_revenue, n_carbon, n_deadwood, n_ha), axis=1)
x_stack = np.dstack((n_revenue, n_carbon, n_deadwood, n_ha))

x_norm = normalize(x.values)
x_norm_stack = normalize(x_stack)

Calculate the number of clusters that keeps the user waiting time less than a second.

In [4]:
%%time 
import time
dur = 0
nclust1 = 50
while dur < 1:
    nclust1 += 50
    c, xtoc, dist = cluster(x_norm, nclust1, seedn, verbose=0)
    w = np.array([sum(xtoc == i) for i in range(nclust1)])
    c_mean = np.array([x_norm_stack[xtoc == i].mean(axis=0) for i in range(nclust1)])
    start = time.time()
    ref = np.array((ide[0], 0, 0, 0))
    asf = ASF(ide, nad, ref, c_mean, weights=w)
    opt.solve(asf.model)
    dur = time.time() - start
print(nclust1)

So if possible, we try to keep the total number of clusters below that.

In [5]:
def kmeans_and_eval(x, rng):
    distsum = []
    for nclust in rng:
        c, xtoc, dist = cluster(x, nclust, seedn, verbose=0)
        distsum.append(np.nansum(dist))
    return distsum

In [6]:
rng = range(50,251,20)
distsum_revenue = kmeans_and_eval(x_norm[:,:7], rng)
distsum_carbon = kmeans_and_eval(x_norm[:,7:14], rng)
distsum_deadwood = kmeans_and_eval(x_norm[:,14:21], rng)
distsum_ha = kmeans_and_eval(x_norm[:,21:], rng)

In [7]:
%pylab inline
pylab.rcParams['figure.figsize'] = (15,12)

fig, ax = plt.subplots(2,2)
fig.suptitle('Number of clusters and sum of intra cluster distances')

ax[0,0].plot(rng, distsum_revenue)
ax[0,0].set_title('Revenue')

ax[0,1].plot(rng, distsum_carbon)
ax[0,1].set_title('Carbon')

ax[1,0].plot(rng, distsum_deadwood)
ax[1,0].set_title('Deadwood')

ax[1,1].plot(rng, distsum_ha)
ax[1,1].set_title('HA')

From the plots we can say nothing but that more is more. The more we have clusters, the more accurate the results are. We also see, that k-means doesn't handle HA values so nicely, at least when compared to the revenue values. All the variables are normalized to 0-1 scale, so that cannot be the reason. There is just something nasty in the data (HA indices are calculated using some nonlinear approximations, so they are not handled as gracefully in this)

Nowadays we also have the map data, so we could use it in this also. We could get better results, but still it is more data handling and not so much contributing in to this thesis.

## We can conclude this by saying that we just fix the number of clusters to be as big as we want, which is 200 clusters in this case. (Keeping calculation time under 1 sec.)