# Forming clusters


Initial idea is to cluster all the stands according to their similarity and then to form own surrogate for every cluster. That of course rises somes questions:
- what is the similarity measure?
- what is good cluster size?
- how do use surrogates in the end?

Because clusters and then also surrogates must be created automatically we need some way to measure and compare different clusterings and surrogates. Best way to do that would be feeding the right away in to the optimization and then see how they compare between each other and the solution in the paper. Then we just would need the optimization procedure first.

Would it be possible to make a neural network out of the entire problem, so that input would be the same than with the optimization problems already formulated and the output as all the objective functions?


So actually we just would be writing the entire problem again...

But if we input there all the optimun set of all the objective functions and great number of other solutions, would we get out something useful and maybe unexpected?

At least it could be worth trying!

In [23]:
len(str(7**29666))


At least there is a possibility to generate quite big datasets (number before is the length of number of possible combinations)

#### Back to reality

Even though all the previous ideas would be great to try some day, after talking with Jussi it's maybe still a better idea to just form the cluster surrogate and always choose one (virtual) sample to represent entire cluster in the optimization. That way the optimization doesn't have to be altered much and I can focus more on just executing. So clustering it is!

## Hierarchical clustering

In [24]:
 %matplotlib inline
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from scipy.spatial.distance import pdist

In [25]:
data_dir = os.path.join(os.getcwd(), '../boreal_data')

carbon = pd.read_csv(os.path.join(data_dir, 'Carbon_storage.csv'))
HA = pd.read_csv(os.path.join(data_dir, 'Combined_HA.csv'))
deadwood = pd.read_csv(os.path.join(data_dir, 'Deadwood_volume.csv'))
revenue = pd.read_csv(os.path.join(data_dir, 'Timber_revenues.csv'))

In [26]:
X1 = carbon.copy()
X1[carbon.isnull()] = np.nanmin(carbon.values) - 1

In the following cosine metric is used, because we want to ignore the size of stands and prefer their similarity in different ways

In [27]:
Z100 = linkage(X1[:100], metric='cosine')

In [28]:
c100, coph_dists = cophenet(Z100, pdist(X1[:100]))
c100

Cophenet distance is quite close to 1 so there is no need to be worried (?)

In [29]:
%pylab inline
pylab.rcParams['figure.figsize'] = (15,9)
plt.figure()
dendrogram(Z100)
plt.show()

Okay, this works with small data


Now question: What is this green cluster?!

In [30]:
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z100, 0.14,criterion='distance')
clusters

In [31]:
carbon[:100][clusters==1]

In [32]:
carbon[:100][clusters==2]

In [33]:
carbon[:100][clusters==3]

That looks quite great already!

There is just one problem: now we assigned Nan-values to be smalles value - 1. This means Nan:s are not a big difference when compared to other values in the dataset -> 

In [34]:
np.nanmin(carbon.values)

In [35]:
np.nanmax(carbon.values)

As we can see, differences between 'valid' data points are much greater than between 'valid' points and points with Nan-values. So it would make sense to assign much different values for Nan:s. That would also automatically connect all the Nan-including lines to the same clusters. Of course another option is to run this clustering separately for all the lines with Nan:s and all the lines without Nan:s. I am just not sure if assigning greatly different values is more efficient or more general than doing this separately. This should be studied!

## Timing

Let's compare some timings with different sized datasets and clustering methods

### Hierarchical clustering

In [36]:
%%time
Z100 = linkage(X1[:100], metric='cosine')

In [37]:
%%time
Z1000 = linkage(X1[:1000], metric='cosine')

In [38]:
%%time
Z10000 = linkage(X1[:10000], metric='cosine')

In [39]:
%%time
Z20000 = linkage(X1[:20000], metric='cosine')

In [40]:
%%time
Zall = linkage(X1, metric='cosine')

In [41]:
%pylab inline
pylab.rcParams['figure.figsize'] = (15,9)
plt.figure()
dendrogram(Zall, truncate_mode='lastp', p=50)
plt.show()

In [59]:
from scipy.cluster.hierarchy import fcluster
clusters_all = fcluster(Zall, 50 ,criterion='maxclust')
clusters_all

In [60]:
ind = 1
print(len(carbon[clusters_all==ind]))
carbon[clusters_all==ind][:10]


In [44]:
29666*0.35

It was said that 35% of stands were simulated. Would all stands belonging to the first cluster be those?

#### Let's also try hierarchical clustering by assigning much worse values for Nan.s

In [45]:
X2 = carbon.copy()
X2[carbon.isnull()] = np.nanmin(carbon.values) - np.nanmax(carbon.values)

In [46]:
Zall_diff = linkage(X2, metric='cosine')

In [47]:
%pylab inline
pylab.rcParams['figure.figsize'] = (15,9)
plt.figure()
dendrogram(Zall_diff, truncate_mode='lastp', p=50)
plt.show()

In [48]:
from scipy.cluster.hierarchy import fcluster
clusters_diff = fcluster(Zall_diff, 50 ,criterion='maxclust')
clusters_diff

In [49]:
ind = 49
print(len(carbon[clusters_diff==ind]))
carbon[clusters_diff==ind][:10]


In [142]:
ind = 5568
((clusters_all==1)[:ind]==(clusters_diff==49)[:ind]).all()

In [138]:
ind = 6653
((clusters_all==1)[ind:]==(clusters_diff==49)[ind:]).all()

In [136]:
(clusters_all==1)[ind]

In [137]:
(clusters_diff==49)[ind]

## K-means

In [143]:
from scipy.cluster.vq import kmeans,vq

In [144]:
%%time
data100 = X1[:100]
centroids100, _ = kmeans(data100, 50)
idx100,  _ = vq(data100, centroids100)

In [145]:
%%time
data1000 = X1[:1000]
centroids1000, _ = kmeans(data1000, 50)
idx1000,  _ = vq(data1000, centroids1000)

In [146]:
%%time
data10000 = X1[:10000]
centroids10000, _ = kmeans(data10000, 50)
idx10000,  _ = vq(data10000, centroids10000)

In [147]:
%%time
data_all = X1
centroidsall, _ = kmeans(data_all, 50)
idxall,  _ = vq(data_all, centroidsall)

Problem is, it is not possible to define cosine distance using scipy or sklearn

In [149]:
from kmeans import kmeans, randomsample

In [173]:
%%time
randomcenters = randomsample(data_all.values, 50)
centers, xtoc, dist = kmeans(data_all.values, randomcenters, delta=.001, maxiter=100, metric='cosine', verbose=2)

In [174]:
tot = 0
for i in range(50):
    tot += sum(xtoc==i)
    print(sum(xtoc==i))

## Timing results


K-means is remarkably faster than hierarchical clustering. Its weakness is still in the number of clusters.
I dont know how to overcome this reliably.