# Clustering and Optimization


## Lets form the clusters first using k-means

In [1]:
 %matplotlib inline
import seaborn
import matplotlib.pyplot as plt
from kmeans import kmeans, randomsample
import numpy as np
import pandas as pd
import random
import os
from BorealWeights import BorealWeightedProblem
from pyomo.opt import SolverFactory


In [2]:
%%time
seed = 2
random.seed(seed)
np.random.seed(seed)
data_dir = os.path.join(os.getcwd(), '../boreal_data')
carbon = pd.read_csv(os.path.join(data_dir, 'Carbon_storage.csv'))
X = carbon.values
X[carbon.isnull()] = np.nanmin(carbon) - np.nanmax(carbon)
randomcenters = randomsample(X, 50)
centers, xtoc, dist = kmeans(X,
                             randomcenters,
                             delta=.00001,
                             maxiter=100,
                             metric='cosine',
                             verbose=0)

In [3]:
%%time
C = centers.copy()
weights = np.array([sum(xtoc==i) for i in range(0,len(C))])

In [4]:
%%time
ClustProblem = BorealWeightedProblem(C,weights)
opt = SolverFactory('glpk')
res = opt.solve(ClustProblem.model, False)

In [5]:
def res_to_list(model):
    resdict = model.x.get_values()
    reslist = np.zeros(model.n.value)
    for i,j in resdict.keys():
        if resdict[i,j] == 1.:
            reslist[i] = j
    return reslist

In [6]:
reslist = res_to_list(ClustProblem.model)

In [7]:
optim_result_surrogate = sum([C[ind,int(reslist[ind])]*weights[ind] for ind in range(len(reslist))])
optim_result_surrogate

That is pretty close to the optimization values acquired by traditional optimization!

Let's also check corresponding values using original values in clusters:

In [8]:
optim_result_surrogate_origin = sum([sum(X[xtoc==ind][:,int(reslist[ind])]) for ind in range(len(reslist))])
optim_result_surrogate_origin

In [9]:
(optim_result_surrogate - optim_result_surrogate_origin)/optim_result_surrogate_origin

Relative error between surrogate result and surrogate result mapped back to original values is quite small

Compare also to original optimization:

In [10]:
%%time
orig_weights = np.ones(len(X))
OrigProblem = BorealWeightedProblem(X,orig_weights)
opt.solve(OrigProblem.model, False)

In [11]:
orig_res_values = res_to_list(OrigProblem.model)
optim_result_orig = sum([X[ind,int(orig_res_values[ind])] for ind in range(len(orig_res_values))])
optim_result_orig

In [12]:
(optim_result_orig - optim_result_surrogate_origin) / optim_result_orig

The difference is then about 3%, which isn't so bad. 

The next step would be rethinkin the clustering:
- what happens if we now use the same clustering to calculate f.ex. revenue values.?
- do we use all the data values to create clusters?