## Optimizing HA separately

There were greatest differences in HA objective values when compared results with and without surrogates. Thats why I would like to study this objective more thoroughly.

In [1]:
 %matplotlib inline
import seaborn
import matplotlib.pyplot as plt
from kmeans import kmeans, randomsample
import numpy as np
import pandas as pd
import random
import os
from BorealWeights import BorealWeightedProblem
from pyomo.opt import SolverFactory
from gradutil import *
seed = 3

Let's now form the clusters by only using HA values, so we can see if the original problem was in the cluster forming part, or if there is something more peculiar in this HA objective.

In [2]:
%%time
random.seed(seed)
np.random.seed(seed)
data_dir = os.path.join(os.getcwd(), '../boreal_data')
ha = pd.read_csv(os.path.join(data_dir, 'Combined_HA.csv'))
X = normalize(ha.values)
randomcenters = randomsample(X, 50)
centers, xtoc, dist = kmeans(X,
                             randomcenters,
                             delta=.00001,
                             maxiter=100,
                             metric='cosine',
                             verbose=1)

In [5]:
%%time
C = centers.copy()
weights = np.array([sum(xtoc==i) for i in range(len(C))])

In [6]:
%%time
clustProblemHA = BorealWeightedProblem(C,weights)
opt = SolverFactory('glpk')
resClustHA = opt.solve(clustProblemHA.model, False)

In [7]:
HASurrogateList = res_to_list(clustProblemHA.model)
resultSurrogateHA = cluster_to_value(C, HASurrogateList, weights)
print("(iv) Combined Habitat {:.0f}".format(resultSurrogateHA))

In [8]:
resultOriginHA = clusters_to_origin(X, xtoc, HASurrogateList)
print("(iv) Combined Habitat {:.0f}".format(resultOriginHA))

In the original optimization the result was:
- (iv) Combined Habitat 10327

Which is exactly the same than here. 

From this I  conclude, that the problem is clustering when using all the objectives. So lets try doing everything as here before, but use more data for clustering.

The solution could be normalizing all the objectives to 0-1 scale, so there would be no weighting differences.

In [41]:
data_dir = os.path.join(os.getcwd(), '../boreal_data')
carbon = pd.read_csv(os.path.join(data_dir, 'Carbon_storage.csv'))
deadwood = pd.read_csv(os.path.join(data_dir, 'Timber_revenues.csv'))
orig_hc_x = np.concatenate((ha, carbon), axis=1)
clust_hc_x = np.concatenate((normalize(ha.values), normalize(carbon.values),), axis=1)
no_nan_hc_x = orig_hc_x.copy()
hc_inds = np.where(np.isnan(no_nan_hc_x))
no_nan_hc_x[hc_inds] = np.take(np.nanmin(no_nan_hc_x, axis=0) - np.nanmax(no_nan_hc_x, axis=0), hc_inds[1])

In [42]:
%%time
random.seed(seed)
np.random.seed(seed)
randomcenters = randomsample(clust_hc_x, 50)
hc_centers, hc_xtoc, hc_dist = kmeans(clust_hc_x,
                             randomcenters,
                             delta=.00001,
                             maxiter=100,
                             metric='cosine',
                             verbose=1)

In [43]:
%%time
#hc_C = no_nan_hc_x[[np.argmin(hc_dist[hc_xtoc==i]) for i in range(len(hc_centers))]]
hc_C = np.array([no_nan_hc_x[hc_xtoc == i].mean(axis=0) for i in range(len(hc_C))])

hc_weights = np.array([sum(hc_xtoc==i) for i in range(len(hc_C))])

In [44]:
%%time
clustProblem_hc_ha = BorealWeightedProblem(hc_C[:,:7],hc_weights)
opt = SolverFactory('glpk')
resClust_hc_ha = opt.solve(clustProblem_hc_ha.model, False)

In [45]:
hc_HASurrogateList = res_to_list(clustProblem_hc_ha.model)
hc_resultSurrogateHA = cluster_to_value(hc_C[:,:7], hc_HASurrogateList, hc_weights)
print("(iv) Combined Habitat {:.0f}".format(hc_resultSurrogateHA))

In [46]:
hc_resultOriginHA = clusters_to_origin(orig_hc_x[:,:7], hc_xtoc, hc_HASurrogateList)
print("(iv) Combined Habitat {:.0f}".format(hc_resultOriginHA))

Now when clustering with carbon, wes see that the difference is quite big: pretty much the same than what it was when using all the data to do the clustering. So the problem really is here. We just should decide what to do with this...