# How to handle Nan-values so that the HA doesn't get marginalized?

It has been a problem this far, that the clustering doesn't work as desired, and the problem is now located in the procedure Nan-values have been handled in clustering. So we need a better way to do that.

In [50]:
 %matplotlib inline
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from gradutil import *
from pyomo.opt import SolverFactory

In [51]:
seedn = 2
opt = SolverFactory('glpk')
solutions = real_solutions()
revenue, carbon, deadwood, ha = init_boreal()
x = np.concatenate((revenue, carbon, deadwood, ha), axis=1)

## Set nan:s to smallest existing option

Let's first just try setting all nan:s to the smallest value in the corresponding column

In [52]:
norm_data = x.copy()
inds = np.where(np.isnan(norm_data))
norm_data[inds] = np.take(np.nanmin(norm_data, axis=0),inds[1])

Then normalize all as before

In [53]:
min_norm_x = normalize(norm_data)

In [73]:
%pylab inline
pylab.rcParams['figure.figsize'] = (15,12)

def hist_plot_norm(data, ax, limits):
    ax[0,0].hist(data[:, :7])
    ax[0,0].axis(limits)
    ax[0,0].set_title('Timber Harvest Revenues')

    ax[0,1].hist(data[:, 7:14])
    ax[0,1].axis(limits)
    ax[0,1].set_title('Carbon storage')

    ax[1,0].hist(data[:, 14:21])
    ax[1,0].axis(limits)
    ax[1,0].set_title('Deadwood')

    ax[1,1].hist(data[:, 21:])
    ax[1,1].axis(limits)
    ax[1,1].set_title('Habitat availability')
    return ax


In [81]:
data = min_norm_x
fig, ax = plt.subplots(2,2)
limits = [.0, 1., 0, 30000]
hist_plot_norm(data, ax, limits)
plt.show()

In [56]:
%%time
nclust = 50
optim_revenue50, optim_carbon50, optim_deadwood50, optim_ha50 = cNopt(x, min_norm_x, min_norm_x, opt, nclust, seedn)

In [57]:
print('Relative differences to original values, 50 clusters')
print("(i) Harvest revenues difference {:.3f}".format((optim_revenue50-solutions['revenue'])/solutions['revenue']))
print("(ii) Carbon storage {:.3f}".format((optim_carbon50-solutions['carbon'])/solutions['carbon']))
print("(iii) Deadwood index {:.3f}".format((optim_deadwood50-solutions['deadwood'])/solutions['deadwood']))
print("(iv) Combined Habitat {:.3f}".format((optim_ha50-solutions['ha'])/solutions['ha']))

So it looks like this setting is not enough to drive optimization away from these points, and it doesn't tell anything about clustering. We need to adjust values for the optimization part, so we can know how the clustering goes

In [58]:
no_nan_x = x.copy()
inds = np.where(np.isnan(no_nan_x))
no_nan_x[inds] = np.take(np.nanmin(no_nan_x, axis=0) - np.nanmax(no_nan_x, axis=0), inds[1])

There is then great penalty for choosing the Nan values in optimization.

In [59]:
%%time
nclust = 50
penalty_optim_revenue50, penalty_optim_carbon50, penalty_optim_deadwood50, penalty_optim_ha50 = cNopt(x, min_norm_x, no_nan_x, opt, nclust, seedn)

In [60]:
print('Relative differences to original values, 50 clusters')
print("(i) Harvest revenues difference {:.3f}".format((penalty_optim_revenue50-solutions['revenue'])/solutions['revenue']))
print("(ii) Carbon storage {:.3f}".format((penalty_optim_carbon50-solutions['carbon'])/solutions['carbon']))
print("(iii) Deadwood index {:.3f}".format((penalty_optim_deadwood50-solutions['deadwood'])/solutions['deadwood']))
print("(iv) Combined Habitat {:.3f}".format((penalty_optim_ha50-solutions['ha'])/solutions['ha']))

We have optimization results and it looks like the clustering is not working. We need another paradigma to handle the Nan-values...

## Give nan:s some penalty

In [61]:
norm_data = x.copy()
inds = np.where(np.isnan(norm_data))
norm_data[inds] = np.take((np.nanmin(norm_data, axis=0)-np.nanmax(norm_data, axis=0))/2,inds[1])
penalty_norm_x = normalize(norm_data)

In [79]:
fig, ax = plt.subplots(2,2)
limits = [.0, 1., 0, 30000]
hist_plot_norm(penalty_norm_x, ax, limits)
plt.show()

In [62]:
%%time
nclust = 50
half_optim_revenue50, half_optim_carbon50, half_optim_deadwood50, half_optim_ha50 = cNopt(x, penalty_norm_x, no_nan_x, opt, nclust, seedn)

In [63]:
print('Relative differences to original values, 50 clusters')
print("(i) Harvest revenues difference {:.3f}".format((half_optim_revenue50-solutions['revenue'])/solutions['revenue']))
print("(ii) Carbon storage {:.3f}".format((half_optim_carbon50-solutions['carbon'])/solutions['carbon']))
print("(iii) Deadwood index {:.3f}".format((half_optim_deadwood50-solutions['deadwood'])/solutions['deadwood']))
print("(iv) Combined Habitat {:.3f}".format((half_optim_ha50-solutions['ha'])/solutions['ha']))

That is not working either. Need something else.

## Give nans ridiculous penalty

In [64]:
norm_data = x.copy()
inds = np.where(np.isnan(norm_data))
norm_data[inds] = np.take((np.nanmin(norm_data, axis=0)-np.nanmax(norm_data, axis=0))*2,inds[1])
ridiculous_norm_x = normalize(norm_data)

In [80]:
fig, ax = plt.subplots(2,2)
limits = [.0, 1., 0, 30000]
hist_plot_norm(ridiculous_norm_x, ax, limits)
plt.show()

In [66]:
%%time
nclust = 50
ridic_optim_revenue50, ridic_optim_carbon50, ridic_optim_deadwood50, ridic_optim_ha50 = cNopt(x, ridiculous_norm_x, no_nan_x, opt, nclust, seedn)

In [67]:
print('Relative differences to original values, 50 clusters')
print("(i) Harvest revenues difference {:.3f}".format((ridic_optim_revenue50-solutions['revenue'])/solutions['revenue']))
print("(ii) Carbon storage {:.3f}".format((ridic_optim_carbon50-solutions['carbon'])/solutions['carbon']))
print("(iii) Deadwood index {:.3f}".format((ridic_optim_deadwood50-solutions['deadwood'])/solutions['deadwood']))
print("(iv) Combined Habitat {:.3f}".format((ridic_optim_ha50-solutions['ha'])/solutions['ha']))