# 05 Tune Model
Used the cleaned, scaled, and normalised data to train the model.

**Disclaimer.** I used AzureML for this step. My trail expired one week before the end of the competition so I started using SciKit Learn. I submitted results from here to the competition but never bettered the score I got via AzureML even though the RMSE scores were comparable. I can only assume that I was overtraining here!

## Initialise the styles for the workbooks

In [1]:
# Initialise styles and packages we need
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

## Imports and classes used

In [2]:
# All the imports used
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

print("Pandas version:       {}".format(pd.__version__))
#print("Scikit learn version: {}".format(sklearn.__version__))

Pandas version:       0.23.4


## Import cleaned, scaled and normalised data we created in 03 Data Scaling and Normalising

In [3]:
final_scaled_normalised_training_values_filename = 'data/DAT102x_Predicting_Chronic_Hunger_-_Clean_Normal_Training_values.csv'
training_labels_filename = 'data/DAT102x_Predicting_Chronic_Hunger_-_Training_labels.csv'
final_scaled_normalised_test_values_filename = 'data/DAT102x_Predicting_Chronic_Hunger_-_Clean_Normal_Test_values.csv'

training_values = pd.read_csv(final_scaled_normalised_training_values_filename)
training_labels = pd.read_csv(training_labels_filename)
test_values = pd.read_csv(final_scaled_normalised_test_values_filename)

# Makes sure country_code and year are treated as categorical!
training_values['country_code'] = training_values['country_code'].astype('category')
training_values['year'] = training_values['year'].astype('category')
test_values['country_code'] = test_values['country_code'].astype('category')
test_values['year'] = test_values['year'].astype('category')

print("Training values: {}".format(training_values.shape))
print("Training label: {}".format(training_labels.shape))
print("Test values:     {}".format(test_values.shape))
print(training_values.head())
#print(training_values.dtypes)

Training values: (1311, 19)
Training label: (1401, 2)
Test values:     (616, 19)
   row_id country_code  year  agricultural_land_area  forest_area  \
0       0      889f053  2002                0.644849     0.326591   
1       1      9e614ab  2012                0.393423     0.598121   
2       2      100c476  2000                0.013088     0.099657   
3       3      4609682  2013                0.545174     0.371419   
4       4      be2a7f5  2008                0.059169     0.177130   

   total_land_area  population_growth  avg_value_of_food_production  \
0         0.522760           0.555582                      0.221955   
1         0.435434           0.463276                      0.583676   
2         0.017267           0.515693                      0.360597   
3         0.396593           0.455697                      0.647380   
4         0.039847           0.379837                      0.672717   

   food_imports_as_share_of_merch_exports  \
0                               

## Join training features and label into test dataset

In [4]:
tempDF = pd.merge(training_values, training_labels, on='row_id', how='inner')
print(tempDF.shape)

(1311, 20)


## Create the test feature matrix and test label vector.
Exclude country_code from model and use get_dummies to convert year column into "one hot encoding" format. That is, we are treating year as a categorical value.

In [5]:
# Start at 2nd column, i.e. exclude country_code
X = pd.get_dummies(training_values.iloc[:,2:len(training_values)])
y = tempDF['prevalence_of_undernourishment'].values
print(X.shape)
#print(X.dtypes)
print(y)

(1311, 32)
[31.26071279 18.29823274 39.51339713 ... 12.08848436 26.43666106
 13.71256945]


## Use grid search to tune the model parameters
From 04 Train Model we have a model with an score of "RMSE: 6.82 (+/- 5.84)", lets tune the neural network parameters to see if we can improve the performance.

### Select parameters to use in grid search - Assuming 'adam' solver
This is exponential, i.e. the cell below will execute the model training and prediction 10 * 3 * 1 * 2 * 5 * 2 = 600 times assuming a 10-fold cross validation. This will take a while!

In [6]:
hidden_layer_sizes_range = [300, 400, 500]
alpha_range = [0.0001]
max_iter_range = [400, 500]
beta_1_range = [0.3, 0.5, 0.7, 0.9, 0.99]
beta_2_range = [0.5, 0.999]

param_grid = dict(hidden_layer_sizes=hidden_layer_sizes_range,
                  alpha=alpha_range, 
                  max_iter=max_iter_range,
                  beta_1=beta_1_range,
                  beta_2=beta_2_range)
print(param_grid)

{'hidden_layer_sizes': [300, 400, 500], 'alpha': [0.0001], 'max_iter': [400, 500], 'beta_1': [0.3, 0.5, 0.7, 0.9, 0.99], 'beta_2': [0.5, 0.999]}


#### Instanciate the grid and commence search.
I used n_jobs = -1 to use all my CPUs to reduce waiting time. This really did hammer my CPUs so your mileage may vary!

In [7]:
nn = MLPRegressor(activation='identity',
                  verbose=False,
                  solver='adam')
grid = GridSearchCV(nn,
                    param_grid,
                    cv=10,
                    scoring='neg_mean_squared_error',
                    return_train_score=False,
                    n_jobs = -1)
# uncomment below to repeat the exercise.
#grid.fit(X, y)

In [8]:
# Uncomment if you wish to repeat

#print(pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
#print(grid.best_score_)
## gave -47.05303823697013 gives RMSE of 6.8595
#print(grid.best_params_)
## gave {'alpha': 0.0001, 'beta_1': 0.9, 'beta_2': 0.999, 'hidden_layer_sizes': 500, 'max_iter': 400}

#### Interpreting results
With the best "adam" solver the RMSE is 6.8595 which is worse than the 'lbfgs' solver we used in 04 Train Model.

Repeat the above exercise for the 'lbfgs' solver and see if we can improve more.

### Select parameters to use in grid search - Assuming 'lbfgs' solver

In [9]:
hidden_layer_sizes_range = [300, 400, 500, 600]
alpha_range = [0.00001, 0.0001, 0.001]
max_iter_range = [300, 400, 500]
param_grid = dict(hidden_layer_sizes=hidden_layer_sizes_range,
                  alpha=alpha_range, 
                  max_iter=max_iter_range)
print(param_grid)

{'hidden_layer_sizes': [300, 400, 500, 600], 'alpha': [1e-05, 0.0001, 0.001], 'max_iter': [300, 400, 500]}


#### Instanciate the grid and commence search.
I used n_jobs = -1 to use all my CPUs to reduce waiting time. This really did hammer my CPUs so your mileage may vary!

In [10]:
nn = MLPRegressor(activation='identity',
                  verbose=False,
                  solver='lbfgs')
grid = GridSearchCV(nn,
                    param_grid,
                    cv=10,
                    scoring='neg_mean_squared_error',
                    return_train_score=False,
                    n_jobs = -1)
# uncomment below to repeat the exercise.
#grid.fit(X, y)

In [11]:
# Uncomment if you wish to repeat

#print(pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
#print(grid.best_score_)
## gave -46.53876933228354 gives RMSE of 6.8219
#print(grid.best_params_)
## {'alpha': 0.001, 'hidden_layer_sizes': 500, 'max_iter': 500}

#### Interpreting the results
With the best "lbfgs" solver the RMSE is 6.8219 which is the same as the the 'lbfgs' solver at 6.82 we used in 04 Train Model.

## Next steps
This workbook contains the process whereby you can tune the model parameters. It is clear from this that the lbfgs solver is the better option and as we used in it "04 Train Model" the next steps are to look at better feature selection.