
# Assignment 2: Hyperparameter Optimizartion For The Human Freedom Index Model

This notebook contains a set of exercises that will guide you through the different steps of this assignment. As in Assignment 1, solutions need to be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and to not modify the test cells. When you are done completing all the exercises submit this same notebook back to moodle in **.ipynb** format.

<div class="alert alert-success">

The <a href="https://www.cato.org/human-freedom-index/2021 ">Human Freedom Index</a> measures economic freedoms such as the freedom to trade or to use sound money, and it captures the degree to which people are free to enjoy the major freedoms often referred to as civil liberties—freedom of speech, religion, association, and assembly— in the countries in the survey. In addition, it includes indicators on rule of law, crime and violence, freedom of movement, and legal discrimination against same-sex relationships. We also include nine variables pertaining to women-specific freedoms that are found in various categories of the index.

<u>Citation</u>

Ian Vásquez, Fred McMahon, Ryan Murphy, and Guillermina Sutter Schneider, The Human Freedom Index 2021: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute and the Fraser Institute, 2021).
    
</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, February 12th, 23:55</div>

In [9]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 


from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score

<div class="alert alert-info"><b>Exercise 1</b>
    
Load the Human Freedom Index data from the link: https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv in a DataFrame called ```df```. The following columns are redundant and should be dropped:
* ```year```
* ```ISO```
* ```countries```
* All columns containing the word ```rank``` 
* All columns containing the word ```score```

Then store the independent variables in a DataFrame called ```X``` and the dependent variable (```hf_quartile```) in a DataFrame called ```y```.
    
<br><i>[0.5 points]</i>
</div>
<div class="alert alert-warning">
Do not download the dataset. Instead, read the data directly from the provided link
</div>

In [2]:
df = pd.read_csv('https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv')

# remove target values with missing values
df = df.dropna(subset=['hf_quartile'])

# remove columns that are specified to be redundant
df = df.drop(columns = ['year', 'ISO', 'countries'])

# drop columns that contain the word rank or score  
df = df.filter(regex='^(?!.*score.*|.*rank.*)')

# Seperate target variable and the features 
X = df.drop(columns = ["hf_quartile"])
y = df["hf_quartile"]

<div class="alert alert-info"><b>Exercise 2</b>
    
Write the code to create a ```Pipeline``` consisting of a ```SimpleImputer``` with the most frequent strategy, a ```OneHotEncoder``` for the categorical variables, a standard scaler, and a logistic regression model with the solver ```saga``` and ```max_iter```2000. Store the resulting pipeline in a variable called ```pipe```.
    
<br><i>[1 point]</i>
</div>
<div class='alert alert-warning'>

Not all the attributes are categorical. Ensure that all non-categorical attributes remain intact.
</div>

In [3]:
# Find the categorical feature
categorical_features = X.select_dtypes(include='object').columns

# Build a transformer to deal with categorical variables in the steps 
transformer = ColumnTransformer([('ohe', OneHotEncoder(sparse=False), [0])], remainder='passthrough')


# Create a pipeline using saga and max iterations of 2000
steps = [('imputer', SimpleImputer(strategy='most_frequent')),
          ('transformer', transformer),
          ('scaler', StandardScaler()),
          ('classifier', LogisticRegression(solver='saga', max_iter=2000))]

pipe = Pipeline(steps)

<div class="alert alert-info"><b>Exercise 3</b>

Write the code to estimate the performance of the model using cross-validation with **three** stratified folds. Store the three test score values in a dictionary called ```fold_scores```.
    
<br><i>[1 point]</i>
</div>

In [4]:
single_fold_scores = cross_val_score(pipe, X, y, cv = StratifiedKFold(n_splits=3))
fold_scores = {'fold1' : single_fold_scores[0],
               'fold2' : single_fold_scores[1],
               'fold3' : single_fold_scores[2]}
fold_scores


{'fold1': 0.9181380417335474,
 'fold2': 0.9501607717041801,
 'fold3': 0.8954983922829582}

<div class="alert alert-info"><b>Exercise 4</b>

    
Write the code to create a GridSearchCV object called ```grid``` and fit it using **only three folds**. The grid search object must include the previous pipeline and test the following hyperparameters:
* ```penalty``` : ['l1', 'l2']
* ```C``` : [0.1,10]

Finally, store the best achieved score (accuracy) in a variable called ```score```.

<br><i>[2.5 points]</i>
</div>

<div class='alert alert-warning'>

Use train and test datasets correctly.
</div>

In [6]:
# train test split 
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state= 7)

# specify parameters to vary 
param_grid = {'classifier__penalty':['l1', 'l2'],
               'classifier__C':[0.1, 10]}

# perform grid search 
grid = GridSearchCV(pipe, param_grid, cv = 3)
grid.fit(X_train, y_train)

score = grid.best_estimator_.score(X_test, y_test)
score 



0.9438502673796791

<div class="alert alert-info"><b>Exercise 5</b>
    
The previous grid search is incomplete because it only optimizes the hyperparameters of the logistic regression model. Now repeat the same process but testing parameters of all the steps of the pipeline. This exercise is open. You can use any hyperparameter from the scaler, imputer, transformer, encoder, or model. Do not limit yourself to linear models.

<br><i>[5 points]</i>
</div>

### The Approach
We will be checking the performance of each hyperparameter individually and evaluate the impact of each hyperparameter comparing to performance without tuning. We then will select the top 3 hyperparameters with the most impact on model performance to try out different combinations of these 3 hyperparemeters. At the end, we will choose the combination with the best performance. 

In [7]:
# to define a function for the approach 

def best_hype(params,pipe): 
    
    #set empty dictionary that records hyperparameters that helped improve the performance
    improve = {}
    
    #get the inital accuracy score without hyperparameter tuning
    score_inital = np.mean(cross_val_score(pipe, X, y, cv=3))
    print(score_inital, " Accuracy without tuning")
    
    
    #evaluate the performance of hyperparameters in isolation 
    scores = {}
    for key,values in params.items():
        single_param_grid = {key: values}
        grid = GridSearchCV(pipe, single_param_grid, cv=3, n_jobs = -2)
        grid.fit(X_train, y_train)
        score = grid.best_estimator_.score(X_test, y_test)
        scores[key] = score
        print(f" {score} : {key}, {grid.best_params_}")
    
    #sort the scores to get the top three hyperparameters
    top_three = dict(sorted(scores.items(), key=lambda x: x[1], reverse=True)[:3])
    
    #trigger a combined search of top three parameters
    print('\n', "We will evaluate the top three parameters: ")
    for key, value in top_three.items():
        print(f"{key}: {params[key]}")
            
    new_grid = GridSearchCV(pipe, {k: params[k] for k in top_three.keys()}, cv = 3, n_jobs = -2)
    new_grid.fit(X_train, y_train)
    best = new_grid.best_estimator_.score(X_test, y_test)
    print('\n',best, "Best combination: ", new_grid.best_params_)
    
    #get the best overall score from this evaluation (inital score, indivisual scores or combination of parameters)
    best_score = max(score_inital, max(scores.values()), best)
    print('\n',best_score, "Best overall score achieved ")
    return best_score
    

### Evaluate the Decision Tree

In [10]:
#construct a pipe for decision tree model 
pipe1 = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('transformer', transformer),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', DecisionTreeClassifier())
])

# Define the hyperparameters to test

dt_grid = {
    'classifier__criterion': ['gini', 'entropy'],
    'classifier__splitter': ['best', 'random'],
    'classifier__max_depth': [200, 300, 500, None],
    'classifier__min_samples_split': [2, 3, 5],
    'classifier__min_samples_leaf': [1, 2, 3],
    'classifier__max_leaf_nodes': [200, 300, 500, None],
    'classifier__min_impurity_decrease': [0.0, 0.05, 0.1],
    'classifier__ccp_alpha': [0.0, 0.05, 0.1],   
}

In [11]:

# Evaluate performance
score1 = best_hype(dt_grid, pipe1)

0.8564555559569831  Accuracy without tuning
 0.8850267379679144 : classifier__criterion, {'classifier__criterion': 'entropy'}
 0.9064171122994652 : classifier__splitter, {'classifier__splitter': 'best'}
 0.9064171122994652 : classifier__max_depth, {'classifier__max_depth': None}
 0.893048128342246 : classifier__min_samples_split, {'classifier__min_samples_split': 3}
 0.9010695187165776 : classifier__min_samples_leaf, {'classifier__min_samples_leaf': 1}
 0.9144385026737968 : classifier__max_leaf_nodes, {'classifier__max_leaf_nodes': None}
 0.9144385026737968 : classifier__min_impurity_decrease, {'classifier__min_impurity_decrease': 0.0}
 0.9037433155080213 : classifier__ccp_alpha, {'classifier__ccp_alpha': 0.0}

 We will evaluate the top three parameters: 
classifier__max_leaf_nodes: [200, 300, 500, None]
classifier__min_impurity_decrease: [0.0, 0.05, 0.1]
classifier__splitter: ['best', 'random']

 0.8877005347593583 Best combination:  {'classifier__max_leaf_nodes': 500, 'classifier__mi

### Evalute Random Forest 

In [12]:
# Establish the pipe for RF

pipe2 = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('transformer', transformer),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Define the hyperparameters to test

rf_grid = {
    'imputer__strategy': ['mean', 'median', 'most_frequent'],
    'classifier__n_estimators': [10, 50, 100, 200, 500, 1000, 1500],
    'classifier__max_depth': [5, 10, 15, 20, 25, 30, None],
    'classifier__min_samples_split': [2, 5, 10, 20, 30],
    'classifier__min_samples_leaf': [1, 2, 5, 10],
    'classifier__max_features': ['auto', 'sqrt', 'log2', None],
    'classifier__criterion': ['gini', 'entropy'],
    'classifier__bootstrap': [True, False],
    'classifier__oob_score': [True, False],
    'classifier__warm_start': [True, False],
    'classifier__ccp_alpha': [0.0, 0.1, 0.5, 1.0],
    'classifier__max_leaf_nodes': [None, 5, 10, 20, 50]
}


In [13]:
# Evaluate performance
score2 = best_hype(rf_grid, pipe2)

0.9196580181984279  Accuracy without tuning


6 fits failed out of a total of 9.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/joblib/memory.py", line 349,

 0.9411764705882353 : imputer__strategy, {'imputer__strategy': 'most_frequent'}
 0.9491978609625669 : classifier__n_estimators, {'classifier__n_estimators': 500}
 0.9438502673796791 : classifier__max_depth, {'classifier__max_depth': 25}
 0.9518716577540107 : classifier__min_samples_split, {'classifier__min_samples_split': 5}
 0.9385026737967914 : classifier__min_samples_leaf, {'classifier__min_samples_leaf': 1}
 0.9545454545454546 : classifier__max_features, {'classifier__max_features': 'sqrt'}
 0.9545454545454546 : classifier__criterion, {'classifier__criterion': 'gini'}
 0.9491978609625669 : classifier__bootstrap, {'classifier__bootstrap': False}
 0.9438502673796791 : classifier__oob_score, {'classifier__oob_score': True}
 0.9438502673796791 : classifier__warm_start, {'classifier__warm_start': True}
 0.946524064171123 : classifier__ccp_alpha, {'classifier__ccp_alpha': 0.0}
 0.9411764705882353 : classifier__max_leaf_nodes, {'classifier__max_leaf_nodes': None}

 We will evaluate the to

### Evalute Logistic Regression 

In [14]:
log_grid = {
    'imputer__strategy': ['most_frequent', 'mean', 'median'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'classifier__penalty': ['l1', 'l2', 'elasticnet'],
    'classifier__solver': ['saga', 'liblinear'],
    'classifier__class_weight': [None, 'balanced'],
    'classifier__fit_intercept': [True, False],
    'classifier__C': [0.1, 1, 5, 10, 100],
    'classifier__max_iter': [1000, 2000, 5000],
    'classifier__multi_class': ['ovr', 'multinomial'],
    'classifier__tol': [1e-4, 1e-3, 1e-2, 1e-1],
    'classifier__l1_ratio': [0, 0.1, 0.3, 0.5, 0.7, 0.9],
    'classifier__warm_start': [True, False]
}

In [15]:
# Evaluate performance
score3 = best_hype(log_grid, pipe)

0.9212657352402287  Accuracy without tuning


6 fits failed out of a total of 9.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/joblib/memory.py", line 349,

 0.9598930481283422 : imputer__strategy, {'imputer__strategy': 'most_frequent'}
 0.9598930481283422 : scaler__with_mean, {'scaler__with_mean': True}
 0.9598930481283422 : scaler__with_std, {'scaler__with_std': True}


3 fits failed out of a total of 9.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1471, in fit
    raise ValueError(
ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)



 0.9598930481283422 : classifier__penalty, {'classifier__penalty': 'l2'}
 0.9598930481283422 : classifier__solver, {'classifier__solver': 'saga'}
 0.9598930481283422 : classifier__class_weight, {'classifier__class_weight': 'balanced'}
 0.9598930481283422 : classifier__fit_intercept, {'classifier__fit_intercept': True}




 0.9598930481283422 : classifier__C, {'classifier__C': 1}




 0.9598930481283422 : classifier__max_iter, {'classifier__max_iter': 2000}




 0.9598930481283422 : classifier__multi_class, {'classifier__multi_class': 'multinomial'}
 0.9598930481283422 : classifier__tol, {'classifier__tol': 0.0001}




 0.9598930481283422 : classifier__l1_ratio, {'classifier__l1_ratio': 0}
 0.9598930481283422 : classifier__warm_start, {'classifier__warm_start': True}

 We will evaluate the top three parameters: 
imputer__strategy: ['most_frequent', 'mean', 'median']
scaler__with_mean: [True, False]
scaler__with_std: [True, False]


24 fits failed out of a total of 36.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/wakeupjz/opt/anaconda3/lib/python3.9/site-packages/joblib/memory.py", line 3


 0.9598930481283422 Best combination:  {'imputer__strategy': 'most_frequent', 'scaler__with_mean': True, 'scaler__with_std': True}

 0.9598930481283422 Best overall score achieved 


### Final Evaluation


In [16]:
overview = pd.DataFrame({"Decision Tree": score1,
                         "Random Forest": score2,
                         "Logistic Regression": score3
                          }, index = ["Accuracy"])

overview

Unnamed: 0,Decision Tree,Random Forest,Logistic Regression
Accuracy,0.914439,0.954545,0.959893
