
# Assignment 2: Hyperparameter Optimizartion For The Human Freedom Index Model

This notebook contains a set of exercises that will guide you through the different steps of this assignment. As in Assignment 1, solutions need to be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and to not modify the test cells. When you are done completing all the exercises submit this same notebook back to moodle in **.ipynb** format.

<div class="alert alert-success">

The <a href="https://www.cato.org/human-freedom-index/2021 ">Human Freedom Index</a> measures economic freedoms such as the freedom to trade or to use sound money, and it captures the degree to which people are free to enjoy the major freedoms often referred to as civil liberties—freedom of speech, religion, association, and assembly— in the countries in the survey. In addition, it includes indicators on rule of law, crime and violence, freedom of movement, and legal discrimination against same-sex relationships. We also include nine variables pertaining to women-specific freedoms that are found in various categories of the index.

<u>Citation</u>

Ian Vásquez, Fred McMahon, Ryan Murphy, and Guillermina Sutter Schneider, The Human Freedom Index 2021: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute and the Fraser Institute, 2021).
    
</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, February 12th, 23:55</div>

In [1]:
import os
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 
from sklearn.impute import SimpleImputer
from sklearn import set_config
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

<div class="alert alert-info"><b>Exercise 1</b>
    
Load the Human Freedom Index data from the link: https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv in a DataFrame called ```df```. The following columns are redundant and should be dropped:
* ```year```
* ```ISO```
* ```countries```
* All columns containing the word ```rank``` 
* All columns containing the word ```score```

Then store the independent variables in a DataFrame called ```X``` and the dependent variable (```hf_quartile```) in a DataFrame called ```y```.
    
<br><i>[0.5 points]</i>
</div>
<div class="alert alert-warning">
Do not download the dataset. Instead, read the data directly from the provided link
</div>

In [2]:
#Download the data and store it in df
df = pd.read_csv("https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv")

#Remove missing values in the target variable
df.dropna(subset=['hf_quartile'], inplace=True)

#Remove columns suggested in exercise instructions
df = df.drop(columns=['year', 'ISO', 'countries'])
df=df.drop(list(df.filter(regex='rank')), axis=1)
df=df.drop(list(df.filter(regex='score')), axis=1)

#Store the independant variables in X and target in y
y = df['hf_quartile']
X = df.drop(columns=['hf_quartile'])

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 2</b>
    
Write the code to create a ```Pipeline``` consisting of a ```SimpleImputer``` with the most frequent strategy, a ```OneHotEncoder``` for the categorical variables, a standard scaler, and a logistic regression model with the solver ```saga``` and ```max_iter```2000. Store the resulting pipeline in a variable called ```pipe```.
    
<br><i>[1 point]</i>
</div>
<div class='alert alert-warning'>

Not all the attributes are categorical. Ensure that all non-categorical attributes remain intact.
</div>

In [113]:
#Seperating numerical and categorical variables
num_variables=[]
cat_variables=[]
for col in X.columns: 
    if X[col].dtypes == 'float64':
        num_variables.append(col)
    else:
        cat_variables.append(col)

#Making the test/train split 
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.25, random_state=42)

#obtaining the regions column's index number 
X.columns.get_loc('region')

#Creating the transformer to make the pipeline
onehotencoder= OneHotEncoder(sparse=False)
transformer = ColumnTransformer([('ohe', onehotencoder, [0])], remainder = 'passthrough')

#Creating the pipeline
pipe_steps=[('impute', SimpleImputer(strategy='most_frequent')),
            ('preprocess', transformer), 
            ('scaler', StandardScaler()), 
            ('lr', LogisticRegression(solver='saga', max_iter=2000))]
pipe= Pipeline(pipe_steps)


In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3</b>

Write the code to estimate the performance of the model using cross-validation with **three** stratified folds. Store the five test score values in a dictionary called ```fold_scores```.
    
<br><i>[1 point]</i>
</div>

In [114]:
#Set up the Cross validation score
scores= cross_val_score(pipe, X_train, y_train, cv=3)

#Set up the dictionary with the Cross validation scores
dict_scores={}
x=0
while x < len(scores):
    dict_scores[x+1] = scores[x]
    x+=1
    
#result  
dict_scores

{1: 0.9550321199143469, 2: 0.9357601713062098, 3: 0.9291845493562232}

In [None]:
stop_patchers(patchers)
call_order = ['estimator', 'X', 'y', 'groups', 'scoring', 'cv', 'n_jobs', 'verbose', 'fit_params', 'pre_dispatch', 'error_score']
check_args({'estimator': pipe, 'X': X, 'y': y, 'cv': 3}, call_order, mocks)

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4</b>

    
Write the code to create a GridSearchCV object called ```grid``` and fit it using **only three folds**. The grid search object must include the previous pipeline and test the following hyperparameters:
* ```penalty``` : ['l1', 'l2']
* ```C``` : [0.1,10]

Finally, store the best achieved score (accuracy) in a variable called ```score```.

<br><i>[2.5 points]</i>
</div>

<div class='alert alert-warning'>

Use train and test datasets correctly.
</div>

In [115]:
# YOUR CODE HERE
#Establish the desired parameters to test
param_grid={'lr__penalty': ['l1', 'l2'], 'lr__C':[0.1,10]}

#Set up the Grid Search
grid = GridSearchCV(pipe, param_grid, cv=3)
grid.fit(X_train, y_train)

#store the highest score
score = grid.best_score_
Best_estimator = grid.best_estimator_.score(X_test, y_test)



In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5</b>
    
The previous grid search is incomplete because it only optimizes the hyperparameters of the logistic regression model. Now repeat the same process but testing parameters of all the steps of the pipeline. This exercise is open. You can use any hyperparameter from the scaler, imputer, transformer, encoder, or model. Do not limit yourself to linear models.

<br><i>[5 points]</i>
</div>

In [109]:
# YOUR CODE HERE
#Create variables that represent a Model 
LR= LogisticRegression()
DT= DecisionTreeClassifier()
RF= RandomForestClassifier()
#Creating the transformer to make the pipelines
onehotencoder= OneHotEncoder(sparse=False)
imputer= SimpleImputer(strategy='most_frequent')
inner_pipe_steps = [('impute', imputer), ('ohe', onehotencoder)]
inner_pipe= Pipeline(inner_pipe_steps)
transformer = ColumnTransformer([('inner', inner_pipe, cat_variables)], remainder = 'passthrough')

#Creating the pipeline for Logistic Regression
pipe_steps=[('preprocess', transformer),
            ('imputer', SimpleImputer()),
            ('scaler', StandardScaler()), 
            ('classifier', DT)]
pipe= Pipeline(pipe_steps)

# Define the parameter grid for the LR
param_gridlr = {
    'classifier':[LR],
    'imputer__strategy' : ['median', 'mean'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'classifier__C': [0.1, 1.0, 10.0], 
    'classifier__penalty' :  ['l1', 'l2']
}


# Define the parameter grid for the DT
param_griddt = {
    'classifier': [DT],
    'imputer__strategy' : ['median', 'mean'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'classifier__criterion': ['gini', 'entropy'],
    'classifier__max_depth': [3, 4, 5, 6, 7, None],
    'classifier__min_samples_split': [2, 3, 4, 5],
    'classifier__min_samples_leaf': [1, 2, 3, 4]
}

# Define the parameter grid for the RF
param_gridrf = {
    'classifier': [RF],
    'imputer__strategy': ['median' 'mean'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'classifier__criterion': ['gini', 'entropy'],
    'classifier__max_depth': [3, 4, 5, 6, 7, None],
    'classifier__min_samples_split': [2, 3, 4, 5],
    'classifier__min_samples_leaf': [1, 2, 3, 4]
}

# Perform the grid search for Decision Tree
grid_param_list = [param_gridlr, param_griddt, param_gridrf]
grid_search = GridSearchCV(pipe, grid_param_list, cv=3, n_jobs=-2)
grid_search.fit(X_train, y_train)

Best_parameters= grid_search.best_params_
Best_score = grid_search.best_score_
Best_estimator = grid_search.best_estimator_.score(X_test, y_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [80]:
#How to see if there are any missing values in the whole dataframe

dfx= pd.read_csv("https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv")

imputer = SimpleImputer(strategy = 'most_frequent')
pd.DataFrame(imputer.fit_transform(dfx)).isnull().sum().sum()


0