
# Assignment 3: Deep Learning Classifier for The Human Freedom Index

This notebook contains a set of exercises that will guide you through the different steps of this assignment. Solutions need to be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and to not modify the test cells. When you are done completing all the exercises submit this same notebook back to moodle in **.ipynb** format.

<div class="alert alert-success">

The <a href="https://www.cato.org/human-freedom-index/2021 ">Human Freedom Index</a> measures economic freedoms such as the freedom to trade or to use sound money, and it captures the degree to which people are free to enjoy the major freedoms often referred to as civil liberties—freedom of speech, religion, association, and assembly— in the countries in the survey. In addition, it includes indicators on rule of law, crime and violence, freedom of movement, and legal discrimination against same-sex relationships. We also include nine variables pertaining to women-specific freedoms that are found in various categories of the index.

<u>Citation</u>

Ian Vásquez, Fred McMahon, Ryan Murphy, and Guillermina Sutter Schneider, The Human Freedom Index 2021: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute and the Fraser Institute, 2021).
    
</div>


<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, February 26th, 23:55</div>


In [1]:
import numpy as np
import pandas as pd

<div class="alert alert-info"><b>Exercise 1</b>

Load the Human Freedom Index data from the link: https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv in a DataFrame called ```df```. The following columns are redundant and should be dropped:
* ```year```
* ```ISO```
* ```countries```
* All columns containing the word ```rank``` 
* All columns containing the word ```score```
    
Then store the independent variables in a DataFrame called ```X``` and the dependent variable (```hf_quartile```) in a DataFrame called ```y```. Finally, split them into separate training and test sets with the relative size of 0.75 and 0.25. Store the training and tests feature matrix in variables called ```X_train``` and ```X_test```, and the training and test label arrays as ```y_train``` and ```y_test```.

<br><i>[1 point]</i>
</div>
<div class="alert alert-warning">
Do not download the dataset. Instead, read the data directly from the provided link
</div>

In [2]:
# YOUR CODE HERE
df = pd.read_csv("https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv")

#Remove missing values in the target variable
df.dropna(subset=['hf_quartile'], inplace=True)

#Remove columns suggested in exercise instructions
df = df.drop(columns=['year', 'ISO', 'countries'])
df=df.drop(list(df.filter(regex='rank')), axis=1)
df=df.drop(list(df.filter(regex='score')), axis=1)

#Store the independant variables in X and target in y
y = df['hf_quartile']
X = df.drop(columns=['hf_quartile'])

#train-test-split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.25, random_state=42)

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
## LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 2</b>

Write the code to create a ```Pipeline``` consisting of a ```SimpleImputer``` with the most frequent strategy, a ```OneHotEncoder``` for the categorical variables, a standard scaler, and a ```MLPClassifier``` model specifying ```max_iter``` equal to 250. Store the resulting pipeline in a variable called ```pipe```.   

<br><i>[1.5 points]</i>
</div>

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#Creating the transformer for the pipe 

simpleimputer=SimpleImputer(strategy='most_frequent')
onehotencoder = OneHotEncoder()
transformer= ColumnTransformer([('ohe', onehotencoder, [0])], remainder='passthrough')
pipe_steps = [('imputer', simpleimputer), ('transformer', transformer),('scaler', StandardScaler()), ('clf', MLPClassifier(max_iter=250))]
pipe=Pipeline(pipe_steps)



In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3</b>
    
Write the code to create a GridSearchCV object called ```grid``` and fit it. The grid search object must contain the pipeline created in the previous exercise. Then, consider the following hyperparameters:
* ```learning_rate_init``` : [0.001, 0.0001]
* ```alpha``` : [0.0001,1]

Finally, store the best score (accuracy) of the training phase in a variable called ```training_score```.
<br><i>[1.5 point]</i>
</div>

In [None]:
# YOUR CODE HERE
# Define the parameter grid for the clf
param_gridMLP = {
    'clf__learning_rate_init': [0.001, 0.0001],
    'clf__alpha': [0.0001, 1]
}

# Perform the grid search for MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipe, param_gridMLP, cv=3, n_jobs=-2)
grid_search.fit(X_train, y_train)

training_score = grid_search.best_score_
training_score

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4</b>
    
Write the code to compute the real achieved ```score``` of the previous grid search to check whether your model is doing overfitting or not.
    
<br><i>[1 point]</i>
</div>

In [None]:
score = grid_search.best_estimator_.score(X_test, y_test)
score

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5</b>

The previous exercises use the scikit learn MLP classifier. Now create an MLP classifier using the Keras library. You can use any tutorial, website, or documentation for this task. Describe how you preprocessed the dataset, the network architecture used, and any trick you employed in the exercise.
    
<br><i>[5 points]</i>
</div>

In [None]:
#Import the dataframe df2 

df2=pd.read_csv("https://github.com/jnin/information-systems/raw/main/data/hfi_cc_2021.csv")

#Remove missing values in the target variable

df2.dropna(subset=['hf_quartile'], inplace=True)

#Remove columns suggested in exercise instructions

df2 = df2.drop(columns=['year', 'ISO', 'countries'])
df2=df2.drop(list(df2.filter(regex='rank')), axis=1)
df2=df2.drop(list(df2.filter(regex='score')), axis=1)
df2['hf_quartile'] = pd.Categorical(df2.hf_quartile)

#create X and Y

y = df2['hf_quartile']
X = df2.drop(columns=['hf_quartile'])

#Test_train split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.25, random_state=42)

#Preprocess X_train, X_test by imputing missing values, onehotencoding the categorical values, and scaling the data

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline 
from sklearn.impute import SimpleImputer

simpleimputer=SimpleImputer(strategy='most_frequent')
onehotencoder = OneHotEncoder(sparse=False)
transformer = ColumnTransformer([('ohe', onehotencoder, [0])], remainder= 'passthrough')
pipe_steps = [('imputer', simpleimputer), ('transformer', transformer), ('scaler', StandardScaler())]
pipe = Pipeline(pipe_steps)

X_train_preprocessed = pd.DataFrame(pipe.fit_transform(X_train))
X_test_preprocessed = pd.DataFrame(pipe.fit_transform(X_test))

#One hot encode y_train and y_test 

from sklearn.preprocessing import OneHotEncoder

y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

onehotencoder = OneHotEncoder(sparse=False)

y_train_enc = pd.DataFrame(onehotencoder.fit_transform(y_train[['hf_quartile']]))
y_test_enc = pd.DataFrame(onehotencoder.fit_transform(y_test[['hf_quartile']]))

In [None]:
pip install -q -U keras-tuner

In [None]:
#Format the data properly to be used by Keras

X_train_preprocessed=np.asarray(X_train_preprocessed).astype(np.float64)
y_train_enc=np.asarray(y_train_enc).astype(np.float64)
X_test_preprocessed=np.asarray(X_test_preprocessed).astype(np.float64)
y_test_enc=np.asarray(y_test_enc).astype(np.float64)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from kerastuner.tuners import BayesianOptimization

# Define the model builder function for the tuner. In this function, we indicate the hyperparameters we are out to optimize. 
#In this case, we are optimizing the number of layers in our Neural network (between 2 and 5), the number of neurons per layer (between 32 and 512),
#the learning rate (0.01, 0.001, 0.0001), and the activation function (relu, Sigmoid, or tanh). 
# The model performance metric we will be aiming to optimize is the accuracy.
def build_model(hp):
    model = keras.Sequential()
    model.add(layers.Flatten(input_shape=(X_train_preprocessed.shape[1],)))
    for i in range(hp.Int('num_layers', 2, 5)):
        model.add(layers.Dense(units=hp.Int('units_' + str(i),
                                            min_value=32,
                                            max_value=512,
                                            step=32),
                               activation=hp.Choice('act_' + str(i),
                                                    values=['relu', 'tanh', 'sigmoid'])))
    model.add(layers.Dense(4, activation='softmax'))
    model.compile(
        optimizer=optimizers.Adam(
            hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])),
        #Since we are solving a classification task, setting the categorical crossentropy as the loss was considered better practice
        #than using anorther loss hyperparamter such as the mean squared error, which is more suited to regression tasks
        loss="categorical_crossentropy",
        metrics=['accuracy'])
    return model

# Instantiate the tuner, within which we implement our build_model function, an indicate the objective metric to optimize
#as well as the trials and executions numbers
tuner = BayesianOptimization(
    build_model,
    objective='val_accuracy',
    max_trials=10,
    executions_per_trial=2)

# Search for the best hyperparameters
tuner.search(X_train_preprocessed, y_train_enc,
             epochs=10,
             validation_split=0.2)

# Get the best model
best_model = tuner.get_best_models(num_models=1)[0]

# Evaluate the best model on the test set
test_loss, test_acc = best_model.evaluate(X_test_preprocessed, y_test_enc)
print('Test accuracy:', test_acc)

best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

# Print the best hyperparameters
print(f"Number of layers: {best_hps.get('num_layers')}")
for i in range(best_hps.get('num_layers')):
    print(f"Units in layer {i}: {best_hps.get('units_' + str(i))}")
    print(f"Activation function in layer {i}: {best_hps.get('act_' + str(i))}")
print(f"Learning rate: {best_hps.get('learning_rate')}")