<a href="https://colab.research.google.com/github/pmontman/pub-choicemodels/blob/main/nb/practice_final_ML_DoE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Review Machine Learning and Design of Experiments

We will cover:

* Machine Learning (ML) models
* Model validation
* Design of Experiments (DoE)

---
---

#  Machine Learning question) Fitting a ML model
Including: 
* Sanity checking, for example, be careful when including variables such as availabilities, panel information or variables that might not be known when predicting new individuals.
* A bit of Model tuning:
 * Which variables to include in the model? 
 * Some fine tuning of the hyperparameters.
 * Using a train/test split for the tuning.


# Machine Learning + Logit Models question)  Final model selection
 * Be able to justify a decision to go for one particular model (including the multinomial logit variants). This involves subjectively considering the accuracy, interpretability, among all models and decide.


# Design of Experiments questions) (a few short questions)

 Some questions about understanding important isses in DoE.

---
---

# 1) Fit a ML model to the dataset

We have seen either **decision trees** or **multilayer perceptron neural networks**. We will have to fit one model, fine tune just a it a bit: 
 * Try several values for the 'hyperparameter':
   * The maximum depth of the tree in decision trees. `max_depth` argument in `DecisionTreeClassifier`.
   * The size of the hidden layer in the neural network. This mean that we only need to try a 1-hidden layer network (`hidden_layer_sizes` argument in the sklearn package function `MLPClassifier`.

### Some coments about Sanity checking

*Note: This section ended up being a bit too verbose, what it means is: be careful and do not add 'invalid' variables to the models, pay attention to the description of the dataset.*

Part of the benefit of ML models is that they can deal with complex variables or 'find transformations' of meaningful variable in an automatic way. Therefore we usually start by 'throwing all available variables at them' and check the results.

 This is quite the opposite of how we do things with the logit type models: we often carefully think about the relationship between the variables and the utility and define a model. 
 
 The two methodologies, logit and ML models complement well, but we need to be careful when using all variables, since some of them might lead to models that are not useful in practice.

 * **Auxiliary variables such as panel id or availability information** they are a by product of the data-gathering process, they are used in the modeling but not as explanatory.
 * **Variables that would not be useful when predicting**
   * Classic example is the choice variable itself. 
  * More subjective example can be variables that are not useful when predicting, because they are n. For example a dataset might be gathering information several choice problems, such as travel model (car, train, bus) and living area (city, suburb, rural area). In that case we should not use information about preferences of one subproblem for the other, because that data would not be available for new individuals (those that did not participate in the survey).
   * Another subjective example: 
In the infidelity dataset, we have variables such as rating, the self-rating of their marriage. I would say that adding this variable can be 'too informative' and finding a model that explains affairs behavior but depends on the self-rating as explanatory variable is not very useful in practice. On the other hand, it can argued that knowing the effect of taht the selft-rating has in the behavior can be an important finding (imagine that we find that self-rating of the marriage does not influence the likelihood of infidelity).
 * **Variables that are not well documented or understood** This one is a bit artificial but if we have a dataset with variables that we do not really understand its meaning or how they are encoded in the dataset, using them can lead to results that are not very meaningful. 
 
  For example, variables such as availabilities or panel id are not useful explanatory variables, they are a by product of the data-gathering process, they are used in the modeling but not as explanatory.
  
**NOTE:** These problems are not specific to ML models, they can appear on more simple logit ones, but the flexibility of ML models makes them more susceptible. For example, adding the choice variable in a simple linear model might to be catastrophic beause the relationship between how it is encoded in the dataset and the utility is not a simple 'line'. But a ML model might be able  to pick a very complex pattern that is totally useless in practice.

----
----


# 2) Model selection, which model to choose among several

**IMPORTANT!:** You are not required to have found a 'perfect model', just that you tried several reasonable alternatives, and demonstrate that you can compare them in a principled way.

In this exercise we have to compare all types of models and variations of the variables and make an justified decision for one model.

A holdout dataset will be given to test the models (in addition to whatever statsitical tests you might use). This is another dataset, different from the test/train split that you might use to tune the machine learning model.


A list of things to consider, it is non exhaustive
 * Explainability:
    * Do we get an intutitive understanding of the model (usually linear vs machine learning)?
    * Can we compute easy measures such as WTP? 
    * Within the logit family:
      * Ordered logit is much more simple, might be worth sacrificing some predictive capacity. Even when we get statistically significance test that other models are better.
      * Nested: How good is the nested structure? Does it make sense? Can we draw some insight about the process from it? 
 * Accuracy in the holdout dataset: Which models are more accurate, and by how much.
 * Confusion matrix
    * How are the errors distributed for each model? Does it make a difference when deciding for one of the candidate models?
    * Maybe a model is accurate but produces  'degenerate' predictions, such all predictions go to the same class.
    * Maybe some confusions are more serious than others, even to the point of preferring some model with slightly worse accuracy that makes less serious errors. Examples of errors: *predicting a cheap phone from the functional brand of smartphones when the actual preference is for an expensive smarthpone from the stylish brand* is a more serious error than *predicting a slighly cheaper phone within the same brand (the actual preference is a slighly more expensive phone whithin the same brand)*. In ordinal logit, the the farthest away the alternatives the worse the error is (predicting *totally disagree when the actual preference is totally agree* is worse than predicting *totally disagree when the actual preference is 'neutral'*). When there is a clear nesting structure, the errors within nests might be less serious than errors across nests (red bus when the true value is blue bus vs red bus when the true value is car), though it might be quite subjective.



----
----

#Design of Experiments

These will be more of 'theory' or check you understanding questions, no need to program (but of course you might code if you want to double check something).
The following is a list of potential questions that help you get and idea of the the actual questions that *might* appear in the exam.

*I know they are not too specific, I hope you understand, we have to make an exam...*
*They will be  clear and about important issues in DoE, nothing too far-fetched or academic.*

* ## Given a design description, How many observations are needed to estimate all parameters?

* ## Identify a potential problem in a given design

* ## What is a pilot study? Why is a pilot study important? 

* ## How many attributes could be tested given a specify sample size and description?

* ## Given a description, is it revealed preferences or stated preferences? Differences between the two?

* ## Given a description, How would you encode variables (linear or dummy) ?

* ## Given a design, Which observations/attributes would you remove to make it in a given the budget of observation?
  * Will be obvious (no need of computing)

* ## Which principles are clearly violated in this design?
 * From orthogonality,  minimal overlap, an level and utility balances...
  * Will be obvious (no need of computing)

* ## How many alternatives would be generated given a description? What kind of problems does it produce? How is it addressed when modelling?

* ## Key differences between designs for linear regression and choice modelling? 

---
---

# Examples

We will show several examples of 'fine tuning' of the ML models + sanity checking + comparisons.

We will reuse the affairs dataset, a reminder of the variables.

# Description of the dataset

A scientific paper making use of the dataset to develop a 'Theory of Extramarial Affairs' can be found [here](https://www.uibk.ac.at/econometrics/data/fair78.pdf).

The dataset has 601 observations of the following variables:

* **affairs:** The answer to the survey, the answers are encoded with numbers.
  0 = None, 1 = Once, 2 = Twice, 3=Three times, 7= 4 to 10 times, 12= monthly, 12 = weekly, 12 = daily.
  As we see, the information was encoded in such a way that we lose information about the more frequent answers, and the encoded numbers do not completely coincide with the frequencies. However, there is an ordinal relationship among the possible answer, from less frequent to more frequent.

*  **gender:** Categorical variable indicating either male or female among the participants.

* **age:** Numeric variable coding age in years: 17.5 = under 20, 22 = 20–24, 27 = 25–29, 32 = 30–34, 37 = 35–39, 42 = 40–44, 47 = 45–49, 52 = 50–54, 57 = 55 or over. 

* **yearsmarried:** Numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4–6 months, 0.75 = 6 months–1 year, 1.5 = 1–2 years, 4 = 3–5 years, 7 = 6–8 years, 10 = 9–11 years, 15 = 12 or more years.

* **children:** Categorical variable indicating if there are children in the marriage.

* **religiousness:** Categorical variable indicating how religious in the person, encoded as numbers: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.

* **education:**: Categorical variable indicating the level of education. Encoded as numbers: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master's degree, 20 = Ph.D., M.D., or other advanced degree.

* **occupation:** Categorical variable classifying the profession of the individual. Encoded as numbers, and the meaning of the numbers has been somewhat lost in time. But it could be something like the one in this [link.](https://dictionary.fitbir.nih.gov/portal/publicData/dataElementAction!view.action?dataElementName=HollingsheadJobClassCat&publicArea=true)

* **rating:** Categorical variable indicating how happy they are with the marriage. Encoded as numbers: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 = happier than average, 5 = very happy.



# Preparing the environment
*The preparation and dataset loading code is given to the students*

In [1]:
!pip install biogeme

Collecting biogeme
  Downloading biogeme-3.2.8.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 5.1 MB/s 
Collecting unidecode
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 46.8 MB/s 
Building wheels for collected packages: biogeme
  Building wheel for biogeme (setup.py) ... [?25l[?25hdone
  Created wheel for biogeme: filename=biogeme-3.2.8-cp37-cp37m-linux_x86_64.whl size=4030751 sha256=51c9db805216067fc5d0dd4cec2fb2dd6bb45e22181cb259e65226344ede1db6
  Stored in directory: /root/.cache/pip/wheels/d4/52/61/de6c73d2bc17603c60e754e260bccb7d4da2503e97015ebd49
Successfully built biogeme
Installing collected packages: unidecode, biogeme
Successfully installed biogeme-3.2.8 unidecode-1.3.2


Load the packages, feel free to change the names.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools

---
---

# Auxiliary functions

The first function takes the dictionary of utilities, a pandas dataframe, and the name of the variable that contains the variable with the results of the choice. It returns the biogeme object with the model and the estimated 'results' object (the one we get the values, likelihoods, etc.)
We have added the dictionary with the utilities to the biogeme object, in case we use it later.

In [3]:
def qbus_estimate_bgm(V, pd_df, tgtvar_name, modelname='bgmdef'):
 av_auto = V.copy()
 for key, value in av_auto.items():
   av_auto[key] = 1
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V.copy()
 return bgm_model, bgm_model.estimate()

The next function will calculate the predictions for a given biogeme object that was estimated with `qbus_estimate_bgm`. The output is the array with the choice probabilities. From the choice probabilities, this can be used to calculate accuracies, confusion matrices and the output of what-if scenarios.

In [4]:
def qbus_simulate_bgm(qbus_bgm_model, betas, pred_pd_df):
  av_auto = None
  if hasattr(qbus_bgm_model, 'ord_probs'):
    av_auto = qbus_bgm_model.ord_probs.copy()
  else:
    av_auto = qbus_bgm_model.utility_dic.copy()

  for key, value in av_auto.items():
    av_auto[key] = 1

  targets = qbus_bgm_model.utility_dic.copy()
  for key, value in targets.items():
    if hasattr(qbus_bgm_model, 'nest_tuple'):
      targets[key] = models.nested(qbus_bgm_model.utility_dic, av_auto, qbus_bgm_model.nest_tuple, key)
    else:
      if hasattr(qbus_bgm_model, 'ord_probs'):
       targets[key] = qbus_bgm_model.ord_probs
      else:
       targets[key] = models.logit(qbus_bgm_model.utility_dic, av_auto, key)

  bgm_db = db.Database('simul', pred_pd_df)
  globals().update(bgm_db.variables)
  bgm_pred_model = bio.BIOGEME(bgm_db, targets)
  simulatedValues = bgm_pred_model.simulate(betas)
  return simulatedValues

The function `qbus_calc_accu_confusion` calculates the accuracies given the choice probability predictions a pandas dataset and the specification of the name that contains the actual choices in the input dataset.

In [5]:
def qbus_calc_accu_confusion(sim_probs, pd_df, choice_var):
  which_max = sim_probs.idxmax(axis=1)
  data = {'y_Actual':   pd_df[choice_var],
          'y_Predicted': which_max
        }

  df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
  confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
  accu = np.mean(which_max == pd_df[choice_var])
  return accu, confusion_matrix 

The next function calculates the likelihood ratio test having to write a bit less code that the default biogeme function. The arguments are the results objects of the two models to be compared. The first is the more complex and the second is the reference model (**the order is important!**). The third argument is the significance level for the test.

In [6]:
def qbus_likeli_ratio_test_bgm(results_complex, results_reference, signif_level):
  return tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results_reference.data.logLike, results_reference.data.nparam), signif_level)

The next function just updates the globals so we can use it 

In [7]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

The next function calculates the nested logit version. Similar to the multinomial logit

In [8]:
def qbus_estimate_nested_bgm(V, pd_df, nests,  tgtvar_name, modelname='bgmdef'):
 av_auto = V.copy()
 for key, value in av_auto.items():
   av_auto[key] = 1
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprobnest = models.lognested (V, av_auto , nests , bgm_db.variables[tgtvar_name] )
 #logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprobnest )
 bgm_model.utility_dic = V.copy()
 bgm_model.nest_tuple = nests
 return bgm_model, bgm_model.estimate()

The auxiliary function for the ordered logit. The use is slightly different from the basic multinomial logit!
* The `V` argument is just the expression of a utility function, not the dictionary mapping alternative ids to the utility functions.
* The argument `ord_alt_ids` is a list with the ids of the alternatives **in the order that we want to impose**.The parameter to know about.

Then the rest of the arguments are used as usual `pd_df` the pandas dataframe, `tgt_varname` the name of the variable with the choices, and an optional `modelname`.

In [9]:
def qbus_estimate_ordered_bgm(V, ord_alt_ids, pd_df, tgtvar_name, modelname='ord_bgm'):
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)

 taus_map = {ord_alt_ids[0]: exp.Beta('tau1', -1, None, None, 0) }
 i = 1
 for id in ord_alt_ids[1:-1]:
  taus_map[id] = taus_map[ ord_alt_ids[i-1] ] + exp.Beta('delta_'+ str(i + 1), i, 0, None, 0)
  i = i + 1

 alt_probs_map = {ord_alt_ids[0]: dist.logisticcdf( taus_map[ord_alt_ids[0] ] - V_ord) }

 i = 1
 for id in ord_alt_ids[1:-1]:
  alt_probs_map[id] = dist.logisticcdf( taus_map[id] - V_ord) - dist.logisticcdf( taus_map[ ord_alt_ids[i-1] ] - V_ord)
  i = i + 1

 alt_probs_map[ord_alt_ids[i] ] = 1 - dist.logisticcdf( taus_map[ord_alt_ids[i-1]] - V_ord)

 logprob = exp.log(exp.Elem(alt_probs_map, bgm_db.variables[tgtvar_name]))

 #logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V
 bgm_model.ord_probs = alt_probs_map.copy()
 return bgm_model, bgm_model.estimate()

# Load dataset

In [10]:
pd_dset = pd.read_csv('https://github.com/pmontman/pub-choicemodels/raw/main/data/affairs.csv')

Lets take a look at the dataset as a reminder of the variables involved.

In [11]:
pd_dset

Unnamed: 0,affairs,gender,age,yearsmarried,children,religiousness,education,occupation,rating
0,0,male,37.0,10.00,no,3,18,7,4
1,0,female,27.0,4.00,no,4,14,6,4
2,0,female,32.0,15.00,yes,1,12,1,4
3,0,male,57.0,15.00,yes,5,18,6,5
4,0,male,22.0,0.75,no,2,17,6,3
...,...,...,...,...,...,...,...,...,...
596,1,male,22.0,1.50,yes,1,12,2,5
597,7,female,32.0,10.00,yes,2,18,5,4
598,2,male,32.0,10.00,yes,2,17,6,5
599,2,male,22.0,7.00,yes,3,18,6,2


Transform categorical to numerical variables.

In [12]:
pd_dset['gender'] = pd_dset['gender'].factorize()[0]
pd_dset['children'] = pd_dset['children'].factorize()[0]


# Examples start here!

**IMPORTANT:** These are examples of potential models and comparisons, for illustration, they are not strict prescriptions on how to do model tuning and comparison.


The first step is to create a a train/test split. *In this case the dataset is not too big, the size of the test set is 151, so conclusions about small differences in percentages are not super reliable, we will ignore that part for the sake of the exercise and assume that the test set is large enough to get reliable comparisons.*

In [13]:
from sklearn.model_selection import train_test_split
dset_train, dset_test = train_test_split(pd_dset, test_size = 0.25, random_state = 3840)

To save space, we will create a helper function that fits the ML model, calculates the predictions and returns the accurac and confusion matrix in the test set. *You do not need to do that, this is to avoid copy pasting the estimation and prediction code*.

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
ml_model = None

This helper function gets a first argument, `model_type`: either `dec_tree` or `neur_net`.
Other argument is `hyperparam_val`, which will be the `max_depth` of the tree if we are using a decision tree, or the `hidden_layer_sizes` if we are using a neural net.
Then we get a list of variables to remove from the datasets, so they are not used as explanatory variables.
We have two other aguments, the train set and the test set. 

In [15]:
def get_ml_model_performance(model_type, hyperparam_val, variable_remove, train_set, test_set):

 if (model_type == 'dec_tree'):
  print('Fitting a Decision Tree with max_depth: ' + str(hyperparam_val))
  ml_model = DecisionTreeClassifier(max_features= None, max_depth=hyperparam_val, random_state=3840 )

 else:
  print('Fitting a Neural Network  with hidden_layer_sizes: ' + str(hyperparam_val))
  ml_model = MLPClassifier(hidden_layer_sizes = (hyperparam_val),
                        activation='logistic', solver='lbfgs', max_iter=3500, n_iter_no_change=1000,batch_size=32, tol=0.000000001, random_state=3840, verbose=True, max_fun=15000)
  
 # Train the model on training data
 ml_model.fit(train_set.drop(variables_remove, axis=1), pd.get_dummies(train_set['affairs']));

 predictions = ml_model.predict(test_set.drop(variables_remove, axis=1))
 ml_model_sim = pd.DataFrame(predictions, columns=[0, 1, 2, 3, 7, 12])

 ml_model_sim.index = test_set.index
 
 #pd.crosstab(index=test_set['affairs'], columns='count') / test_set.shape[0]
 return qbus_calc_accu_confusion(ml_model_sim, test_set, 'affairs')

The first example, we forget to remove the choice variable in the decision tree.
We see that the model gets perfect predictions on the test set, this should be a good hint of problems in the variables.

In [16]:
variables_remove = []
hyp_max_depth = 6
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 6


(1.0, Predicted   0   1   2   3   7   12
 Actual                            
 0          101   0   0   0   0   0
 1            0  14   0   0   0   0
 2            0   0   4   0   0   0
 3            0   0   0   8   0   0
 7            0   0   0   0  13   0
 12           0   0   0   0   0  11)

If we use a very simple decision tree (maximum depth of 2), we might not detect the error! So a good idea is to try a complex model just to check for these types of problems. The range of the max_depth in decision trees usually goes between 2 and 14, though more extreme values have worked in some datasets.

In [17]:
variables_remove = []
hyp_max_depth = 2
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 2


(0.7417218543046358, Predicted   0   12
 Actual            
 0          101   0
 1           14   0
 2            4   0
 3            8   0
 7           13   0
 12           0  11)

We now remove the affairs variable from the explanatory variables, and we get a more sensible accuracy/confusion matrix. At max_depth 4 we see that predictions are all of class 0: 'no affairs'.

In [18]:
variables_remove = ['affairs']
hyp_max_depth = 4
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 4


(0.6688741721854304, Predicted    0
 Actual        
 0          101
 1           14
 2            4
 3            8
 7           13
 12          11)

A more complex tree, max_depth 6, predicts other classes, but incorrectly.

In [19]:
variables_remove = ['affairs']
hyp_max_depth = 6
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 6


(0.6622516556291391, Predicted    0  7
 Actual           
 0          100  1
 1           14  0
 2            3  1
 3            8  0
 7           13  0
 12          11  0)

If we increase the complexity of the trees, we get worse accuracy, but the predictions errors are not that serious! For example, 9 confusions between 0 affairs and 13 or more affairs, in the previous model we had 11.

In [20]:
variables_remove = ['affairs']
hyp_max_depth = 9
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 9


(0.6423841059602649, Predicted  0   1   2   3   7   12
 Actual                           
 0          93   2   1   1   3   1
 1          13   1   0   0   0   0
 2           3   0   0   0   1   0
 3           7   0   0   0   0   1
 7          10   0   0   0   1   2
 12          8   0   0   0   1   2)

We play around with removing some variables, for example occupation was encoded in a 'weird' way, not well documented. We might as well check what happens if we remove it. We could not get a good model for a range of max_depth values.

In [21]:
variables_remove = ['affairs', 'occupation']
hyp_max_depth = 4
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 4


(0.6688741721854304, Predicted    0
 Actual        
 0          101
 1           14
 2            4
 3            8
 7           13
 12          11)

The 'rating' variable (self rating how happy they are with the marriage) reduces the applicability of the model (we need to know its value to make predictions, se we need to ask the individuals, we might as well ask them about the infidelity), we can try removing it.

Playin around a bit with the depth of the tree, we see that we get relatively similar performance when comparing to the model that includes both rating and occupation.

In [22]:
variables_remove = ['affairs', 'occupation', 'rating']
hyp_max_depth = 9
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 9


(0.6490066225165563, Predicted  0   1   3   7   12
 Actual                       
 0          95   1   1   3   1
 1          12   1   0   0   1
 2           4   0   0   0   0
 3           7   0   1   0   0
 7          12   0   0   0   1
 12         10   0   0   0   1)

We try also removing education, because the encoding to integer might be a bit 'fuzzy'. We get better accuracy here, and there are less extreme errors, a good candidate for the final model?

In [23]:
variables_remove = ['affairs', 'occupation', 'education', 'rating']
hyp_max_depth = 5
get_ml_model_performance('dec_tree', hyp_max_depth, variables_remove, dset_train, dset_test)

Fitting a Decision Tree with max_depth: 5


(0.6754966887417219, Predicted   0   1   12
 Actual                
 0          100   1   0
 1           14   0   0
 2            4   0   0
 3            8   0   0
 7           13   0   0
 12           9   0   2)

With a **neural network** we can also get 'too good' results if we forget to remove some key variables. Also: please ignore the warnings about convergence.

In [24]:
variables_remove = []
hyp_network_size = 32
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 32


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.9933774834437086, Predicted   0   1   2   3   7   12
 Actual                            
 0          101   0   0   0   0   0
 1            0  14   0   0   0   0
 2            0   0   4   0   0   0
 3            0   0   0   8   0   0
 7            0   0   0   0  12   1
 12           0   0   0   0   0  11)

Removing the variable gets more reasonable results.

In [31]:
variables_remove = ['affairs']
hyp_network_size = 32
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 32


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.6490066225165563, Predicted  0   1   3   7   12
 Actual                       
 0          96   1   1   2   1
 1          14   0   0   0   0
 2           4   0   0   0   0
 3           8   0   0   0   0
 7          11   1   0   1   0
 12         10   0   0   0   1)

Playing around with the hyperparameter of the size of the networks changes the predictions, in this case, a small network is better (though the accuracy is lower than a naive model)

In [33]:
variables_remove = ['affairs']
hyp_network_size = 6
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 6


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.6622516556291391, Predicted  0   7   12
 Actual               
 0          97   2   2
 1          14   0   0
 2           3   0   1
 3           7   0   1
 7          12   1   0
 12          9   0   2)

The range for the hyperparamter is more variable than in the decision trees, we can go from 1 (very rare, it degenerates in a linear model) or as many as observations (can produce perfec fits). We can try a few between 8 and 256 (be mindful it does not get too slow), from smaller to larger. In this case, a large neural netowrk does not produce better results.

In [34]:
variables_remove = ['affairs']
hyp_network_size = 128
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 128


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.6158940397350994, Predicted  0   1   2   3   7   12
 Actual                           
 0          91   3   1   0   4   2
 1          13   0   0   0   1   0
 2           4   0   0   0   0   0
 3           5   1   1   0   1   0
 7          11   0   0   0   1   1
 12          8   0   0   1   1   1)

We can check removin variables for the same reasons as in the decision trees.
In this case the  results are close to the naive prediction, same accuracy but strictly better confusion matrix.

In [36]:
variables_remove = ['affairs', 'occupation']
hyp_network_size = 10
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 10


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.6688741721854304, Predicted   0   12
 Actual            
 0          101   0
 1           14   0
 2            4   0
 3            8   0
 7           12   1
 12          11   0)

In [37]:
variables_remove = ['affairs', 'occupation']
hyp_network_size = 32
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 32


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.6556291390728477, Predicted  0   1   2   7   12
 Actual                       
 0          97   0   3   1   0
 1          12   1   0   1   0
 2           4   0   0   0   0
 3           7   0   0   1   0
 7          10   1   1   0   1
 12          8   1   1   0   1)

In [29]:
variables_remove = ['affairs', 'occupation']
hyp_network_size = 128
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 128


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.6556291390728477, Predicted  0   1   2   3   7   12
 Actual                           
 0          93   2   0   3   2   1
 1           9   2   0   0   2   1
 2           3   0   1   0   0   0
 3           5   0   0   1   2   0
 7          13   0   0   0   0   0
 12          7   0   1   0   1   2)

Lets try removing rating only Directly superior as a 'naive' method that predicts the most popular class: it is the same accuracy, but the model predicts more classes (it is more useful) and the confusions are less extreme. This is subjective, in my opinion an error of *the model predicting 'no affairs' when the true is 7 affairs* is more serious than  *the model predicting 2 affairs when the true is 7 affairs*.
So we here we get a reasonable model that uses less variables (compared to some of the previous models in this exercise).

In [30]:
variables_remove = ['affairs', 'rating']
hyp_network_size = 18
get_ml_model_performance('neur_net', hyp_network_size, variables_remove, dset_train, dset_test)

Fitting a Neural Network  with hidden_layer_sizes: 18


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


(0.6688741721854304, Predicted   0  1  2
 Actual             
 0          99  1  1
 1          13  1  0
 2           3  0  1
 3           8  0  0
 7          11  1  1
 12         11  0  0)