### Follow-up on comments during project presentation

Herein document adresses some issues/comments that were mentioned during project presentation. 
The document is written in Jupyter notebook as some of the issues were regarding the exact calculations and models performance.

The few code chunks below reads data and import some code in order to investigate the issues more in depth. The code is ran with the RNG same seed as `analysis.ipynb`, so the results are comparable to the ones shown in slides.

In [1]:
import joblib
import random as rnd
import inspect

In [2]:
%run -i src/data_preproc.py

Reading file: orbis_active_be.xlsx ...
Reading file: orbis_active_de.xlsx ...
Reading file: orbis_active_dk.xlsx ...
Reading file: orbis_active_es.xlsx ...
Reading file: orbis_active_fin.xlsx ...
Reading file: orbis_active_fra.xlsx ...
Reading file: orbis_active_it.xlsx ...
Reading file: orbis_active_no.xlsx ...
Reading file: orbis_active_rest.xlsx ...
Reading file: orbis_active_se.xlsx ...
Reading file: orbis_default.xlsx ...


In [3]:
split_share = 0.8

rnd.seed(1)
train_id = rnd.sample(range(df.shape[0]), round(df.shape[0] * split_share))

train_df = df.drop(['country', 'last_year', 'sector'], axis=1).loc[train_id]
test_df  = df.drop(['country', 'last_year', 'sector'], axis=1).loc[~np.isin(list(range(df.shape[0])), train_id)]

for variable in train_df.loc[:, train_df.apply(lambda x: any(x.isna()))].columns:
    train_df.loc[train_df[variable].isna(), variable] = train_df[variable].median()
    test_df.loc[test_df[variable].isna(), variable] = test_df[variable].median()

I assume the problem is persistent among every model, so for simplicity I will analyze only XGBoost trained on undersampled data

In [4]:
xgb_us = joblib.load('models/xgb_us.sav')

In [5]:
%run -i "src/pred_metrics_class.py"

#### Low F1 score of the models

As table below shows, the F1 score of the models appears to be very low. 


![title](latex/img/balance_acc_table.png)

In the code I used the following formula:

$$F_1 = \frac{TP}{TP + \frac{1}{2}(FP + FN)}$$

Which gives the same result as the sklearn implementation:

In [6]:
metrics_xgb_us = PredMetrics(pred_pd = xgb_us.predict_proba(np.array(test_df.loc[:,test_df.columns != 'Inactive']))[:,1],
                             actual = np.array(test_df.Inactive))

metrics_xgb_us.f1_score(0.5) == metrics_xgb_us.f1_score_sk(0.5, 'binary')

True

I wrapped the sklearn function into the class for code simplicity. You can check it in the `pred_metrics_class.py`

Since the $F_1$ score is a function of recall and precision we can check both of these values:

In [7]:
{'recall: ': metrics_xgb_us.tpr(0.5),
 'precision: ': metrics_xgb_us.ppv(0.5)}

{'recall: ': 0.8071748878923767, 'precision: ': 0.07100124909604891}

Clearly, the model has a problem with precision, i.e. when model predicts a default, there is actually low chance that the corporate will default. 
Before I will check it further, I think it is worth to note, that this kind of trade off combination between precision and recall is better than the other way around as it's usually less painful not to give loan (low precision means the model very often was wrong while predicting default, so not giving loan was very often wrong decision) unlike giving a loan to the future default (high recall means, that when we identified high share of defaults, thus the loans were safe).

Additionally, despite some sources (e.g. [this](https://deepai.org/machine-learning-glossary-and-terms/f-score) article on the first page after searching "F1 score" in google) claiming that its not sensitive to the class imbalance, it indeed is. And it's the case in my application.

Let's consider a random prediction of imbalanced and balanced data:

In [78]:
random_metric_balance = PredMetrics(pred_pd = np.random.choice([0,1], 10000, p = [0.5, 0.5]),
                                    actual = np.random.choice([0,1], 10000, p = [0.5, 0.5]))

random_metric_imbalance = PredMetrics(pred_pd = np.random.choice([0,1], 10000, p = [0.5, 0.5]),
                                      actual = np.random.choice([0,1], 10000, p = [0.9, 0.1]))

{'balanced :': round(random_metric_balance.f1_score(0.5), 3),
 'imbalanced :': round(random_metric_imbalance.f1_score(0.5), 3)}

{'balanced :': 0.499, '/n imbalanced :': 0.168}

Indeed, despite both of the 'models' having the same predicting ability, their F1 score varies. The F1 score varies also when both of the predictions are imbalanced (i.e. when models were trained on rebalanced dataset).

Given this properity we can compare actual models to the benchmark 'guessing' models with the same class imbalance of test dataset and prediction values :

In [101]:
random_metrics = PredMetrics(pred_pd = np.random.choice([0,1], 
                                                        test_df.Inactive.shape[0], 
                                                        p = [sum(metrics_xgb_us.df_compare.default_prob >= 0.5) / metrics_xgb_us.df_compare.shape[0], 
                                                             sum(metrics_xgb_us.df_compare.default_prob < 0.5) / metrics_xgb_us.df_compare.shape[0]]),
                             actual = np.array(test_df.Inactive))

{'random prediction :': random_metrics.f1_score(0.5),
 'xgb_us prediction :': metrics_xgb_us.f1_score(0.5)}

{'random prediction :': 0.04769569847205369,
 'xgb_us prediction :': 0.1305214816605233}

With the context above, the model F1 score turns out to be very good (for the level of imbalance of the test dataset). 

### Correct logistic regression with $L_1$ and $L_2$ regularization (elastic net)

I have corrected the previous formula in the slides to (as per Hastie et al ESL):

$$\min_{\beta} \alpha \beta^2 + (1-\alpha)|\beta| - C \sum_{i = 1}^n y_i \beta ' X_i - \ln 1 + e^{\beta'X_i}$$

#### Multicolinearity

In order to assess the impact of multicollinearity on the model, I've removed variables that with high correlation (on condition that $|R^2| > 0.5$). 

Table below shows the out-of-sample performance of the models with a 50% threshold (the same as calculated for original project):

![title](latex/img/balance_acc_table_no_multicor.png)

And below is the table of performance for models with multicolinearity:

![title](latex/img/balance_acc_table.png)

The difference is rather small for the overall performance of models (balanced accuracy), the most significant impact is on the structure of the errors. The more complex ML models (xgb and mlpc) were better at correcting prositives and slightly worse at predicting negatives. Apart from the positive impact on general performance, this trade-off is more favorable.

Given the slight but positive impact on the perfromance and a lessen complexity (because of higher sparsity of models), the variable redction would be indeed a good choice.

If you want to check it further, the complete analysis from `analysis.ipynb` with no multicollinearity is in the different branch of the github project (https://github.com/m-dadej/PD_estimation/tree/multicollinearity-impact)