# Project 1 - Credit risk for mortages

##  2.1.1

We wish to make a model that decides whether or not to grant a loan to an individual based on relevant data, our goal is to maximize profit. Our utility function is defined as:

| Utility | Grant loan ($a_1$) | Don't grant loan ($a_2$) |
|----------------|-------------|------------------|
| Loan repaid ($\omega_1$)    | $m((1+r)^n-1)$       | 0                |
| Loan defaulted ($\omega_2$) | $-m$                 | 0                |

Thus our expected utility function is defined as our predicted probability of the person repaying the loan times the return on investment, minus our predicted probability that the person will default on the loan times the investment.

$$E(U|\textrm{Grant loan}) = P(\omega_1)\cdot m((1+r)^n - 1) + P(\omega_2)\cdot (-m),$$
$$E(U|\textrm{Don't grant loan}) = 0$$

In [1]:
def expected_utility(self, X):
    pr_1, pr_2 = self.predict_proba(X).T
    U_1 = X['amount']*( (1+self.interest_rate)**X['duration'] - 1)
    U_2 = -X['amount']
    return pr_1*U_1 + pr_2*U_2

We grant a loan when:
$$E(U|\textrm{Grant loan}) > E(U|\textrm{Don't grant loan})$$

We considered different models for predicting the probability, and made our final decision later in the process.

## 2.1.2

We implemented `́NameBanker.fit()` and `NameBanker.predict_proba()` by calling on our classifiers own `fit()` and `predict_proba()` functions.

In [2]:
def fit(self, X, y):
    self.classifier = RandomForestClassifier(n_estimators=self.n_estimators)
    self.classifier.fit(X, y)

def predict_proba(self, X):
    return self.classifier.predict_proba(X.to_frame().transpose())

For our `KNearestNeighborsBanker` we standardized all the non-categorical features before fitting the data, and before making new predictions.

## 2.1.3

As noted earlier we grant a loan if
$$E(U|\textrm{Grant loan}) > E(U|\textrm{Don't grant loan})$$
otherwise we don't grant the loan.

In [3]:
def get_best_action(self, X):
    for exp_utility in self.expected_utility(X):
        if exp_utility > 0:
            return 1
        else:
            return 2

## 2.1.4

For model selection we tried different classifiers for comparison; KNearestNeighbors, LogisticRegression and RandomForest, with varying hyperparameters, and we also compared their performance to the RandomBanker (grants loans randomly) and the YesBanker (always grants loans). We kept the original interest rate of 0.5% per month as with the new rate of 5% the YesBanker was one of the best performing models and even the RandomBanker was profitting. We felt it made little sense to optimize for an unrealistically high interest rate.

In [1]:
# Slightly modified TestLending.py so it runs faster and we can call the
# model checking as a function model_check
from TestLending import X, model_check
from banker import run
from randomforestbanker import RandomForestBanker
from logisticbanker import LogisticBanker
from kneighborsbanker import KNeighborsBanker
from randombanker import RandomBanker
from yesbanker import YesBanker

In [2]:
interest_rate = 0.005

print("RandomBanker: ")
model_check(X,
            RandomBanker(interest_rate=interest_rate),
            interest_rate=interest_rate)

print("YesBanker: ")
model_check(X,
            YesBanker(interest_rate=interest_rate),
            interest_rate=interest_rate)

print("LogisticBanker: ")
model_check(X,
            LogisticBanker(interest_rate=interest_rate),
            interest_rate=interest_rate)

for k in [1, 5, 15, 25, 35]:
    print(f"KNeighborsBanker with k={k}:")
    model_check(X,
                KNeighborsBanker(interest_rate=interest_rate, k=k),
                interest_rate=interest_rate)

for n in range(25, 151, 25):
    print(f"RandomForestBanker with estimators={n}:")
    model_check(X,
                RandomForestBanker(interest_rate=interest_rate, n_estimators=n),
                interest_rate=interest_rate)

RandomBanker: 
-101094.87977796681
YesBanker: 
-183481.48531310007
LogisticBanker: 
5402.441832063978
KNeighborsBanker with k=1:
-105561.42891528284
KNeighborsBanker with k=5:
-27201.04033618056
KNeighborsBanker with k=15:
-8426.810873055567
KNeighborsBanker with k=25:
-3267.789958120485
KNeighborsBanker with k=35:
-1614.665685385079
RandomForestBanker with estimators=25:
3781.32128916494
RandomForestBanker with estimators=50:
5084.396434927892
RandomForestBanker with estimators=75:
6150.781219333917
RandomForestBanker with estimators=100:
6003.25565680178
RandomForestBanker with estimators=125:
6238.063310505793
RandomForestBanker with estimators=150:
5803.069072435341


In [3]:
# n_estimators = 100 and k = 15 by default
run()

RandomForestBanker (581.3633305063215, 1292.6421177905368)
LogisticBanker (433.97574560359425, 2783.9969836057344)
KNeighborsBanker (96.85603287170403, 1938.901379761263)
RandomBanker (-9260.304286101706, 11830.89580573957)
YesBanker (-17924.250714848473, 11782.002412033245)


From these results we would pick `RandomForestBanker` with an `n_estimators` somewhere between 75 and 150. With Hesse's `banker.py` program we can see that `RandomForestBanker` has both the highest utility and the lowest standard deviation of all the bankers. However, we have not used a separate test set to evaluate the final performance of our model, which we should do as we are using the test results to do a final model selection.

We made all our bankers compatible with both Dimitrakakis' `TestLending.py` and Hesse's `banker.py`. However, in this notebook we used a modified version of `TestLending.py` that runs faster and put the main program in a function `model_check()` with no functional differences from the original.

For some reason `KNeighborsBanker` performs way worse on `TestLending.py` than on `banker.py` and we could not figure out why, but all the other models performed similarily between the programs.