# Fitting a `LogisticRegression`

In this notebook we fit a simple `LogisticRegression` model to some of the numeric features we identified in the previous notebook.

## Importing Packages

Let's begin by importing some packages.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

## Reading-In Data

Next we'll read in our full data set.

In [2]:
df_default = pd.read_csv('data_processed/03_categorical_processed.csv', low_memory=False)

## Numeric Features

In a previous year's class, some students noticed that a model that includes `['last_fico_range_high', 'last_fico_range_low', 'int_rate', 'fico_range_low']` seemed to perform pretty well.  I am going to explore this futher.

In [7]:
features_all = [
    "int_rate",
    "annual_inc",
    "dti",
    "revol_util",
    "tot_cur_bal",
    "last_fico_range_high",
    "last_fico_range_low",
    "fico_range_low",
]

## Fitting `LogisticRegression` One Variable at a Time

First, I am going to fit a logistic regression to each of the features one at a time.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [13]:
lst_feature = []
lst_accuracy = []
for ix_feature in features_all:
    print(ix_feature)
    df_train = df_default[[ix_feature, "charged_off"]].dropna() 
    X = df_train[[ix_feature]]
    y = df_train["charged_off"]
    model = LogisticRegression(max_iter=1000)
    acc = cross_val_score(model, X, y)
    lst_feature.append(ix_feature)
    lst_accuracy.append(np.mean(acc))

int_rate
annual_inc
dti
revol_util
tot_cur_bal
last_fico_range_high
last_fico_range_low
fico_range_low


As we can see `last_fico_range_high` and `last_fico_range_low` seem to have better performance than the other features.

In [18]:
df_single_accuracy = pd.DataFrame({
    "feature": lst_feature,
    "accuracy": lst_accuracy,
})
df_single_accuracy = (
    df_single_accuracy
    .reset_index(drop=True)
    .sort_values(by="accuracy", ascending=False)
)
df_single_accuracy

Unnamed: 0,feature,accuracy
5,last_fico_range_high,0.894141
6,last_fico_range_low,0.889576
1,annual_inc,0.799983
7,fico_range_low,0.799983
3,revol_util,0.799981
2,dti,0.799647
0,int_rate,0.798781
4,tot_cur_bal,0.797476


## Fitting Progressive `LogisticRegressions`

In [28]:
lst_features = []
lst_accuracy = []
for ix in range(1, len(df_single_accuracy)):
    features = list(df_single_accuracy["feature"][0:ix])
    print(features)
    df_train = df_default[features + ["charged_off"]].dropna()
    X = df_train[features]
    y = df_train["charged_off"]
    model = LogisticRegression(max_iter=1000)
    acc = cross_val_score(model, X, y)
    lst_features.append(features)
    lst_accuracy.append(np.mean(acc))

['last_fico_range_high']
['last_fico_range_high', 'last_fico_range_low']
['last_fico_range_high', 'last_fico_range_low', 'annual_inc']
['last_fico_range_high', 'last_fico_range_low', 'annual_inc', 'fico_range_low']
['last_fico_range_high', 'last_fico_range_low', 'annual_inc', 'fico_range_low', 'revol_util']
['last_fico_range_high', 'last_fico_range_low', 'annual_inc', 'fico_range_low', 'revol_util', 'dti']
['last_fico_range_high', 'last_fico_range_low', 'annual_inc', 'fico_range_low', 'revol_util', 'dti', 'int_rate']


In [29]:
df_multiple_accuracy = pd.DataFrame({
    "feature": lst_features,
    "accuracy": lst_accuracy,
})
df_multiple_accuracy = (
    df_multiple_accuracy
    .reset_index(drop=True)
    .sort_values(by="accuracy", ascending=False)
)
df_multiple_accuracy

Unnamed: 0,feature,accuracy
6,"[last_fico_range_high, last_fico_range_low, an...",0.897367
5,"[last_fico_range_high, last_fico_range_low, an...",0.895683
0,[last_fico_range_high],0.894141
1,"[last_fico_range_high, last_fico_range_low]",0.894141
2,"[last_fico_range_high, last_fico_range_low, an...",0.894132
4,"[last_fico_range_high, last_fico_range_low, an...",0.893057
3,"[last_fico_range_high, last_fico_range_low, an...",0.892553


## Organizing Features and Labels

I have a suspicion that `last_fico_range_high` and `last_fico_range_low` are observed after the loan has been issued, so we will remove these.

Let's now isolate our the columns that we will need for our analysis.  Note that `loan_amnt` is not one of our predictive features, we are keeping it to do further analysis once we have our inferences.

In [30]:
columns = ["loan_amnt", "int_rate", "annual_inc", "dti", "revol_util", "tot_cur_bal", "charged_off"]
features = ["int_rate", "annual_inc", "dti", "revol_util", "tot_cur_bal"]

Here we are dropping observations that have missing values.  Missing values should be handled using a `pipeline`; we will demonstrate this in a subsequent notebook.

In [31]:
df_train = df_default[columns].dropna()

Finally, we isolate our predictive features and labels.

In [32]:
X = df_train[features]
y = df_train["charged_off"]

## Cross-Validation Accuracy

Now we use `cross_val_score()` to estimate our out-of-sample performance.

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [34]:
model = LogisticRegression(max_iter=1000)
cross_val_score(model, X, y)

array([0.7971839 , 0.79709528, 0.79748518, 0.79652817, 0.79723706])

As we can see, our model does about as well as the degenerate model of always assuming `False`.  Thus, our model doesn't have much predictive power.

In [35]:
1 - y.mean()

0.7974532791025335

## Fitting Model to Entire Data Set

Next, we fit our model to the entire data set and examine the in-sample inferences.

In [36]:
model.fit(X, y)

The in-sample accuracy is about the same as the out of sample accuracy.

In [37]:
model.score(X, y)

0.7971218686586738

Recall that about 20.2% of the individual loans in the training set defaulted. 

In [39]:
y.mean()

0.20254672089746656

However, when using the hard predictions from the model, we see that only 2.5% of the loans are forecasted to default.

In [40]:
model.predict(X).mean()

0.024940851210888695

Now, let's check the dollar value of the loans that defaulted.  Notice that, notice that about \\$1.78 billion of the \\$8.19 billion defaulted.

In [42]:
print("Defaulted:", np.sum(df_train["loan_amnt"] * df_train["charged_off"]))
print("Total:", np.sum(df_train["loan_amnt"]))

Defaulted: 1787274475.0
Total: 8198093675.0


That's about 21.8% of the total loans balance issued.

In [43]:
np.sum(df_train["loan_amnt"] * df_train["charged_off"]) / np.sum(df_train["loan_amnt"])

0.2180109847305447

The expected value of the loan balance predicted by the model is \\$1.69 billion, which is 95% of the actual loan balance.

In [45]:
np.sum(df_train["loan_amnt"] * model.predict_proba(X)[:,1])

1699247568.442789

In [46]:
np.sum(df_train["loan_amnt"] * model.predict_proba(X)[:,1]) / np.sum(df_train["loan_amnt"] * df_train["charged_off"])

0.9507479641272162

**Conclusion:** Our model isn't working that well in terms of hard predictions, but performing alright in terms of expected balance defaulted.