# Understanding credit risk
* Created on: 05/28/2021
* Created by: Michael Monahan
* Source reference: DataCamp

## Import the required librarys

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import statsmodels as sm
import seaborn as sns
import numpy as np

## What is credit risk?
* The possibility that an entity that has borrowed money will not repay it
* Calculated risk difference between lending an entity money and a risk-free investment
* When an entitiy fails to repay, it is in default
* The likelihood that an entity will default is the probability of default(PD)

## Expected loss
* The dollar amount a firm loses as a reult of a loan default
* Three primary components:
    * Probability of Default(PD)
    * Exposure at Default (EAD) the outstanding loan amount at default
    * Loss Given Default (LGD) ratio of loss : recovery of assets at default
* Formula for expected loss:  `expected_loss = PD * EAD * LGD`

## Types of data used
* Two primary types of data used:
    * **Application data:** interest rates, FICO, loan amount, loan intent
    * **Behavioral data:** employment length, history of default, income

## Data processing
* Prepared data allows models to train faster
* Often positively impacts model performance
### Outliers and performance
* Possible causes of outliers:
    * Entry error
    * Technical issues or system failures 
### Detecting outliers with cross tables
* Using cross tables with aggregate functions
```python
pd.crosstab(cr_loan['person_home_ownership'],cr_loan['loan_status'],
            values=cr_loan['loan_int_rate'], aggfunc='mean').round(2)
```
* Removing outliers using the `.drop()` method within Pandas
```python
indicies = cr_loan[cr_loan['person_emp_length'] >= 60].index
cr_loan.drop(indicies, inplace=True)
```

In [None]:
# Example 
# Do not run

# Create the cross table for loan status, home ownership, and the max employment length
print(pd.crosstab(cr_loan['loan_status'],cr_loan['person_home_ownership'],
                  values=cr_loan['person_emp_length'], aggfunc='max'))

# Create an array of indices where employment length is greater than 60
indices = cr_loan[cr_loan['person_emp_length'] > 60].index

# Drop the records from the data based on the indices and create a new dataframe
cr_loan_new = cr_loan.drop(indices)

# Create the cross table from earlier and include minimum employment length
print(pd.crosstab(cr_loan_new['loan_status'],cr_loan_new['person_home_ownership'],
                  values=cr_loan_new['person_emp_length'], aggfunc=['min','max']))

# Use Pandas to drop the record from the data frame and create a new one
cr_loan_new = cr_loan.drop(cr_loan[cr_loan['person_age'] > 100].index)

# Create a scatter plot of age and interest rate
colors = ["blue","red"]
plt.scatter(cr_loan_new['person_age'], cr_loan_new['loan_int_rate'],
            c = cr_loan_new['loan_status'],
            cmap = matplotlib.colors.ListedColormap(colors),
            alpha=0.5)
plt.xlabel("Person Age")
plt.ylabel("Loan Interest Rate")
plt.show()

## Risk with missing data in loan data
### How to handle missing data?
* Generally there are three ways to handle missing data
1. Impute values where data is missing
2. Remove rows containing missing values
3. Leave the rows with missing data unchanged
* Understanding the data determines the course of action
### Finding missing data 
* Null values are easily found using the `isnull()` function
* Null records can be easily counted with the `sum()` function
* `.any()` method checks all columns
```python
null_columns = cr_loan.columns[cr_loan.isnull().any()]
cr_loan[null_columns].isnull().sum()
```
### Replacing missing data 
* Replace missing data using methods like `.fillna()` with aggregate functions and methods
```python
cr_loan['loan_int_rate'].fillna((cr_loan['loan_int_rate'].mean()),inplace=True)
```
### Dropping missing data 
* Uses indices to identify records the same as with outliers
* Remove the records entirely using the `.drop()` method
```python
indices = cr_loan[cr_loan['person_emp_length'].isnull()].index
cr_loan.drop(indices, inplace=True)
```

In [None]:
# Example
# Do not run

# Print a null value column array
print(cr_loan.columns[cr_loan.isnull().any()])

# Print the top five rows with nulls for employment length
print(cr_loan[cr_loan['person_emp_length'].isnull()].head())

# Impute the null values with the median value for all employment lengths
cr_loan['person_emp_length'].fillna((cr_loan['person_emp_length'].median()), inplace=True)

# Create a histogram of employment length
n, bins, patches = plt.hist(cr_loan['person_emp_length'], bins='auto', color='blue')
plt.xlabel("Person Employment Length")
plt.show()

# Print the number of nulls
print(cr_loan['loan_int_rate'].isnull().sum())

# Store the array on indices
indices = cr_loan[cr_loan['loan_int_rate'].isnull()].index

# Save the new data without missing data
cr_loan_clean = cr_loan.drop(indices)

## Logistic regression for probability of default
### Predicting probabilites
* Probabilities of default as an outcome from machine learning
    * Learn from data in columns (features)
* Classification models (default, non-default)
* Two most common models:
    * Logistic regression
    * Decision tree

### Logistic regression
* Similar to linear regression, but only produces values between `0` and `1`
* Logistic regression available within scikit-learn package
```python
from sklearn.linear_model import LogisticRegression
```
* Called as a function with or without parameters
```python
clf_logistic = LogisticRegression(solver='lbfgs')
```
* Uses the `.fit()` method to train
```python
clf_logistic.fit(training_columns, np.ravel(training_labels))
```

## Training and testing sets (60 - 40 rule)
### Creating the training and test sets
* Sperate the data into training columns and labels
```python
X = cr_loan.drop('loan_status', axis  = 1)
y = cr_loan[['loan_status']]
```
* Use `train_test_split()` function in the sci-kit learn
```python
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.4, random_state=123)
```
* `test_size`: a percentage of data for test set
* `random_state`: a random seed value for reproducibility

In [None]:
# Example
# Do not run

# Create the X and y data sets
X = cr_loan_clean[['loan_int_rate']]
y = cr_loan_clean[['loan_status']]

# Create and fit a logistic regression model
clf_logistic_single = LogisticRegression(solver='lbfgs')
clf_logistic_single.fit(X, np.ravel(y))

# Print the parameters of the model
print(clf_logistic_single.get_params())

# Print the intercept of the model
print(clf_logistic_single.intercept_)


# Create X data for the model
X_multi = cr_loan_clean[['loan_int_rate','person_emp_length']]

# Create a set of y data for training
y = cr_loan_clean[['loan_status']]

# Create and train a new logistic regression
clf_logistic_multi = LogisticRegression(solver='lbfgs').fit(X_multi, np.ravel(y))

# Print the intercept of the model
print(clf_logistic_multi.intercept_)

# Create the X and y data sets
X = cr_loan_clean[['loan_int_rate','person_emp_length','person_income']]
y = cr_loan_clean[['loan_status']]

# Use test_train_split to create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

# Create and fit the logistic regression model
clf_logistic = LogisticRegression(solver='lbfgs').fit(X_train, np.ravel(y_train))

# Print the models coefficients
print(clf_logistic.coef_)

## Predicting the probability of default
* Using one-hot encoding to convert non-numeric variable to an int using `get_dummies()` in Pandas
```python
# Seperate the numeric columns
cred_num = cr_loan.select_dtypes(exclude=['object'])
# Seperate non-numeric columns
cred_cat = cr_loan.select_dtypes(include=['object'])
# One-hot encode the non-numeric columns only
cred_cat_onehot = pd.get_dummies(cred_cat)
# Union the numeric columns with the one-hot encoded columns
cr_loan = pd.concat([cred_num,cred_cat_onehot], axis = 1)
```
* Use the `predict_proba()` method in scikit-learn
```python
# Train the model
clf_logistic.fit(X_train, np.ravel(y_train))
# Predict using the model
clf_logistic.predict_proba(X_test)
```

## Credit model performance
### Model accuracy scoring
* Calculate accuracy
    * Accuracy = Number of correct predictions / number of predicitons
* Use the `.score()` method from scikit-learn
```python 
# Check the accuracy against the test data 
clf_logistic.score(X_test,y_yest)
```
### ROC curve charts
* Receiver Operating Characteristic curve
    * Plots true positive rates (sensitivity) against false positive rate (fall-out)
```python
fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)
plt.plot(fallout, sensitivity, color= 'darkorange')
```

In [None]:
# Example
# Do not run

# Create a dataframe for the probabilities of default
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_default'])

# Reassign loan status based on the threshold
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.50 else 0)

# Print the row counts for each loan status
print(preds_df['loan_status'].value_counts())

# Print the classification report
target_names = ['Non-Default', 'Default']
print(classification_report(y_test, preds_df['loan_status'], target_names=target_names))

# Print all the non-average values from the report
print(precision_recall_fscore_support(y_test,preds_df['loan_status']))

# Print the first two numbers from the report
print(precision_recall_fscore_support(y_test,preds_df['loan_status'])[2])

# Create predictions and store them in a variable
preds = clf_logistic.predict_proba(X_test)

# Print the accuracy score the model
print(clf_logistic.score(X_test, y_test))

# Plot the ROC curve of the probabilities of default
prob_default = preds[:, 1]
fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)
plt.plot(fallout, sensitivity, color = 'darkorange')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.show()

# Compute the AUC and store it in a variable
auc = roc_auc_score(y_test, prob_default)

## Model discrimination and impact
### Confusion matrices
* Shows the number of correct and incorrect predictions for each `loan_status`

In [None]:
# Example
# Do not run

# Set the threshold for defaults to 0.5
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.5 else 0)

# Print the confusion matrix
print(confusion_matrix(y_test,preds_df['loan_status']))

# Set the threshold for defaults to 0.5
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.4 else 0)

# Print the confusion matrix
print(confusion_matrix(y_test,preds_df['loan_status']))

# Reassign the values of loan status based on the new threshold
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > 0.4 else 0)

# Store the number of loan defaults from the prediction data
num_defaults = preds_df['loan_status'].value_counts()[1]

# Store the default recall from the classification report
default_recall = precision_recall_fscore_support(y_test,preds_df['loan_status'])[1][1]

# Calculate the estimated impact of the new default recall rate
print(num_defaults * avg_loan_amnt * (1 - default_recall))

plt.plot(thresh,def_recalls)
plt.plot(thresh,nondef_recalls)
plt.plot(thresh,accs)
plt.xlabel("Probability Threshold")
plt.xticks(ticks)
plt.legend(["Default Recall","Non-default Recall","Model Accuracy"])
plt.show()

## Gradient boosted trees with XGBoost
### Decision trees
* Creates predictions similar to logistic regression
* Not structured like a regression model

### A forest of decision trees
* XGBoost uses many simplistic trees (ensemble)
* Each tree will be slightly better than a coin toss
* Creating and training trees
    * part of the `xgboost` package in Python, called `xgb`
    * trains with `.fit()` just like the logistic regression model
```python
# Examples
# Create a logistic regression model
clf_logistic = LogisticRegression()
# Train the logistic regression
clf_logistic.fit(X_train, np.ravel(y_train))
# Create a gradient boosted tree model
clf_gbt = xgb.XGBClassifier()
# Train the gradient boosted tree
clf_xgb.fit(X_train, np.ravel(y_train))
```

### Hyperparameters of gradient boosted trees
* **Hyperparameters:** model parameters(settings) that cannot be learned from the data
* Some common hyperparameters for gradient boosted trees:
    * `learning rate`: smaller values make each step more conservative
    * `max_depth`: sets how deep each tree can go, larger indicates more complexity
```python
xgb.XGBClassifier(learning_rate = 0.2,
                  max_depth = 4)
```

In [None]:
# Example
# Do not run

# Train a model
import xgboost as xgb
clf_gbt = xgb.XGBClassifier().fit(X_train, np.ravel(y_train))

# Predict with a model
gbt_preds = clf_gbt.predict_proba(X_test)

# Create dataframes of first five predictions, and first five true labels
preds_df = pd.DataFrame(gbt_preds[:,1][0:5], columns = ['prob_default'])
true_df = y_test.head()

# Concatenate and print the two data frames for comparison
print(pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1))

# Print the first five rows of the portfolio data frame
print(portfolio.head())

# Create expected loss columns for each model using the formula
portfolio['gbt_expected_loss'] = portfolio['gbt_prob_default'] * portfolio['lgd'] * portfolio['loan_amnt']
portfolio['lr_expected_loss'] = portfolio['lr_prob_default'] * portfolio['lgd'] * portfolio['loan_amnt']

# Print the sum of the expected loss for lr
print('LR expected loss: ', np.sum(portfolio['lr_expected_loss']))

# Print the sum of the expected loss for gbt
print('GBT expected loss: ', np.sum(portfolio['gbt_expected_loss']))

# Predict the labels for loan status
gbt_preds = clf_gbt.predict(X_test)

# Check the values created by the predict method
print(gbt_preds)

# Print the classification report of the model
target_names = ['Non-Default', 'Default']
print(classification_report(y_test, gbt_preds, target_names=target_names))

## Feature selection for credit risk
### Column importance 
* Use the `.get_booster()` and `.get_score()` methods
    * **Weight:** the number of times a column appears in all trees
```python
# Train the model
clf_gbt.fit(X_train, np.ravel(y_train))
# Print the feature importances
clf_gbt.get_booster().get_score(importance_type = 'weight')
```
### Plotting column importances
* Use the `plot_importance()` function
```python
xgb.plot_importance(clf_gbt, importance_type = 'weight')
```

## Cross validation for credit models
### Cross validation basics
* Used to train and test the model in a way that simulates using the model on new data
* Segments the training data into different pieces to estimate futurre performance
* Uses `DMatrix`, an internal structure optimized for `XGBoost`
* Early stopping tells cross validation to stop after a scoring metric has not improved after x number of iterations
### How it works
* Processes parts of training data (folds) and tests against the unused training data 
* Final testing against the actual test set

### Setting up cross validation with XGBoost
```python
# Set the number of folds
n_folds = 2
# Set the early stopping number
early_stop = 5
# Set any specific parameters for cross validation 
params = {'objective':'binary:logistic',
          'seed':99, 'eval_metric':'auc'}
```
* `'binary':'logistic'` is used to specify classification for `loan_status`
* `'eval_metric':'auc'` tells XGBoost to score the model's performance on AUC

### Using cross validation with XGBoost
```python
# Restructure the train data for xgboost
DTrain = xgb.DMatrix(X_train, label = y_train)
# Perform cross validation
xgb.cv(params, DTrain, num_boost_round = 5, nfold= n_folds,
       early_stopping_rounds= early_stop)
```

### Cross validation scoring
* Uses cross validation and scoring metrics with `cross_val_score` function in scikit-learn
```python
# Import the module
from sklearn.model_selection import cross_val_score
# Create a gbt model
xg = xgb.XGBClassifier(learning_rate = 0.4, max_depth = 10)
# Use cross validation and accuracy scores five consecutive times
cross_val_score(gbt, X_train, y_train, cv = 5)
```

In [None]:
# Example
# Do not run

# Create a gradient boosted tree model using two hyperparameters
gbt = xgb.XGBClassifier(learning_rate = 0.1, max_depth = 7)

# Calculate the cross validation scores for 4 folds
cv_scores = cross_val_score(gbt, X_train, np.ravel(y_train), cv = 4)

# Print the cross validation scores
print(cv_scores)

# Print the average accuracy and standard deviation of the scores
print("Average accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(),
                                              cv_scores.std() * 2))

## Class imbalances
### Model loss function
* Gradient boosted trees in `xgboost` use s loss function of log-loss
### Causes of imbalance
* Data problems:
    * Credit data was not sampled correctly
    * Data storage problems
* Business processes:
    * Measures already in place to not accept probable defaults
    * Probable defaults are quickly sold to other firms
* Behavioral factors:
    * Normally, people do not default on their loans
    * The less often they default, the higher their credit rating
    
### Undersampling to equalize class proportions
* Training and testing sets must be put back together
* Create two new sets based on actual `loan_status`
```python
# Concat the training sets
X_y_train = pd.concat([X_train.reset_index(drop=True),
                       y_train.reset_index(drop=True)], axis = 1)
# Get counts of defaults and non-defaults
count_nondefault, count_default = X_y_train['loan_status'].value.counts()
# Seperate nondefaults and defaults
nondefaults = X_y_train[X_y_train['loan_status']==0]
defaults = X_y_train[X_y_train['loan_status']==1]
```
* Randomly sample data set of non-ddefaults
* Concatenate with data set of defaults
```python
# Undersample the non-defaults using sample() in pandas
nondefaults_under = nondefaults.sample(count_default)
# Concat the undersampled non-defaults with the defaults
X_y_train_under = pd.concat([nondefaults_under.reset_index(drop=True),
                             defaults.reset_index(drop=True)],axis = 0)
```

In [None]:
# Example 
# Do not run

# Create data sets for defaults and non-defaults
nondefaults = X_y_train[X_y_train['loan_status'] == 0]
defaults = X_y_train[X_y_train['loan_status'] == 1]

# Undersample the non-defaults
nondefaults_under = nondefaults.sample(count_default)

# Concatenate the undersampled nondefaults with defaults
X_y_train_under = pd.concat([nondefaults_under.reset_index(drop = True),
                             defaults.reset_index(drop = True)], axis = 0)

# Print the value counts for loan status
print(X_y_train_under['loan_status'].value_counts())

## Model evaluation and implementation
### Comparing classification reports
* Create reports with `classification_report()` and compare
* ROC and AUC analysis
    * models with better performance will have more lift
    * more lift suggests the AUC score is higher
    
### Calculating model calibration
* Shows precentage of true defaults for each predicted probability
* a line plot of the results of the `calibration_curve()`
```python
from sklearn.calibration import calibration_curve
calibration_curve(y_test, probabilities_of_default, n_bins = 5)
```
* Plotting calibration curves
```python 
plt.plot(mean_predicted_value, fraction_of_positives, label = '%s' % "Example Model")
```


In [None]:
# Eample 
# Do not run

# Print the logistic regression classification report
target_names = ['Non-Default', 'Default']
print(classification_report(y_test, preds_df_lr['loan_status'], target_names=target_names))

# Print the gradient boosted tree classification report
print(classification_report(y_test, preds_df_gbt['loan_status'], target_names=target_names))

# Print the default F-1 scores for the logistic regression
print(precision_recall_fscore_support(y_test,preds_df_lr['loan_status'], average = 'macro')[2])

# Print the default F-1 scores for the gradient boosted tree
print(precision_recall_fscore_support(y_test,preds_df_gbt['loan_status'], average = 'macro')[2])

# ROC chart components
fallout_lr, sensitivity_lr, thresholds_lr = roc_curve(y_test, clf_logistic_preds)
fallout_gbt, sensitivity_gbt, thresholds_gbt = roc_curve(y_test, clf_gbt_preds)

# ROC Chart with both
plt.plot(fallout_lr, sensitivity_lr, color = 'blue', label='%s' % 'Logistic Regression')
plt.plot(fallout_gbt, sensitivity_gbt, color = 'green', label='%s' % 'GBT')
plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for LR and GBT on the Probability of Default")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()

# Print the logistic regression AUC with formatting
print("Logistic Regression AUC Score: %0.2f" % roc_auc_score(y_test, clf_logistic_preds))

# Print the gradient boosted tree AUC with formatting
print("Gradient Boosted Tree AUC Score: %0.2f" % roc_auc_score(y_test, clf_gbt_preds))

# Create the calibration curve plot with the guideline
plt.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated') 
plt.plot(mean_pred_val_lr, frac_of_pos_lr,
         's-', label='%s' % 'Logistic Regression')

plt.ylabel('Fraction of positives')
plt.xlabel('Average Predicted Probability')
plt.legend()
plt.title('Calibration Curve')
plt.show()

## Credit acceptance rates
* Can use model predictions to set better loan acceptance thresholds
* The goal is deny probable defaults for all new loans
* **Acceptance rate:** what percentage of *new loans* are accepted to keep the number of defaults in a portfolio low
    * Accepted loans which are defaults have an impact similar to false negatives

* Calculate the threshold value for an 85% acceptance rate
```python
# Compute the threshold for 85% acceptance
threshold = np.quantile(prob_default, 0.85)
# Compute the quantile on the probabilities of default
preds_df['loan_status'] = preds_df['prob_default'].apply(lambda x: 1 if x > {threshold} else 0)
```
### Bad rate 
* Even with a calculated threshold, some of the accepted loans will be defaults
* These are loans with `prob_default` values where the model is not well calibrated
  Bad Rate = Accepted Defaults / Total Accepted Loans
```python
# Calculate the bad rate
np.sum(accepted_loans['true_loan_status']) / accepted_loans['true_loan_status'].count()
```
* If non-default is `0`, and default is `1` than the `sum()` is the count of defaults
* The `.count()` of a single column is the same as the row count for the data frame

In [None]:
# Example
# Do not run

# Check the statistics of the probabilities of default
print(test_pred_df['prob_default'].describe())

# Calculate the threshold for a 85% acceptance rate
threshold_85 = np.quantile(test_pred_df['prob_default'], 0.85)

# Apply acceptance rate threshold
test_pred_df['pred_loan_status'] = test_pred_df['prob_default'].apply(lambda x: 1 if x > threshold_85 else 0)

# Print the counts of loan status after the threshold
print(test_pred_df['pred_loan_status'].value_counts())

# Plot the predicted probabilities of default
plt.hist(clf_gbt_preds, color = 'blue', bins = 40)

# Calculate the threshold with quantile
threshold = np.quantile(clf_gbt_preds, 0.85)

# Add a reference line to the plot for the threshold
plt.hist(x = clf_gbt_preds, color = 'red')
plt.axvline(threshold)
plt.show()

# Print the top 5 rows of the new data frame
print(test_pred_df.head())

# Create a subset of only accepted loans
accepted_loans = test_pred_df[test_pred_df['pred_loan_status'] == 0]

# Calculate the bad rate
print(np.sum(accepted_loans['true_loan_status']) / accepted_loans['true_loan_status'].count())

# Print the statistics of the loan amount column
print(test_pred_df['loan_amnt'].describe())

# Store the average loan amount
avg_loan = np.mean(test_pred_df['loan_amnt'])

# Set the formatting for currency, and print the cross tab
pd.options.display.float_format = '${:,.2f}'.format
print(pd.crosstab(test_pred_df['true_loan_status'],
                 test_pred_df['pred_loan_status_15']).apply(lambda x: x * avg_loan, axis = 0))

## Credit strategy and minimum expected loss
### Selecting acceptance rates
* Acceptance rates are not fixed
* Two options to test different rates:
    * Calculate the threshold, bad rate and losses manually
    * Automatically create a table of these values and select an acceptance rate
* The table of all possible values is called a strategy table

### Setting up the strategy table 
* Set up arrays or lists to store each value
```python
# Set all the acceptance rates to test
accept_rates = [1.0,0.95,0.9,0.85,0.8,0.75,0.7,0.65,0.6,0.55,0.5,0.45,0.4,0.35,0.3,0.25,0.2,0.15,0.1,0.05]
# Create lists to store thresholds and bad rates
thresholds = []
bad_rates = []
```
```python
for rate in accept_rates:
    # Calculate threshold
    threshold = np.quantile(preds_df['prob_default'],rate).round(3)
    # Store threshold value in a list
    thresholds.append(np.quantile(preds_df['prob_default'],rate).round(3))
    # Apply the threshold to reassign loan status
    test_pred_df['pred_loan_status'] = \
    test_pred_df['prob_default'].apply(lambda x: 1 if threshold else 0)
    # Create accepted loans set of predicted non-defaults
    accepted_loans = test_pred_df[test_pred_df['pred_loan_status'] == 0]
    # Calculate and store the bad rate
    bad_rates.append(np.sum((accepted_loans['true_loan_status'])
                            / accepted_loans['true_loan_status'].count()).round(3))
```
```python
strat_df = pd.DataFrame(zip(accept_rates, thresholds, bad_rates),
                        columns = ['Acceptance Rate','Threshold','Bad Rate'])
```

### Total expected loss
* How much one expects to lose on the defaults in the portfolio

In [None]:
# Example
# Do not run

# Print the first five rows of the data frame
print(test_pred_df.head())

# Calculate the bank's expected loss and assign it to a new column
test_pred_df['expected_loss'] = test_pred_df['prob_default'] * test_pred_df['loss_given_default'] * test_pred_df['loan_amnt']

# Calculate the total expected loss to two decimal places
tot_exp_loss = round(np.sum(test_pred_df['expected_loss']),2)

# Print the total expected loss
print('Total expected loss: ', '${:,.2f}'.format(tot_exp_loss))