<a href="https://colab.research.google.com/github/jyimz/Python_Data_Projects/blob/main/Predicting_Loan_Repayment_with_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
df = pd.read_csv("https://raw.githubusercontent.com/amandeep0/IS451/main/data/loans.csv")

df.head()
df.info()
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CreditPolicy    9578 non-null   int64  
 1   Purpose         9578 non-null   object 
 2   IntRate         9578 non-null   float64
 3   Installment     9578 non-null   float64
 4   LogAnnualInc    9578 non-null   float64
 5   Dti             9578 non-null   float64
 6   Fico            9578 non-null   int64  
 7   DaysWithCrLine  9578 non-null   float64
 8   RevolBal        9578 non-null   int64  
 9   RevolUtil       9578 non-null   float64
 10  InqLast6mths    9578 non-null   int64  
 11  Delinq2yrs      9578 non-null   int64  
 12  PubRec          9578 non-null   int64  
 13  NotFullyPaid    9578 non-null   int64  
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB
       CreditPolicy      IntRate  Installment  LogAnnualInc          Dti  \
count   9578.000000  9578.000000  9

In [None]:
df_train, df_test = train_test_split(df, test_size=0.3, random_state=42, stratify=df['NotFullyPaid'])
baseline_accuracy = (df_test == 0).mean()
baseline_accuracy

CreditPolicy      0.196242
Purpose           0.000000
IntRate           0.000000
Installment       0.000000
LogAnnualInc      0.000000
Dti               0.011134
Fico              0.000000
DaysWithCrLine    0.000000
RevolBal          0.032707
RevolUtil         0.030967
InqLast6mths      0.371260
Delinq2yrs        0.885873
PubRec            0.948156
NotFullyPaid      0.839944
dtype: float64

**ai). Baseline Accuracy: NotFullyPaid =
0.839944**

In [None]:
logit_regression = smf.logit('NotFullyPaid ~ CreditPolicy + Purpose + IntRate + Installment + LogAnnualInc + Dti + Fico + DaysWithCrLine + RevolBal + RevolUtil + InqLast6mths + Delinq2yrs + PubRec', data=df_train)
logit_results = logit_regression.fit()
print(logit_results.summary())

Optimization terminated successfully.
         Current function value: 0.408503
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:           NotFullyPaid   No. Observations:                 6704
Model:                          Logit   Df Residuals:                     6685
Method:                           MLE   Df Model:                           18
Date:                Fri, 10 Nov 2023   Pseudo R-squ.:                 0.07107
Time:                        04:47:06   Log-Likelihood:                -2738.6
converged:                       True   LL-Null:                       -2948.1
Covariance Type:            nonrobust   LLR p-value:                 9.569e-78
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept                         7.5239      1.533      4.908      0.

**aii). The resulting logistic regression model includes significant independent variables such as Purpose (credit card, debt consolidation, major purchase, small business), CreditPolicy, Installment, LogAnnualInc, Fico, RevolBal, RevolUtil, and InqLast6mths, indicating their effect on the likelihood of not fully paying back a loan.**

In [None]:
Fico_A = 700
Fico_B = 710
coefficient_fico = -0.0079

difference = coefficient_fico * (Fico_B - Fico_A)

print("Logit(A) - Logit(B) =", difference)

Logit(A) - Logit(B) = -0.07900000000000001


**aiii). Logit(A) - Logit(B) = -0.07900000000000001**

**Note that the coefficient for the "Fico" predictor needs to be negative because it is a coefficient in the log-odds scale.**

In [None]:
df_test['PredictedRisk'] = logit_results.predict(df_test) > 0.5
pd.crosstab(df_test['NotFullyPaid'], df_test['PredictedRisk'])

PredictedRisk,False,True
NotFullyPaid,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2395,19
1,443,17


In [None]:
accuracy = (df_test['PredictedRisk'] == df_test['NotFullyPaid']).mean()

print("Accuracy:", accuracy)

Accuracy: 0.8392484342379958


**aiv). New Accuracy = 0.8392484342379958**

**The accuracy of the logistic regression model on the test set using a threshold of 0.5 is approximately 0.8392. The accuracy of the baseline model is 0.8399.**

**We can see that the logistic regression model performs slightly worse than the baseline model in terms of accuracy. However, the difference in accuracy between the two models is very small.**

In [None]:
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Make predictions on both training and test sets
y_train_pred = logit_results.predict(df_train)
y_test_pred = logit_results.predict(df_test)

# Create a Pandas DataFrame for each set of predictions
train_predictions = pd.DataFrame({'Actual': df_train["NotFullyPaid"], 'Predicted': (y_train_pred > 0.5).astype(int)})
test_predictions = pd.DataFrame({'Actual': df_test["NotFullyPaid"], 'Predicted': (y_test_pred > 0.5).astype(int)})

# Calculate cross-tabulation tables for both training and test sets
train_cross_tab = pd.crosstab(train_predictions['Actual'], train_predictions['Predicted'], rownames=['Actual'], colnames=['Predicted'])
test_cross_tab = pd.crosstab(test_predictions['Actual'], test_predictions['Predicted'], rownames=['Actual'], colnames=['Predicted'])

print(train_cross_tab)
print(test_cross_tab)

Predicted     0   1
Actual             
0          5596  35
1          1038  35
Predicted     0   1
Actual             
0          2395  19
1           443  17


In [None]:
y_train = df_train["NotFullyPaid"]
y_test = df_test["NotFullyPaid"]

In [None]:
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred)
roc_auc_train = auc(fpr_train, tpr_train)

fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred)
roc_auc_test = auc(fpr_test, tpr_test)

print(roc_auc_train)
print(roc_auc_test)

0.6891376339505232
0.669282446597745


In [None]:
TP = 17
FP = 443
FN = 19

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")


Precision: 0.04
Recall: 0.47
F1 Score: 0.07


**av). The ROC AUC score of the logistic regression model on the train set is approximately 0.6891. The ROC AUC score of the logistic regression model on the test set is approximately 0.6693.**

**While an accuracy of 0.8392 is relatively high, suggesting the model can predict correctly in most cases, the AUC of 0.6693 indicates that the model's ability to rank risks or distinguish between positive and negative returns is relatively weak.**

**Precision: 0.04**

**Recall: 0.47**

**F1 Score: 0.07**

In [None]:
logit_reg_rate = smf.logit("NotFullyPaid ~ IntRate", data = df_train)
logit_reg_rate_results = logit_reg_rate.fit()
print(logit_reg_rate_results.summary())

Optimization terminated successfully.
         Current function value: 0.426440
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:           NotFullyPaid   No. Observations:                 6704
Model:                          Logit   Df Residuals:                     6702
Method:                           MLE   Df Model:                            1
Date:                Fri, 10 Nov 2023   Pseudo R-squ.:                 0.03029
Time:                        04:47:25   Log-Likelihood:                -2858.9
converged:                       True   LL-Null:                       -2948.1
Covariance Type:            nonrobust   LLR p-value:                 9.877e-41
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.7806      0.170    -22.263      0.000      -4.113      -3.448
IntRate       16.7572      1.

**bi). IntRate is significant in this model as indicated by its p-value of 0.000. This suggests that the relationship between the interest rate and the probability of the loan not being fully paid is unlikely to be due to chance.**

**No, the variable IntRate was not significant in the frrst model as indicated by its p-value of 0.318. This means that the relationship between the interest rate and the probability of the loan not being fully paid is likely due to chance.**

In [None]:
train_int_rate_predictions = logit_reg_rate_results.predict(df_test['IntRate'])
highest_int_rate_predictions = max(train_int_rate_predictions)
predictions = (train_int_rate_predictions > 0.5).astype(int)
pred_not_fully_paid = sum(predictions)
print(highest_int_rate_predictions)
print(pred_not_fully_paid)

0.44362906300227267
0


**bii). Highest predicted probability of a loan not being paid back in full: 0.44362906300227267**

**Number of loans predicted as not fully paid back: 0**

In [None]:
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Make predictions on both training and test sets
y_train_int_rate = logit_reg_rate_results.predict(df_train)
y_test_int_rate = logit_reg_rate_results.predict(df_test)

# Create a Pandas DataFrame for each set of predictions
train_int_rate_predictions = pd.DataFrame({'Actual': df_train["NotFullyPaid"], 'Predicted': (y_train_int_rate > 0.5).astype(int)})
test_int_rate_predictions = pd.DataFrame({'Actual': df_test["NotFullyPaid"], 'Predicted': (y_test_int_rate > 0.5).astype(int)})

# Calculate cross-tabulation tables for both training and test sets
train_cross_tab_int = pd.crosstab(train_int_rate_predictions['Actual'], train_int_rate_predictions['Predicted'], rownames=['Actual'], colnames=['Predicted'])
test_cross_tab_int = pd.crosstab(test_int_rate_predictions['Actual'], test_int_rate_predictions['Predicted'], rownames=['Actual'], colnames=['Predicted'])

print(train_cross_tab_int)
print(test_cross_tab_int)

Predicted     0
Actual         
0          5631
1          1073
Predicted     0
Actual         
0          2414
1           460


In [None]:
accuracy_b = (df_test['PredictedRisk'] == df_test['NotFullyPaid']).mean()

print("Accuracy:", accuracy_b)

Accuracy: 0.8392484342379958


In [None]:
TP = 2414
FP = 1
FN = 460

# Calculate precision
precision = TP / (TP + FP)

# Calculate recall
recall = TP / (TP + FN)

# Calculate F1 score
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")

Precision: 1.00
Recall: 0.84
F1 Score: 0.91


In [None]:
def calculate_recall(tp, fn):
    recall = tp / (tp

SyntaxError: ignored

In [None]:
y_train_b = df_train["NotFullyPaid"]
y_test_b = df_test["NotFullyPaid"]

fpr_train_b, tpr_train_b, _ = roc_curve(y_train_b, y_train_int_rate)
roc_auc_train_b = auc(fpr_train_b, tpr_train_b)

fpr_test_b, tpr_test_b, _ = roc_curve(y_test_b, y_test_int_rate)
roc_auc_test_b = auc(fpr_test_b, tpr_test_b)

print(roc_auc_train_b)
print(roc_auc_test_b)

0.6217173670648584
0.6165961240589315


**biii). The ROC AUC score of the logistic regression model on the train set is approximately 0.6217. The ROC AUC score of the logistic regression model on the test set is approximately 0.6166.**

**This models AUC is slightly lower meaning it's better at determining paid and not fully paid.**

**Accuracy: 0.8392484342379958**

**precision = 1**

**recall = 0.84**

**f1_score = 0.91**

In [None]:
import math

def calculate_total_amount_paid_back(investment, interest_rate, time_period):
    total_amount_paid_back = investment * math.exp(interest_rate * time_period)
    return total_amount_paid_back

investment = 10  # Initial investment amount
interest_rate = 0.06  # Annual interest rate
time_period = 3  # Number of years

total_paid_back = calculate_total_amount_paid_back(investment, interest_rate, time_period)
print(f"Total Amount Paid Back: ${total_paid_back:.2f}")

Total Amount Paid Back: $11.97


**ci). The amount paid back after 3 years is: $ 11.97**

In [None]:
def calculate_profit_or_loss(total_amount_paid_back, initial_investment):
    if total_amount_paid_back >= initial_investment:
        profit = total_amount_paid_back - initial_investment
        return profit
    else:
        loss = initial_investment - total_amount_paid_back
        return -loss

initial_investment = 10
total_amount_paid_back = 11.97

profit_or_loss = calculate_profit_or_loss(total_amount_paid_back, initial_investment)
print(f"Profit or Loss: ${profit_or_loss:.2f}")

Profit or Loss: $1.97


In [None]:
value_of_investment = 11.96
cost_of_investment = 10

profit_if_paid_back = value_of_investment - cost_of_investment
print(f"Profit if investment is paid back in full: ${profit_if_paid_back:.2f}")

Profit if investment is paid back in full: $1.96


**cii). Profit if investment is paid back in full: V - c = 11.96 - 10 = $1.96**

**Profit if investment is not paid back in full: V - amount received**


In [None]:
df_test['Profit'] = np.exp(df_test['IntRate'] * 3) - 1
df_test.loc[df_test['NotFullyPaid'] == 1, 'Profit'] = -1
max_profit = 1 * df_test['Profit'].max()
print(max_profit)

0.8894768654675331


**ciii). MAX Profit: 0.8895**

In [None]:
# Filter loans with interest rate >= 15%
HighInterest = df_test[df_test['IntRate'] >= 0.15]

# Calculate average profit of $1 investment in high-interest loans
average_profit = HighInterest['Profit'].mean()

# Calculate proportion of high-interest loans not paid back in full
not_paid_back_prop =  HighInterest['NotFullyPaid'].value_counts()

print(average_profit)
print(not_paid_back_prop)

0.23986972902172937
0    316
1    103
Name: NotFullyPaid, dtype: int64


In [None]:
prop_not_paid = not_paid_back_prop[1] / not_paid_back_prop.sum()
print(prop_not_paid)

0.2458233890214797


**civ). The average profit of a $1.00**

**investment in one of these high-interest loans is approximately $0.24**

In [None]:
# Sort loans in HighInterest dataset by PredictedRisk in ascending order
sorted_loans = df_test.sort_values('PredictedRisk', ascending=True)
# print(sorted_loans)
SelectedLoans = sorted_loans.head(100)

print(SelectedLoans)

total_profit = SelectedLoans['Profit'].sum()
print(total_profit)

      CreditPolicy             Purpose  IntRate  Installment  LogAnnualInc  \
1102             1           all_other   0.1008        51.69     10.714418   
733              1  debt_consolidation   0.1008       290.75     10.571317   
4232             1         credit_card   0.1114       328.04     11.496796   
8977             0  debt_consolidation   0.1505       312.23     10.433998   
4728             1  debt_consolidation   0.1253       267.74     10.308953   
...            ...                 ...      ...          ...           ...   
6741             1      major_purchase   0.1183       297.38     10.085809   
2858             1           all_other   0.1600       351.58     11.385047   
9090             0  debt_consolidation   0.1316       540.33     12.676089   
5381             1           all_other   0.1253       160.64      9.505991   
4972             1         credit_card   0.0751       186.66     11.711776   

        Dti  Fico  DaysWithCrLine  RevolBal  RevolUtil  InqLast

In [None]:
not_paid_back_count = SelectedLoans['NotFullyPaid'].sum()
print(not_paid_back_count)

16


cv). The profit of the investor who invested $1

in each of the 100 loans and had 16 loans not paid back in full is $19.804292361623485.

On the other hand, the simple strategy of investing $100

in all loans yielded a profit of $20.94.

Therefore, the simple strategy of investing in all loans is slightly more profitable compared to the strategy of investing $1 in each loan.

d). Predictive models often fail in financial situations because they assume stationarity, which means that the statistical properties of a time series remain constant over time. However, financial markets exhibit non-stationary behavior, with volatility clusters, trends, and regime shifts. This violates the assumption of stationarity and makes it challenging to accurately predict financial outcomes using traditional models.

To improve the situation, analysts can continuously monitor the data, engage in feature engineering, use rolling window analysis, employ ensemble modeling, regularly validate and update models, and leverage domain expertise. These strategies help capture changing patterns and external factors that may impact financial outcomes, ensuring more accurate predictions in non-stationary environments.

In [None]:
HighInterest = HighInterest.sort_values(by='PredictedRisk', ascending=True)
print(SelectedLoans)

SelectedLoans = HighInterest.iloc[:100]

print(SelectedLoans)
print(SelectedLoans['Profit'].sum())

      CreditPolicy             Purpose  IntRate  Installment  LogAnnualInc  \
5769             1  debt_consolidation   0.1565       587.77     11.289782   
7379             1           all_other   0.1682        53.35     10.308953   
8371             0  debt_consolidation   0.1545       334.91     11.066638   
7527             1  debt_consolidation   0.1941       471.89     10.858999   
1360             1           all_other   0.1588       245.69     11.461632   
...            ...                 ...      ...          ...           ...   
9408             0    home_improvement   0.1531       174.08     10.596635   
2061             1  debt_consolidation   0.1695       534.41     11.082204   
7983             0  debt_consolidation   0.1722       357.63     12.072541   
9164             0  debt_consolidation   0.1632       423.77     10.714418   
5811             1  debt_consolidation   0.1531       844.28     11.755872   

        Dti  Fico  DaysWithCrLine  RevolBal  RevolUtil  InqLast

In [None]:
import numpy as np

# Given values
principal = 10  # principal amount in dollars
annual_interest_rate = 6 / 100  # converting the annual interest rate from percent to proportion
time_years = 3  # time period in years

# Continuous compounding formula: A = P * e^(r*t)
amount = principal * np.exp(annual_interest_rate * time_years)
amount

11.972173631218102