# Micro Mortgages - Logistic Regression
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, auc, RocCurveDisplay
from sklearn.metrics import classification_report

import statsmodels.api as sm
import statsmodels.formula.api as smf


In [None]:
plt.style.use('fivethirtyeight')

## Problem description

In India, there are about 20 million home loan (mortgage) aspirants
working in the informal sector:

- Monthly income between INR 20,000-25,000 (\$ 325-400)
- Typically no formal accounts and documents (e.g., tax returns, income proofs, bank statements)
- Often use services of money lenders with interest rates between 30 and 60% per annum

Providing mortgages to this group of customers requires to quickly and
efficiently assess their creditworthiness. Due to a lack of formal
documents and objective data, most financial institutions perform
interview-based processes to decide about these loan requests:

Strength of the current process:

-   Interview-based field assessment

-   Relaxation of document requirements

Weaknesses of the current process:

-   Costly (total transaction costs as high as 30% of loan volume)

-   Subjective judgments; depends on individual skills and motivations

-   Low reliability across branches and credit officers

-   Risk of corruption and fraud

## Load data

Load training data from CSV file.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/micro_mortgages/data/micromortgage.csv')

In [None]:
data.head()

## Prepare data

In [None]:
data = data.drop(['ID'], axis=1)
data["Tier"] = data["Tier"].apply(lambda x: "T"+str(x))

In [None]:
train, test = train_test_split(data, test_size=0.2, random_state=42)

## Exploratory data analysis

### Descriptive summary statistics

Calculate base rate of mortgage approvals.

In [None]:
train["Decision"].mean()

### Explore relationships between response and predictors.

In [None]:
sns.boxplot(data=train, x="Decision", y="Age")
plt.show()

In [None]:
# YOUR CODE HERE

In [None]:
sns.barplot(data=train, x="Gender", y="Decision")
plt.show()

In [None]:
# YOUR CODE HERE

## Logistic Regression

In [None]:
model_logit = smf.logit(formula='Decision ~ Age + Gender', data=train)
model_logit = model_logit.fit()

In [None]:
print(model_logit.summary())

In [None]:
pred_proba = model_logit.predict(test)
pred_label = round(pred_proba, 0)

In [None]:
print(classification_report(test["Decision"], pred_label))

In [None]:
fpr, tpr, thresholds = roc_curve(test["Decision"], pred_proba)
auc_score = auc(fpr, tpr)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc_score, estimator_name='Random Forest')
display.plot()
plt.show()