![display relevant image here](path/url/to/image)
- Banner/header image

# Title
- Relevant to Data and Business Context

## Overview
- BLUF (Bottom Line Up Front)
- One paragraph summary of final model performance and business implications
- Frame your 'story'

## Business Understanding

1. Begin by thoroughly analyzing the business context of FinTech Innovations' loan approval process. Write a short summary that:
- Describes the current manual process and its limitations
- Identifies key stakeholders and their needs
- Explains the implications of different types of model errors
- Justifies your choice between classification and regression approaches

2. Define your modeling goals and success criteria:
- Select appropriate evaluation metrics based on business impact
- You must use at least two different metrics
- Consider creating custom metric
- Establish baseline performance targets
- Document your reasoning for each choice


#### Business Context ####

##### Current State #####

FinTech Innovation currently has a manual approval process for new loans. This a common practice in the financial services industry, particularly for new or fast-growing companies that have have not yet built automated loan processing infrastructure.

In the current process, individual loan officers review each loan application, making a go/no go decision based on company guidelines.

The current state has the following challenges and risks:

- Manual review is time-consuming and inefficient. Each application requires 5-10 minutes of loan officer time, which in turn is unavailable for business development and customer service.
- Manual review is also inconsistent. Individual loan officers interpret lending guidelines differently and the manual process is also prone to simple human error.
- Loan processing takes days or sometimes weeks. 
    - This creates an opportunity cost for FinTech since delaying loan approval costs the company revenue.
    - This is also a suboptimal experience for FinTech's customers.
- Loan decisions are high-stakes events: each bad loan that is approved costs the company an average of $50,000, and each good loan that is not approved has an opportunity cost of $8,000.

We believe that reimaging the loan approval process and moving to a data-led approach will produce significant benefits to the company, our loan officers, and our customers.

##### Project Stakeholders #####

This work will impact a large number of people in and around FinTech Innovations, but there are 3 groups whose stake in this work is most significant.

1. FinTech Innovations senior leadership team. Rebuilding the loan approval process will require a large investment of time from a wide range of employees, and the senior leadership team will need:
    - Clearly articulated, measurable benefits to justify this change
    - A simple way to understand how the new data tools work, and how they are used
2. FinTech Innovations loan officers. These employees will be directly impacted by any changes we make, and are the company's link to our customers. They will need:
    - An understanding of how their work will change and why
    - Data tools that are easy for them to learn and use
    - Talking points and explanations on how the new loan approval process works for customers
3. FinTech Innovations customers (current and future). This group will be directly impacted by these changes; while they will not have input into our business operations, they will need:
    - A high-level understanding of the loan process and confidence in its reliability and fairness
    - A understanding of the benefits of this change, and what they can expect when they apply for loans.

##### Model Outcome Considerations #####

The goal of this work will be a model that performs well across as many factors as possible, but I will give the most consideration to avoiding incorrectly classifying high-risk applications as low-risk. This error is the most costly to the business--as previously mentioned, the average cost of a defaulted loan is $50,000 while the average revenues from a good loan are $8,000.

##### My Approach #####

I will seek to build a model that predicts applicants' risk scores. This will give loan officers a tool to make their work more efficient, and we will also get meaningful feedback from the officers directly if/when it is rolled out. This will be a regression exercise to assign each applicant a score from a continuous range.

Although there is meaningful value in trying to predict loan approval decisions, I worry that this model will codify biases and unfair practices hidden in the current state.

#### Modeling Goals and Success Criteria ####

##### Evaluation Metrics #####

##### Baseline Performance Targets #####

##### Decision Log #####

## Data Understanding
3. Conduct comprehensive exploratory data analysis:
- Describe basic data characteristics
- Examine distributions of all features and target variables
- Investigate relationships between features
- Create visualizations to help aid in EDA
- Document potential data quality issues and their implications

4. Develop feature understanding:
- Categorize features by type (numerical, categorical, ordinal)
- Identify features requiring special preprocessing
- Document missing value patterns and their potential meanings
- Note potential feature engineering opportunities


In [10]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

In [2]:
data = pd.read_csv('financial_loan_data.csv')

data.head()

Unnamed: 0,Age,AnnualIncome,CreditScore,EmploymentStatus,EducationLevel,Experience,LoanAmount,LoanDuration,MaritalStatus,NumberOfDependents,...,MonthlyIncome,UtilityBillsPaymentHistory,JobTenure,NetWorth,BaseInterestRate,InterestRate,MonthlyLoanPayment,TotalDebtToIncomeRatio,LoanApproved,RiskScore
0,45,"$39,948.00",617,Employed,Master,22,13152,48,Married,2,...,3329.0,0.724972,11,126928,0.199652,0.22759,419.805992,0.181077,0,49.0
1,38,"$39,709.00",628,Employed,Associate,15,26045,48,Single,1,...,3309.083333,0.935132,3,43609,0.207045,0.201077,794.054238,0.389852,0,52.0
2,47,"$40,724.00",570,Employed,Bachelor,26,17627,36,,2,...,3393.666667,0.872241,6,5205,0.217627,0.212548,666.406688,0.462157,0,52.0
3,58,"$69,084.00",545,Employed,High School,34,37898,96,Single,1,...,5757.0,0.896155,5,99452,0.300398,0.300911,1047.50698,0.313098,0,54.0
4,37,"$103,264.00",594,Employed,Associate,17,9184,36,Married,1,...,8605.333333,0.941369,5,227019,0.197184,0.17599,330.17914,0.07021,1,36.0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         20000 non-null  int64  
 1   AnnualIncome                20000 non-null  object 
 2   CreditScore                 20000 non-null  int64  
 3   EmploymentStatus            20000 non-null  object 
 4   EducationLevel              19099 non-null  object 
 5   Experience                  20000 non-null  int64  
 6   LoanAmount                  20000 non-null  int64  
 7   LoanDuration                20000 non-null  int64  
 8   MaritalStatus               18669 non-null  object 
 9   NumberOfDependents          20000 non-null  int64  
 10  HomeOwnershipStatus         20000 non-null  object 
 11  MonthlyDebtPayments         20000 non-null  int64  
 12  CreditCardUtilizationRate   20000 non-null  float64
 13  NumberOfOpenCreditLines     200

In [4]:
# EDA Code Here - Create New Cells As Needed

# Categorical fields and initial values

EmploymentStatus_values = data['EmploymentStatus'].unique()
# 'Employed', 'Self-Employed', 'Unemployed'
EducationLevel_values = data['EducationLevel'].unique()
# 'Master', 'Associate', 'Bachelor', 'High School', nan, 'Doctorate'
MaritalStatus_values = data['MaritalStatus'].unique()
# 'Married', 'Single', nan, 'Divorced', 'Widowed'
HomeOwnershipStatus_values = data['HomeOwnershipStatus'].unique()
# 'Own', 'Mortgage', 'Rent', 'Other'
BankrupcyHistory_values = data['BankruptcyHistory'].unique() 
# 'No', 'Yes'
LoanPurpose_values = data['LoanPurpose'].unique() 
# 'Home', 'Debt Consolidation', 'Education', 'Other', 'Auto'


## Data Preparation
5. Design your preprocessing strategy:
- Create separate preprocessing flows for different feature types
- Must utilize ColumnTransformer and Pipeline
- Consider using FeatureUnion as well
- Handle missing values appropriately for each feature
- Handle Categorical and Ordinal data appropriately
- Scale numeric values if model requires it (linear model)
- Document your reasoning for each preprocessing decision



In [5]:
# Data Prep Code Here - Create New Cells As Needed

numeric_features = [
    'AnnualIncome',
    'CreditScore',
    'Experience',
    'LoanAmount',
    'LoanDuration',
    'NumberOfDependents',
    'MonthlyDebtPayments',
    'CreditCardUtilizationRate',
    'NumberOfOpenCreditLines',
    'NumberOfCreditInquiries',
    'DebtToIncomeRatio',
    'PreviousLoanDefaults',
    'PaymentHistory',  
    'LengthOfCreditHistory'
    'SavingsAccountBalance',
    'CheckingAccountBalance',
    'TotalAssets',
    'TotalLiabilities',
    'MonthlyIncome',
    'UtilityBillsPaymentHistory',
    'JobTenure',
    'NetWorth',
    'BaseInterestRate',
    'InterestRate',
    'MonthlyLoanPayment',
    'TotalDebtToIncomeRatio'
]

categorical_features = [
    'EmploymentStatus',
    'EducationLevel',
    'MaritalStatus',
    'HomeOwnershipStatus',
    'BankrupcyHistory',
    'LoanPurpose'
]

numeric_transformer = Pipeline(steps=[
    # Handle missing values with median 
    ('imputer', SimpleImputer(strategy='median')),
    # Then scale the features
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    # Fill missing values with 'missing' label
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    # Encode categorical variables, drop first to avoid multicollinearity
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])




## Modeling
6. Implement your modeling approach:
- Choose appropriate model algorithms based on your problem definition
- Set up validation strategy with chosen metrics
- Use a train test split and cross validation
- Create complete pipeline including any preprocessing and model
- Document your reasoning for each modeling decision

7. Optimize your model:
- Define parameter grid based on your understanding of the algorithms
- Implement GridSearchCV and/or RandomizedSearchCV with chosen metrics
- Consider tuning preprocessing steps
- Track and document the impact of different parameter combinations
- Consider the trade-offs between different model configurations

NOTE: Be mindful of time considerations - showcase “how to tune” 


In [11]:
#  Modeling Code Here - Create New Cells as Needed
X = data.drop('LoanApproved', axis=1)
y = data['RiskScore']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])


In [None]:
rf_param_grid = {
    # Preprocessing parameters
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    
    # Model parameters
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [10, 20, None],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__class_weight': ['balanced', None]
}

## Evaluation and Conclusion
8. Conduct thorough evaluation of final model:
- Assess models test data performance using your defined metrics
- Analyze performance across different data segments
- Identify potential biases or limitations
- Visualize model performance
    - Classification: Confusion Matrix/ROC-AUC
    - Regression: Scatter Plot (Predicted vs. Actual values)

9. Extract and interpret feature importance/significance:
- Which features had the most impact on your model?
- Does this lead to any potential business recommendations?

10. Prepare your final deliverable:
- Technical notebook with complete analysis
- Executive summary for business stakeholders
- Recommendations for implementation
- Documentation of potential improvements