<a href="https://colab.research.google.com/github/mayankbrn/9.9_LoanTap_LogisticRegression/blob/main/09_LoanTap_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LoanTap Logistic Regression

## About the Challenge

### Context

LoanTap is an online platform committed to delivering customized loan products to millennials. They innovate in an otherwise dull loan segment, to deliver instant, flexible loans on consumer friendly terms to salaried professionals and businessmen.

The data science team at LoanTap is building an underwriting layer to determine the creditworthiness of MSMEs as well as individuals.

### How can we help LoanTap ?

LoanTap deploys formal credit to salaried individuals and businesses 4 main financial instruments:

Personal Loan
EMI Free Loan
Personal Overdraft
Advance Salary Loan
This case study will focus on the underwriting process behind Personal Loan only

### Problem Statement

Given a set of attributes for an Individual, determine if a credit line should be extended to them. If so, what should the repayment terms be in business recommendations?


### Dataset

[DataSet Link](https://drive.google.com/file/d/19FfBWDPhDPlc-1FzA50HVfoiMXmYdBUH/view?usp=sharing)

### Column Profiling

- **loan_amnt**: The amount of the loan applied for by the borrower. If the credit department reduces the loan amount, it is reflected in this value.
- **term**: Number of payments on the loan in months (either 36 or 60).
- **int_rate**: Interest rate on the loan.
- **installment**: The monthly payment the borrower owes if the loan is originated.
- **grade**: Loan grade assigned by LoanTap.
- **sub_grade**: Loan subgrade assigned by LoanTap.
- **emp_title**: Job title provided by the borrower when applying for the loan.
- **emp_length**: Employment length in years, ranging from 0 (less than one year) to 10 (ten or more years).
- **home_ownership**: Home ownership status provided by the borrower during registration or from the credit report.
- **annual_inc**: Self-reported annual income provided by the borrower during registration.
- **verification_status**: Indicates if income was verified by LoanTap, not verified, or if the income source was verified.
- **issue_d**: The month the loan was funded.
- **loan_status**: Current status of the loan (Target Variable).
- **purpose**: Category of the loan request provided by the borrower.
- **title**: Loan title provided by the borrower.
- **dti**: Debt-to-income ratio, calculated using the borrower’s total monthly debt payments (excluding mortgage and the requested LoanTap loan), divided by the borrower’s self-reported monthly income.
- **earliest_cr_line**: Month the borrower's earliest reported credit line was opened.
- **open_acc**: Number of open credit lines in the borrower's credit file.
- **pub_rec**: Number of derogatory public records.
- **revol_bal**: Total revolving credit balance.
- **revol_util**: Revolving line utilization rate, or the amount of credit the borrower is using relative to available revolving credit.
- **total_acc**: Total number of credit lines currently in the borrower's credit file.
- **initial_list_status**: Initial listing status of the loan (values: W, F).
- **application_type**: Indicates whether the loan is an individual or joint application.
- **mort_acc**: Number of mortgage accounts.
- **pub_rec_bankruptcies**: Number of public record bankruptcies.
- **address**: Address of the individual.


### Concepts Used

- Exploratory Data Analysis
- Feature Engineering
- Logistic Regression
- Precision Vs Recall Tradeoff


### What does good looks like?

## Analysis and Modeling Steps

1. **Data Import & Initial Analysis**:
   - Import the dataset and perform exploratory data analysis, examining the structure and characteristics of the dataset.

2. **Target Variable Analysis**:
   - Analyze the dependency of the target variable (`Loan_Status`) on predictor variables using visualizations (count plots, box plots, heatmaps, etc.).

3. **Correlation Analysis**:
   - Check the correlation among independent variables to understand their interactions.

4. **Feature Engineering**:
   - Create flags for specific columns where values above 1.0 are set to 1, otherwise 0:
     - **Pub_rec**
     - **Mort_acc**
     - **Pub_rec_bankruptcies**

5. **Data Cleaning**:
   - Treat missing values and outliers.

6. **Scaling**:
   - Apply scaling using either MinMaxScaler or StandardScaler.

7. **Modeling**:
   - Use a Logistic Regression model from Sklearn or Statsmodel library and explain the results.

8. **Results Evaluation**:
   - Evaluate the model using:
     - **Classification Report**
     - **ROC AUC Curve**
     - **Precision-Recall Curve**

9. **Tradeoff Considerations**:
   - **Balancing Detection and False Positives**: How to ensure the model accurately detects defaulters while minimizing false positives. This is crucial to prevent lost financing opportunities.
   - **NPA Risks**: Given the risks of non-performing assets, it’s essential to minimize loans issued to likely defaulters to reduce financial risk.

10. **Insights & Recommendations**:
    - Provide actionable insights and recommendations based on the analysis and model results.


### Evaluation Criteria (100 points):

### 1. Problem Definition and Exploratory Data Analysis (10 Points)
   - **Problem Statement**: Define the problem based on the provided statement, with additional perspectives if relevant.
   - **Data Overview**: Observe data shape, attribute data types, conversion of categorical attributes to 'category' type (if needed), detect missing values, and generate a statistical summary.
   - **Univariate Analysis**: Plot distribution of continuous variables and use barplots/countplots for categorical variables.
   - **Bivariate Analysis**: Explore relationships between key variables.
   - **Insights**: Summarize key insights from EDA, including:
     - Attribute ranges and detection of outliers.
     - Variable distribution characteristics and relationships.
     - Observations from each univariate and bivariate plot.

### 2. Data Preprocessing (20 Points)
   - **Duplicate Check**: Identify and remove duplicate values.
   - **Missing Values**: Treat missing values appropriately.
   - **Outlier Treatment**: Address outliers in the dataset.
   - **Feature Engineering**: Develop new features as needed.
   - **Data Preparation**: Prepare data for modeling.

### 3. Model Building (10 Points)
   - **Logistic Regression Model**: Build the model and analyze model statistics.
   - **Coefficients**: Display model coefficients with respective column names.

### 4. Results Evaluation (50 Points)
   - **ROC AUC Curve & Analysis** (10 Points)
   - **Precision-Recall Curve & Analysis** (10 Points)
   - **Classification Report**: Generate and analyze the confusion matrix and other metrics (10 Points).
   - **Tradeoff Questions**:
     - **Defaulter Detection and False Positives**: How can the model detect actual defaulters while minimizing false positives? Reducing false positives is crucial to avoid missed financing opportunities (10 Points).
     - **NPA Risk Management**: Since NPAs (non-performing assets) pose a major risk, ensure the model minimizes loan disbursement to high-risk borrowers (10 Points).

### 5. Actionable Insights & Recommendations (10 Points)
   - Provide actionable insights based on the model results and suggest recommendations for improving model deployment, risk assessment, and lending strategies.

### Questionnare


1. **Percentage of Fully Paid Loans**:
   - Calculate the percentage of customers who have fully paid their loan amount.

2. **Correlation Analysis**:
   - Comment on the correlation between `Loan Amount` and `Installment` features.

3. **Home Ownership**:
   - The majority of customers have home ownership as **[insert most common category here]**.

4. **Loan Grade and Payment Likelihood**:
   - People with grades ‘A’ are more likely to fully pay their loan. **(True/False)**

5. **Top Job Titles**:
   - The top 2 afforded job titles are **[insert top job title 1]** and **[insert top job title 2]**.

6. **Primary Focus Metric**:
   - From a bank’s perspective, the primary metric to focus on should be:
     - **ROC AUC**
     - **Precision**
     - **Recall**
     - **F1 Score**

7. **Precision-Recall Tradeoff**:
   - The gap in precision and recall affects the bank by **[explain how the difference impacts loan approval and risk assessment]**.

8. **Influential Features**:
   - The features that heavily affected the outcome are **[list key influential features]**.

9. **Geographical Impact**:
   - Will the results be affected by geographical location? **(Yes/No)**


## Solution ▶