#### Bank Loan Performance  Review

##### Project Object
This project is to demonstrate the use of SQL and Pandas for data analysis

##### Project requirements
##### Overview KPIs
    Sliced by All, loan status, term, Purpose, average Debt-to-Income Ratio (DTI), home ownership
1.	Total loan application
2.	Total loan amount
3.	Average loan amount

##### Why do customers apply for loan?
1.	Purpose
2.	Term

##### Good Loans KPIs
	Good loans(Loans with status 'Fully Paid' and 'Current.'): Sliced by Purpose, Term, home ownership
1.	Count
2.	total amount and percentage of good loans
3.	average years of credit history
4.	years in current job
5.	average Debt-to-Income Ratio (DTI)
6.	average credit score
7.	average number of open accounts
8.	average number of credit problems
9.	average maximum open credit.

##### Bad Loans KPIs
    Bad loans(Loans with status 'Charged Off'): Sliced by Purpose, Term, home ownership
1.	Count
2.	total amount and percentage of bad loans
3.	average years of credit history
4.	years in current job
5.	average Debt-to-Income Ratio (DTI)
6.	average credit score
7.	average number of open accounts
8.	average number of credit problems
9.	average maximum open credit.


In [125]:
import pandas as pd
from sqlalchemy import create_engine
pd.set_option("display.max_columns", None)

In [126]:
# Establish a database connection
with open('project_secret.txt', 'r') as file:
    driver = file.readline().strip()
    server_name = file.readline().strip()
    database = file.readline().strip()
    username = file.readline().strip()
    password = file.readline().strip()
    Table_credit_train = file.readline().strip()
    Table_credit_test = file.readline().strip()

# To create a SQLAlchemy engine
engine = create_engine(f'mssql+pyodbc://{username}:{password}@{server_name}/{database}?driver={driver}')


In [127]:
# To load table in pandas df
query = f"SELECT * FROM {Table_credit_train}"
df_initial = pd.read_sql(query, engine)

In [128]:
print(f"Number of data entry {len(df_initial)}")

Number of data entry 100000


In [129]:
# To load non-duplicated (based on loan_id) data into memory
query = f"""
with duplicate AS (
	Select Loan_ID, count(*) AS duplicateLoanIDCounter from {Table_credit_train}
	group by Loan_ID
	having count(*) > 1
)

select d.duplicateLoanIDCounter, t.* from {Table_credit_train} AS t
JOIN
duplicate d on d.loan_id = t.loan_id
"""
df_duplicated = pd.read_sql(query, engine)

In [130]:
print(df_duplicated['duplicateLoanIDCounter'].describe())

count    36002.0
mean         2.0
std          0.0
min          2.0
25%          2.0
50%          2.0
75%          2.0
max          2.0
Name: duplicateLoanIDCounter, dtype: float64


In [131]:
# To reload the data without duplicate rows
query = f"""
WITH numbered_rows AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY Loan_ID ORDER BY Loan_ID) AS row_num
    FROM {Table_credit_train}
)
SELECT *
FROM numbered_rows
WHERE row_num = 1;
"""
df_non_duplicated = pd.read_sql(query, engine)

In [183]:
print(f"Number of non-duplicated clean data is {len(df_non_duplicated)}")

Number of non-duplicated clean data is 81999


In [133]:
100000 - (36002/2)

81999.0

In [134]:
df_final=df_non_duplicated.copy()

In [135]:
df_final.columns = [x.lower() for x in df_final.columns] # Change field name to lower case
df_final.head()

Unnamed: 0,loan_id,customer_id,loan_status,current_loan_amount,term,credit_score,annual_income,years_in_current_job,home_ownership,purpose,monthly_debt,years_of_credit_history,months_since_last_delinquent,number_of_open_accounts,number_of_credit_problems,current_credit_balance,maximum_open_credit,bankruptcies,tax_liens,row_num
0,0000757f-a121-41ed-b17b-162e76647c1f,dde79588-12f0-4811-bab0-e2b07f633fcd,Fully Paid,258082,Short Term,746.0,950475.0,4 years,Rent,Debt Consolidation,6748.419922,11.5,,12,0,330429,815782.0,0.0,0.0,1
1,0000afa6-8902-4f8f-b870-25a8fdad0aeb,e49c1a82-a0f7-45e8-9f46-2f75c43f9fbc,Charged Off,541486,Long Term,,,6 years,Rent,Business Loan,10303.509766,17.6,73.0,7,0,268337,372988.0,0.0,0.0,1
2,00020fb0-6b8a-4b3a-8c72-9c4c847e8cb6,c9decd06-16f7-44c3-b007-8776f2a9233d,Fully Paid,99999999,Short Term,742.0,1230440.0,3 years,Home Mortgage,Debt Consolidation,11073.959961,26.799999,,11,0,168720,499642.0,0.0,0.0,1
3,00045ecd-59e9-4752-ba0d-679ff71692b3,b7bce684-b4b0-4b29-af66-eae316bce573,Fully Paid,260986,Short Term,734.0,1314838.0,10+ years,Own Home,Debt Consolidation,16325.94043,30.299999,,7,0,189221,373890.0,0.0,0.0,1
4,0004f37b-5859-40f6-98d0-367aa3b3f3f1,f662b062-5fa5-463d-b5c0-4e36d09fcab1,Fully Paid,301818,Short Term,,,1 year,Own Home,Home Improvements,14770.219727,13.6,2.0,12,0,127680,1173370.0,0.0,0.0,1


In [185]:
len(df_final)

81999

In [137]:
df_final.describe()

Unnamed: 0,current_loan_amount,credit_score,annual_income,monthly_debt,years_of_credit_history,months_since_last_delinquent,number_of_open_accounts,number_of_credit_problems,current_credit_balance,maximum_open_credit,bankruptcies,tax_liens,row_num
count,81999.0,64939.0,64939.0,81999.0,81999.0,37378.0,81999.0,81999.0,81999.0,81997.0,81824.0,81991.0,81999.0
mean,12102130.0,1168.636366,1376561.0,18330.633109,18.296783,35.064236,11.114489,0.161441,293620.3,793535.8,0.113463,0.028064,1.0
std,32198250.0,1633.006359,1119818.0,12127.700799,7.043774,22.021222,4.981266,0.473148,372614.5,9208747.0,0.344674,0.254642,0.0
min,10802.0,585.0,76627.0,0.0,3.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,179234.0,711.0,847818.0,10117.595215,13.5,16.0,8.0,0.0,113316.0,280456.0,0.0,0.0,1.0
50%,307912.0,732.0,1170590.0,16075.330078,17.0,32.0,10.0,0.0,209931.0,477774.0,0.0,0.0,1.0
75%,519332.0,743.0,1649248.0,23811.370117,21.799999,51.0,14.0,0.0,366994.5,798490.0,0.0,0.0,1.0
max,100000000.0,7510.0,165557400.0,435843.28125,70.5,176.0,76.0,15.0,32878970.0,1539738000.0,7.0,15.0,1.0


In [138]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81999 entries, 0 to 81998
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   loan_id                       81999 non-null  object 
 1   customer_id                   81999 non-null  object 
 2   loan_status                   81999 non-null  object 
 3   current_loan_amount           81999 non-null  int64  
 4   term                          81999 non-null  object 
 5   credit_score                  64939 non-null  float64
 6   annual_income                 64939 non-null  float64
 7   years_in_current_job          81999 non-null  object 
 8   home_ownership                81999 non-null  object 
 9   purpose                       81999 non-null  object 
 10  monthly_debt                  81999 non-null  float64
 11  years_of_credit_history       81999 non-null  float64
 12  months_since_last_delinquent  37378 non-null  float64
 13  n

#

### KPIs

#### Overview KPIs
    Sliced by All, loan status, term, Purpose, average Debt-to-Income Ratio (DTI), home ownership
1.	Total loan application
2.	Total loan amount
3.	Average loan amount

#### Sliced by All

In [175]:
# Pandas Result
result = {
    "Total loan application": ['{:,.0f}'.format(len(df_final))],
    "Total loan amount": ['{:,.0f}'.format(df_final['current_loan_amount'].dropna().sum())],
    "Average loan amount": ['{:,.0f}'.format(df_final['current_loan_amount'].dropna().mean())]
}
pd.DataFrame(result)

Unnamed: 0,Total loan application,Total loan amount,Average loan amount
0,81999,992362799131,12102133


In [178]:
# SQL Equivalent Result
q = f"""
WITH numbered_rows AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY Loan_ID ORDER BY Loan_ID) AS row_num
    FROM {Table_credit_train}
),
clean_data AS (
    SELECT *
    FROM numbered_rows
    WHERE row_num = 1
)

SELECT 
    count(*) AS total_loan_application,
    SUM(CASE WHEN current_loan_amount IS NOT NULL THEN CAST(current_loan_amount AS BIGINT) ELSE 0 END) AS total_loan_amount,
    AVG(CASE WHEN current_loan_amount IS NOT NULL THEN CAST(current_loan_amount AS FLOAT) END) AS average_loan_amount

FROM clean_data;
"""

result = pd.read_sql(q, engine)
print(result)

   total_loan_application  total_loan_amount  average_loan_amount
0                   81999       990672421134         1.208152e+07


.

#### Sliced by Loan Status

In [141]:
# Pandas Result
status_count = df_final['loan_status'].value_counts()
status_list = status_count.index.tolist()

total_loan_application = []
total_loan_amount = []
average_loan_amount = []

for status in status_list:
    total_loan_application.append(status_count[status])

    # To get the total loan amount for each loan status
    filt = df_final['loan_status'] == status
    loan_amount_for_this_status = df_final[filt]['current_loan_amount'].sum()
    total_loan_amount.append('{:,.0f}'.format(loan_amount_for_this_status))

    # To get the average loan amount for each loan status
    filt = df_final['loan_status'] == status
    average_loan_amount_for_this_status = df_final[filt]['current_loan_amount'].mean()
    average_loan_amount.append('{:,.0f}'.format(average_loan_amount_for_this_status))


result = {
    "Loan status" : status_list,
    "Total loan application" : total_loan_application,
    "Total loan amount" : total_loan_amount,
    "Average loan amount" : average_loan_amount
}


pd.DataFrame(result)

Unnamed: 0,Loan status,Total loan application,Total loan amount,Average loan amount
0,Fully Paid,59360,985005684971,16593762
1,Charged Off,22639,7357114160,324975


In [162]:
import numpy as np
filt = df_final['loan_status'] == 'Fully Paid'
print(df_final[filt]['current_loan_amount'].sum())
np.sum(df_final[filt]['current_loan_amount'])

985005684971


985005684971

In [165]:
# SQL Equivalent Result
q = f"""
WITH numbered_rows AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY Loan_ID ORDER BY Loan_ID) AS row_num
    FROM {Table_credit_train}
),
clean_data AS (
    SELECT *
    FROM numbered_rows
    WHERE row_num = 1
)
SELECT
    loan_status,
    count(*) AS total_loan_application,
    sum(current_loan_amount) AS total_loan_amount,
    avg(current_loan_amount) AS average_loan_amount
FROM clean_data
GROUP BY loan_status
;
"""
result = pd.read_sql(q, engine)
print(result)

   loan_status  total_loan_application  total_loan_amount  average_loan_amount
0   Fully Paid                   59360       982915977010             16558557
1  Charged Off                   22639         7357114160               324975


#### Sliced by Loan term

In [149]:
# Pandas Result
term_count = df_final['term'].value_counts()
term_list = term_count.index.tolist()

total_loan_application = []
total_loan_amount = []
average_loan_amount = []

for term in term_list:
    total_loan_application.append(term_count[term])

    # To get the total loan amount for each loan term
    filt = df_final['term'] == term
    loan_amount_for_this_term = df_final[filt]['current_loan_amount'].sum()
    total_loan_amount.append('{:,.0f}'.format(loan_amount_for_this_term))

    # To get the average loan amount for each loan term
    filt = df_final['term'] == term
    average_loan_amount_for_this_term = df_final[filt]['current_loan_amount'].mean()
    average_loan_amount.append('{:,.0f}'.format(average_loan_amount_for_this_term))


result = {
    "Loan term" : term_list,
    "Total loan application" : total_loan_application,
    "Total loan amount" : total_loan_amount,
    "Average loan amount" : average_loan_amount
}


pd.DataFrame(result)

Unnamed: 0,Loan term,Total loan application,Total loan amount,Average loan amount
0,Short Term,61387,822613993234,13400459
1,Long Term,20612,169748805897,8235436


In [144]:
# SQL Equivalent Result
q = f"""
WITH numbered_rows AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY Loan_ID ORDER BY Loan_ID) AS row_num
    FROM {Table_credit_train}
),
clean_data AS (
    SELECT *
    FROM numbered_rows
    WHERE row_num = 1
)
SELECT term, count(*) AS total_loan_application, sum(CAST(current_loan_amount AS BIGINT)) AS total_loan_amount, avg(CAST(current_loan_amount AS BIGINT)) AS average_loan_amount
FROM clean_data
GROUP BY term
;
"""
result = pd.read_sql(q, engine)
print(result)

         term  total_loan_application  total_loan_amount  average_loan_amount
0  Short Term                   61387       822717993955             13402153
1   Long Term                   20612       169153738083              8206565
