### CONTENT

1. Business Objective & Problem Statement

2. Importing Required Libraries

3. Data Review and Cleansing
    
4. Data Analysis
     
5. Conclusions & Recommendations

###  BUSINESS OBJECTIVE & PROBLEM STATEMENT

Problem Statement:
Identify these risky df applicants, then such loans can be reduced thereby cutting down the amount of credit loss. Identification of such applicants using EDA is the aim of this case study.

### IMPORTING REQUIRED LIBRARIES

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 500)

In [None]:
df = pd.read_csv("loan_data_2007_2014.csv", index_col=0, low_memory=False)

### DATA REVIEW & CLEANSING

In [None]:
df.head()

In [None]:
df.info(verbose=True)

In [None]:
df.describe()

Data Observations

Data Type Observations

1. Term, zip_code are objects. Upon review of the data , there appears no value in 
converting this to a numeric variable They can continue to be considered as a categorical variables. 
Clean up necessary for additional signs / words

2. int_rate, emp_length are objects and will need to be converted to floats 

3. issue_d app, earliest_cr_line,  last_pymnt_d, next_pymnt_d ,last_credit_pull_d  are objects and will
   need to be converted to dates

In [None]:
# Review the Null value %
round(df.isnull().sum() / len(df) * 100, 2)

### Null Value Observations

In [None]:
null_df = pd.DataFrame(df.isnull().sum() / len(df)).reset_index()

In [None]:
# remove the null values above 75% from df dataframe
null_df = pd.DataFrame(df.isnull().sum() / len(df)).reset_index()
null_list = list(null_df["index"][null_df[0] >= 0.75])
null_list

In [None]:
df.drop(null_list, axis=1, inplace=True)

In [None]:
round(df.isnull().sum() / len(df) * 100, 2)

In [None]:
df.drop(["desc"], axis=1, inplace=True)

In [None]:
customer_behavior_var = [
    "delinq_2yrs",
    "earliest_cr_line",
    "inq_last_6mths",
    "open_acc",
    "pub_rec",
    "revol_bal",
    "revol_util",
    "total_acc",
    "out_prncp",
    "out_prncp_inv",
    "total_pymnt",
    "total_pymnt_inv",
    "total_rec_prncp",
    "total_rec_int",
    "total_rec_late_fee",
    "recoveries",
    "collection_recovery_fee",
    "last_pymnt_d",
    "last_pymnt_amnt",
    "last_credit_pull_d",
    "mths_since_last_delinq",
    "id",
    "url",
    "title",
]
df = df.drop(customer_behavior_var, axis=1)

In [None]:
df.shape

##### Data Type Review

In [None]:
df.info()

In [None]:
# Review term . Since the terms are only in two categories, they can be kept as a categorical variable as is.
# Cleaning up data to remove word 'months'
df["term"] = df["term"].str.replace("months", "")
df["term"].value_counts()

In [None]:
# Review interest rate
df["int_rate"].value_counts()

In [None]:
# Removing the word 'Years' , < and+ signs.
df["emp_length"] = df["emp_length"].str.replace("years", "")
df["emp_length"] = df["emp_length"].str.replace("+", "").str.replace("< 1", "0.5")
df["emp_length"] = df["emp_length"].str.replace("year", "")
df["emp_length"] = df["emp_length"].astype("float64")

In [None]:
# Retaining emp_length as a categorical variable. There are 1075 missing values.
# These will be replaced with the mode of 10 + years
mode_emplength = df["emp_length"].mode()
df["emp_length"] = df["emp_length"].fillna(mode_emplength[0])

In [None]:
df["issuemonth"] = df["issue_d"].str[:3]

In [None]:
df.issuemonth.value_counts()

In [None]:
df["pymnt_plan"].value_counts(dropna=False)

In [None]:
df["policy_code"].value_counts()

In [None]:
df["collections_12_mths_ex_med"].value_counts()

In [None]:
df["application_type"].value_counts()

In [None]:
df["initial_list_status"].value_counts()

In [None]:
df["initial_list_status"].value_counts()

In [None]:
# Creating a list of variables to drop from the dataframe
unique_col_list = [
    "policy_code",
    "pymnt_plan",
    "collections_12_mths_ex_med",
    "application_type",
    "initial_list_status",
    "acc_now_delinq",
]
df.drop(unique_col_list, axis=1, inplace=True)

In [None]:
df["emp_title"].fillna("Unknown", inplace=True)

In [None]:
df.isnull().sum()

In [None]:
loan_new = df[df["loan_status"] in []]

In [None]:
loan_new

## DATA ANALYSIS

#### UNIVARIATE ANALYSIS

In [None]:
loan_new.info()

In [None]:
loan_new.describe()

In [None]:
# Function for numerical univariate analysis
def univariate_num(col, loan_new):
    sns.set(font_scale=2)
    fig, axes = plt.subplots(1, 2, figsize=(30, 10), dpi=50)
    sns.boxplot(ax=axes[0], x=loan_new[col])
    axes[0].set_title(col + " distribution")

    sns.histplot(ax=axes[1], x=loan_new[col], kde=True)
    axes[1].set_title(col + " distribution and density")

    locs, labels = plt.xticks()
    plt.show()

In [None]:
univariate_num("loan_amnt", loan_new)

In [None]:
# Observations
# Loan Amount
# 1. Median df amount application is 10K 75% of values lie under 15K. We do notice outliers beyond 30K
# 2. Right skewed distribution with mean around 11K

In [None]:
univariate_num("funded_amnt", loan_new)

In [None]:
univariate_num("funded_amnt_inv", loan_new)

In [None]:
univariate_num("int_rate", loan_new)

In [None]:
univariate_num("installment", loan_new)

In [135]:
univariate_num("annual_inc", loan_new)

In [None]:
# Reviewing the outliers in income
loan_new[loan_new["annual_inc"] >= 3000000]

In [None]:
# Removing outlier records > 3M. There seem to be additional outliers that said, eliminating high values will
# compromise our ability to evaluate loans for high income customers
loan_new = loan_new[loan_new["annual_inc"] <= 3000000]

In [None]:
univariate_num("annual_inc", loan_new)

In [None]:
univariate_num("dti", loan_new)

In [None]:
univariate_num("emp_length", loan_new)

In [None]:
loan_new["annual_inc_cat"] = pd.qcut(loan_new["annual_inc"], q=5)

In [None]:
loan_new["emp_length_cat"] = pd.cut(loan_new["emp_length"], bins=5)

In [None]:
loan_new["int_rate_cat"] = pd.cut(loan_new["int_rate"], bins=5)

In [None]:
loan_new["loan_amnt_cat"] = pd.cut(loan_new["loan_amnt"], bins=5)

In [None]:
loan_new.head()

In [None]:
# extracting a list of objects
dtype_object = loan_new.select_dtypes(include=["object"]).columns.tolist()
dtype_object

In [None]:
# Building a function to generate univariate categroies
def univariate_cat(col, loan_new):
    sns.set(font_scale=1)
    plt.figure(figsize=[10, 5])
    ax = sns.countplot(x=loan_new[col], data=loan_new)
    plt.title(col + " distribution")
    # add_value_labels(ax)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
univariate_cat("loan_status", loan_new)

In [None]:
univariate_cat("term", loan_new)

In [None]:
univariate_cat("grade", loan_new)

In [None]:
univariate_cat("sub_grade", loan_new)

In [None]:
print(loan_new["emp_title"].value_counts())
# univariate_cat('emp_title',loan_new)

In [None]:
univariate_cat("verification_status", loan_new)

In [None]:
univariate_cat("home_ownership", loan_new)

In [None]:
print(loan_new["purpose"].value_counts())

In [None]:
univariate_cat("purpose", loan_new)

In [None]:
univariate_cat("issuemonth", loan_new)

In [None]:
loan_new["zip_code"].value_counts().nlargest(20).plot.bar()

In [None]:
# loan_new['addr_state'].value_counts().nlargest(20).plot.bar()#
univariate_cat("addr_state", loan_new)

In [None]:
univariate_cat("annual_inc_cat", loan_new)

In [None]:
univariate_cat("emp_length_cat", loan_new)

In [None]:
univariate_cat("int_rate_cat", loan_new)

In [None]:
univariate_cat("loan_amnt_cat", loan_new)

# BIVARIATE ANALYSIS

In [None]:
# Defining a function to generate bi variate analyses between the target and the chosen categorical variables
def plot_stats(feature, label_rotation=False, horizontal_layout=True):
    sns.set(font_scale=1)
    temp = loan_new[feature].value_counts()
    df1 = pd.DataFrame({feature: temp.index, "Number of Applications": temp.values})

    # Calculate the percentage of loan_status_new =1 per category value
    cat_perc = (
        loan_new[[feature, "loan_status_new"]].groupby([feature], as_index=False).mean()
    )
    cat_perc.sort_values(by="loan_status_new", ascending=False, inplace=True)
    sns.set_style("whitegrid")

    if horizontal_layout:
        sns.set(font_scale=1.5)
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 5))
    else:
        sns.set(font_scale=1.5)
        fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(20, 10))

    sns.set_color_codes("pastel")
    sns.set_style("whitegrid")

    s = sns.barplot(ax=ax1, x=feature, y="Number of Applications", data=df1)

    if label_rotation:
        s.set_xticklabels(s.get_xticklabels(), rotation=90)

    # s = sns.barplot(ax=ax2, x = feature, y='TARGET', order=cat_perc[feature], data=cat_perc)
    s = sns.barplot(ax=ax2, x=feature, y="loan_status_new", data=cat_perc)

    if label_rotation:
        s.set_xticklabels(s.get_xticklabels(), rotation=90)

    plt.ylabel("Percent of charged Off [%]")

    plt.tick_params(axis="both", which="major")
    plt.subplots_adjust(wspace=0.2, top=0.9)
    plt.show()

In [None]:
loan_new.columns

In [None]:
# New column addition to display the target variable as a numeric value
loan_new["loan_status_new"] = loan_new["loan_status"].map(
    {"Charged Off": 1, "Fully Paid": 0}
)

In [None]:
##plot the Box plot to check for the possible option to check the outliers
sns.set(style="whitegrid")
f, axs = plt.subplots(3, 3, figsize=(15, 20))
plt.subplot(3, 3, 1)
sns.boxplot(x="loan_status_new", y="loan_amnt", data=loan_new)
plt.title("loan_amt")
plt.subplot(3, 3, 2)
sns.boxplot(x="loan_status_new", y="funded_amnt", data=loan_new)
plt.title("funded_amnt")
plt.subplot(3, 3, 3)
sns.boxplot(x="loan_status_new", y="funded_amnt_inv", data=loan_new)
plt.title("funded_amnt_inv")
plt.subplot(3, 3, 4)
sns.boxplot(x="loan_status_new", y="int_rate", data=loan_new)
plt.title("int_rate")

plt.subplot(3, 3, 5)
sns.boxplot(x="loan_status_new", y="installment", data=loan_new)
plt.title("installment")

plt.subplot(3, 3, 6)
sns.boxplot(x="loan_status_new", y="annual_inc", data=loan_new)
plt.title("annual_inc")

plt.subplot(3, 3, 7)
sns.boxplot(x="loan_status_new", y="dti", data=loan_new)
plt.title("dti")

plt.show()

In [None]:
# numericvariables = ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'installment', 'annual_inc', 'dti',
#        'emp_length','loan_status_new']
# sns.pairplot(data=loan_new[numericvariables],hue="loan_status_new")

In [None]:
corr = loan_new.corr()
corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corrdf0 = corr.unstack().reset_index()
corrdf0.columns = ["VAR1", "VAR2", "Correlation"]
corrdf0.dropna(subset=["Correlation"], inplace=True)
corrdf0["Correlation"] = round(corrdf0["Correlation"], 2)
# Since we see correlation as an absolute value, we are converting it into absolute value
corrdf0["Correlation_abs"] = corrdf0["Correlation"].abs()
corrdf0.sort_values(by="Correlation_abs", ascending=False).head(40)

In [None]:
plot_stats("loan_status", True)

In [None]:
plot_stats("term", True)

In [None]:
plot_stats("int_rate_cat", True)

In [None]:
plot_stats("grade", True)

In [None]:
plot_stats("sub_grade", True, False)

In [None]:
plot_stats("home_ownership", True)

In [None]:
plot_stats("verification_status", True)

In [None]:
plot_stats("loan_amnt_cat", True)

In [None]:
plot_stats("annual_inc_cat", True)

In [None]:
plot_stats("emp_length_cat", True)

In [None]:
plot_stats("issuemonth", True)

In [None]:
plot_stats("purpose", True)

In [None]:
plot_stats("addr_state", True, False)

### CONCLUSIONS & RECOMMENDATIONS

CONCLUSIONS

MAJOR FACTOR INFLUENCING CHARGE OFFs

1. Interest rate
2. Loan amount
3. Annual income
4. Public record of bankruptcies
5. Verification status is unverified
6. Home ownership status of others, rent and mortgage
7. Grade: lower grade tend to have higher default
8. Term
9. Top 5 states of applicant pool
10.Purpose

RECOMMENDATIONS FOR LENDING CLUB

1. Prioritize high grade loans
2. Scrutinize purpose, state and public bankruptcy record 
3. Cap df amounts beyond >20K where the charge off is higher. Similarly, cap interest rates in the default zone
4. Offer reduced repayment terms such as 48 months etc
5. Ensure verification is complete for all loans disbursed
6. Engage more with employees with experience over 1 year and lesser than 10 to expand the pool by learning about their needs
7. Improve df amount to funded_inv ratio by aligining investor needs. We observe not all investors choose to participate and some loans go unbacked
8. For seasonal spending, spend additional time reviewing applications that are likely to default based on further analysis
9. The subgrading system needs improvement to accurately show risk of default
