# Credit Card Default Predictor (DSCI Group Project Proposal)

## Introduction
Defaulting on a credit card payment is the event that a client of a bank does not pay the credit card balance, hence causing the bank/lender to lose money. And hence the chance that someone would default on a payment is a significant impact on whether a bank should approve a credit card application as well as the credit limit given.

This project aims to come up with a classification predictive model on whether the an account will default on his/her next credit card payment. The dataset is downloaded from [Kaggle](https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset).

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. [[1]](https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset)

We will be using all columns except ID as predictors and default.payment.next.month for the value we are predicting for reasons explained in the data summarization and visualization step


## Methods and Results

### Library Import 
I used suppressPackageStartupMessages for cleaniness

In [1]:
suppressPackageStartupMessages(library(tidyverse))
library(repr)
suppressPackageStartupMessages(library(tidymodels))
options(repr.matrix.max.rows = 6)
library(ggplot2)
suppressPackageStartupMessages(require(gridExtra))
set.seed(9999)

### Data Import

In [2]:
data_url <- url("https://raw.githubusercontent.com/mlool/dsci-100-2023W1-group-008-31/main/data/UCI_Credit_Card.csv")
credit_card_data <- read_csv(data_url, show_col_types = FALSE)
credit_card_data

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,⋯,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,20000,2,2,1,24,2,2,-1,-1,⋯,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,⋯,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,⋯,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
29998,30000,1,2,2,37,4,3,2,-1,⋯,20878,20582,19357,0,0,22000,4200,2000,3100,1
29999,80000,1,3,1,41,1,-1,0,0,⋯,52774,11855,48944,85900,3409,1178,1926,52964,1804,1
30000,50000,1,2,1,46,0,0,0,0,⋯,36535,32428,15313,2078,1800,1430,1000,1000,1000,1


### Cleaning and Wrangling
First, I wish to remove all rows that has a datapoint outside the specificed possible values gotten from the source (eg. a 10 when the column should have have value 0 to 5). I chose to do it before splitting since those invalid entries are going to be removed anyways and as those are the majority of the entries (only 4000 out of 30000 is valid) if we split first then filter, we may end up in cases only very few valid entries in the test data.

In [3]:
category_colnames <- c("SEX", "EDUCATION", "MARRIAGE", "default.payment.next.month")
sex_categories <- c(1, 2)
education_categories <- c(1, 2, 3, 4, 5, 6)
marriage_status <- c(1, 2, 3)
pay_status <- c(-1, 1, 2, 3, 4, 5, 6, 7, 8, 9)
credit_card_tidy <- credit_card_data |>
                        rename(PAY_1 = PAY_0) |> # Since everything else start at 1 instead of 0
                        mutate(across(all_of(category_colnames), ~as_factor(.x))) |>
                        filter(SEX %in% sex_categories,
                               EDUCATION %in% education_categories,
                               MARRIAGE %in% marriage_status,
                               PAY_1 %in% pay_status,
                               PAY_2 %in% pay_status,
                               PAY_3 %in% pay_status,
                               PAY_4 %in% pay_status,
                               PAY_5 %in% pay_status,
                               PAY_6 %in% pay_status,)
credit_card_tidy

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,⋯,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
<dbl>,<dbl>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
12,260000,2,1,2,51,-1,-1,-1,-1,⋯,8517,22287,13668,21818,9966,8583,22301,0,3640,0
22,120000,2,2,1,39,-1,-1,-1,-1,⋯,0,632,316,316,316,0,632,316,0,1
29,50000,2,3,1,47,-1,-1,-1,-1,⋯,2040,30430,257,3415,3421,2044,30430,257,0,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
29977,40000,1,2,2,47,2,2,3,2,⋯,51259,47151,46934,4000,0,2000,0,3520,0,1
29992,210000,1,2,1,34,3,2,2,2,⋯,2500,2500,2500,0,0,0,0,0,0,1
29995,80000,1,2,2,34,2,2,2,2,⋯,77519,82607,81158,7000,3500,0,7000,0,4000,1


Then after removing the invalid entries we split the data to perform analysis on only the training set to avoid violating the golden rule of machine learning.

In [4]:
credit_card_split <- initial_split(credit_card_tidy, prop = 0.75, strata = default.payment.next.month)
credit_card_training <- training(credit_card_split)
credit_card_testing <- testing(credit_card_split)
credit_card_training

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,⋯,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
<dbl>,<dbl>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
12,260000,2,1,2,51,-1,-1,-1,-1,⋯,8517,22287,13668,21818,9966,8583,22301,0,3640,0
29,50000,2,3,1,47,-1,-1,-1,-1,⋯,2040,30430,257,3415,3421,2044,30430,257,0,0
31,230000,2,1,2,27,-1,-1,-1,-1,⋯,15339,14307,36923,17270,13281,15339,14307,37292,0,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
29977,40000,1,2,2,47,2,2,3,2,⋯,51259,47151,46934,4000,0,2000,0,3520,0,1
29992,210000,1,2,1,34,3,2,2,2,⋯,2500,2500,2500,0,0,0,0,0,0,1
29995,80000,1,2,2,34,2,2,2,2,⋯,77519,82607,81158,7000,3500,0,7000,0,4000,1


### Data Summarization and Visualization