# I. Sample End-to-End Analysis
## A. Challenge
I want to look at loan (application process) info to predict who will default. Before I bias myself looking at the data that's available, I'm going to list out things I'd want to see where possible:
### 1. Info about the Borrower
- FICO score at time of application
- Employment status
- Household AGI (adjusted gross income) on last year's tax return
- Net worth
- Number of bankruptcies in last 20 years (though I don't know if records go beyond 10 years)
- Current or past home ownership
- Marital status
- Highest attained education level
- Age
- Race

### 2. Info about the Loan, Maybe with Borrower Info Baked In
- Loan size
- Interest rate
- Debt-to-income ratio
- Required payment-to-income ratio


## B. Finding the Data
### 1. Concerns About Data Sets:
- Too simplistic, not enough "features"
- Low sample size (under 10,000)
- Imbalanced results, which plays into the above
- Lots of features, but too many null values which would render many useless

### 2. Where To Look:
#### A. Kaggle - The LeetCode of ML/Data Science:
I heard about this site because it has competitions, a leaderboard, and curated data sets. What I didn't know was that there's an unmoderated GitHub-like free-for-all where where anyone can post a data set without a README explaining what the data is, or what outcomes were being measured, and there's nothing stopping users from contributing poorly worded or even erroneous interpretations of the data set! That was a real letdown.

#### B. Hugging Face:
This appears to be GitHub-like web site for data sets, where I can search by row count, topic, data format etc. My shoot-from-the-hip criticisms are:
- Unlike Kaggle, some of these appear aimless in nature. For example I can find data on social media posts (in English), translated to their Arabic equivalent. No upvote counts or any worthwhile "target" dimension offered.
- Lack of good documentation
- Crtyptic titles so site visitors are missing half the story until they click a link

#### C. Individual Schools/Research Facilities (What We Use This time):
In the interest of expediency, I grabbed the first one that fit from **UC Irvine**:

https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients


# II. Generalized Order of Operations
This will be my 2nd rodeo since I got exposed to this process in the classic ##Housing Price Prediction## challenge in Chapter 2.

## A. Sense Check Around the Data ("Data Exploration")
## B. Data Preparation
### 1. Cleaning: missing values, duplicates, potentially handling outliers
### 2. Transformation: scaling, normalization, 1-hot encoding
### 3. Splitting
## C. Initial Run
### 1. Cross Fold Validation: With strong preference for stratification
### 2. Evaluation of Initial Run
### 3. Narrowing Feature Selection
## D. Final Run and Evaluation



# III. Start of End-to-End Analysis

## A. Data Exploration 

In [3]:
import pandas as pd
imported_credit_card_info = pd.read_csv("uc_irvine_credit_card_data.csv")
print(imported_credit_card_info.shape)

(30001, 25)


We have 30,000 rows of data and 25 columns. Good start!

In [2]:
imported_credit_card_info.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


UCI documentation shows that this was for Sept 2025, so "pay_3" column means "3 months ago in July 2025.

### A. Payment Code Explanation
-2 = No consumption

-1 = Paid in full

0 = Paid the minimum due

1 = Payment 1 month late

2 = Payment 2 months late

### B. Bill Amount Explanation
BILL_AMT4 = "Four months ago, what was the outstanding balance on the credit card?"

In [4]:
imported_credit_card_info.isnull().sum()

Unnamed: 0    0
X1            0
X2            0
X3            0
X4            0
X5            0
X6            0
X7            0
X8            0
X9            0
X10           0
X11           0
X12           0
X13           0
X14           0
X15           0
X16           0
X17           0
X18           0
X19           0
X20           0
X21           0
X22           0
X23           0
Y             0
dtype: int64

### C. Gut-Check On Null Values
Not a single empty cell across 30K rows and 25 columns. Okay, so we can skip null-handling this exercise.

In [8]:
stats_high_level = imported_credit_card_info.describe().T