# 1. Load and Preprocess Data


Here is where we are going to prepare the dataset for training our machine learning models.

Let's load the data into a Pandas DataFrame. Each row represents a sample (client), and each column represents a feature (such as number of children, annual income, income category) or the label (Outcome).

We will also check and remove duplicates. Duplicate rows can bias the model, so it's important we remove them.

Check and clean rows with NULL entries. Missing values can prevent the model from training.

In [2]:
import kagglehub
import pandas as pd
import os

# Download latest version
path = kagglehub.dataset_download("samuelcortinhas/credit-card-approval-clean-data")

print("Path to dataset files:", path)

df_application = pd.read_csv(os.path.join(path, 'clean_dataset.csv'))

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/henrywei/.cache/kagglehub/datasets/samuelcortinhas/credit-card-approval-clean-data/versions/2


In [3]:
df_application.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1,30.83,0.0,1,1,Industrials,White,1.25,1,1,1,0,ByBirth,202,0,1
1,0,58.67,4.46,1,1,Materials,Black,3.04,1,1,6,0,ByBirth,43,560,1
2,0,24.5,0.5,1,1,Materials,Black,1.5,1,0,0,0,ByBirth,280,824,1
3,1,27.83,1.54,1,1,Industrials,White,3.75,1,1,5,1,ByBirth,100,3,1
4,1,20.17,5.625,1,1,Industrials,White,1.71,1,0,0,0,ByOtherMeans,120,0,1


# Checking the dataset

In [4]:
df_application.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    int64  
 1   Age             690 non-null    float64
 2   Debt            690 non-null    float64
 3   Married         690 non-null    int64  
 4   BankCustomer    690 non-null    int64  
 5   Industry        690 non-null    object 
 6   Ethnicity       690 non-null    object 
 7   YearsEmployed   690 non-null    float64
 8   PriorDefault    690 non-null    int64  
 9   Employed        690 non-null    int64  
 10  CreditScore     690 non-null    int64  
 11  DriversLicense  690 non-null    int64  
 12  Citizen         690 non-null    object 
 13  ZipCode         690 non-null    int64  
 14  Income          690 non-null    int64  
 15  Approved        690 non-null    int64  
dtypes: float64(3), int64(10), object(3)
memory usage: 86.4+ KB


In [6]:
print(df_application['Industry'].unique())
print(df_application['Ethnicity'].unique())
print(df_application['Citizen'].unique())

['Industrials' 'Materials' 'CommunicationServices' 'Transport'
 'InformationTechnology' 'Financials' 'Energy' 'Real Estate' 'Utilities'
 'ConsumerDiscretionary' 'Education' 'ConsumerStaples' 'Healthcare'
 'Research']
['White' 'Black' 'Asian' 'Latino' 'Other']
['ByBirth' 'ByOtherMeans' 'Temporary']


In [7]:
df_application.columns

Index(['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'Industry',
       'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore',
       'DriversLicense', 'Citizen', 'ZipCode', 'Income', 'Approved'],
      dtype='object')

#  Feature Engineering
We convert the categorical variables into numbers for machine learning by using one-hot encoding and ordinal encoding based on the nature of the data type.


In [11]:
def features(df):
    X = df[['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense', 'Income', 'Approved']]
    X['White'] = df["Ethnicity"].apply(lambda x: 1 if x == "White" else 0)
    X['Black'] = df['Ethnicity'].apply(lambda x: 1 if x == "Black" else 0)
    X['Asian'] = df['Ethnicity'].apply(lambda x: 1 if x == "Asian" else 0)
    X['Latino'] = df['Ethnicity'].apply(lambda x: 1 if x == "Latino" else 0)
    X['Other'] = df['Ethnicity'].apply(lambda x: 1 if x == "Other" else 0)
    X['Industrials'] = df['Industry'].apply(lambda x: 1 if x == "Industrials" else 0)
    X['Materials'] = df['Industry'].apply(lambda x: 1 if x == "Materials" else 0)
    X['CommunicationServices'] = df['Industry'].apply(lambda x: 1 if x == "CommunicationServices" else 0)
    X['Transport'] = df['Industry'].apply(lambda x: 1 if x == "Transport" else 0)
    X['InformationTechnology'] = df['Industry'].apply(lambda x: 1 if x == "InformationTechnology" else 0)
    X['Financials'] = df['Industry'].apply(lambda x: 1 if x == "Financials" else 0)
    X['Energy'] = df['Industry'].apply(lambda x: 1 if x == "Energy" else 0)
    X['Real Estate'] = df['Industry'].apply(lambda x: 1 if x == "Real Estate" else 0)
    X['Utilities'] = df['Industry'].apply(lambda x: 1 if x == "Utilities" else 0)
    X['ConsumerDiscretionary'] = df['Industry'].apply(lambda x: 1 if x == "ConsumerDiscretionary" else 0)
    X['Education'] = df['Industry'].apply(lambda x: 1 if x == "Education" else 0)
    X['ConsumerStaples'] = df['Industry'].apply(lambda x: 1 if x == "ConsumerStaples" else 0)
    X['Healthcare'] = df['Industry'].apply(lambda x: 1 if x == "Healthcare" else 0)
    X['Research'] = df['Industry'].apply(lambda x: 1 if x == "Research" else 0)



    return X
clean_data = features(df_application)
clean_data.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['White'] = df["Ethnicity"].apply(lambda x: 1 if x == "White" else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Black'] = df['Ethnicity'].apply(lambda x: 1 if x == "Black" else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Asian'] = df['Ethnicity'].apply(lambda x: 1 if x == "Asian" 

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,...,InformationTechnology,Financials,Energy,Real Estate,Utilities,ConsumerDiscretionary,Education,ConsumerStaples,Healthcare,Research
0,1,30.83,0.0,1,1,1.25,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,58.67,4.46,1,1,3.04,1,1,6,0,...,0,0,0,0,0,0,0,0,0,0
2,0,24.5,0.5,1,1,1.5,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,27.83,1.54,1,1,3.75,1,1,5,1,...,0,0,0,0,0,0,0,0,0,0
4,1,20.17,5.625,1,1,1.71,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Split the dataset into testing data and training data

# Train the dataset

# Evaluate the dataset