Imports and Dataset Loading

We start by importing the necessary libraries, such as pandas for data handling and numpy for numerical operations. We use the !wget command to download the dataset from a URL and save it as "cc_approvals.data" in the current jupyter notebook. Next, we load the dataset into a Pandas DataFrame, providing custom column names to make it more understandable.

In [1]:
import pandas as pd
import numpy as np

# Download the dataset from a URL and save it as cc_approvals.data
!wget -O cc_approvals.data https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data

# Load the dataset
df = pd.read_csv('cc_approvals.data', header=None)
cols = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel', 'Ethnicity',
        'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense', 'Citizen',
        'ZipCode', 'Income', 'ApprovalStatus']
df.columns = cols



7[1A[1G[27G[Files: 0  Bytes: 0  [0 B/s] Re]87[2A[1G[27G[https://archive.ics.uci.edu/ml]87[1S[3A[1G[0JSaving 'cc_approvals.data'

In [2]:
df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [3]:
df.isna().sum()

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
ApprovalStatus    0
dtype: int64

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    object 
 1   Age             690 non-null    object 
 2   Debt            690 non-null    float64
 3   Married         690 non-null    object 
 4   BankCustomer    690 non-null    object 
 5   EducationLevel  690 non-null    object 
 6   Ethnicity       690 non-null    object 
 7   YearsEmployed   690 non-null    float64
 8   PriorDefault    690 non-null    object 
 9   Employed        690 non-null    object 
 10  CreditScore     690 non-null    int64  
 11  DriversLicense  690 non-null    object 
 12  Citizen         690 non-null    object 
 13  ZipCode         690 non-null    object 
 14  Income          690 non-null    int64  
 15  ApprovalStatus  690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


###Data Preprocessing

In this cell, we perform data preprocessing tasks. We replace any question marks ('?') in the dataset with NaN values using the replace method. Rows with missing values (NaN) are then dropped from the dataset to ensure data completeness. We convert specific columns to appropriate data types for analysis. In this case, we convert 'Age' to float, 'ZipCode' to integer, and 'ApprovalStatus' to binary values (0 or 1) to prepare the data for machine learning.

In [5]:
# Replace '?' with NaN and drop rows with missing values
df.replace('?', np.nan, inplace=True)
df.dropna(inplace=True)

# Convert columns to appropriate data types
df['Age'] = df['Age'].astype(float)
df['ZipCode'] = df['ZipCode'].astype(int)
df['ApprovalStatus'] = df['ApprovalStatus'].map({'+': 1, '-': 0})


In [6]:
df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,1
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,1


###Handling Outliers

In this cell, we define a function called handle_outliers to address outliers in the dataset. The function takes as input the DataFrame (df), the column name to be examined (column), and the target variable ('0' or '1') for which outliers will be handled. Outliers are identified based on the Interquartile Range (IQR) and a 1.5*IQR threshold. Rows with outliers for the specified column and target are removed from the DataFrame.

In [14]:
def handle_outliers(df, column, target):
    q1 = np.percentile(df[column], 25)
    q3 = np.percentile(df[column], 75)
    iqr = q3 - q1
    lower_fence = q1 - 1.5 * iqr
    upper_fence = q3 + 1.5 * iqr
    outliers = ((df[column] > upper_fence) | (df[column] < lower_fence))
    df = df[~(outliers & (df['ApprovalStatus'] == target))]
    return df

# Handle outliers for 'Age' based on ApprovalStatus
df = handle_outliers(df, 'Age', 0)
df = handle_outliers(df, 'Age', 1)
df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,1
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,1
5,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,1
