# Module 1 - Supervised Learning

In this module you will explore some different classification techniques. By the end of this module you will be able to:

- Apply different classification models
- Identify different scoring metrics
- Compare the results from different classification models.

In [48]:
# Before we start, let's import all the different packages that we are going to use for this module
import pandas as pd
import numpy as np
from sklearn import naive_bayes, svm
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

**Data**

For this module, we will use an open dataset which contains multiple credit card applications' status. By using the features provided in the dataset, we will try to predict whether or not a costumer gets his credit card application approved. 

In [54]:
# Let's start by creating our df and doing some data pre-processing
# NOTE: 
cc_df = pd.read_csv("datasets/cc_approvals.data", header=None)
cols = ['Gender','Age','Debt','Married','BankCustomer','EducationLevel','Ethnicity',
        'YearsEmployed','PriorDefault','Employed','CreditScore','DriversLicense','Citizen',
        'ZipCode','Income','ApprovalStatus']
cc_df.columns = cols

# Removing missing values
mask = (cc_df == "?").any(axis=1)
cc_df = cc_df[~mask]
cc_df.reset_index(drop=True, inplace=True)

# Transform our target column into a binary column
cc_df['ApprovalStatus'] = np.where(cc_df['ApprovalStatus']=="+", 1,0)
cc_df

# Transforming Age from Object to Numerical
cc_df['Age'] = pd.to_numeric(cc_df['Age'])

# Transforming categorical columns into a numerical format
dummies = ['PriorDefault', 'Employed', 'DriversLicense']
for dummy in dummies:
    cc_df[dummy] = np.where(cc_df[dummy]=="t", 1,0)
cc_df['Gender'] = np.where(cc_df['Gender']=="a", 1,0)

categorical = cc_df.select_dtypes(include=["object"]).columns
cc_df = pd.get_dummies(cc_df, columns=categorical)

cc_df

Unnamed: 0,Gender,Age,Debt,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Income,ApprovalStatus,...,ZipCode_00583,ZipCode_00600,ZipCode_00640,ZipCode_00680,ZipCode_00711,ZipCode_00720,ZipCode_00760,ZipCode_00840,ZipCode_00980,ZipCode_02000
0,0,30.83,0.000,1.25,1,1,1,0,0,1,...,False,False,False,False,False,False,False,False,False,False
1,1,58.67,4.460,3.04,1,1,6,0,560,1,...,False,False,False,False,False,False,False,False,False,False
2,1,24.50,0.500,1.50,1,0,0,0,824,1,...,False,False,False,False,False,False,False,False,False,False
3,0,27.83,1.540,3.75,1,1,5,1,3,1,...,False,False,False,False,False,False,False,False,False,False
4,0,20.17,5.625,1.71,1,0,0,0,0,1,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
648,0,21.08,10.085,1.25,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,False
649,1,22.67,0.750,2.00,0,1,2,1,394,0,...,False,False,False,False,False,False,False,False,False,False
650,1,25.25,13.500,2.00,0,1,1,1,1,0,...,False,False,False,False,False,False,False,False,False,False
651,0,17.92,0.205,0.04,0,0,0,0,750,0,...,False,False,False,False,False,False,False,False,False,False


In [59]:
# Data Split - We split our df into our training and testing sets. We are keeping a 20% for testing

X = cc_df.drop('ApprovalStatus', axis=1)
y = cc_df[['ApprovalStatus']]

# Set the seed
seed = 40

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=seed)

In [None]:
## Classification

We will start by applying some classification algorithms and evaluate their performance. 

### SVM

This algorith attempts to find the hyperplane that better separates the data. 

## Regression

### Logistic Regression

Now, let's proceed with a Logistic Regression approach.

In [61]:
y_train

Unnamed: 0,ApprovalStatus
293,0
321,0
19,1
139,1
259,1
...,...
440,0
165,1
7,1
219,1
