# Predicting Credit Card Approvals with Logistic Regression

In this project we are going to build a Machine Learning model using the Logistic Regression algorithm, to predict whether a request for a credit card gets rejected or approved. There are various factors determining the result of a credict card request, namely high loan balances, low income levels, or too many inquiries on an individual's credit report. We are going to use all these features to build an automatic credit card approval predictor using machine learning.


![image](https://images.unsplash.com/photo-1609429019995-8c40f49535a5?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2069&q=80)

## Project Outline
- First, we will start off by loading and viewing the dataset.
- We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.
- We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.
- After our data is in good shape, we will do some exploratory data analysis to build our intuitions.
- Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.


## Project Tasks
1. [Credit card applications](#1.-Credit-card-applications)
2. [Inspecting the applications](#2.-Inspecting-the-applications)
3. [Splitting the dataset into train and test sets](#3.-Splitting-the-dataset-into-train-and-test-sets)
4. [Handling the missing values](#4.-Handling-the-missing-values)
5. Preprocessing the data
6. Fitting a logistic regression model to the train set
7. Making predictions and evaluating performance
8. Grid searching and making the model perform better
9. Finding the best performing model


### 1. Credit card applications
First we load our dataset into ```cc_apps``` using  ```pandas```. The loaded dataset includes the following: Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.

In [1]:
import pandas as pd
cc_apps = pd.read_csv('Dataset/cc_approvals.data', header= None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


### 2. Inspecting the applications
Now, we inspect the structure, numerical summary, and specific rows of the dataset by extracting the summary statistics of the data using the ```describe()``` method of ```cc_apps```. Then, we use the ```info()``` method of ```cc_apps``` to get more information about the DataFrame.

<a id='2._Inspecting_the_applications'></a>

In [2]:
print(cc_apps.info())
print('\n', cc_apps.describe())
print('\n', cc_apps.tail(17))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB
None

                2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    10

### 3. Splitting the dataset into train and test sets

Taking a good look at the data, we understand that features such as ```DriverLisence``` or ```ZipCode``` are not effective in credir approval and we can set them aside using the ```drop()``` method. Next, it is time to split our data into train set and test set.


```sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)```

In [3]:
from sklearn.model_selection import train_test_split

cc_apps = cc_apps.drop([11, 13], axis= 1)

cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size= 0.33, random_state= 42)


In [4]:
cc_apps

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,g,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,g,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,g,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,g,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,g,750,-


### 4. Handling the missing values
No dataset is perfect and this dataset is not an exception! First of all, we can observe that there are many missing values that are shown as '?'. We can replace these question marks with np.NaN from ```numpy``` that makes more sense. We do this by using ```replace()``` function.

``` DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *, inplace=False, limit=None, regex=False, method=_NoDefault.no_default)```

In [5]:
import numpy as np
cc_apps_train = cc_apps_train.replace('?', np.NaN)
cc_apps_test = cc_apps_test.replace('?', np.NaN)

Next, we impute the missing values with a strategy called **mean imputation**. However, this strategy is not a very good one as it ignores all the features correlations.

In mean imputation, we replace all the null values with the mean of its column. to do this, we use pandas ```fillna()``` function to replace the missing values with their corresponding mean calculated by ```np.mean()```. We must pay attantion to the fact that the ```fillna()``` method implicitly handles the imputations for the columns containing **numeric** data-types. 

In [8]:

cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_test.mean(), inplace=True)


print(cc_apps_train.isnull().sum())

0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


  cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
  cc_apps_test.fillna(cc_apps_test.mean(), inplace=True)
