## **1. Credit card applications**

Banks receive a lot of applications for credit cards. It is known that they reject many of them for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. At analyzing this applications we will have to handle with the missing values and preprocess the dataset to finally make a credit card approval predictor using Statistics and Machhine Learning.
    
As mentioned, we will use the Credit Card Approval dataset from the UCI Machine Learning Repository. The structure of this notebook is as follows:

> * First, we will start off by loading and viewing the dataset.<br>
    > * We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different   ranges, plus that it contains a number of missing entries.<br> 
    > * We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.<br> 
    > * After our data is in good shape, we will do some exploratory data analysis to build our intuitions.<br> 
    > * Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.<br> 
  
First, we load and peek at the dataset.

In [None]:

# Import pandas
import pandas as pd
# Load dataset
cc_apps = pd.read_csv("dataset/cc_approvals.data", header=None)
# Inspect data
cc_apps.head()

Clearly this Dataset has been modified to keep the confidentialityof the data. Furthermore, its pertinent to remind the information of the colums and the dataset provided by UCI Machine Learning Repository at [http://archive.ics.uci.edu/ml/datasets/credit+approval](https://) and the information founded on internet about the data.

 At first, the dataset may seem confusing but 


### Attribute Information:

> - A1: **Gender (Male a)** b, a. <br>
> - A2: **Age in years** continuous. <br>
> - A3: **Debt in Thousands of Dollars** continuous. <br>
> - A4: **Married** u, y, l, t. <br>
> - A5: **Bank Customer** g, p, gg. <br>
> - A6: **Education Level** c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. <br>
> - A7: **Ethnicity** v, h, bb, j, n, z, dd, ff, o. <br>
> - A8: **Years Employed** continuous. <br>
> - A9: **Prior Default** t, f. <br>
> - A10: **Employed** t, f. <br>
> - A11: **Credit Score** continuous. <br>
> - A12: **DriversLicense** t, f. <br>
> - A13: **Citizen** g, p, s. <br>
> - A14: **Zip Code** continuous. <br>
> - A15: **Income** continuous. <br>
> - A16: **Approval Status** +,- (class attribute). <br> 


## **2. Inspecting the applications**

Now we will inspect the dataset a little more closely and try to identify some potential problems, missing values, and interesting points.

In [None]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)
print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)
print("\n")

# Inspect missing values in the dataset
cc_apps.tail(20)


## **3. Splitting the dataset into train and test sets**

Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to preprocess the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then preprocess it.

Also it is important to notice that some features in the dataset like *Drivers License*, and  *ZipCode* may not be that correlated to the target variable *Approval Status* and we will drop them.

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split
# Drop the feature columns 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)
# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

## **4. Handling the missing values**
Now, we find that the dataset has an interesting mixture of numerical and non-numerical values or features. This is common and we will be able to deal with this in the preproccesing, so we will start by checking for missing values and keep going with the preprocessing...

In [None]:
# Import numpy
import numpy as np
# Inspect missing values in the dataset
cc_apps.tail(20)
# Replace the '?'s with NaN
cc_apps_train = cc_apps_train.replace('?', np.NaN)
cc_apps_test = cc_apps_test.replace('?', np.NaN)
# Inspect the missing values again
cc_apps.tail(20)

We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.
We know ignoring missing values is not a good idea because it may affect in the performance of the Machine Learning Model so we will use a strategy called Mean Imputation to replace the missing values with the mean of the column.

In [None]:
# Implement mean imputation
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)
# Verify that missing values have been imputed
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())
# Inspection of the missing values
cc_apps.tail(20)

We have taken care of some of the missing values, those that were in numeric columns, now we will take care of the missing values in the categorical columns *A0*, *A1*, *A3*, *A4*, *A5* and *A6*.
We will do this by a similar technique to the one used before called mode imputation, where we replace the missing values with the most frequent value in the column.

In [None]:
# Iterate over each column of cc_apps_train
for col in cc_apps_train.columns:
    # Check if the column is of object type
    if cc_apps_train[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())



Clearly now we have dealt with all the missing values in the dataset, we can move on to the next step.

## **5. Preprocessing the data**

Now we want to:
> - Convert the non-numeric data into numeric.
> - Scale the feature values to a uniform range.

We convert the non-numeric data into numeric because this will result in faster computation times and also many Machine Learning Models will require the data to be strictly numeric. Now we will use the get_dummies() function from pandas to convert the categorical columns into numeric.

In [None]:
# Convert the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)

Now we only have to scale the values to a uniform range. We will use the MinMaxScaler from sklearn to do this. 
We want to scale all the values to a uniform range because if we, for example, assign all the values to the interval $(0,1)$, then for example, if someone has a *Credit Score* closer to $1$ it certainly means that such individual is more trustworthy than someone with a *Credit Score* closer to $0$.

In [None]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Segregate features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:, [-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:, [-1]].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

## **6. Fitting a logistic regression model to the train set**
According to *UCI*, predicting whether and individual will get a credit card or not is a classification problem. We will use a Logistic Regression model to solve this problem. Our dataset contains more denies than approvals, we can see this below.


In [None]:
# Find the number of occurences of each value in the Approval column
print(cc_apps[15].value_counts())

Why we peak this model? Because it is a simple model that is easy to interpret and it is a good baseline for a classification problem. We will use the LogisticRegression class from sklearn.linear_model to fit a logistic regression model to the train set.
We say this because this data is correlated and we can take advantage of this using generalized linear models, Logistic Regression is one of them.

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train,y_train)

## **7. Making predictions and evaluating performance**
We will now evaluate our model on the test set with respect to classification accuracy and we will look the Confusion Matrix and we will evaluate if our model is capable of predicting wether a card application will be approved or not and the confusion matrix will help us to see how many correct and incorrect predictions our model made.

In [None]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test,y_test))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test,y_pred)

## **8. Grid searching for optimal hyperparameters**
By now we can see our model is doing pretty good and had an accuracy of $100%$. If it weren't the case we could also perform a grid search to find the best hyperparameters for our model.
Logistic regression implementation's by scikit-learn conssists of different hyperparameters, by now we will use 
> - **tol** <br>
> - **max_iter** <br> 

where **tol** is the tolerance for the stopping criteria and **max_iter** is the maximum number of iterations taken for the solvers to converge.


In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001 ,0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are the corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

## **9. Finding the best performing model**
We have defined the grid of hyperparameter values and converted them into a single dictionary format which GridSearchCV() expects as one of its parameters. Now, we will begin the grid search to see which values perform best.
We will instantiate GridSearchCV() with our earlier logreg model with all the data we have. We will also instruct GridSearchCV() to perform a cross-validation of five folds.
Finally we will end storing the best-achieved score and the respective best hyperparameters.

In [None]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print("Accuracy of logistic regression classifier: ", best_model.score(rescaledX_test,y_test))

Finally while doing this predictor we have dealt with steps like cleaning the data, preproccesing, scaling, and label encoding, and we have also used a simple model like Logistic Regression to predict whether an individual will get a credit card or not. We have also used GridSearchCV to find the best hyperparameters for our model.

#### *This was collected and solved by jdpalmad. Suggestions are found at Datacamp and the Dataset is provided by UCI Machine Learning Repository.*