## 1. Bank Card Applications

Commercial banks receive a large number of applications for credit cards. Many of these applications are rejected for various reasons, such as high loan balances, low income levels, or a high number of inquiries on an individual's credit report. Manually analyzing these applications is a tedious, error-prone, and time-consuming task (and time is money!). Fortunately, this task can be automated using the power of machine learning, and nearly every commercial bank does so nowadays. In this notebook, we will develop an automatic credit card approval predictor using machine learning techniques, just like real banks do.

- First, we will begin by loading and examining the dataset.
- The dataset contains a mix of both numerical and non-numerical features, with values from different ranges, and it also includes some missing entries.
- We will need to preprocess the dataset to ensure that the machine learning model we choose can make accurate predictions.
- Once our data is properly prepared, we will conduct exploratory data analysis to build our understanding and intuition.
- Finally, we will build a machine learning model capable of predicting whether an individual's credit card application will be accepted.

First, let's load and view the dataset. It is important to note that, due to the confidential nature of this data, the contributor of the dataset has anonymized the feature names.


In [None]:
# Import pandas
# ... YOUR CODE FOR TASK 1 ...

# Load dataset
cc_apps = ...

# Inspect data
# ... YOUR CODE FOR TASK 1 ...

## 2. Inspecting the Applications

The output may initially appear somewhat confusing, but let's uncover the key features of a credit card application. To protect privacy, the dataset's features have been anonymized. However, we can refer to [this blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) for a comprehensive overview of the probable features. In a typical credit card application, we expect to find the following probable features: `Gender`, `Age`, `Debt`, `Married`, `BankCustomer`, `EducationLevel`, `Ethnicity`, `YearsEmployed`, `PriorDefault`, `Employed`, `CreditScore`, `DriversLicense`, `Citizen`, `ZipCode`, `Income`, and the `ApprovalStatus`. This information provides a solid starting point, allowing us to map these features to the columns in the output.

Upon our initial analysis, it becomes evident that the dataset contains a mix of numerical and non-numerical features. Preprocessing will be necessary to address this. Before proceeding with the preprocessing steps, let's conduct a more thorough examination of the dataset to identify any other potential issues that may require attention.

Please note that to ensure data privacy, the contributor has anonymized the feature names in the dataset.


In [None]:
# Print summary statistics
cc_apps_description = ...
print(cc_apps_description)

print('\n')

# Print DataFrame information
cc_apps_info = ...
print(cc_apps_info)

print('\n')

# Inspect missing values in the dataset
# ... YOUR CODE FOR TASK 2 ...

## 3. Splitting the Dataset into Train and Test Sets

Now, it's time to split our data into two sets: the train set and the test set. This division is essential to prepare the data for two different phases of machine learning modeling: training and testing. It is crucial to ensure that no information from the test data is used in the preprocessing of the training data or during the training process of the machine learning model. By separating the data beforehand, we can maintain the integrity of our evaluation and avoid data leakage.

Additionally, we have identified that features such as `DriversLicense` and `ZipCode` are not as significant as the other features in the dataset when it comes to predicting credit card approvals. To gain a better understanding, we could measure their [statistical correlation](https://realpython.com/numpy-scipy-pandas-correlation-python/) with the labels of the dataset. However, for this project's scope, we will exclude these features and focus on designing our machine learning model using the most relevant features. In Data Science literature, this process is commonly referred to as *feature selection*.


In [None]:
# Import train_test_split
# ... YOUR CODE FOR TASK 3 ...

# Drop the features 11 and 13
cc_apps = ...

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(..., test_size=0.33, random_state=42)

## 4. Handling the missing values (part i)
<p>Now we've split our data, we can handle some of the issues we identified when inspecting the DataFrame, including:</p>
<ul>
<li>Our dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like <code>mean</code>, <code>max</code>, and <code>min</code>) about the features that have numerical values. </li>
<li>Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output of the second task.</li>
</ul>
<p>Now, let's temporarily replace these missing value question marks with NaN.</p>

In [None]:
# Import numpy
# ... YOUR CODE FOR TASK 4 ...

# Replace the '?'s with NaN in the train and test sets
cc_apps_train = cc_apps_train.replace(...)
cc_apps_test = cc_apps_test.replace(...)

## 5. Handling the missing values (part ii)
<p>We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.</p>
<p>An important question that gets raised here is <em>why are we giving so much importance to missing values</em>? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as Linear Discriminant Analysis (LDA). </p>
<p>So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.</p>

In [None]:
# Impute the missing values with mean imputation


# Count the number of NaNs in the datasets and print the counts to verify
# ... YOUR CODE FOR TASK 5 ...

## 6. Handling the missing values (part iii)
<p>We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this is why the mean imputation strategy would not work here. This needs a different treatment. </p>
<p>We are going to impute these missing values with the most frequent values as present in the respective columns. This is <a href="https://www.datacamp.com/community/tutorials/categorical-data">good practice</a> when it comes to imputing missing values for categorical data in general.</p>

In [None]:
# Iterate over each column of cc_apps_train

    # Check if the column is of object type

        # Impute with the most frequent value


# Count the number of NaNs in the dataset and print the counts to verify
# ... YOUR CODE FOR TASK 6 ...

## 7. Preprocessing the data (part i)
<p>The missing values are now successfully handled.</p>
<p>There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into two main tasks:</p>
<ol>
<li>Convert the non-numeric data into numeric.</li>
<li>Scale the feature values to a uniform range.</li>
</ol>
<p>First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using the <code>get_dummies()</code> method from pandas.</p>

In [None]:
# Convert the categorical features in the train and test sets independently
cc_apps_train = ...
cc_apps_test = ...

# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=..., fill_value=...)

## 8. Preprocessing the data (part ii)
<p>Now, we are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data. </p>
<p>Now, let's try to understand what these scaled values mean in the real world. Let's use <code>CreditScore</code> as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a <code>CreditScore</code> of 1 is the highest since we're rescaling all the values to the range of 0-1.</p>

In [None]:
# Import MinMaxScaler
# ... YOUR CODE FOR TASK 8 ...

# Segregate features and labels into separate variables
X_train, y_train = 
X_test, y_test = 

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = 
rescaledX_train = 
rescaledX_test = 

## 9. Fitting a logistic regression model to the train set
<p>Essentially, predicting if a credit card application will be approved or not is a <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved. </p>
<p>This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.</p>
<p>Which model should we pick? A question to ask is: <em>are the features that affect the credit card approval decision process correlated with each other?</em> Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).</p>

In [None]:
# Import LogisticRegression
# ... YOUR CODE FOR TASK 9 ...

# Instantiate a LogisticRegression classifier with default parameter values
logreg = ...

# Fit logreg to the train set
# ... YOUR CODE FOR TASK 9 ...

## 10. Making predictions and evaluating performance
<p>But how well does our model perform? </p>
<p>We will now evaluate our model on the test set with respect to <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">classification accuracy</a>. But we will also take a look the model's <a href="http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a>. In the case of predicting credit card applications, it is important to see if our machine learning model is equally capable of predicting approved and denied status, in line with the frequency of these labels in our original dataset. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.  </p>

In [None]:
# Import confusion_matrix
# ... YOUR CODE FOR TASK 10 ...

# Use logreg to predict instances from the test set and store it
y_pred = 

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", ...)

# Print the confusion matrix of the logreg model
# ... YOUR CODE FOR TASK 10 ...

## 11. Grid searching and making the model perform better
<p>Our model was pretty good! In fact it was able to yield an accuracy score of 100%.</p>
<p>For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>
<p>But if we hadn't got a perfect score what's to be done?. We can perform a <a href="https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/">grid search</a> of the model parameters to improve the model's ability to predict credit card approvals.</p>
<p><a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">scikit-learn's implementation of logistic regression</a> consists of different hyperparameters but we will grid search over the following two:</p>
<ul>
<li>tol</li>
<li>max_iter</li>
</ul>

In [None]:
# Import GridSearchCV
# ... YOUR CODE FOR TASK 11 ...

# Define the grid of values for tol and max_iter
tol = ...
max_iter = ...

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(..., ...)

## 12. Finding the best performing model
<p>We have defined the grid of hyperparameter values and converted them into a single dictionary format which <code>GridSearchCV()</code> expects as one of its parameters. Now, we will begin the grid search to see which values perform best.</p>
<p>We will instantiate <code>GridSearchCV()</code> with our earlier <code>logreg</code> model with all the data we have. We will also instruct <code>GridSearchCV()</code> to perform a <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a> of five folds.</p>
<p>We'll end the notebook by storing the best-achieved score and the respective best parameters.</p>
<p>While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as <strong>scaling</strong>, <strong>label encoding</strong>, and <strong>missing value imputation</strong>. We finished with some <strong>machine learning</strong> to predict if a person's application for a credit card would get approved or not given some information about that person.</p>

In [None]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=..., param_grid=..., cv=...)

# Fit grid_model to the data
grid_model_result = grid_model.fit(...)

# Summarize results
best_score, best_params = grid_model_result...., grid_model_result....
print("Best: %f using %s" % (...))

# Extract the best model and evaluate it on the test set
best_model = ...
print("Accuracy of logistic regression classifier: ", ...)