# Predicting Credit Card Approvals

Build a machine learning model to predict if a credit card application will get approved.

## Project Description

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this project, you will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.

The dataset used in this project is the <a href='http://archive.ics.uci.edu/ml/datasets/credit+approval'>Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

### Project Tasks

1. Credit card applications
2. Inspecting the applications
3. Handling the missing values (part i)
4. Handling the missing values (part ii)
5. Handling the missing values (part iii)
6. Preprocessing the data (part i)
7. Splitting the dataset into train and test sets
8. Preprocessing the data (part ii)
9. Fitting a logistic regression model to the train set
10. Making predictions and evaluating performance
11. Grid searching and making the model perform better
12. Finding the best performing model

# Task 1: Credit card applications

**Instructions**
Load and look at the dataset.

- Import the pandas library under the alias pd.
- Load the dataset, "datasets/cc_approvals.data", into a pandas DataFrame called cc_apps. Set the header argument to None.
- Print the first 5 rows of cc_apps using the head() method.

_________________________________________
**Good to know**

For this project, it is recommended that you know basic Python programming, the pandas and numpy packages, some data preprocessing, and a little bit of machine learning. Here are some resources that may be helpful throughout the project:

- For a quick introduction to Python:
    - <a href='https://www.datacamp.com/courses/introduction-to-python'>DataCamp's Introduction to Python course</a>
- For learning the basics of the pandas and numpy packages:
    - <a href='https://www.datacamp.com/courses/data-manipulation-with-pandas'>Data Manipulation with pandas</a>
    - <a herf='https://www.datacamp.com/community/blog/python-pandas-cheat-sheet'>pandas Cheatsheet</a>
    - <a herf='https://www.datacamp.com/community/blog/python-numpy-cheat-sheet'>NumPy Cheat Sheet</a>
- For data preprocessing:
    - <a herf='https://www.datacamp.com/community/tutorials/preprocessing-in-data-science-part-1-centering-scaling-and-knn'>Preprocessing in Data Science (Part 1)</a>
    - <a herf='https://www.datacamp.com/community/tutorials/preprocessing-in-data-science-part-2-centering-scaling-and-logistic-regression'>Preprocessing in Data Science (Part 2)</a>
    - <a herf='https://www.datacamp.com/community/tutorials/preprocessing-in-data-science-part-3-scaling-synthesized-data'>Preprocessing in Data Science (Part 3)</a>
- For machine learning:
    - <a herf='https://developers.google.com/machine-learning/crash-course/'>Google's Machine Learning Crash Course</a>
    - <a herf='https://www.datacamp.com/courses/supervised-learning-with-scikit-learn'>Supervised Learning with scikit-learn</a>

Apart from the above, we encourage you to use your preferred search engine to find other useful resources.
_________________________________________

> Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.

<img src='image/credit_card.jpg' width=50%/>

In [1]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Inspect data
print(cc_apps.shape)
cc_apps.head()

(690, 16)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [2]:
cc_apps[15].value_counts()

-    383
+    307
Name: 15, dtype: int64

# Task 2: Inspecting the applications

**Instructions**

Inspect the structure, numerical summary, and specific rows of the dataset.

- Extract the summary statistics of the data using the describe() method of cc_apps.
- Use the info() method of cc_apps to get more information about the DataFrame.
- Print the last 17 rows of cc_apps using the tail() method to display missing values.

_________________________________________
**Helpful links:**

pandas tail() method <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html'>documentation</a>
_________________________________________

>The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but <a href='http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html'>this blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.
>
>As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.

### Attribute Information:
    A1:	b, a.
    A2:	continuous.
    A3:	continuous.
    A4:	u, y, l, t.
    A5:	g, p, gg.
    A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
    A7:	v, h, bb, j, n, z, dd, ff, o.
    A8:	continuous.
    A9:	t, f.
    A10:	t, f.
    A11:	continuous.
    A12:	t, f.
    A13:	g, p, s.
    A14:	continuous.
    A15:	continuous.
    A16: +,-

In [3]:
# Print summary statistics
cc_apps.columns = [
    'Gender', 'Age', 'Debt', 'Married', 
    'BankCustomer', 'EducationLevel', 'Ethnicity', 'YearsEmployed', 
    'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense', 
    'Citizen', 'ZipCode', 'Income', 'ApprovalStatus'
]
cc_apps_description = cc_apps.describe()
display(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
display(cc_apps_info)

print("\n")

# Inspect missing values in the dataset
cc_apps.tail(17)

Unnamed: 0,Debt,YearsEmployed,CreditScore,Income
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    object 
 1   Age             690 non-null    object 
 2   Debt            690 non-null    float64
 3   Married         690 non-null    object 
 4   BankCustomer    690 non-null    object 
 5   EducationLevel  690 non-null    object 
 6   Ethnicity       690 non-null    object 
 7   YearsEmployed   690 non-null    float64
 8   PriorDefault    690 non-null    object 
 9   Employed        690 non-null    object 
 10  CreditScore     690 non-null    int64  
 11  DriversLicense  690 non-null    object 
 12  Citizen         690 non-null    object 
 13  ZipCode         690 non-null    object 
 14  Income          690 non-null    int64  
 15  ApprovalStatus  690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


None





Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


# Task 3: Handling the missing values (part i)

**Instructions**

Inspect the missing values in the dataset and replace the question marks with NaN.

- Import the numpy library under the alias np.
- Print the last 17 rows of the dataset.
- Replace the '?'s with NaNs using the replace() method.
- Print the last 17 rows of cc_apps using the tail() method to confirm that the replace() method performed as expected.

_________________________________________
**Helpful links:**

- pandas replace() method <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html'>documentation</a>
- NumPy data types for <a href='https://docs.scipy.org/doc/numpy-1.13.0/user/misc.html'>special values</a>
_________________________________________

> We've uncovered some issues that will affect the performance of our machine learning model(s) if they go unchanged:
>
>    - Our dataset contains both numeric and non-numeric data (specifically data that are of float64, int64 and object types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.
>    - The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like mean, max, and min) about the features that have numerical values.
>    - Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.
>
>Now, let's temporarily replace these missing value question marks with NaN.

<code>
# Import numpy
import numpy as np
# Filtering only non numerical columns
df_str_columns = cc_apps.select_dtypes(exclude=['int64', 'float64']) # exclude or include
#display(df_str_columns.head()) 
# How many fields with '?'
missing_values = df_str_columns.apply(lambda col: col.str.contains('?', regex=False)).sum()
display(missing_values[missing_values>0])
print(missing_values.sum().sum(), 'rows in total')
mask = np.column_stack([(cc_apps[col] == '?').values for col in df_str_columns])
cc_apps.loc[mask.any(axis=1)].head()
</code>

In [4]:
# Filtering only non numerical columns
df_str_columns = cc_apps.select_dtypes(exclude=['int64', 'float64']) # exclude or include

# How many fields with '?'
missing_values = (df_str_columns == '?').sum()
display(missing_values[missing_values>0])
print(missing_values.sum().sum(), 'rows in total')

#cc_apps[1] = cc_apps[1].astype(float)
cc_apps.loc[(df_str_columns == '?').any(axis=1)]

Gender            12
Age               12
Married            6
BankCustomer       6
EducationLevel     9
Ethnicity          9
ZipCode           13
dtype: int64

67 rows in total


Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
71,b,34.83,4.0,u,g,d,bb,12.5,t,f,0,t,g,?,0,-
83,a,?,3.5,u,g,d,v,3.0,t,f,0,t,g,00300,0,-
86,b,?,0.375,u,g,d,v,0.875,t,f,0,t,s,00928,0,-
92,b,?,5.0,y,p,aa,v,8.5,t,f,0,f,g,00000,0,-
97,b,?,0.5,u,g,c,bb,0.835,t,f,0,t,s,00320,0,-
202,b,24.83,2.75,u,g,c,v,2.25,t,t,6,f,g,?,600,+
206,a,71.58,0.0,?,?,?,?,0.0,f,f,0,f,p,?,0,+
243,a,18.75,7.5,u,g,q,v,2.71,t,t,5,f,g,?,26726,+
248,?,24.50,12.75,u,g,c,bb,4.75,t,t,2,f,g,00073,444,+
254,b,?,0.625,u,g,k,v,0.25,f,f,0,f,g,00380,2010,-


In [5]:
# Import numpy
import numpy as np

# Inspect missing values in the dataset
display(cc_apps.tail(17))

# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.NaN)

# Inspect the missing values again
cc_apps.tail(17)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


# Task 4: Handling the missing values (part ii)

**Instructions**

Impute the NaN values with the mean imputation approach.

- For the numeric columns, impute the missing values (NaNs) with pandas method fillna().
- Verify if the fillna() method performed as expected by printing the total number of NaNs in each column.

Remember that you have already marked all the question marks as NaNs. pandas provides fillna() to help you impute missing values with different strategies, mean imputation being one of them. pandas also has a mean() method to calculate the mean of a DataFrame. As your dataset contains both numeric and non-numeric data, for this task you will only impute the missing values (NaNs) present in the columns having numeric data-types (columns 2, 7, 10 and 14).

_________________________________________
**Helpful links:**

- mean imputation <a href='https://machinelearningmastery.com/handle-missing-data-python/'>tutorial</a>
- pandas fillna() method <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html'>documentation</a>
- pandas mean() method <a href='https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.mean.html'>documentation</a>
- pandas isnull() method <a href='https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html'>documentation</a>
_________________________________________

>We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.
>
>An important question that gets raised here is why are we giving so much importance to missing values? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as LDA.
>
>So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.

In [6]:
# Observing mean values per numerical columns
display(cc_apps.mean())

Debt                4.758725
YearsEmployed       2.223406
CreditScore         2.400000
Income           1017.385507
dtype: float64

In [7]:
# Impute the missing values with mean imputation
cc_apps.fillna(cc_apps.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
cc_apps.isnull().sum()

Gender            12
Age               12
Debt               0
Married            6
BankCustomer       6
EducationLevel     9
Ethnicity          9
YearsEmployed      0
PriorDefault       0
Employed           0
CreditScore        0
DriversLicense     0
Citizen            0
ZipCode           13
Income             0
ApprovalStatus     0
dtype: int64

# Task 5: Handling the missing values (part iii)

**Instructions**

Impute the missing values in the non-numeric columns.

- Iterate over each column of cc_apps using a for loop.
- Check if the data-type of the column is of object type by using the dtypes keyword.
- Using the fillna() method, impute the column's missing values with the most frequent value of that column with the value_counts() method and index attribute and assign it to cc_apps.
- Finally, verify if there are any more missing values in the dataset that are left to be imputed by printing the total number of NaNs in each column.

_________________________________________
The column names of a pandas DataFrame can be accessed using columns attribute. The dtypes attribute provides the data type. In this part, object is the data type that you should be concerned about. The value_counts() method returns the frequency distribution of each value in the column, and the index attribute can then be used to get the most frequent value.

**Helpful links:**

- pandas value_counts() method documentation
- Accessing the index attribute in a tutorial
- Method chaining with pandas tutorial
_________________________________________

>We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment.
>
>We are going to impute these missing values with the most frequent values as present in the respective columns. This is <a href='https://www.datacamp.com/community/tutorials/categorical-data'>good practice</a> when it comes to imputing missing values for categorical data in general.

<code>
# Iterate over each column of cc_apps
for col in cc_apps:
    # Check if the column is of object type
    if cc_apps[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])
# Count the number of NaNs in the dataset and print the counts to verify
cc_apps.isnull().sum()
</code>

In [8]:
# Iterate over each column of cc_apps
cc_apps = cc_apps.apply(lambda col: col.fillna(col.value_counts().index[0]) if col.dtypes=='object' else col)

# Count the number of NaNs in the dataset and print the counts to verify
cc_apps.isnull().sum()

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
ApprovalStatus    0
dtype: int64

# Task 6: Preprocessing the data (part i)

**Instructions**

Convert the non-numeric values to numeric.

- Import the LabelEncoder class from sklearn.preprocessing module.
- Instantiate LabelEncoder() into a variable le.
- Iterate over all the values of each column cc_apps and check their data types using a for loop.
- If the data type is found to be of object type, label encode it to transform into numeric (such as int64) type.

_________________________________________
The values of each column a pandas DataFrame can be accessed using .columns and .to_numpy(). The dtypes attribute provides the data type. In this part, object is the data type that you should be concerned about.

**Helpful links:**

- Checking data types of the columns in a DataFrame <a href='https://stackoverflow.com/questions/40353079/pandas-how-to-check-dtype-for-all-columns-in-a-dataframe'>Stack Overflow answer</a>
- sklearn LabelEncoder class <a href='http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html'>documentation</a>
_________________________________________

>The missing values are now successfully handled.
>
>There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into three main tasks:
>
>1. Convert the non-numeric data into numeric.
>2. Split the data into train and test sets.
>3. Scale the feature values to a uniform range.
>
>First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called <a href='http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html'>label encoding</a>. 

<code>
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Instantiate LabelEncoder
le = LabelEncoder()
# Iterate over all the values of each column and extract their dtypes
for col in cc_apps:
    # Compare if the dtype is object
    if cc_apps[col].dtypes=='Object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps[col]=le.fit_transform(cc_apps[col])
</code>

In [9]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
cc_apps = cc_apps.apply(lambda col: le.fit_transform(col) if col.dtypes=='object' else col)
cc_apps.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,1,156,0.0,1,0,12,7,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,1,0,10,3,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,1,0,10,3,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,1,0,12,7,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,1,0,12,7,1.71,1,0,0,0,2,37,0,0


# Task 7: Splitting the dataset into train and test sets

**Instructions**

Split the preprocessed dataset into train and test sets.

- Import train_test_split from the sklearn.model_selection module.
- Drop features 11 and 13 using the drop() method and convert the DataFrame to a NumPy array using .to_numpy().
- Segregate the features and labels into X and y (the column with index 13 is the label column).
- Using the train_test_split() method, split the data into train and test sets with a split ratio of 33% (test_size argument) and set the random_state argument to 42.

_________________________________________
A NumPy array can be segregated using array slicing. Before slicing, take note of the total number of columns that should be present in the array after dropping features 11 and 13.

Setting random_state ensures the dataset is split with same sets of instances every time the code is run.

**Helpful links:**

- pandas drop() method <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html'>documentation</a>
- NumPy indexing and slicing <a href='https://www.tutorialspoint.com/numpy/numpy_indexing_and_slicing.htm'>tutorial</a>
- sklearn train_test_split() <a href='https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html'>method documentation</a>
_________________________________________

> We have successfully converted all the non-numeric values to numeric ones.
>
>Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.
>
>Also, features like DriversLicense and ZipCode are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection. 

In [10]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop(['DriversLicense', 'ZipCode'], axis=1)
cc_apps = cc_apps.to_numpy()

# Segregate features and labels into separate variables
X,y = cc_apps[:,0:12] , cc_apps[:,13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Task 8: Preprocessing the data (part ii)

**Instructions**

Drop DriversLicense and ZipCode features and rescale the data.

- Import the MinMaxScaler class from the sklearn.preprocessing module.
- Instantiate MinMaxScaler class in a variable called scaler with the feature_range parameter set to (0,1).
- Fit the scaler to X_train and transform the data, assigning the result to rescaledX_train.
- Use the scaler to transform X_test, assigning the result to rescaledX_test.

_________________________________________
When a dataset has varying ranges as in this credit card approvals dataset, one a small change in a particular feature may not have a significant effect on the other feature, which can cause a lot of problems when predictive modeling.

**Helpful links:**

- sklearn's MinMaxScaler class <a href='http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html'>documentation</a>
_________________________________________


>The data is now split into two separate sets - train and test sets respectively. We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.
>
>Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range of 0-1.

In [11]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

# Task 9: Fitting a logistic regression model to the train set

**Instructions**

Fit a LogisticRegression classifier with rescaledX_train and y_train.

- Import LogisticRegression from the sklearn.linear_model module.
- Instantiate LogisticRegression into a variable named logreg with default values.
- Fit rescaledX_train and y_train to logreg using the fit() method.

_________________________________________
If a quick refresher on logistic regression's working mechanism is needed, check out this <a href='https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python'>tutorial</a>.

**Helpful links:**

- sklearn Logistic Regression <a href='https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>documentation</a>
_________________________________________

>Essentially, predicting if a credit card application will be approved or not is a <a href='https://en.wikipedia.org/wiki/Statistical_classification'>classification</a> task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.
>
>This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.
>
>Which model should we pick? A question to ask is: are the features that affect the credit card approval decision process correlated with each other? Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [12]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg = logreg.fit(rescaledX_train, y_train)

# Task 10: Making predictions and evaluating performance

**Instructions**

Make predictions and evaluate performance.

- Import confusion_matrix() from sklearn.metrics module.
- Use predict() on rescaledX_test (which contains instances of the dataset that logreg has not seen until now) and store the predictions in a variable named y_pred.
- Print the accuracy score of logreg using the score(). Don't forget to pass rescaledX_test and y_test to the score() method.
- Call confusion_matrix() with y_test and y_pred to print the confusion matrix.

_________________________________________
**Helpful links:**

- sklearn confusion matrix <a href='https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html'>documentation</a>
_________________________________________

>But how well does our model perform?
>
>We will now evaluate our model on the test set with respect to <a href=https://developers.google.com/machine-learning/crash-course/classification/accuracy>classification accuracy</a>. But we will also take a look the model's <a href='http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/'>confusion matrix</a>. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.
>
><img src='image/confusion_matrix_simple2.png' width=30%>

In [13]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test, y_pred)

Accuracy of logistic regression classifier:  0.8421052631578947


array([[94,  9],
       [27, 98]], dtype=int64)

# Task 11: Grid searching and making the model perform better¶

**Instructions**

Define the grid of parameter values for which grid searching is to be performed.

- Import GridSearchCV from the sklearn.model_selection module.
- Define the grid of values for tol and max_iter parameters into tol and max_iter lists respectively.
- For tol, define the list with values 0.01, 0.001 and 0.0001. For max_iter, define the list with values 100, 150 and 200.
- Using the dict() method, create a dictionary where tol and max_iter are keys, and the lists of their values are the corresponding values. Name this dictionary as param_grid.

_________________________________________
Grid search can be very exhaustive if the model is very complex and the dataset is extremely large. Luckily, that is not the case for this project.
_________________________________________

>Our model was pretty good! It was able to yield an accuracy score of almost 84%.
>
>For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.
>
>Let's see if we can do better. We can perform a <a href='https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/'>grid search</a> of the model parameters to improve the model's ability to predict credit card approvals.
>
><a href='http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>scikit-learn's implementation of logistic regression</a> consists of different hyperparameters but we will grid search over the following two:
>
>- tol
>- max_iter

In [14]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

# Task 12: Finding the best performing model

**Instructions**

Find the best score and best parameters for the model using grid search.

- Instantiate GridSearchCV() with the attributes set as estimator = logreg, param_grid = param_grid and cv = 5 and store this instance in grid_model variable.
- Use scaler (which you created in Task-8) rescale X and assign it to rescaledX.
- Fit rescaledX and y to grid_model and store the results in grid_model_result.
- Call the best_score_ and best_params_ attributes on the grid_model_result variable, then print both.

_________________________________________
Grid searching is a process of finding an optimal set of values for the parameters of a certain machine learning model. This is often known as hyperparameter optimization which is an active area of research. Note that, here we have used the word parameters and hyperparameters interchangeably, but they are not exactly the same.

**Helpful links:**

Hyperparameter Optimization in Machine Learning Models <a href='https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models?tap_a=5644-dce66f&tap_s=3575'>tutorial</a>
_________________________________________

>We have defined the grid of hyperparameter values and converted them into a single dictionary format which GridSearchCV() expects as one of its parameters. Now, we will begin the grid search to see which values perform best.
>
>We will instantiate GridSearchCV() with our earlier logreg model with all the data we have. Instead of passing train and test sets separately, we will supply X (scaled version) and y. We will also instruct GridSearchCV() to perform a <a href='https://www.dataschool.io/machine-learning-with-scikit-learn/'>cross-validation</a> of five folds.
>
>We'll end the notebook by storing the best-achieved score and the respective best parameters.
>
>While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as scaling, label encoding, and missing value imputation. We finished with some machine learning to predict if a person's application for a credit card would get approved or not given some information about that person.

In [15]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.852174 using {'max_iter': 100, 'tol': 0.01}


<font color=darkgreen>Congratulations, you passed all project tasks!
Rate this project to finish...</font>

# Aditional material

- Datacamp course: 
    - https://learn.datacamp.com/projects/558
    - https://projects.datacamp.com/projects/558