# Using Machine Learning to Predict a Category (Part 2)

Recall:
<br>
You are an analyst for a credit card company and you obtain data regarding its customers that is stored on the Math@Work server.  You think that the first 23 columns of this dataset is good for predicting the last column, DEFAULT.  That is, you think you can use this data to predict whether a future customer will default on a credit card.

**Here is your data import:**

In [3]:
import pandas as pd
cc_default = pd.read_csv('https://mathatwork.org/DATA/cc-default.csv')
print(cc_default.head())

   LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0      20000    2          2         1   24      2      2     -1     -1   
1     120000    2          2         2   26     -1      2      0      0   
2      90000    2          2         2   34      0      0      0      0   
3      50000    2          2         1   37      0      0      0      0   
4      50000    1          2         1   57     -1      0     -1      0   

   PAY_5   ...     BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  \
0     -2   ...             0          0          0         0       689   
1      0   ...          3272       3455       3261         0      1000   
2      0   ...         14331      14948      15549      1518      1500   
3      0   ...         28314      28959      29547      2000      2019   
4      0   ...         20940      19146      19131      2000     36681   

   PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  DEFAULT  
0         0         0         0         0        1 

**Here is your test and training subsets:**

In [4]:
from sklearn.model_selection import train_test_split
X = cc_default.loc[:,'LIMIT_BAL':'PAY_AMT6']
y = cc_default.loc[:,'DEFAULT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

**Here is your modeling using Logistic Regression:**

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)

y_predicted = logistic_regression.predict(X_test)
eval_metric = accuracy_score(y_pred = y_predicted, y_true = y_test)
print(eval_metric)

0.774166666667


Notice that the accuracy here may be a little different each time you run your model.  This is because we are not splitting the data the same way every time.  The 80/20 split into test and training subsets is done randomly.  Nonetheless, it is still close to what we had from part 1.

To test another classification model, all we need to do is replace some of the code in the modeling step above.  We do NOT need to import the data again or re-split it into test and training subsets.
<br><br>
**STEP 1:**  Decide on an algorithm.
<br>
Suppose you want to compare the Logistic Regression model accuracy with that of a `Classification Tree` model. Then you would import your algorithm and then initialize it.

In [6]:
from sklearn.tree import DecisionTreeClassifier
class_tree = DecisionTreeClassifier()

**STEP 2:** Split your data into test and training subsets.
<br>
This was already done in part 1 of this notebook above and therefore does not need to be done again.

**STEP 3:** Use the training data to train your model.

In [7]:
class_tree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

**STEP 4:** Use the test data to evaluate the model.

In [9]:
y_predicted = class_tree.predict(X_test)
eval_metric2 = accuracy_score(y_pred = y_predicted, y_true = y_test)
print(eval_metric2)

0.727


The accuracy of the Classification Tree algorithm is 72.7% which means that this model predicts credit card default correctly 72.7% of the time.  This is a bit less than the Logistic Regression model. 

### Exercise

Suppose for good measure you want to compare both Logistic Regression and Classification Tree model accuracies to that of a Naive Bayes model.  Run the following code to import and initialize your algorithm:

In [11]:
from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()

The above code completed **STEP 1** of machine learning modeling.  Recall, we do not need to re-do **STEP 2**.
<br>
**STEP 3:** Use the training data to train your model. Do this in the next cell below.

**STEP 4:** Use the test data to evaluate the model.  Do this in the next cell below.

Which model had the highest accuracy? Explain below.

# Using the Best Model to Make a Prediction

**STEP 5:** Make a prediction.
<br>
This is the last step of this process.  Let's create data for a new customer using the same columns as in our X DataFrame.  We will then use that data to predict whether this new customer will default (column y).  
<br>
Here is our new customer:

In [29]:
data = [250000, 1, 1, 2, 29, 0, 0, 0, 0, 0, 0, 70887, 67060, 63561, 59696, 56875, 55512, 3000, 3000, 3000, 3000, 3000, 3000]

import numpy as np
customer = np.array(data).reshape((1,-1))
print(customer)

[[250000      1      1      2     29      0      0      0      0      0
       0  70887  67060  63561  59696  56875  55512   3000   3000   3000
    3000   3000   3000]]


We will assume that the Logistic Regression model was the best. Here is our prediction using Logistic Regression:

In [30]:
default = logistic_regression.predict(customer)
print(default)

[0]


The code above returned 0. Recall that in the DEFAULT column, 1 = Yes and 0 = No.  Therefore, our model predicts that a customer with this data will not default.  You will definitely want your company to provide a credit card loan to this individual.

### Exercise

Assume the Naive Bayes model was the best in accuracy.  Use this model to make a prediction regarding the same customer.  Explain in the cell below your analysis what the model results indicate regarding this customer's prediction of credit card default.