# Introduction to Scikit-learn for Classification Modeling

`Scikit-learn` is a Python library used for machine learning.  It features various classification and regression algorithms.  To use `Scikit-learn` to model data, you must import the desired algorithm and evaluation metric from the library. Evaluation metrics will be explained a bit later.
<br><br>
**For example:** Let's suppose we want to make a prediction using a Logistic Regression model with accuracy as our evaluation metric.  Then your imports would look like:

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

- The 3rd import is our Logistic Regression algorithm import.
- The 2nd import is our evaluation metric import.
- The 1st import is a method for splitting our data randomly into train and test subsets.

**Evaluation metrics** tell us how `good` our model is.  Though there are other types of evaluation metrics for classification models, we will only use `accuracy` as a measure of `goodness` of our model.
<br><br>
`Accuracy` is a measure of how often our model predicts correctly.  The higher the accuracy the better our model predicts. For example, if your accuracy measure is 0.778, this means:
<br>77.8 PERCENT OF THE TIME YOUR MODEL PREDICTS CORRECTLY.

It is important to note that once an algorithm is chosen and imported, it must be initialized before you can use it.  To initialize our Logisitic Regression algorithm we do the following:

In [5]:
logistic_regression = LogisticRegression()

Here I am creating an instance of a Logistic Regression model and then assigning it to the variable  `logistic_regression` for use later.  There are arguments that can be passed into **LogisticRegression( )** to make your predictions better, but I will not discuss those.

Now let's go over an example problem to detail the steps of using machine learning for making predictions.

# Using Machine Learning to Predict a Category

You are an analyst for a credit card company. You obtain the following data regarding its customers:
<br>
- LIMIT_BAL: Amount of the credit given to a customer.
- SEX: Customer's gender (1 = male; 2 = female). 
- EDUCATION: Customer's highest level of education completed (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- MARRIAGE: Customer's marital status (1 = married; 2 = single; 3 = others). 
- AGE: Customer's age (year). 
- PAY0 - PAY6: Customer's payment history characteristics (1 = payment delay for one month; 2 = payment delay for two months; . . .; 9 = payment delay for nine months and above. ).
- BILL_AMT1 - BILL_AMT6: Amount of a customer's bill statement for 6 consecutive months. 
- PAY_AMT2 - PAY_AMT6: History of a customer's past credit card payments made towards the balance. 
- DEFAULT - Whether the customer defaulted on his payment (1 = Yes; 0 = No)

You have stored this data as a CSV file on the Math@Work server.  Let's import this dataset into its own Pandas DataFrame named cc_default and then print the first 10 rows of our newly created DataFrame to the screen. Recall, we discussed importing data into a DataFrame in the Data Science workshop.

In [11]:
import pandas as pd
cc_default = pd.read_csv('https://mathatwork.org/DATA/cc-default.csv')
print(cc_default.head(10))

   LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0      20000    2          2         1   24      2      2     -1     -1   
1     120000    2          2         2   26     -1      2      0      0   
2      90000    2          2         2   34      0      0      0      0   
3      50000    2          2         1   37      0      0      0      0   
4      50000    1          2         1   57     -1      0     -1      0   
5      50000    1          1         2   37      0      0      0      0   
6     500000    1          1         2   29      0      0      0      0   
7     100000    2          2         2   23      0     -1     -1      0   
8     140000    2          3         1   28      0      0      2      0   
9      20000    1          3         2   35     -2     -2     -2     -2   

   PAY_5   ...     BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  \
0     -2   ...             0          0          0         0       689   
1      0   ...          32

You decide that using the first 23 columns above will be sufficient to predict the last column, DEFAULT for future customers.  That is, given this input data, we can create a model for predicting the credit card default of a brand new customer.  Will a future customer default?

**STEP 1:**  Decide on an algorithm.
<br>
You decide to make the prediction using a Logistic Regression model with accuracy as your evaluation metric. Note that Logistic Regression is an appropriate model because we want to predict a category: Default? Yes/No?
<br><br>
Here is your initial code:

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()

**STEP 2:** Split your data into test and training subsets.
<br><br>
We will do this using a pretty cool method we imported from the `Scikit-learn` library, **train_test_split( )**. But first, we need to define X and y.  
- X is the features you are using to make the prediction
- y is the feature you are trying to predict
<br><br>
As stated above, your X is the first 23 columns and your y is the DEFAULT column.  Let's slice our DataFrame to get X.  Since this is a DataFrame, the process is a bit different than how you would slice a Numpy array (as discussed in the Data Science workshop):

In [16]:
X = cc_default.loc[:,'LIMIT_BAL':'PAY_AMT6']
print(X.head())

   LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0      20000    2          2         1   24      2      2     -1     -1   
1     120000    2          2         2   26     -1      2      0      0   
2      90000    2          2         2   34      0      0      0      0   
3      50000    2          2         1   37      0      0      0      0   
4      50000    1          2         1   57     -1      0     -1      0   

   PAY_5    ...     BILL_AMT3  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  \
0     -2    ...           689          0          0          0         0   
1      0    ...          2682       3272       3455       3261         0   
2      0    ...         13559      14331      14948      15549      1518   
3      0    ...         49291      28314      28959      29547      2000   
4      0    ...         35835      20940      19146      19131      2000   

   PAY_AMT2  PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  
0       689         0         0       

We used the **loc[ ]** method on the `cc_default` DataFrame to slice all rows and all columns from LIMIT_BAL to PAY_AMT_6 into a new DataFrame, X.

Let's slice our DataFrame to get y.  That is, use the **loc[ ]** method on the cc_default DataFrame to slice all rows and only the DEFAULT column into a new DataFrame, y.

In [17]:
y = cc_default.loc[:,'DEFAULT']
print(y.head())

0    1
1    1
2    0
3    0
4    0
Name: DEFAULT, dtype: int64


Now use the **train_test_split( )** method to split our data (X and y) into test and training subsets:

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Since `test_size=0.20` was passed into our method, it specifies that we want 20% of the data in a test set and subsequently 80% of the data in a training set.  This data selection is done randomly by **train_test_split( )**.  In general, 80/20 is a good split.
<br><br>
We have done this for both our X and y DataFrames and defined X_train, X_test, y_train and y_test.

**STEP 3:**  Use the training data to train your model.
<br><br>
That is, we will use X_train and y_train as inputs so that our Logistic Regression algorithm can find a pattern in which to make our prediction.  This training is done by passing the train data into the **fit( )** method applied to our `logistic_regression( )` algorithm defined above.

In [21]:
logistic_regression.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

There is no output here, but know that our model is all trained and ready to be evaluated.

**STEP 4:** Use the test data to evaluate the model.
<br>

That is, we will use X_test to make predictions about y and then compare y_test to these predicted y values.  Remember our evaluation metric will be `accuracy`.

In [25]:
y_predicted = logistic_regression.predict(X_test)
eval_metric = accuracy_score(y_pred = y_predicted, y_true = y_test)
print(eval_metric)

0.780666666667


The result tells us that the Logistic Regression algorithm 78% of the time predicts credit card default correctly.  100% is the best, but 78% is very good.

Before moving to the last step of making our prediction, remember that we want to pick the BEST model to make our prediction.  Here, the Logistic Regression model was very good, but is it the BEST?  To determine this, we have to re-do all the above steps for different classification models and then compare their accuracies.  Whichever one has the highest accuracy wins!
<br><br>
We will do this in part 2 of this notebook, `Scikit-Learn-Classification2.ipynb`.  See you there!