# Classification

**Dataset:** Spambase 



# IMPORT LIBRARIES

In [None]:
import pandas as pd                                   # For dataframes
import matplotlib.pyplot as plt                       # For plotting data
import seaborn as sns                                 # For plotting data
from sklearn.model_selection import train_test_split  # For train/test splits
from sklearn.model_selection import GridSearchCV     # For parameter optimization
from sklearn.neighbors import KNeighborsClassifier   # For kNN classification
from sklearn.metrics import plot_confusion_matrix    # Evaluation measure

# LOAD AND PREPARE DATA
Many of the datasets for this course come from the Machine Learning Repository at the University of California, Irvine (UCI) at [https://archive.ics.uci.edu/](https://archive.ics.uci.edu/).

For this demonstrations of clustering techniques, we'll use the `Spambase Data Set`, which can be accessed via [https://archive.ics.uci.edu/ml/datasets/Spambase](https://archive.ics.uci.edu/ml/datasets/Spambase). We'll use the dataset saved in the file `spambase.data`. 

This data can be downloaded as a `CSV` file without the variable names using `pd.read_csv`. You'll need to manually add the `.csv` extension. This code saves the file in the data folder of our Python directory.

## Import Data

- To read read the dataset from a local CSV file, run the following cell. (This is the recommended approach.)

In [None]:
df = pd.read_csv('data/spambase_raw.csv', header=None)

- Alternatively, to read the data from the UCI ML Repository, uncomment the lines in the cell below and run them.

In [None]:

df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data',
    header=None)

- Look at the data.

In [None]:
df.head()

## Rename Variables

- Assign a name to all attributes as `X0`, `X1`, ..., `X56`.
- Assign `y` to the class variable (the last column of df).
- Display the first 5 rows.

In [None]:
# Sequentially renames all attribute columns and renames the last column to 'y'
df.columns = ['X' + str(i) for i in range(0, len(df.columns) - 1)] + ['y']

# Shows the first few lines of the data
df.head()

## Split Data
To prepare the dataset for classification, we have to split it into train and test sets.

- `train_test_split()` splits the data into train and test.
- In the arguments list, the data matrix consists of all attribute columns. Extract columns `X0`, `X1`, ..., `X56` with `df.filter(regex='\d')`. The filter keeps only the names that have a numeric character in them.
- Specify the target variable as `df.y`.
- Set up `trn` and `tst` dataframes.

In [None]:
# Specifies X by filtering all columns with a number in name
X_trn, X_tst, y_trn, y_tst = train_test_split(
    df.filter(regex='\d'),  
    df.y, 
    test_size=0.30,
    random_state=1)

# Creates the training dataset, trn
trn = X_trn
trn['y'] = y_trn

# Creates the testing dataset, tst
tst = X_tst
tst['y'] = y_tst

# EXPLORE TRAINING DATA

## Bar Plot of Class Variable

Use Seaborn's `countplot()` function to create a bar plot.

In [None]:
sns.countplot(x='y', data=trn)

## Explore Attribute Variables
Select four arbitrary features and get paired plots (takes a moment).

In [None]:
# Creates a grid using Seaborn's PairGrid()
g = sns.PairGrid(
    trn, 
    vars=['X5', 'X20', 'X25', 'X53'], 
    hue='y', 
    diag_sharey=False, 
    palette=['red', 'green'])

# Adds histograms on the diagonal
g.map_diag(plt.hist)

# Adds density plots above the diagonal
g.map_upper(sns.kdeplot)

# Adds scatterplots below the diagonal
g.map_lower(sns.scatterplot)

# Adds a legend
g.add_legend(title='Spam')


##PREPARE DATA

Separate the data matrix from the class variable.

In [None]:
# Separates the attributes X0-X56 into X_trn
X_trn = trn.filter(regex='\d')

# Separates the class variable into y_trn
y_trn = trn.y

# Separates the attributes X0-X56 into X_tst
X_tst = tst.filter(regex='\d')

# Separates the class variable into y_tst
y_tst = tst.y

# Class labels
spam = ['Not Spam','Spam']

In [None]:
trn.head()

## kNN: TRAIN MODEL
To train a kNN model, set up a KNeighborsClassifier object and fit it to training data.



In [None]:
# Sets up a kNN model and fits it to data
knn = KNeighborsClassifier(n_neighbors=5) \
    .fit(X_trn, y_trn)


###Calculate Mean Accuracy on Training Data

In [None]:
print(
    'Accuracy on training data: ' 
    + str("{:.2%}".format(knn.score(X_trn, y_trn))))

###Optimize the kNN Model
The challenge in training a kNN model is to determine the optimal number of neighbors. To find the optimal parameters, GridSearchCV object can be used.

In [None]:
# Sets up the kNN classifier object
knn = KNeighborsClassifier() 

# Search parameters
param = range(3, 15, 2)

# Sets up GridSearchCV object and stores it in grid variable
grid = GridSearchCV(
    knn,
    {'n_neighbors': param})

# Fits the grid object and gets the best model
best_knn = grid \
    .fit(X_trn,y_trn) \
    .best_estimator_

# Displays the optimum model
best_knn.get_params()

###Plot the Accuracy by Neighbors Parameter
Once the optimal parameters are found, the accuracy for different parameters can be compared by plotting. The grid variable has an attribute cv_results_, which is a dictionary of key value pairs and stores the cross validation accuracy for each parameter.

In [None]:
# Plots mean_test_scores vs. total neighbors
plt.plot(
    param,
    grid.cv_results_['mean_test_score'])

# Adds labels to the plot
plt.xticks(param)
plt.ylabel('Mean CV Score')
plt.xlabel('n_neighbors')

# Draws a vertical line where the best model is
plt.axvline(
    x=best_knn.n_neighbors, 
    color='red', 
    ls='--')

##TEST MODEL
In this phase, we'll evaluate the accuracy of the trained kNN model on the test set. A good evaluation measure is the confusion matrix that gives the fraction of true positives, true negatives, false positives, and false negatives.

###Visualize the Confusion Matrix
Normalize the scores to display as proportions across rows.

In [None]:
plot_confusion_matrix(
    best_knn, X_tst, y_tst,
    display_labels=spam,
    normalize='true')

### Calculate Mean Accuracy on Testing Data

In [None]:
print(
    'Accuracy on testing data: ' 
    + str("{:.2%}".format(best_knn.score(X_tst, y_tst))))