# Classification

In this notebook, I'll provide workflows for doing classification using the "Iris" dataset. 

In [3]:
# Load libraries

# General DS Libraries
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

# Pre-processing and Scoring
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, f1_score

# Pre-processing and Scoring
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, train_test_split

# Set up to use Latex
matplotlib.rcParams['text.usetex'] = True
matplotlib.rc('font',**{'family':'sans-serif','sans-serif':['Helvetica']})

Here's the Iris dataset, loaded and with some info

In [4]:
# Load Iris dataset and print out info
iris = sns.load_dataset('iris')
print(iris.info())
print(iris.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
None
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


## Part 1: Some Simple Binary Classification

Making a binary classification model, I can use lots of different strategies:

* Logistic Regression
* Naive Bayes
* K-Nearest Neighbors
* Simple Decision Tree
* Random Forest
* Gradient Boosting

First, I'll just split up the data, using just virginica/versicolor:

In [6]:
# Remove Setosa
iris_binary = iris[iris.species != 'setosa']

# Split X and Y
X = iris_binary.drop(['species'], axis = 1, inplace=False)
Y = iris_binary['species']

# Split to train/test (80/20 split)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 12)

Now I'll make a predictAndROC function to make a prediction and make a ROC curve as well

## Part 2: Multiclass Classification

This is trickier because I now have to deal with all 3 classes. I'll have to re-make my train and test sets, but that's easy enough. More difficult is visualization; I'll do this using a heat map of the confusion matrix instead. 

### 1: Logistic Regression

The issue here is that I have multiple classes; I'll get around that by using a OneVsRest approach:

In [None]:
# Create the classifier object
lr = OneVsRestClassifier(LogisticRegression())

# 