## Chapter 3 A Tour of Machine Learning Classifier Using Scikit-Learn

Each classification algorithm has its tradeoff, no ML free lunch.

No single classifier works for all situations.

You must experiment and see which classifier works best for your project, although there is a mountain of data out there to point you in the right direction to shortcut much of this.

The power of a classifier depends on the data available for learning.

* Training
  * Selecting features and collecting labeled training examples
  * Choose a performance metric
  * Choose a learning algorithm and training a model
  * Evaluation the performance of a model
  * Changing the settings and tuning the model

Page 54

Scikit gives us an API to easily leverage models like preceptron and Adaline for testing. It also includes many functions and convenience features. 

Lets training some more perceptron models with the Iris Dataset

In [1]:
from sklearn import datasets
import numpy as np
iris = datasets.load_iris() # datasets contains a number of sample datasets from sklearn
X = iris.data[:, [2, 3]] # : means all rows, [2, 3] means columns 2 and 3 of the iris data
y = iris.target # target is the class labels
# X is the features or data, y is the class labels or target

# Print the class labels, deduplicated
print('Class labels:', np.unique(y)) # We use numpy's unique method to deduplicate the iris targets in y

Class labels: [0 1 2]


It is best practice to utilizer numerical values for class labels to avoid technical issues and increase training/testing speed.

Lets split the dataset into a train and test dataset. Scikit Learn has a function for this we can import. We want to randomly split the X and y arrays into 30% test data and 70% training data.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
# stratify=y means that the class labels are distributed in the training and test sets as they are in the original dataset

print('Labels counts in y:', np.bincount(y))
print('Lanels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))



Labels counts in y: [50 50 50]
Lanels counts in y_train: [35 35 35]
Labels counts in y_test: [15 15 15]
