# Training a machine learning model with scikit-learn


Watch this video: http://blog.kaggle.com/2015/04/30/scikit-learn-video-4-model-training-and-prediction-with-k-nearest-neighbors/

and this video: http://blog.kaggle.com/2015/05/14/scikit-learn-video-5-choosing-a-machine-learning-model/

## Reviewing the iris dataset

In [1]:
from IPython.display import IFrame
IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=200)

- **How many observations aka samples?**
150

- **How many features? What are the features?**
4, Sepal length and width, petal length, width 

- **What is the target variable we're trying to predict?**
Ira species

- Why is this called a **classification** problem?
We are trying to categorize an unknown Iris

- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

## K-nearest neighbors (KNN) classification

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

### Example training data

![Training data](assets/04_knn_dataset.png)

### KNN classification map (K=1)

![1NN classification map](assets/04_1nn_map.png)

### KNN classification map (K=5)

![5NN classification map](assets/04_5nn_map.png)

*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*

## Loading the data

In [3]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

In [4]:
# print the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


## scikit-learn 5-step modeling pattern

**Step 1:** Import the class you plan to use

In [5]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [22]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

In [23]:
# print your instance of the model
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [31]:
# call your fit aka 'train' function here
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [32]:
# call your predict function here
knn.predict([[3, 5, 4, 2]])

array([1])

- Returns a NumPy array
- Can predict for multiple observations at once

In [33]:
# here's some new data for you
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]

# call your predict function on this
knn.predict(X_new)

array([1, 1])

**Step 5**: Evaluate how good your predictions are

In [34]:
# compute classification accuracy 
from sklearn import metrics

y_pred = knn.predict(X_test)

# there's a function you can call to compute classification accuracy
metrics.accuracy_score(y_test, y_pred)

0.95

## Using a different value for K

In [35]:
# instantiate a new model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X_train, y_train)

# predict the response for new observations
y_pred = knn.predict(X_test)

# again, print your classification accuracy
metrics.accuracy_score(y_test, y_pred)

0.95

## Using a different classification model

In [36]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate a new model (using the default parameters)
log_reg = LogisticRegression()


# fit the model with data
log_reg.fit(X_train, y_train)

# predict the response for new observations
y_pred = log_reg.predict(X_test)

# yet again, print your classification accuracy
metrics.accuracy_score(y_test, y_pred)


0.9666666666666667

## Resources

- [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)
- [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)
- [Videos from An Introduction to Statistical Learning](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)
    - Classification Problems and K-Nearest Neighbors (Chapter 2)
    - Introduction to Classification (Chapter 4)
    - Logistic Regression and Maximum Likelihood (Chapter 4)