# Lecture 5: Notebook SK_00

## Sklearn Intro

![sklearn](http://scikit-learn.org/stable/_images/scikit-learn-logo-notext.png)



*scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license.*

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. It is currently maintained by a team of volunteers.

[Github repository](https://github.com/scikit-learn/scikit-learn)

[Website](http://scikit-learn.org)

## Who is using scikit-learn?

![spotify](http://scikit-learn.org/stable/_images/spotify.png)


*Scikit-learn provides a toolbox with solid implementations of a bunch of state-of-the-art models and makes it easy to plug them into existing applications. We’ve been using it quite a lot for music recommendations at Spotify and I think it’s the most well-designed ML package I’ve seen so far.*


![evernote](http://scikit-learn.org/stable/_images/evernote.png)

*Building a classifier is typically an iterative process of exploring the data, selecting the features (the attributes of the data believed to be predictive in some way), training the models, and finally evaluating them. For many of these tasks, we relied on the excellent scikit-learn package for Python.*

## Typical ML questions

* How do I choose which attributes of my data to include in the model?
* How do I choose which model to use?
* How do I optimize this model for best performance?
* How do I ensure that I'm building a model that will generalize to unseen data?
* Can I estimate how well my model is likely to perform on unseen data?

## Benefits and drawbacks of scikit-learn

#### Benefits:
* Consistent interface to machine learning models
* Provides many tuning parameters but with sensible defaults
* Exceptional documentation
* Rich set of functionality for companion tasks
* Active community for development and support

#### Potential drawbacks:

Someone says...
* Harder (than R) to get started with machine learning
* Less emphasis (than R) on model interpretability

### Installing scikit-learn

**Option 1:** [Install scikit-learn library](http://scikit-learn.org/stable/install.html) and dependencies (NumPy and SciPy)

**Option 2:** [Install Anaconda distribution](https://www.anaconda.com/download/) of Python, which includes:

- Hundreds of useful packages (including scikit-learn)
- IPython and Jupyter Notebook
- conda package manager
- Spyder IDE

## Machine learning terminology

- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)
- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)
- **Classification** is supervised learning in which the response is categorical
- **Regression** is supervised learning in which the response is ordered and continuous

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

## A first classification example

**Step 1:** Load `wine` dataset from sklearn

store feature matrix in `X`, and response vector in `y`

Print features and target name

**Step 2:** Import the classifier class you plan to use and instantiate the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

#### K-nearest neighbors (KNN) classification

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown value.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown.

Create the classifer

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

[documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

**Step 4:** Predict

Create a random vector of 13 floats to test the classifier

Predict

#### Using a different value for K

use `n_neighbors` equals to 5

#### Instantiate another classifier, e.g. LogisticRegression

**Step 4 (Real one!):** Evalute the classifier!

Goal is to estimate likely performance of a model on **out-of-sample data**

## Evaluation procedure #1: Train/test split

__Idea:__ Split the dataset into two pieces, so that the model can be trained and tested on different data

**Step 1 :** Import `train_test_split` from sklearn

**Step 2 :** Split `X` and `y` into training and testing sets

What did this accomplish?

- Model can be trained and tested on **different data**
- Response values are known for the testing set, and thus **predictions can be evaluated**
- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance

Print the shapes of the new `X` objects

Print the shapes of the new `y` objects

**Step 3 :** Train the model on the training set

**Step 4 :** Make predictions on the testing set

Compare actual response values `y_test` with predicted response values `y_pred`

**Step 5 :** Import metrics module

Print classification accuracy

#### Classification accuracy:

- **Proportion** of correct predictions
- Common **evaluation metric** for classification problems

### Downsides of train/test split?

- Provides a **high-variance estimate** of out-of-sample accuracy
- **K-fold cross-validation** overcomes this limitation
- But, train/test split is still useful because of its **flexibility and speed**

## Evaluation procedure #2: KFold

1. Split the dataset into K **equal** partitions (or "folds").
2. Use fold 1 as the **testing set** and the union of the other folds as the **training set**.
3. Calculate **testing accuracy**.
4. Repeat steps 2 and 3 K times, using a **different fold** as the testing set each time.
5. Use the **average testing accuracy** as the estimate of out-of-sample accuracy.

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False).split(range(25))

for iteration, data in enumerate(kf, start=1):
    print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))

- Dataset contains **25 observations** (numbered 0 through 24)
- 5-fold cross-validation, thus it runs for **5 iterations**
- For each iteration, every observation is either in the training set or the testing set, **but not both**
- Every observation is in the testing set **exactly once**

Apply KFold to the wine dataset

## Comparing cross-validation to train/test split

Advantages of **cross-validation:**

- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)

Advantages of **train/test split:**

- Runs K times faster than K-fold cross-validation
- Simpler to examine the detailed results of the testing process

## Cross-validation recommendations

1. K can be any number, but **K=10** is generally recommended
2. For classification problems, **stratified sampling** is recommended for creating the folds
    - Each response class should be represented with equal proportions in each of the K folds
    - scikit-learn's `cross_val_score` function does this by default

#### `cross_val_score` as alternative to `KFold`

Test `cross_val_score` with different values of `n_neighbors`

Plot the result

###  Notebook Credits:
    
[justmarkham](https://github.com/justmarkham)

[online sklearn course](https://github.com/justmarkham/scikit-learn-videos)