## Disclosure


Materials for this notebook were created by [Kevin Markham](https://github.com/justmarkham/scikit-learn-videos) ([Data School](http://www.dataschool.io/)) and
 [Jake Vanderplas](http://www.vanderplas.com) for [PyData Seattle 2015](https://github.com/jakevdp/sklearn_pydata2015/).

# Introduction to Scikit-Learn

This notebook will cover the basics of Scikit-Learn, a popular package containing a collection of tools for machine learning written in Python. See more at http://scikit-learn.org.

## What is Machine Learning?
* Algorithims that enable computers to "learn"
* Considered a sub-field of artificial intellegence
* Learning is based on recognition of patterns of input/output data (supervised learning)

## Utility of Machine Learning Methods
* Classification (Categorical Data)
* Regression (Continuous and Ordered Data)

### Classification Example

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs



def plot_sgd_separator():
    # we create 50 separable points
    X, Y = make_blobs(n_samples=100, centers=2,
                      random_state=0, cluster_std=0.60)
    

    # fit the model
    clf = SGDClassifier(loss="hinge", alpha=0.01, fit_intercept=True)
    clf.fit(X, Y)

    # plot the line, the points, and the nearest vectors to the plane
    xx = np.linspace(-1, 5, 10)
    yy = np.linspace(-1, 5, 10)

    X1, X2 = np.meshgrid(xx, yy)
    Z = np.empty(X1.shape)
    for (i, j), val in np.ndenumerate(X1):
        x1 = val
        x2 = X2[i, j]
        p = clf.decision_function([[x1, x2]])
        Z[i, j] = p[0]
    levels = [-1.0, 0.0, 1.0]
    linestyles = ['dashed', 'solid', 'dashed']
    colors = 'k'

    ax = plt.axes()
    ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
    ax.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)

    ax.axis('tight')

    
if __name__ == '__main__':
    plot_sgd_separator()
    plt.show()

### Regression Example

In [None]:
from sklearn.linear_model import LinearRegression


def plot_linear_regression():
    a = 0.2
    b = 1.0

    # x from 0 to 10
    rand = np.random.random(20)
    x = 30 * rand

    # y = a*x + b with noise
    y = a * x + b + np.random.normal(size=x.shape)
    
    
    # create a linear regression classifier
    clf = LinearRegression()
    clf.fit(x[:, None], y)

    # predict y from the data
    x_new = np.linspace(0, 30, 100)
    y_new = clf.predict(x_new[:, None])
   

    # plot the results
    ax = plt.axes()
    ax.scatter(x, y)
    ax.plot(x_new, y_new)

    ax.set_xlabel('x')
    ax.set_ylabel('y')

    ax.axis('tight')


if __name__ == '__main__':
    plot_linear_regression()
    plt.show()

## The Famous Iris Dataset
* Created by British biologist/statistician Ronald Fisher (1936)
* The dataset has been used in machine learning since the 1970s

* Image(source: https://www.oreilly.com/library/view/python-artificial-intelligence/9781789539462/b2e092eb-6167-4853-af07-c7b6cac86d1a.xhtml)

In [None]:
from IPython.display import Image 
Image("C:\\Users\\Laptop\\DroneCourse\\ENVS333\Intro_SKlearn\images\iris.png")

## Machine learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- A popular dataset for machine learning because prediction is **easy**
- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [None]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

In [None]:
# print the iris data
print(iris.data)

## Machine learning terminology

- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)
- Each value we are predicting is the **target** (also known as: response, outcome, label, dependent variable)

### Loading the Iris Data with Scikit-Learn

Scikit-learn has a very straightforward set of data on these iris species.  The data consist of
the following:

- Features in the Iris dataset:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm

- target classes to predict:

  1. Iris Setosa
  2. Iris Versicolour
  3. Iris Virginica
  
``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:

In [None]:
# print the four feature names
print(iris.feature_names)

In [None]:
# print integers representing the species of each observation
print(iris.target)

In [None]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [None]:
# check the types of the features and target
print(type(iris.data))
print(type(iris.target))

In [None]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

In [None]:
# check the shape of the target (single dimension matching the number of observations)
#print(iris.target.shape)
iris.target.shape

In [None]:
# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

In [None]:
X.shape

## Classification of Iris Dataset

### 1. K-nearest neighbors (KNN) classification

1. Choose a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the majority response value from the K nearest neighbors as the predicted response value for the unknown iris.

### Example training data

![Training data](images/04_knn_dataset.png)

### KNN classification map (K=1)

![1NN Classification Map](images/04_1nn_map.png)

### KNN classification map (K=5)

![5NN classification map](images/04_5nn_map.png)

### scikit-learn 4-step modeling pattern

**Step 1:** Import the class you plan to use

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

In [None]:
print(knn)

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [None]:
knn.fit(X, y)

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [None]:
knn.predict([[3, 5, 4, 2]])

- Returns a NumPy array
- Can predict for multiple observations at once

In [None]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)

#### Using a different value for K

In [None]:
knn5 = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn5.fit(X, y)

# predict the response for new observations
knn5.predict(X_new)

### Scale Iris Datset for Logistic Regression

In [None]:
from sklearn import datasets
import numpy as np
from sklearn import preprocessing

In [None]:
#scale the dataset
X_scaled = preprocessing.scale(X)

In [None]:
#confirm scaling transformation between original(X) and scaled(Xscale)
X

In [None]:
X_scaled

### Using Logistic Regression

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_scaled, y)

# predict the response for new observations
y_pred = logreg.predict(X_scaled)

In [None]:
y_pred.shape

In [None]:
# compute classification accuracy for the logistic regression model
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))