#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Introduction to scikit-learn

<img height="20px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/220px-Scikit_learn_logo_small.svg.png" align="left" hspace="10px"> [Scikit-learn](https://scikit-learn.org) is a machine learning library for Python.

Scikit-learn is an approachable library and supports a wide variety of traditional machine learning models, not just deep learning models.

## Overview

### Learning Objectives

* Demonstrate the ability to do the following in scikit-learn:
  * load sample data
  * generate sample data
  * transform data
  * train a model
  * make predictions using a model

### Estimated Duration

75 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |


## Datasets

Scikit-learn contains methods for loading, fetching, and making (generating) data. The methods for doing this all fall under the [datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) package. Most of the functions in this package have `load`, `fetch`, or `make` in the name to let you know what the method is doing under the hood.

**Loading** functions bring static datasets into your program. The data comes pre-packaged with scikit-learn, so no network access is required.

**Fetching** functions also bring static datasets into your program. However, the data is pulled from the internet (and cached), so if you don't have network access these functions might fail.

**Generating** functions make dynamic datasets based on some equation.

These pre-packaged dataset functions exist for many popular/classic datasets such as the [MNIST digits dataset](https://en.wikipedia.org/wiki/MNIST_database) and the [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). The generation functions reference classic dataset "shape" formations such as [moons](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html#sklearn.datasets.make_moons) and [swiss rolls](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_swiss_roll.html#sklearn.datasets.make_swiss_roll).

These datasets are great for getting introduced to machine learning.

### Loading

Let us first look at an example of loading data. We will load the iris flowers dataset using the [load_iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris) function.

In [0]:
from sklearn.datasets import load_iris

iris_data = load_iris()
iris_data

That's a lot to take in!

Let's examine this loaded data a little more closely. First we'll see what type the data is:

In [0]:
type(iris_data)

A `sklearn.utils.Bunch` is a class type that you'll see quite often when working with datasets built into scikit-learn. It is a dictionary-like container for feature and target data within a dataset.

You won't find much documentation about `Bunch` objects though because they are not really meant for usage beyond containing data returned by scikit-learn.

Let's look at the attributes of our iris data bunch:

In [0]:
dir(iris_data)

`DESCR` is a description of the dataset.

In [0]:
print(iris_data['DESCR'])

`filename` is the file where the data is stored.

In [0]:
print(iris_data['filename'])

`feature_names` are the names of the feature columns.

In [0]:
print(iris_data['feature_names'])

`target_names` are not however the names of the target columns. There is only one column of targets.

Instead, `target_names` are the human-readable names of the classes in the target list within the bunch. In this case they are the names of the three species of iris in this dataset.

Note that the target names are in a list where:

  * setosa is the 0th element
  * versicolor is the 1st element
  * virginica is the 2nd element

In [0]:
print(iris_data['target_names'])

We can now examine `target` and see that it contains zeros, ones, and twos. These correspond to the target names 'setosa', 'versicolor', and 'virginica'.

In [0]:
print(iris_data['target'])

Last we look at the `data` within the bunch. The data is an array of arrays. Each sub-array contains four values. These values match up with the `feature_names`. The first item in each sub-array is 'sepal length (cm)', the next is 'sepal width (cm)', and so on.

In [0]:
iris_data['data']

The number of target values should always equal the number of rows in the data.

In [0]:
print(len(iris_data['data']))
print(len(iris_data['target']))

`Bunch` objects are a perfectly fine container for data. They can be used directly to feed models.

`Bunch` objects are *not* very good for analyzing and manipulating your data.

We will typically convert `Bunch` objects into Pandas `DataFrame` objects to make analysis, data cleaning, and test/train splitting easier and more uniform.

To do this we will take the matrix of feature data and append the target data to it to create a single matrix of data. We also take the list of feature names and append the word 'species' to represent the target classes in the matrix.

An example of how to do this is below.

In [0]:
import pandas as pd
import numpy as np

iris_df = pd.DataFrame(
  data=np.append(
    iris_data['data'], 
    np.array(iris_data['target']).reshape(len(iris_data['target']), 1), 
    axis=1),
  columns=np.append(iris_data['feature_names'], ['species'])
)

iris_df.sample(n=10)

You might notice that the integer representation of species got converted to a floating point number along the way. We can change that back.

In [0]:
iris_df['species'] = iris_df['species'].astype('int64')

iris_df.sample(n=10)

### Fetching

Fetching is similar to loading. Scikit-learn will first see if it can find the dataset locally and if so will simply load the data. Otherwise, it will attempt to pull the data from the internet.

We can see fetching in action with the [fetch_california_housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing) function below. If you run the code block once you should see a message that the data is downloading. If you run it again you won't see that message because the data is already local to your running code.

In [0]:
from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()

type(housing_data)

The dataset is once again given to us as a `Bunch`.

If you followed the link to the [fetch_california_housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing) you notice that the dataset is a **regression** dataset as opposed the iris dataset, which was a **classification** dataset.

We can see the difference in the dataset by checking out the attributes of the `Bunch`.

In [0]:
dir(housing_data)

We see that four of the attributes that we expect are present, but 'target_names' is missing. This is because our target is now a home price and not a discrete value like an iris species.

In [0]:
print(housing_data['target'][:10])

Converting a `Bunch` of regression data to a `DataFrame` is no different than converting a `Bunch` of classification data.

In [0]:
import pandas as pd
import numpy as np

housing_df = pd.DataFrame(
  data=np.append(
    housing_data['data'], 
    np.array(housing_data['target']).reshape(len(housing_data['target']), 1), 
    axis=1),
  columns=np.append(housing_data['feature_names'], ['price'])
)

housing_df.sample(n=10)

### Generating

In the example datasets we've seen so far in this colab, the data is static and loaded from a file. Sometimes it makes more sense to generate a dataset. For this we use one of the many [generator](https://scikit-learn.org/stable/modules/classes.html#samples-generator) functions.

`make_regression` is a generator that will create a dataset with an underlying regression that you can then attempt to discover using various machine learning models.

In the example below we create a dataset with 10 data points. For the sake of visualization we have only one feature per datapoint, but we could ask for more.

The return value are the X and y values for the regression. The X is a matrix of features. The y is a list of targets.

Since a generator uses randomness to generate data, we are going to set a random_state in this colab for reproducibility. **You won't do this in your production code.**

In [0]:
from sklearn.datasets import make_regression

features, targets = make_regression(n_samples=10, n_features=1, random_state=42)

features, targets

We can use a visualization library to plot the regression data.

In [0]:
import matplotlib.pyplot as plt

plt.plot(features, targets, 'b.')
plt.show()

That data do have a very linear pattern!

If we want to make it more realistic, just add some noise during data generation.

**Remember that random_state is for reproducibility only. Don't use this in your code unless you have a good reason to!**

In [0]:
from sklearn.datasets import make_regression

features, targets = make_regression(n_samples=10, n_features=1, random_state=42, noise=5.0)

plt.plot(features, targets, 'b.')
plt.show()

There are dozens of dataset loaders and generators in the scikit-learn [datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) package. When you want to play with a new machine learning algorithm, they are a great source of data for getting started.

## Models



Machine learning involves training a model to gain insight and predictive power from a dataset. Scikit-learn has support for many different types of models ranging from classic algebraic models through more modern deep learning models.

This section exists to survey some concepts that you will encounter when building and running models in scikit-learn.

### Estimators

Most of the models in scikit-learn are considered [estimators](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator). An estimator is expected to implement two methods: `fit` and `predict`.

`fit` is used to train the model. At a minimum it is passed the feature data used to train the model. In supervised models it is also passed the target data.

`predict` is used to get predictions from the model. This method is passed features and returns target predictions.

Let's see an example of this in action.

A linear regression is a simple model that you might have encountered in a statistics class in the past. The model attempts to draw a straight line through a set of data points so that the line is as close to as many points as possible.

We'll use scikit-learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class to fit a line to the regression data that we generated earlier in this colab. To do that we simply call `fit(features, targets)`.

After fitting, we can ask the model for predictions. In this case we just ask for predictions based on the features that we used to train in order to draw a scatter plot of the actual data with the regression line plotted over it by calling `predict(features)`.

In [0]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

regression = LinearRegression()
regression.fit(features, targets)
predictions = regression.predict(features)

plt.plot(features, targets, 'b.')
plt.plot(features, predictions, 'r-')
plt.show()

At this point, don't worry too much about the details of what `LinearRegression` is doing. There is a deep-dive into regression problems coming up soon.

For now just note the `fit`/`predict` pattern for training estimators and know that you'll see it throughout our adventures with scikit-learn.

### Transformers

In practice it is rare that you will get perfectly clean data that is ready to feed into your model for training (calling `fit`). Most of the time you will need to perform some type of cleaning on the data first.

You've got some hands-on experience doing this in our Pandas colabs. Scikit-learn can also be used to perform some data preprocessing tasks on your datasets.

Transformers are spread about within the scikit-learn library. Some are in the [preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) module while others are in more specialized packages like [compose](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose), [feature_extraction](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction), [impute](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute), and more.

All transformers implement a `fit` and `transform` methods. The `fit` method calculates parameters necessary to perform the data transformation. `transform` actually applies the transformation. There is a convenience `fit_transform` method that performs both fitting and transformation in one method call.

Let's see a transformer in action.

We will use the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) to scale our feature data between zero and one.

Looking at our feature data now we can see values below zero and above one.

In [0]:
features

We will now create a `MinMaxScaler` and fit it to our feature data.

Each transformer has different information that it needs to perform a transformation. In the case of the `MinMaxScaler` the smallest and largest values in the data are needed.

In [0]:
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler()
transformer.fit(features)
transformer.data_min_, transformer.data_max_

You might notice that the values are stored in arrays. This is because transformers can operate on more than one feature. In this case we have only one.

Next we need to apply the transformation to our features.

We can now see that all of the features fall between the range of zero to one.

In [0]:
features = transformer.transform(features)
features

### Pipelines

It isn't coincidence that transformers have `fit` and `transform` methods and that models have `fit` methods. The common interface across classes allows scikit-learn to create pipelines for data processing and model building.

A [pipeline](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline) is simply a series of transformers, often with an estimator at the end.

In the example below we use a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class to perform min-max scaling or our feature data and then train a linear regression model using the scaled features.

In [0]:
from sklearn.pipeline import Pipeline

features, targets = make_regression(n_samples=10, n_features=1, random_state=42, noise=5.0)

pipeline = Pipeline([
  ('scale', MinMaxScaler()),
  ('regression', LinearRegression())
])

pipeline.fit(features, targets)

predictions = pipeline.predict(features)

plt.plot(features, targets, 'b.')
plt.plot(features, predictions, 'r-')
plt.show()

### Metrics

So far we have seen ways that scikit-learn can help you get data to perform machine learning, modify that data, train a model, and finally make predictions. But how good are the predictions?

Scikit-learn also comes with many functions for measuring model performance in the [metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) package.

To illustrate a metrics function in action we'll import the [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) function and use it to find the mean squared error (MSE) between the target values that we used to train our linear regression model and the predicted values.

In [0]:
from sklearn.metrics import mean_squared_error

mean_squared_error(targets, predictions)

What does the resulting value mean in relation to our model? Is it good or bad?

In this case it doesn't have much meaning aside from being the mean of the squares of the distance between our actual target values and their predicted values. Since the data that we fit the regression to isn't related to any real-world metrics the MSE is just a number.

As we learn more about machine learning and begin training models on real data you'll learn how to interpret MSE and other metrics in context of the data being analyzed and the problem being solved.

There are also metrics in each estimator class. These metrics can be extracted using the `score` method.

The `regression` class we created earlier can be scored, as can the `pipeline`.

In [0]:
print(regression.score(features, targets))
print(pipeline.score(features, targets))

The return value of the `score` method depends on the estimator being used. In the case of `LinearRegression` the score is the r-squared score, where scores closer to 1.0 are better. You can find the metric that `score` returns in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score) for the estimator that you are using.

## Closing

Scikit-learn is a massive library that contains scores of resources for performing machine learning tasks. In this colab we have only introduced some basic concepts that you will see repeated throughout your career in data science.

You are also encouraged to check out the [scikit-learn documentation](https://scikit-learn.org/stable/documentation.html) where you will find a user's guide, tutorials, and a full API reference.

Scikit-learn is an Open Source project. You can find it on [Github](https://github.com/scikit-learn/scikit-learn)

## Resources

* https://scikit-learn.org/stable/documentation.html
* https://en.wikipedia.org/wiki/Scikit-learn
* https://en.wikipedia.org/wiki/Estimator
* https://en.wikipedia.org/wiki/Mean_squared_error
* https://github.com/scikit-learn/scikit-learn

# Exercises

## Exercise 1

Load the [Boston house price dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) into a Pandas `DataFrame`. Append the target values to the last column of the `DataFrame` called `boston_df`. Name the target column 'PRICE'.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

In [0]:
from sklearn.datasets import load_boston

import pandas as pd
import numpy as np

boston_data = load_boston()

boston_df = pd.DataFrame(
  data=np.append(
    boston_data['data'], 
    np.array(boston_data['target']).reshape(len(boston_data['target']), 1), 
    axis=1),
  columns=np.append(boston_data['feature_names'], ['PRICE'])
)

## Exercise 2

Search the [scikit-learn datasets documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) and find a function to make a "Moons" dataset. Create a dataset with 75 samples. Use a random state of `42` and a noise of 0.08. Store the X return value in a variable called `features` and the y return value in a variable called `targets`.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

In [0]:
from sklearn.datasets import make_moons

features, targets = make_moons(n_samples=75, random_state=42, noise=0.08)

## Exercise 3

In Exercise Two you created a "moons" dataset. In that dataset the features are (x,y)-coordinates that can be graphed in a scatterplot. The targets are zeros and ones that represent a binary classification.

Use matplotlib's [scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html) method to visualize the data as a scatterplot. Use the `c` argument to scatter to make the dots for each class a different color.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

In [0]:
from sklearn.datasets import make_moons

import matplotlib.pyplot as plt

features, targets = make_moons(n_samples=75, random_state=42, noise=0.08)

plt.scatter(features[:, 0], features[:, 1], c=targets)
plt.show()

## Challenge

Use the [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class to combine a data pre-processor and an estimator.

To accomplish this:

1. Find a [preprocessor](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) that uses the max absolute value for scaling.
1. Find a [linear_model](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) based on the Huber algorithm.
1. Combine this preprocessor and estimator into a pipeline.
1. Make a sample regression dataset with 200 samples, 1 feature. Use a random state of 85 and a noise of 5.0. Save the features in a variable called `features` and the targets in a variable called `targets`.
1. Fit the model.
1. Using the features that were created when the regression dataset was created, make predictions with the model and save them into a variable called `predictions`.
1. Plot the features and targets used to train the model on a scatterplot with blue dots.
1. Plot the features and predictions over the scatterplot as a red line.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import HuberRegressor
from sklearn.datasets import make_regression

pipeline = Pipeline([
  ('scale', MaxAbsScaler()),
  ('regression', HuberRegressor())
])

features, targets = make_regression(n_samples=200, n_features=1, random_state=85, noise=5.0)

pipeline.fit(features, targets)

predictions = pipeline.predict(features)

plt.plot(features, targets, 'b.')
plt.plot(features, predictions, 'r-')
plt.show()