# scikit-learn

## What is it?

> scikit-learn (abbreviated `sklearn`) is a __high-level machine learning library__ containing:
> - machine learning algorithms
> - example datasets
> - data pre-processing & pipelines

Pair that with simple API and we get a powerful & easy tool to get the job done

In [None]:
import sklearn

print(sklearn.__version__)

Although it didn't reach stable version (yet), it was around for __more than 10 years__ and is used throughout the industry.

## Where it is used

- Fast prototyping and testing ideas
- __Part__ of more complicated pipelines
- Often as part of Machine Learning research (if possible)
- Widely in production for particular models such as decision trees and others that we will look at shortly

## Simple example

We will introduce `sklearn` as a simple tool which allows us to easily show you concepts without delving into details.

__Do not sweat over what are those algorithms right now, we will go over them in the next chapters in detail!__

## Data Loading

As mentioned, `sklearn` provides a few ready datasets for us to use. Data is returned either in `np.array`s or in `pd.DataFrame` (we will stick to `np.array` though as it's more common).

## Exercise

Load [Boston Housing Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) as `np.array` (check arguments!)

Print shape of both features and targets

In [None]:
from sklearn import datasets

# np.array instances
X, y = datasets.load_boston(return_X_y=True)

X.shape, y.shape

This is `Boston` house pricing dataset with `503` examples, `13` features and respective `y` targets.

## Features

Features consist of `13` features, among which we can find:
- crime rate in this part of town
- whether there is a Charles River nearby
- how many teachers for a single pupil are in this area

> As you can see features may be really creative, some may not be related to our task, while others might be in an unintuitive way.

__We should always perform data analysis when we want to solve the task!__

## Targets

`y` (targets) is simply house price connected to those features that we would like to predict at the end of this notebook

_You can read more about Boston Dataset [here](https://www.kaggle.com/c/boston-housing)_

In [None]:
y[:5]

As the targets are continuous, we can use them in a __regression task__ we should solve.

In [None]:
X[:5]

Features are also floating point arrays. We will use them to train our algorithm.

## Model

Now that we have example dataset, we can create model which will learn to predict based on it.

In `sklearn` it is really simple (see [documentation](https://scikit-learn.org/stable/modules/classes.html#classical-linear-regressors)).

Here we will use a basic ML algorithm called __Linear Regression__ you will find more about later. 

> When we have some features and want to predict a continous variable (regression), linear regression is one algorithm we can use to do so

## Exercise

Load [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and set `normalize` to `True`.

Also import appropriate package from `sklearn` to do that:

In [None]:
from sklearn import linear_model

model = linear_model.LinearRegression(normalize=True)

## How to use Sklearn's API of models

`sklearn` machine learning algorithms are objects which usually follow this general convention:

- `__init__(*args, **kwargs)` - here you setup your algorithm (as seen above). It controls parts of it behaviours, usually those are hyperparameters (you will learn about them in following lessons)
- `fit(X, [y])` - train algorithm on `X` (features) and `y` (targets). In case of unsupervised algorithms there is no `y`, we will also see it later
- `predict(X)` - pass data (previously unseen) to algorithm after `fit` was called. This gives us predictions (`y_pred`). In our case how much will a house cost.

Given that we can do the following:

In [None]:
model.fit(X, y)
y_pred = model.predict(X)

print(y_pred[:5], "\n", y[:5])

## Evaluating our model

So our model predicts some values, but how well does it actually do? `sklearn` provides performance __metrics__ for us to use.

You can see `sklearn`'s metrics [here](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics), in this case we will use [Mean Squared Error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error).

We will go over it in detail later, but for now it is enough for you to understand the smaller the error the better.

## Exercise

Import `seklarn.metrics.mean_squared_error` using `from` import syntax and
display what is the error between true targets and predicted ones

In [None]:
from sklearn import metrics

metrics.mean_squared_error(y, y_pred)

## Model persistance

Training (`fitting`) process is often quite expensive (in time and or compute cost), while what we are after is the ability to predict on unseen data (we will see what "unseen data" exactly is in the next notebook).

We see our model works okay and we would like to save it for later use without the need to `train` on the data again.

> Model persistence means saving your machine learning algorithm currently held in RAM (Random Access Memory) to a storage (usually hard drive) from which it can be reinstantiated at any point in time

As per usual it's simple with `sklearn`:

In [None]:
import joblib

joblib.dump(model, "model.joblib")

## Congratulations, BUT

You made your first machine learning model in roughly `5` lines of code.
Why would we need anything else?

### Downsides

As `sklearn` is very high level it doesn't require much knowledge to use as is.
But __we have to know more__ in order to do machine learning well. What is missing here:

- Why and what for? There are many more ways (and way more correct) to do machine learning
- Knowledge of machine learning algorithms; we have to know which one to choose for which kind of problems
- Knowledge of possible pitfalls; machine learning can easily go wrong. We have to know more about it in order to improve our model's performance
- In-depth knowledge of the ideas; often it might be a good idea to implement major ideas on your own

__We will do all of the above__, but hopefully you can see how easy and definitely not scary it can be.

## Pipelines

`scikit-learn` offers other goodies you can use. `Pipelines` are a way to easy join multiple machine learning related steps into one.

Also everything we have seen in previous steps is employed here. `pipe` also has similar API.

## Exercise

Use [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and LinearRegression inside a [`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).

- PCA comes first (method for reducing the number of features)
- Followed by LinearRegression (can have default arguments)
- Fit this pipeline to data, `predict` on `X`
- Display Mean Squared Error once more

You can also use [`sklearn.pipeline.make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)

In [None]:
from sklearn import decomposition
# We can import Pipeline directly
# Usually we should import subpackage though
from sklearn.pipeline import Pipeline

pca = decomposition.PCA(n_components=5)
linear = linear_model.LinearRegression()

# pipeline will run PCA model and linear model afterwards
pipe = Pipeline(steps=[('pca', pca), ('linear', linear)])

pipe.fit(X, y)
y_pred = pipe.predict(X)

print(f"Mean Squared Error: {metrics.mean_squared_error(y, y_pred)}")

## sklearn tips

- __Always try easiest solution first__. Create a weak baseline algorithm and check how it performs. Do not go straight to the most complicated ones! It is called [Occam's Razor](https://en.wikipedia.org/wiki/Occam%27s_razor) in philosophy and machine learning also
- Some algorithms have attributes you might be interested in. Those are usually suffixed by `_` underscore, for example `my_algorithm.interesting_attribute_`
- Some `__init__` functions have __a lot of possible arguments__. Each of them influences how the algorithm works. But which are the most important and have the most influence? __In `sklearn` those arguments come in order from most influential to least__
- Many `sklearn` algorithms provide `n_jobs` argument, which parallelizes `fit`, `predict` and other functions. You can use `n_jobs=-1` to use as many processes as there are virtual cores (it is often a reasonable amount), which improves performance tremendously.
- __Use idiomatic `sklearn`__ - search the documentation, use pipelines if possible

## Challenges

- Fit `sklearn.Pipeline` consisting of [`VarianceThreshold`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) and [`T-SNE`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) as the first and second algorithm instead of PCA, followed by [`DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). Do it on [`diabetes`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes) and [`boston`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston) datasets.
- Go around `scikit-learn` documentation and learn more about it. Write your own more in-depth notes

## Summary

- `sklearn` is a high level library used to quickly prototype solutions
- It is not optimized for all tasks it does, many can be done in a more efficient manner
- `API` is consistent throughout the library and each object has similiar methods like:
    - `__init__` (to setup algorithm)
    - `fit`
    - `predict`
- `sklearn.pipeline.Pipeline` is powerful tool for chaining multiple operations in a readable manner