# Machine Learning & `sklearn` Basics

In [27]:
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline


Scikit-learn (aka sklearn) is the standard library for machine learning in Python.  Like Pandas, under the hood it uses numpy, and it comes with a very powerful yet simple interface that covers 95% of the day to day model R&D and productionization work that you will need to do.

In this module we'll be going over sklearn's basics and understand the process of building a model.

## Machine Learning workflow

At a high level, building a model is fairly simple, in that there are five main steps:
1. ingest, and clean your data
2. transform / feature engineer
3. select model type and train the model(s)
4. predict on test data, and evaluate performance
5. iterate based on performance metrics

We've already seen step 1 in the data analysis module, and we will be going through the rest of the steps in the next few modules.

## `sklearn` Design Principles

sklearn has three fundamental interfaces:
- the Estimator (a thing that builds a model)
- the Predictor (a thing that can predict an output using inputs)
- the Transformer (a thing that can take one or more rows of data, and augment the data in some way)

Since these are interfaces, a specific class can be one or more of these things.  For example, a class can be an estimator (it will train using training data) and a transformer (it will augment input data) at the same time.

In addition, we can compose estimators together using `pipelines` to build complex data processing and modeling tasks easily

### Estimators

Estimators in sklearn are any class that implements the `fit` method.  The `fit` method is the learning part of machine learning, where we will input training features and target variables, and allow the specific algorithm to fit a model based on the data and its hyperparameters.

Let's take a look at a super simple linear regression model

In [1]:
from sklearn.linear_model import LinearRegression

In [4]:
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3


In [5]:
regression = LinearRegression(fit_intercept=True)
fitted = regression.fit(X, y)

In [6]:
fitted.coef_

array([1., 2.])

In the above example we did the following:
1. imported the LinearRegression model
2. created some test data
3. Instatiated the regression
4. called the `fit` method which trains our linear regression model using our features and targets, creating a fitted model

We can verify that the linear regression ran properly as our fitted coefficients are the same as the coefficients we used to generate the data.

Also, we can see that instantiating the Estimator and training the model are done in separate steps.  This means that you can instantiate an Estimator and use that estimator to train against multiple data sets.

### Predictors

The main goal of training a model is using the fitted model to make predictions for test features.  In scikit-learn, any class that has a `predict(X, ...)` method is considered a predictor.  

We actually created a predictor in the example above - the fitted model is a predictor with the `predict` method, as we can see below:

In [7]:
fitted.predict(np.array([[3, 5], [4, 6]]))

array([16., 19.])

We can see above that after fitting the model, we're now able to predict new feature samples that are passed to the model.  

**note**: the fitted model is still an estimator, in that we can all `.fit` on it again and fit new data if we wanted to, i.e:

In [8]:
fitted2 = fitted.fit(X, y)

### Transformers

Transformers are used to augment the input data in some way, and output the transformed data.  This could be for:
- preprocessing
- feature selection
- dimensionality reduction
etc.

Most of the time, transformers are estimators as well.  For example, we can use the StandardScaler:

In [9]:
from sklearn.preprocessing import StandardScaler

In [10]:
data = [[0, 0], [0, 0], [1, 1], [1, 1]]

In [11]:
scaler = StandardScaler()

In [12]:
scaler.fit(data)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [13]:
scaler.mean_

array([0.5, 0.5])

In [14]:
scaler.transform(data)

array([[-1., -1.],
       [-1., -1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [15]:
scaler.fit_transform(data)

array([[-1., -1.],
       [-1., -1.],
       [ 1.,  1.],
       [ 1.,  1.]])

The above defines a class that inherits from the `BaseEstimator` and the `TransformerMixin`, which means that it needs to define the `fit` method, but also now has the `fit_transform` method from the `TransformerMixin`.

### Pipelines

Because there are consistent interfaces across all Estimators, Transformers and Predictors, we can compose (i.e. string together) estimators to group together data processing operations and model fitting / prediction operations together.  

sklearn allows us to do this easily by providing a `Pipeline` that is literally a list of Estimator objects, specifically a series of Transformers (i.e. objects that have both `fit` and `transform` methods) and a final model Estimator that just has the `fit` method.  The pipeline itself also has its own `fit` and `predict` methods, as a Pipeline is also an Estimator and a Predictor itself.

Let's take a look at an example:

In [16]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [17]:
X, y = make_classification(random_state=0)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [19]:
pipe = Pipeline([
    ('scaler', StandardScaler()), 
    ('svc', SVC())
])

**note**: for Pipelines, you must provide a name for each step along with the estimator object, and the names must be unique

In [20]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svc',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [21]:
pipe.predict(X_test)

array([1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 0])

In the above example, we have:
1. created a Pipeline with 2 Estimator objects - a StandardScaler and an SVC model (SVM Classifier), 
2. fitted the pipeline against our training data and
3. made a prediction with the fitted Pipeline.

We can also access the individual Estimators inside the pipeline:

In [22]:
pipe[0]

StandardScaler(copy=True, with_mean=True, with_std=True)

In [23]:
pipe['scaler']

StandardScaler(copy=True, with_mean=True, with_std=True)

In [24]:
pipe['svc']

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [25]:
pipe[1]

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

once we have the individual estimators, we can also get any coeffients we'd like from them, e.g.

In [26]:
pipe['scaler'].mean_

array([-0.11573584, -0.09293973,  0.02018342, -0.09238542, -0.13040859,
        0.16463763, -0.12202019, -0.02483696, -0.05495534, -0.03486058,
       -0.0366951 , -0.03897222, -0.14321947, -0.0654956 , -0.05421968,
        0.04105662,  0.2041835 , -0.08359841,  0.16394695, -0.03460228])