# Scikit-Learn

<!--<badge>--><a href="https://colab.research.google.com/github/mthd98/Machine-Learning-from-Zero-to-Hero-Bootcamp-v1/blob/main/Week 03 - Machine Learning Algorithms/1- Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->


[Scikit-learn](http://scikit-learn.org/stable/) is a python-based machine learning library providing implementations of a great many algorithms for supervised and unsupervised learning. In large part, it builds upon the cabilities of NumPy, SciPy, matplotlib, and Pandas.

In the context of supervised learning, the primary objects scikit-learn defines are called **estimators**. Each of these defines a `fit` method, which develops a model from provided training data, and a `predict` method, which uses the model to map a new instance to a suitable target value. Scikit-learn also defines multiple utilities for partitioning and manipulating data sets as well as evaluating models.

Below, we cover some of the basic steps needed to create a model in scikit-learn.  These notes are based on material appearing in the *scikit-learn tutorials*.

*  [Tutorial](http://scikit-learn.org/stable/tutorial/index.html)
*  [Cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)

## Datasets

Scikit-learn comes bundled with several pre-defined (typically small) `datasets` that users can explore.

    load_boston()	Load and return the boston house-prices dataset (regression).
    load_iris()	Load and return the iris dataset (classification).
    load_diabetes()	Load and return the diabetes dataset (regression).
    load_digits()	Load and return the digits dataset (classification).
    load_linnerud()	Load and return the linnerud dataset (multivariate regression).
    load_wine()	Load and return the wine dataset (classification).
    load_breast_cancer()	Load and return the breast cancer wisconsin dataset (classification).

The iris dataset is loaded below, and a description of it is printed.

In [1]:
import numpy as np
import pandas as pd

# using 'from * import ...' allows as to import submodules directly
from sklearn import (
    datasets,
    model_selection,
    linear_model,
    metrics,
    neighbors,
    tree,
    ensemble,
    preprocessing,
)

# alternatively, we can import the whole package as such
import sklearn

In [None]:
iris_dataset = (
    datasets.load_iris()
)  # sklearn.datasets.load_iris() works exactly the same

print(iris_dataset.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

We can also use `iris_dataset.data` and `iris_dataset.targets` to create or x & y (inputs & outputs) pairs that will be used for training and testing

In [None]:
x = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
y = pd.DataFrame(iris_dataset.target, columns=["Labels"])

x

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


Alternatively, can  load a dataset into x & y directly (i.e. into input/output pairs) by setting the `return_X_y` parameter to `True`

In [None]:
x, y = datasets.load_iris(return_X_y=True)

x.shape, y.shape

((150, 4), (150,))

## Train/Test Split

In order to validate that our model can generalize to data that it wasn't trained on, it's necessary to create a sperate **testing dataset** that will not be used in training.

Within the `model_selection` submodule of Scikit Learn, there's the `train_test_split` that we can use to automatically split the data into training and testing pairs.

Here's an explanation of the different parameters taken directly from the function's docstring

#### **Parameters**

**arrays** : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse
    matrices or pandas dataframes.

**test_size** : float, int or None, optional (default=None)
    If float, should be between 0.0 and 1.0 and represent the proportion
    of the dataset to include in the test split. If int, represents the
    absolute number of test samples. If None, the value is set to the
    complement of the train size. If train_size is also None, it will
    be set to 0.25.

**train_size** : float, int, or None, (default=None)
    If float, should be between 0.0 and 1.0 and represent the
    proportion of the dataset to include in the train split. If
    int, represents the absolute number of train samples. If None,
    the value is automatically set to the complement of the test size.

**random_state** : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by np.random.

**shuffle** : boolean, optional (default=True)
    Whether or not to shuffle the data before splitting. If shuffle=False
    then stratify must be None.

**stratify** : array-like or None (default=None)
    If not None, data is split in a stratified fashion, using this as
    the class labels.






In [None]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(
    x, y, test_size=0.1, random_state=42, stratify=y
)

Please note that the `stratify` parameter works only in the context of classification tasks where there are a fixed amount of possible outputs/targets

# Fitting and predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

Here is a simple example where we fit a Linear Regression to some very basic data:

In [3]:
x = [[ 1,  2,  3],  # 2 samples, 3 features
    [11, 12, 13]]
y = [0.5, 0.1]# classes of each sample

model = linear_model.LinearRegression()

model.fit(x,y)


In [5]:
pred= model.predict(x)  # predict classes of the training data
print(pred)
pred= model.predict([[4, 5, 6], 
                    [14, 15, 16]])  # predict classes of new data
print(pred)

[0.5 0.1]
[ 0.38 -0.02]


The `fit` method generally accepts 2 inputs:

1. The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.

2. The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, y does not need to be specified. y is usually 1d array where the i th entry corresponds to the target of the i th sample (row) of X.

Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:

# Linear Regression

In statistics, linear regression is a linear approach to modelling the relationship between a set a features, and a desired output. The case of one input feature is called simple linear regression; for more than one, the process is called multiple linear regression.

Scikit Learn defines this algorithm in `LinearRegression` class as a part of the `linear_models` module.


First, we load the data

In [None]:
x, y = datasets.load_diabetes(return_X_y=True)
# normalize the values of x and y
y_normalize = preprocessing.MinMaxScaler()
y_norm = y_normalize.fit_transform(y.reshape(-1, 1))  # normlize the y

x_normalize = preprocessing.StandardScaler()
x_norm = x_normalize.fit_transform(x)  # normlize the x

print("Diabetes features/input shape:", x.shape)
print("Diabetes target/output shape:", y.shape)

Diabetes features/input shape: (442, 10)
Diabetes target/output shape: (442,)


Second, we split the data into 90/10 training/testing split (90% of the data will be used for training while 10% will be used for testing)

In [None]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(
    x_norm, y_norm.reshape(-1), test_size=0.1, random_state=42
)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

((397, 10), (45, 10), (397,), (45,))

Third, we train (i.e. `fit`) the model using the training dataset (`x_train` as inputs, `y_train` as targets)

In [None]:
regressor = (
    linear_model.LinearRegression()
)  # initialize the parameter of linear  regression model
regressor.fit(x_train, y_train)  # training the model on the train data

# we can preview the learned coefficients (i.e. weights) and intercept (i.e. bias)

print("Weights:\n", regressor.coef_)
print("Bias:\n", regressor.intercept_)

Weights:
 [ 0.00295256 -0.03890481  0.07545094  0.04980218 -0.12584691  0.07115817
  0.01788275  0.03507649  0.10618591  0.01043328]
Bias:
 0.39477478088542883


Fourth, we'll feed the test set into the trained model

In [None]:
y_pred = regressor.predict(x_test)

Finally, we'll evaluate the predicted output against the ground-truth values in `y_test` using Scikit Learn's `metrics` module

One of the most used metrics to evaluate regression models is `mean_squared_error` which has the following formula: $$\frac{1}{n}\sum_{i=1}^{n}(\hat y_i - y_i)^2$$

Where `n` is the total number of examples evaluated (in this case 45), $\hat y$ is the predicted value (here `y_pred`) and $y$ is the ground-truth value (here `y_test`)


In [None]:
metrics.mean_squared_error(y_test, y_pred)

0.02662901220648913

# Logistic Regression

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.


Scikit Learn defines this algorithm in `LogisticRegression` class as a part of the `linear_models` module.


First, we load the data

In [None]:
x, y = datasets.load_breast_cancer(return_X_y=True)
# normalize the values of x
x_normalize = preprocessing.StandardScaler()
x_norm = x_normalize.fit_transform(x)
print("Breast Cancer features/input shape:", x_norm.shape)
print("Breast Cancer target/output shape:", y.shape)

Breast Cancer features/input shape: (569, 30)
Breast Cancer target/output shape: (569,)


Second, we split the data into 90/10 training/testing split (90% of the data will be used for training while 10% will be used for testing)

Since this is a classification problem (we only have two possible outputs, 1 or 0), we can use the `stratify` parameter to ensure that the two possible output values are distributed proportionally between the training and testing sets and preserve the data's original distribution across the two sets.

In [None]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(
    x_norm, y, test_size=0.1, random_state=42, stratify=y
)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

((512, 30), (57, 30), (512,), (57,))

Third, we train (i.e. `fit`) the model using the training dataset (`x_train` as inputs, `y_train` as targets)

In [None]:
classifier = linear_model.LogisticRegression()
classifier.fit(x_train, y_train)

# we can preview the learned coefficients (i.e. weights) and intercept (i.e. bias)

print("Weights:\n", regressor.coef_)
print("Bias:\n", regressor.intercept_)

Weights:
 [ 0.00295256 -0.03890481  0.07545094  0.04980218 -0.12584691  0.07115817
  0.01788275  0.03507649  0.10618591  0.01043328]
Bias:
 0.39477478088542883


Fourth, we'll feed the test set into the trained model

In [None]:
y_pred = classifier.predict(x_test)

Finally, we'll evaluate the predicted output against the ground-truth values in `y_test` using Scikit Learn's `metrics` module

One of the most used metrics to evaluate classification models is `accuracy_score` which calculates the precentage of the examples that the trained classifier guessed correctly


In [None]:
metrics.accuracy_score(y_test, y_pred)

0.9649122807017544

# Pipeline 
Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once `( fit() , predict() , etc).`

In [None]:
# Import the sklearn pipeline
from sklearn.pipeline import Pipeline

In [None]:
# Download the dataset 
x, y = datasets.load_breast_cancer(return_X_y=True)
# Split the dataset to train and test 
x_train, x_test, y_train, y_test = model_selection.train_test_split(
    x, y, test_size=0.1, random_state=42, stratify=y
)

The first step in building the pipeline is to define each transformer type. The convention here is generally to create transformers for the different variable types. In the code below I have created a numeric transformer which applies a StandardScaler, and includes a SimpleImputer to fill in any missing values. This is a really nice function in scikit-learn and has a number of options for filling missing values. I have chosen to use median but another method may result in better performance. The categorical transformer also has a SimpleImputer with a different fill method, and leverages OneHotEncoder to transform the categorical values into integers.

In [None]:
# Create the sklearn pipeline 
pipe = Pipeline([('scaler', preprocessing.StandardScaler()),
                 ('Logistic_R', linear_model.LogisticRegression())])

# fit the pipeline 
pipe.fit(x_train, y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('Logistic_R', LogisticRegression())])

In [None]:
# Calculate the Accuracy of the model
pipe.score(x_test, y_test)

0.9649122807017544