# What we're covering in the Scikit-Learn Introduction
This notebook outlines the content convered in the Scikit-Learn Introduction.

It's a quick stop to see all the Scikit-Learn functions and modules for each section outlined.

What we're covering follows the following diagram detailing a Scikit-Learn workflow.

<img src="../images/sklearn-workflow-title.png" />

## 0. Standard library imports
For all machine learning projects, you'll often see these libraries (Matplotlib, NumPy and pandas) imported at the top.

```Python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
```

## 1. Getting our data ready to be used with machine learning
Three main things we have to:
1. Split the data into features and labels (usually `X` & `y`)
2. Filling (also called inputting) or disregarding missing values

   __Notes__: The practice of filling missing data is called __imputation__. And it's important to remember there's no perfect way to fill missing data. The techniques you use will depend heavily on your dataset. A good place to look would be searching for "data imputation techniques".
  
   `SimpleImputer()` transforms data by filling missing values with a given strategy. And we can use it to fill the missing values in our DataFrame.
  
   We split data into train & test to perform filling missing values on them separately.
  
   We use `fit_transform()` on the training data and `transform()` on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).
  
3. Converting non-numerical values to numerical values (also called feature encoding)

The key takeaways to remember are:

* Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.

* For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as __feature engineering__ or __feature encoding__.

* Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as __data imputation__.

Keep these in mind:
* Always keep your training & test data separate
* Test sets separately (this goes for filling data with pandas as well)
* Don't use data from the future (test set) to fill data from the past (training set)

## 2. Pick a model/estimator (to suit your problem)
To pick a model we use the __[Scikit-Learn machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)__.
<img src="../images/sklearn-ml-map.png" />
__Notes:__ 
* sklearn refers to ML models, algorithms as estimators.
* Classification problem - predicting a category (heart disease or not)
   * Sometimes you'll see `clf` (short for classifier) used as a classification estimator
* Regression problem - predicting a number (selling prices of a car)

If you're working on a ML problem and looking to use Sklearn and not sure what model you should use, refer to the [Sklearn Machine Learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

What if `Ridge` didn't work or the score didn't fit our needs? Try a different model... How about we try an ensemble model (an ensemble is combination of smaller models to try and make better predictions than just a single model)?

Sklearn's ensemble models can be found here: https://scikit-learn.org/stable/modules/ensemble.html

Tidbit:
1. If you have structured data, use ensemble methods
2. If you have unstructured data, use deep learning or transfer learning (e.g. music, video, images, etc)

## 3 Fit the model/algorithm on our data and use it to make predictions

### Fitting the model to the data

Different names for:
* `X` = features, feature variables, data
* `y` = labels, targets, target variables

#### Random Forest model deep dive

These resources will help you understand what's happening inside the Random Forest models we've been using.

* [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
* [Random Forest Wikipedia (simple version)](https://simple.wikipedia.org/wiki/Random_forest)
* [Random Forests in Python](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0) by Will Koehrsen
* [An implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen

### Make predictions using the machine learning model

2 ways to make predictions:

1. `predict()`
2. `predict_proba()`

## 4. Evaluating a machine learning model
Every Scikit-Learn model has a default metric which is accessible through the `score()` function.

However there are a range of different evaluation metrics you can use depending on the model you're using.

Three ways to evaluate Scikit-Learn models/estimators:
1. Estimator's built in score method
2. The `scoring` parameter
3. Problem-specific metric functions

A full list of evaluation metrics can be __[found in the documentation](https://scikit-learn.org/stable/modules/model_evaluation.html)__.

### Classification model evaluation metrics

1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification report

#### Area under the receiver operating characteristic curve (AUC/RUC)

* Area under curve (AUC)
* ROC curve

ROC curves are a comparison of a model's true positive rate (tpr) versus a models false positive rate (fpr)

* True positive = model predicts 1 when truth is 1
* False positive = model predicts 1 when truth is 0
* True negative = model predicts 0 when truth is 0
* False negative = model predicts 0 when truth is 1

#### Confusion matrix

A confusion matrix is a quick way to compare the labels a model predicts and the actual labels it was supposed to predict.

In essence, giving you an idea of where the model is getting confused.

To summarize classification metrics:
* __Accuracy__ is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labeled with 0 or 1)
* __Precision__ and __recall__ become more important when classes are imbalanced.
* If false positive predictions are worse than false negatives, aim for a higher precision.
* If false negative predictions are worse that false positives, aim for a higher recall.
* __F1-score__ is a combination of precision and recall.

### Regression model evaluation metrics

Model evaluation metrics [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)


The ones we're going to cover are:
1. R^2 (pronounced r-squared) or coefficient of determination
2. Mean absolute error (MAE)
3. Mean squared error (MSE)

#### R^2

What R-squared does: Compares your models predictions to the mean of the targets. Values can range from negative infinity ( a very poor model) to 1. For example, if all your model does is predict the mean of the targets, it's R^2 value would be 0. And if your model perfetly predicts a range of nubmers, it's R^2 value would be 1.

#### Mean absolute error (MAE)

MAE is the average of the absolut differences between predictions and actual values.

It gives you an idea of how wrong your models predictions are.

#### Mean squared error (MSE)

MSE is the mean of the square of the errors between actual and predicted values.

Below are some of the most important evaluation metrics you'll want to look into for classification and regression models.

### Classification Model Evaluation Metrics/Techniques

* __Accuracy__ - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.

* __[Precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score)__ - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.

* __[Recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score)__ - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.

* __[F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)__ - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.

* __[Confusion matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)__ - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).

* __[Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)__ - Splits your dataset into multiple parts and train and tests your model on each part then evaluates performance as an average.

* __[Classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)__ - Sklearn has a built-in function called `classification_report()` which returns some of the main classification metrics such as precision, recall and f1-score.

* __[ROC Curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_score.html)__ - Also known as [receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) is a plot of true positive rate versus false-positive rate.

* __[Area Under Curve (AUC) Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)__ - The area underneath the ROC curve. A perfect model achieves an AUC score of 1.0.

### Which classification metric should you use?

* __Accuracy__ is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).

* __Precision__ and __recall__ become more important when classes are imbalanced.

* If false-positive predictions are worse than false-negatives, aim for higher precision.

* If false-negative predictions are worse than false-positives, aim for higher recall.

* __F1-score__ is a combination of precision and recall.

* A confusion matrix is always a good way to visualize how a classification model is going.

### Regression Model Evaluation Metrics/Techniques

* [R^2 (pronounced r-squared) or the coefficient of determination](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) - Compares your model's predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.

* [Mean absolute error (MAE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.

* [Mean squared error (MSE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).

### Which regression metric should you use?

* __R2__ is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your __R2__ value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.

* __MAE__ gives a better indication of how far off each of your model's predictions are on average.

* As for __MAE__ or __MSE__, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences. Let's say we're predicting the value of houses (which we are).

  * Pay more attention to MAE: When being \$10,000 off is __twice__ as bad as being \$5,000 off.

  * Pay more attention to MSE: When being \$10,000 off is __more than twice__ as bad as being \$5,000 off.

For more resources on evaluating a machine learning model, be sure to check out the following resources:

* [Scikit-Learn documentation for metrics and scoring (quantifying the quality of predictions)](https://scikit-learn.org/stable/modules/model_evaluation.html)

* [Beyond Accuracy: Precision and Recall by Will Koehrsen](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c)

* [Stack Overflow answer describing MSE (mean squared error) and RSME (root mean squared error)](https://stackoverflow.com/a/37861832)

#### Data Modelling class to assist with evaluation

```Python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split


class DataModel():    
    data = []
    train = ()
    valid = ()
    test = ()

    _TRAINING_SIZE = 0.7
    _VALIDATION_SIZE = 0.15
    _TEST_SIZE = 0.15
    _X = None
    _y = None

    def __init__(self, X, y, train_size=0.7, valid_size=0.15, test_size=0.15):
        if train_size + valid_size + test_size != 1:
            raise ValueError(
                f"Combined split sizes must equal 1. Current split size={train_size + valid_size + test_size}")

        self._TRAINING_SIZE = train_size
        self._VALIDATION_SIZE = valid_size
        self._TEST_SIZE = test_size
        self._X = X
        self._y = y

    def evaluate_preds(self, y_true, y_preds):
        """
        Performs evaluation comparison on y_true labels vs. y_pred labels on a classification.
        """
        accuracy = accuracy_score(y_true, y_preds)
        precision = precision_score(y_true, y_preds)
        recall = recall_score(y_true, y_preds)
        f1 = f1_score(y_true, y_preds)
        return {"accuracy": round(accuracy, 2),
                "precision": round(precision, 2),
                "recall": round(recall, 2),
                "f1": round(f1, 2)}

    def train_test_validation_split(self, data):
        len_df = len(data)
        train_split = round(self._TRAINING_SIZE * len_df)
        valid_split = round(train_split + self._VALIDATION_SIZE * len_df)

        self.train = self._X[:train_split], self._y[:train_split]
        self.valid = self._X[train_split:valid_split], self._y[train_split:valid_split]
        self.test = self._X[valid_split:], self._y[valid_split:]

        return (self.train, self.valid, self.test)

    def train_test_split(self, test_size=None):
        if not test_size:
            test_size = self._TEST_SIZE
            
        return train_test_split(self._X, self._y, test_size=test_size)

```


## 5. Improve through experimentation
Two of the main methods to improve a models baseline metrics (the first evaluation metrics you get).

From a data perspective ask:

* Could we collect more data? In machine learning, more data is generally better, as it gives a model more opportunities to learn patterns.
* Could we improve our data? This could mean filling in misisng values or finding a better encoding (turning things into numbers) strategy.

From a model perspective ask:

* Is there a better model we could use? If you've started out with a simple model, could you use a more complex one? (we saw an example of this when looking at the __[Scikit-Learn machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)__, ensemble methods are generally considered more complex models)
* Could we improve the current model? If the model you're using performs well straight out of the box, can the __hyperparameters__ be tuned to make it even better?

__Hyperparameters__ are like settings on a model you can adjust so some of the ways it uses to find patterns are altered and potentially improved. Adjusting hyperparameters is referred to as hyperparameter tuning.

Hyperparameters vs. Parameters
* Parameters = model find these patterns in data
* Hyperparameters = settings on a model you can adjust to (potentially) improve its ability to find patterns

Three ways to adjust hyperparameters:
1. By hand
2. Randomly with RandomSearchCV
3. Exhaustively with GridSearchCV

### 5.1 Fine tuning hyperparameters by hand

A very common split used in tuning hyperparameters is by using a 70/15/15 split:

* Model gets trained on the Training split
* Hyperparameters get tuned on the Validation split
* Model gets evaluated on the Test split

<img src="../images/Data-split-for-training-validation-and-testing.png" />

## 6. Saving and loading machine learning trained models

Two ways to save and load machine learning models:

1. With Python's `pickle` module
2. With the `joblib` module

## 7. Putting it all together!

Steps we want to do:
1. Fill missing data
2. Convert data to numbers
3. Build a model on the data

### Test Case

```Python

# Getting data ready
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# Setup random seed
import numpy as np
np.random.seed(42)

# Import data and drop rows with missing labels
data = pd.read_csv(CAR_SALES_MISSING_CSV)
data.dropna(subset=["Price"], inplace=True)

# Define different features and transformer pipeline
categorical_features = ["Make", "Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer",SimpleImputer(strategy="constant", fill_value=4))
])

numeric_feature = ["Odometer (KM)"]
numeric_transformers = Pipeline(steps=[
    ("imputer",SimpleImputer(strategy="mean"))
])

# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(transformers=[("cat", categorical_transformer, categorical_features), 
                                               ("door", door_transformer, door_feature), 
                                               ("num", numeric_transformers, numeric_feature)
                                              ])

# Create a preprocessing and modelling pipeline
model = Pipeline(steps=[("preprocessor", preprocessor),
                       ("model", RandomForestRegressor())
                       ])

# Split data
X = data.drop("Price", axis=1)
y = data["Price"]

data_model = DataModel(X, y)
X_train, X_test, y_train, y_test = data_model.train_test_split(test_size=0.2)

model.fit(X_train, y_train)
model.score(X_test, y_test) # score: 0.22188417408787875

# Use GridSearchCV with our regression Pipeline
from sklearn.model_selection import GridSearchCV

pipe_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "model__n_estimators": [100, 1000],
    "model__max_depth": [None, 5],
    # "model__max_features": ["auto"],
    "model__min_samples_split": [2, 4]
}

gs_model = GridSearchCV(model, pipe_grid, cv=5, verbose=2)
gs_model.fit(X_train, y_train)

gs_model.score(X_test, y_test) # Improved score: 0.33876510771726487
```