# **Why Scikit-Learn for machine learning?**

Scikit-Learn, also known as sklearn, is Python’s premier general-purpose machine learning library. While you’ll find other packages that do better at certain tasks, Scikit-Learn’s versatility makes it the best starting place for most ML problems.

It’s also a fantastic library for beginners because it offers a high-level interface for many tasks (e.g. preprocessing data, cross-validation, etc.). This allows you to better practice the entire machine learning workflow and understand the big picture.

# Scikit-Learn Tutorial Contents
Here are the steps for building your first random forest model using Scikit-Learn:

1. Set up your environment.
2. Import libraries and modules.
3. Load red wine data.
4. Split data into training and test sets.
5. Declare data preprocessing steps.
6. Declare hyperparameters to tune.
7. Tune model using cross-validation pipeline.
8. Refit on the entire training set.
9. Evaluate model pipeline on test data.
10. Save model for further use.

### Step 1: Set up your environment.

Make sure the following are installed on your computer:

* Python 2.7+ or Python 3
* NumPy
* Pandas
* Scikit-Learn (a.k.a. sklearn)

I strongly recommend installing Python through Anaconda. It comes with all of the above packages already installed.

If you need to update any of the packages, it's as easy as typing  _$ conda update <package>_ from your command line program (Terminal in Mac).

You can confirm Scikit-Learn was installed properly:

In [9]:
import sklearn
print (sklearn.__version__)

0.22.1


### Step 2: Import libraries and modules.
To begin, let's import numpy, which provides support for more efficient numerical computation:

```
import numpy as np
```

Next, we'll import Pandas, a convenient library that supports dataframes . Pandas is technically optional because Scikit-Learn can handle numerical matrices directly, but it'll make our lives easier:

```
import pandas as pd
```

Now it's time to start importing functions for machine learning. The first one will be the train_test_split() function from the model_selection module. As its name implies, this module contains many utilities that will help us choose between models.

```
from sklearn.model_selection import train_test_split
```

Next, we'll import the entire preprocessing module. This contains utilities for scaling, transforming, and wrangling data.

```
from sklearn import preprocessing
```

Next, let's import the families of models we'll need

**What's the difference between model "families" and actual models?**

A "family" of models are broad types of models, such as random forests, SVM's, linear regression models, etc. Within each family of models, you'll get an actual model after you fit and tune its parameters to the data.

*Tip: Don't worry too much about this for now... It will make more sense once we get to Step 7.

We can import the random forest family like so:

```
from sklearn.ensemble import RandomForestRegressor
```

For the scope of this tutorial, we'll only focus on training a random forest and tuning its parameters. We'll have another detailed tutorial for how to choose between model families.

For now, let's move on to importing the tools to help us perform cross-validation.

```
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
```
Next, let's import some metrics we can use to evaluate our model performance later.

```
from sklearn.metrics import mean_squared_error, r2_score
```
And finally, we'll import a way to persist our model for future use.

```
from sklearn.externals import joblib
```

Joblib is an alternative to Python's pickle package, and we'll use it because it's more efficient for storing large numpy arrays.

### Step 3: Load red wine data.

Alright, now we're ready to load our data set. The Pandas library that we imported is loaded with a whole suite of helpful import/output tools.

You can read data from CSV, Excel, SQL, SAS, and many other data formats.

The convenient tool we'll use today is the read_csv() function. Using this function, we can load any CSV file, even from a remote URL!

Load wine data from remote URLPython
```
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url)
```

Now let's take a look at the first 5 rows of data:
```
print data.head()
```
Upon further inspection, you will realise that the CSV file is actually using semicolons to separate the data. That's annoying, but easy to fix:


```
data = pd.read_csv(dataset_url, sep=';')
```

Now, let's take a look at the data.

```
print (data.shape)
print (data.describe())
```

### Step 4: Split data into training and test sets.

Splitting the data into training and test sets at the beginning of your modeling workflow is crucial for getting a realistic estimate of your model's performance.

First, let's separate our target (y) features from our input (X) features:

```
y = data.quality
X = data.drop('quality', axis=1)
```
This allows us to take advantage of Scikit-Learn's useful train_test_split function:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)
```

As you can see, we'll set aside 20% of the data as a test set for evaluating our model. We also set an arbitrary "random state" (a.k.a. seed) so that we can reproduce our results.

Finally, it's good practice to stratify your sample by the target variable. This will ensure your training set looks similar to your test set, making your evaluation metrics more reliable.

### Step 5: Declare data preprocessing steps.
Remember, in Step 3, we made the mental note to standardize our features because they were on different scales.

**WTF is standardization?**

Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations.

Standardization is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.

Scikit-Learn makes data preprocessing a breeze. For example, it's pretty easy to simply scale a dataset:

```
X_train_scaled = preprocessing.scale(X_train)
print X_trained_scaled
```

You can confirm that the scaled dataset is indeed centered at zero, with unit variance:

```
print X_train_scaled.mean(axis=0)
print X_train_scaled.std(axis=0)
```

### Step 6: Declare hyperparameters to tune.

Now it's time to consider the hyperparameters that we'll want to tune for our model.

**WTF are hyperparameters?**

There are two types of parameters we need to worry about: model parameters and hyperparameters. Models parameters can be learned directly from the data (i.e. regression coefficients), while hyperparameters cannot.

Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model.

As an example, let's take our random forest for regression:

Within each decision tree, the computer can empirically decide where to create branches based on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the actual branch locations are model parameters.

However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in the forest. These are examples of hyperparameters that the user must set.

We can list the tunable hyperparameters like so:

```
print pipeline.get_params()
```

Now, let's declare the hyperparameters we want to tune through cross-validation.

```
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}
```

### Step 7: Tune model using a cross-validation pipeline.

Now we're almost ready to dive into fitting our models. But first, we need to spend some time talking about cross-validation.

This is one of the most important skills in all of machine learning because it helps you maximize model performance while reducing the chance of overfitting.

**WTF is cross-validation (CV)?**

Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.

Practically, that "method" is simply a set of hyperparameters in this context.

These are the steps for CV:

1. Split your data into k equal parts, or "folds" (typically k=10).
2. Train your model on k-1 folds (e.g. the first 9 folds).
3. Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold).
4. Perform steps (2) and (3) k times, each time holding out a different fold.
5. Aggregate the performance across all k folds. This is your performance metric.


Why is cross-validation important in machine learning?

Let's say you want to train a random forest regressor. One of the hyperparameters you must tune is the maximum depth allowed for each decision tree in your forest.

How can you decide?

That's where cross-validation comes in. Using only your training set, you can use CV to evaluate different hyperparameters and estimate their effectiveness.

This allows you to keep your test set "untainted" and save it for a true hold-out evaluation when you're finally ready to select a model.

For example, you can use CV to tune a random forest model, a linear regression model, and a k-nearest neighbors model, using only the training set. Then, you still have the untainted test set to make your final selection between the model families!

So WTF is a cross-validation "pipeline?"

The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop. This prevents accidentally tainting your training folds with influential data from your test fold.

Here's how the CV pipeline looks after including preprocessing steps:

1. Split your data into k equal parts, or "folds" (typically k=10).
2. Preprocess k-1 training folds.
3. Train your model on the same k-1 folds.
4. Preprocess the hold-out fold using the same transformations from step (2).
5. Evaluate your model on the same hold-out fold.
6. Perform steps (2) - (5) k times, each time holding out a different fold.
7. Aggregate the performance across all k folds. This is your performance metric.

Fortunately, Scikit-Learn makes it stupidly simple to set this up:

```
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 
# Fit and tune model
clf.fit(X_train, y_train)
```

Yes, it's really that easy. GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters.

It takes in your model (in this case, we're using a model pipeline), the hyperparameters you want to tune, and the number of folds to create.

Obviously, there's a lot going on under the hood. We've included the pseudo-code above, and we'll cover writing cross-validation from scratch in a separate guide.

Now, you can see the best set of parameters found using CV:

```
print clf.best_params_
```

### Step 8: Refit on the entire training set.
After you've tuned your hyperparameters appropriately using cross-validation, you can generally get a small performance improvement by refitting the model on the entire training set.

Conveniently, GridSearchCV from sklearn will automatically refit the model with the best set of hyperparameters using the entire training set.

This functionality is ON by default, but you can confirm it:

```
print clf.refit
```
Now, you can simply use the  clf object as your model when applying it to other sets of data. That's what we'll be doing in the next step.

### Step 9: Evaluate model pipeline on test data.
Alright, we're in the home stretch!

This step is really straightforward once you understand that the  clf object you used to tune the hyperparameters can also be used directly like a model object.

Here's how to predict a new set of data:

```
y_pred = clf.predict(X_test)
```

Now we can use the metrics we imported earlier to evaluate our model performance.
```
print r2_score(y_test, y_pred)
print mean_squared_error(y_test, y_pred)
```

### Step 10: Save model for future use.
Great job completing this tutorial!

You've done the hard part, and deserve another glass of wine. Maybe this time you can use your shiny new predictive model to select the bottle.

But before you go, let's save your hard work so you can use the model in the future. It's really easy to do so:
```
joblib.dump(clf, 'rf_regressor.pkl')
```

And that's it. When you want to load the model again, simply use this function:

```
clf2 = joblib.load('rf_regressor.pkl')
```

Predict data set using loaded model...
```
clf2.predict(X_test)
```

### The complete code, from start to finish.

In [None]:
# 2. Import libraries and modules
import numpy as np
import pandas as pd
 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib 
 
# 3. Load red wine data.
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')
 
# 4. Split data into training and test sets
y = data.quality
X = data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)
 
# 5. Declare data preprocessing steps
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))
 
# 6. Declare hyperparameters to tune
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}
 
# 7. Tune model using cross-validation pipeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 
clf.fit(X_train, y_train)
 
# 8. Refit on the entire training set
# No additional code needed if clf.refit == True (default is True)
 
# 9. Evaluate model pipeline on test data
pred = clf.predict(X_test)
print r2_score(y_test, pred)
print mean_squared_error(y_test, pred)
 
# 10. Save model for future use
joblib.dump(clf, 'rf_regressor.pkl')
# To load: clf2 = joblib.load('rf_regressor.pkl')