<center>
<img src="./images/EAN.jpg" style="width:1200px">
</center>

<center>
<img src="./images/0_intro_ml.jpg" style="width:1200px">
</center>

# Lecture 2: Supervised Learning

## Instructors:

>Leonardo A. Espinosa, PhD. Instructor.
(*email*: leonardo.espinosaleal@arcada.fi)

> Ruben D. Acosta, MSc. Instructor.
(*email*:  rdacostav@universidadean.edu.co)

# Goal for today
* Understand the principles of supervised learning.
* Identify the pros and cons of the main algorithms for regression and classification.

<center>
<img src="./images/ai_ml_dl.png" style="width:1400px">
</center>


# Statistical Learning

**framework for machine learning drawing from the fields of *statistics* and *functional analysis*.**

* It deals with the problem of finding a predictive function (**f**) based on data.

* It is basically a set of tools for *understanding the data*.

>**Machine learning** is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed.


### Types of Machine learning

>* Supervised learning
>* Unsupervised learning
>* Reinforcement learning

#### Others:
>* Semi-supervised learning
>* Active learning

# Supervised Learning (*Supervised Statistical Learning*)

>Building a statistical model for predicting, or estimating, an *output* based on one or more *inputs*. 

>Range of disciplines: Bussines, medicine, astrophysics, public policy , social sciences and many more!



### Classification and Regression

* >Classification $\to$ qualitative or categorical variables.
* >Regression $\to$ Continuos numerical quantity.

### Generalization, Overfitting and Underfitting

#### Generalization
Build a model on the training set and then be able to make predictions on **"new data"**.

#### Overfitting
Building a model that is too complex for the amount of available information.

#### Underfitting
Choosing a too simple model.



## Diagnostics

> **Bias** ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.

> **Variance** ― The variance of a model is the variability of the model prediction for given data points.

> **Bias/variance tradeoff** ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.

<center>
<img src="./images/bias-variance.png" style="width:1200px">
Figure 1b. Example of Fitting-Underfitting models in classification and regression.
</center>

# Possible Remedies

>Underfitting:
* Complexity model
* Add more features

>Overfitting:
* Perform regularization
* Get more data

<center>
<img src="./images/02-overfitting_underfitting.png" style="width:1000px">
Figure 1c: Trade-off of model complexity againts *training* and *test* accuracy.
</center>


#### Relation of Model Complexity to Dataset Size
Having more data and building appropiately more complex models.
However data alone is not enough.

Always remember:

1. Is generalization that counts.

2. The *curse of dimensionality* and the *blessing of non-uniformity*.

3. No free-lunch theorem.

<center>
<img src="./images/cod.jpg" style="width:1000px">
Figure 1d: The curse of dimensionality and the blessing of non-uniformity.
</center>

<center>
<img src="./images/nflt.jpg" style="width:1000px">
Figure 1e: No free-lunch theorem
</center>

### Supervised Machine Learning Algorithms

In this Lecture we are going to explore, using examples, the main algorithms for supervised learning, following a taxonomic approach, including pros and cons.

1\. <a href="#/32/1">k-nearest neighbors (KNN)</a>:
   * for classification.
   * for regression.
   
2\. <a href="#/50/1">Linear Models</a>:
   * <a href="#/51/1">for Regression</a>:
       * Linear regression *aka* least squares.
       * Ridge.
       * Lasso.
       * Elastic Net.
   * <a href="#/65/1">for Classification</a>:
       * Logistic regression.
       * Linear Suppor Vector Machines.
       * Linear models for multiclass classification.

3\. <a href="#/83">Naive-Bayes Classifiers</a>
   
4\. <a href="#/89/1">Decision Trees</a>:
   * for classification.
   * for regression.

5\. <a href="#/99/1">Ensembles of Decision Trees</a>:
   * Random Forest.
   * Gradient Boosted Decision Trees.
       
6\. <a href="#/109/1">Kernelized Support Vector Machines</a>

7\. <a href="#/130/1">Conclusions</a>

### The Abalone dataset

<center>    
<img src="./images/02-abalon.jpg" style="width:1000px">
Figure 2: Abalones' picture.    
</center>



#### The main question:
Predict the age of abalone from physical measurements


#### Data Set Information:

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

#### Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem. 
more info [https://archive.ics.uci.edu/ml/datasets/Abalone](https://archive.ics.uci.edu/ml/datasets/Abalone)


Name |            Data Type|       Meas.|   Description
-|-|-|-
Sex           |  nominal    |           |  M, F, and I (infant)
Length        |  continuous |     mm    |   Longest shell measurement
Diameter      |  continuous |     mm    |  perpendicular to length
Height        |  continuous |     mm    |  with meat in shell
Whole weight  |  continuous |     grams |  whole abalone
Shucked weight|  continuous |     grams |  weight of meat
Viscera weight|  continuous |     grams |  gut weight (after bleeding)
Shell weight  |  continuous |     grams |  after being dried
Rings         |  integer    |           |  +1.5 gives the age in years


<center>
<img src="./images/tabla.png" style="width:1200px">
Table 1: List of the abalone's dataset features.
</center>

<center>
<img src="./images/scikit-learn-logo.png" style="width:1000px">
**http://scikit-learn.org**
</center>

* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

>Users: Spotify, Evernote, Booking.com a, OKCupid and many others!

<center>
<img src="./images/ml_map.png" style="width:1100px">
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
</center>

<center>
<img src="./images/ml_map-cr.png" style="width:1600px">
    Today's Lecture
</center>

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import mglearn

from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 25}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 5.0
plt.rcParams['lines.markersize'] = 15.0

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

In [None]:
file_path='https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
names = ['Sex','Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']
df = pd.read_csv(file_path,header=None,names=names)

In [None]:
# We add a Years column  
df['Years'] = df['Rings'] + 1.5

# We change the M,F and I categorical variables as numerical using 0,1 and 2.
replace_list = {"Sex" : {"M": 0, "F" : 1, "I": 2}}
df.replace(replace_list,inplace=True)
# If we want, we can inspect the dataset.


In [None]:
df

In [None]:
# Here we turn into numpy arrays
X = df.iloc[:,:8].values
y_cls = df.iloc[:,8].values
y_reg = df.iloc[:,9].values

# *k*-Nearest Neighbors 

The most intuitive algorithm. There are two versions:

1. ### *k*-Neighbors for classification 
2. ### *k*-Neighbors for regression

<center>    
<img src="./images/02-knns.png" alt="Drawing" style="width: 1200px;"/>
<strong>Figure 3:</strong> KNN examples. <strong>Top</strong>: K=1 and <strong>bottom</strong>: K=3. <strong>Left</strong>: for classification and <strong>right</strong>: for regression.        
</center>    



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_cls, random_state=0)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X_train, y_train)

In [None]:
print("Test set accuracy: {:.2f}".format(clf.score(X_test, y_test)))

In [None]:
#Abalon Dataset

training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to any number
neighbors_settings = list(range(1, 50))
for n_neighbors in neighbors_settings:
# build the model
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train, y_train)
# record training set accuracy
    training_accuracy.append(clf.score(X_train, y_train))
# record generalization accuracy
    test_accuracy.append(clf.score(X_test, y_test))

In [None]:
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.gca().invert_xaxis()
plt.legend()

In [None]:
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.ylim([0.2,0.35])
plt.gca().invert_xaxis()
plt.legend()

## Analyzing KNeighborsClassifier: Benchmark example

In [None]:
from sklearn.model_selection import train_test_split

def knn_test():
    Xt, yt = mglearn.datasets.make_forge()

    fig, axes = plt.subplots(1, 3)
    for n_neighbors, ax in zip([1, 3, 9], axes):
    # the fit method returns the object self, so we can instantiate
    # and fit in one line
        clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(Xt, yt)
        mglearn.plots.plot_2d_separator(clf, Xt, fill=True, eps=0.5, ax=ax, alpha=.4)
        mglearn.discrete_scatter(Xt[:, 0], Xt[:, 1], yt, ax=ax)
        ax.set_title("{} neighbor(s)".format(n_neighbors))
        ax.set_xlabel("feature 0")
        ax.set_ylabel("feature 1")
        axes[0].legend(loc=3)

In [None]:
knn_test()

In [None]:
from sklearn.neighbors import KNeighborsRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y_reg, random_state=0)
reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train,y_train)

In [None]:
print("Test set R^2: {:.2f}".format(reg.score(X_test, y_test)))

In [None]:
# Abalon dataset

training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to wherever
neighbors_settings = list(range(1, 100))
for n_neighbors in neighbors_settings:
# build the model
    reg = KNeighborsRegressor(n_neighbors=n_neighbors)
    reg.fit(X_train, y_train)
# record training set accuracy
    training_accuracy.append(reg.score(X_train, y_train))
# record generalization accuracy
    test_accuracy.append(reg.score(X_test, y_test))

In [None]:
def plot_abalon_knn():
    plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
    plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
    plt.ylabel("Accuracy ($R^2$)")
    plt.xlabel("n_neighbors")
    plt.gca().invert_xaxis()
    plt.legend()

In [None]:
plot_abalon_knn()

In [None]:
def plot_abalon_knn_zoom():
    plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
    plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
    plt.ylabel("Accuracy ($R^2$)")
    plt.xlabel("n_neighbors")
    plt.ylim([0.4,0.65])
    plt.gca().invert_xaxis()
    plt.legend()

In [None]:
plot_abalon_knn_zoom()

## Analyzing KNeighborsRegressor: Benchmark example

In [None]:
def knn_regressor():
    X, y = mglearn.datasets.make_wave(n_samples=40)
    X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, random_state=0)

    fig, axes = plt.subplots(1, 3, figsize=(30, 10))
    line = np.linspace(-3, 3, 1000).reshape(-1, 1)
    for n_neighbors, ax in zip([1, 3, 20], axes):
    # make predictions using 1, 3, or 9 neighbors
        reg = KNeighborsRegressor(n_neighbors=n_neighbors)
        reg.fit(X_train2, y_train2)
        ax.plot(line, reg.predict(line))
        ax.plot(X_train2, y_train2, '^', c=mglearn.cm2(0), markersize=8)
        ax.plot(X_test2, y_test2, 'v', c=mglearn.cm2(1), markersize=8)

        ax.set_title("{} neighbor(s)\n train score: {:.2f}\n test score: {:.2f}".format(n_neighbors, 
                    reg.score(X_train2, y_train2),reg.score(X_test2, y_test2)))
        ax.set_xlabel("Feature")
        ax.set_ylabel("Target")

    axes[0].legend(["Model predictions", "Training data/target","Test data/target"], loc="best")


In [None]:
knn_regressor()

## Conclusions on the *KNN* algorithms

* Two important parameters: the number of neighbors and how you measure the distance between points. By default is the Minkowski with p=2.

$$ d(\mathbf{x},\mathbf{y}) = \left[\sum_{i=1}^N (x_i - y_i)^p \right]^{\frac{1}{p}} $$

* It is a model easy to understand. But its perform is poor on large datasets (either in number of features or in number of samples) or sparse data.





<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

## Exercise

Fit a K-NN model for the wine dataset for predicting:

  1. The level of alcohol, and
  2. The type of wine.
  
Test different values of neighbors.

# Linear Models

Linear models make a prediction using a linear function of the input features.


## Linear models for regression and linear models for classification.

$$\hat{y}(\mathbf{w},\mathbf{x}) = w_0 +  w_1 * x_1 + w_2 * x_2 + ... + w_p * x_p $$

## Linear Models for Regression

   * Ordinary Least squares
   $$ \underset{w}{min\,} {|| X w - y||_2}^2  $$
      
   * Ridge (L2 regularization)
   $$  \underset{w}{min\,} {{|| X w - y||_2}^2 + \alpha {||w||_2}^2} $$
   
   * Lasso (L1 regularization)
   $$ \underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha ||w||_1} $$
   
   
If $X$ is a matrix of size $(n, p)$ this methods have a cost of $O(n p^2)$, assuming that $n \geq p$.

   * Elastic Net (L2 and L1 regularization)
   $$ \underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 + \frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2} $$


<center>    
<img src="./images/lin-reg.png" alt="Drawing" style="width: 1400px;"/>
Figure 3: Regularization methods.        
</center>    

In [None]:
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y_reg, random_state=42)
lr = LinearRegression().fit(X_train, y_train)

print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

In [None]:
ridge10 = Ridge(alpha=10).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge10.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge10.score(X_test, y_test)))

In [None]:
ridge01 = Ridge(alpha=0.01).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge01.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge01.score(X_test, y_test)))

In [None]:
def plot_ridge():
    plt.plot(ridge.coef_, 's', label="Ridge alpha=1")
    plt.plot(ridge10.coef_, '^', label="Ridge alpha=10")
    plt.plot(ridge01.coef_, 'v', label="Ridge alpha=0.01")
    plt.plot(lr.coef_, 'o', label="LinearRegression")
    plt.xticks(range(len(names)-1), names, rotation=90)
    plt.hlines(0, 0, range(len(names)-1))
    plt.xlabel("Coefficient index")
    plt.ylabel("Coefficient magnitude")
    plt.hlines(0, 0, len(lr.coef_))
    plt.ylim(-25, 25)
    plt.legend()

In [None]:
plot_ridge()

In [None]:
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import learning_curve, KFold


def plot_learning_curve(est, X, y):
    training_set_size, train_scores, test_scores = learning_curve(
        est, X, y, train_sizes=np.linspace(.1, 1, 20), cv=KFold(20, shuffle=True, random_state=1))
    estimator_name = est.__class__.__name__
    line = plt.plot(training_set_size, train_scores.mean(axis=1), '--',
                    label="training " + estimator_name)
    plt.plot(training_set_size, test_scores.mean(axis=1), '-',
             label="test " + estimator_name, c=line[0].get_color())
    plt.xlabel('Training set size')
    plt.ylabel('Score (R^2)')
    plt.ylim(0, 1.1)


def plot_ridge_n_samples(X,y,alpha=1):
    plot_learning_curve(Ridge(alpha=alpha), X, y)
    plot_learning_curve(LinearRegression(), X, y)
    plt.legend(loc=(0, 1.05), ncol=2, fontsize=18)

In [None]:
# Learning curves
plot_ridge_n_samples(X,y_reg,alpha=1)

In [None]:
X_b, y_b = mglearn.datasets.load_extended_boston()
plot_ridge_n_samples(X_b,y_b,alpha=1)

In [None]:
from sklearn.linear_model import Lasso

X_train, X_test, y_train, y_test = train_test_split(X, y_reg, random_state=42)

lasso = Lasso(max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

In [None]:
lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso001.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso001.coef_ != 0)))

In [None]:
lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso00001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso00001.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso00001.coef_ != 0)))

In [None]:
def plot_lasso_vs_ridge():
    plt.plot(lasso.coef_, 's', label="Lasso alpha=1")
    plt.plot(lasso001.coef_, '^', label="Lasso alpha=0.01")
    plt.plot(lasso00001.coef_, 'v', label="Lasso alpha=0.0001")
    plt.plot(ridge01.coef_, 'o', label="Ridge alpha=0.1")
    plt.xticks(range(len(names)-1), names, rotation=90)
    plt.hlines(0, 0, range(len(names)-1))
    plt.legend(ncol=2, loc=(0, 1.05))
    plt.ylim(-25, 25)
    plt.xlabel("Coefficient index")
    plt.ylabel("Coefficient magnitude")

In [None]:
plot_lasso_vs_ridge()

# Conclusions on  Linear Models for Regression
* Despite their simplicity, linear models for regression are widely used in industry.

<center>
<img src="./images/questions.jpg" style="width:1000px">
    ANY QUESTIONS?
</center>

## Linear Models for Classification

* Logistic Regression (with L1 or L2 regularization)
$$\underset{w, c}{min\,} \|w\|_1 \quad or\quad \underset{w, c}{min\,} \frac{1}{2}w^T w  \quad + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
$$

* Linear Support Vector Machines  (for $x_i \in \mathbb{R}^p, i=1,…, n,$ and $y \in \{1, -1\}^n$)
$$ \min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n} \zeta_i \quad \textrm {subject to }\quad  y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,\\  \zeta_i \geq 0, i=1, ..., n $$



<center>    
<img src="./images/lr-svm.jpg" alt="Drawing" style="width: 1750px;"/>
Figure 4. Logistic Regression (*left*) and LSVM (*right*).
</center>  

In [None]:
# Reformulate the problem using the Abalone dataset, now binary Male or Female is the target.
#First remove the rows for Sex I (Infant) = 2.

df_bin = df[df.Sex !=2]

# Here we turn into numpy arrays
X_bin = df_bin.iloc[:,1:].values
y_bin = df_bin.iloc[:,0].values

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(X_bin, y_bin, random_state=42)
logreg = LogisticRegression().fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))

In [None]:
logreg100 = LogisticRegression(C=100).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg100.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg100.score(X_test, y_test)))

In [None]:
logreg001 = LogisticRegression(C=0.01).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg001.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg001.score(X_test, y_test)))

In [None]:
new_names = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings','Years']

# Logistic Regression with L2 regularization

def plot_lg_l2():
    plt.plot(logreg.coef_.T, 'o', label="C=1")
    plt.plot(logreg100.coef_.T, '^', label="C=100")
    plt.plot(logreg001.coef_.T, 'v', label="C=0.001")
    plt.xticks(range(len(new_names)), new_names, rotation=90)
    plt.hlines(0, 0, range(len(new_names)))
    plt.ylim(-5, 5)
    plt.xlabel("Coefficient index")
    plt.ylabel("Coefficient magnitude")
    plt.legend()

In [None]:
plot_lg_l2()

In [None]:
# Logistic Regression with L1 regularization

def plot_lg_l1():
    for C, marker in zip([0.001, 1, 100], ['o', '^', 'v']):
        lr_l1 = LogisticRegression(C=C, penalty="l1").fit(X_train, y_train)
        #print("Training accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
            #C, lr_l1.score(X_train, y_train)))
        #print("Test accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
            #C, lr_l1.score(X_test, y_test)))
        plt.plot(lr_l1.coef_.T, marker, label="C={:.3f}".format(C))

    plt.xticks(range(len(new_names)), new_names, rotation=90)
    plt.hlines(0, 0, range(len(new_names)))
    plt.xlabel("Coefficient index")
    plt.ylabel("Coefficient magnitude")
    plt.ylim(-5, 5)
    plt.legend(loc=3)

In [None]:
plot_lg_l1()

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# Linear models for multiclass classification

* one-vs.-rest approach

>caution: one-vs.-rest classifiers -linear ones- read the target labels in scalar mode [0,1,2,...]

<center>    
<img src="./images/one-vs-rest.png" alt="Drawing" style="width: 1250px;"/>
**Figure 5**. Multiclass classification vs binary classification 
</center>  

In [None]:
# Here we include the M,F and I as 3 classes.
# Here we turn into numpy arrays
X_mul = df.iloc[:,1:].values
y_mul = df.iloc[:,0].values

In [None]:
features=['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings','Years']

n,m = 6,7
mglearn.discrete_scatter(X_mul[:, n], X_mul[:, m], y_mul)
plt.xlabel(features[n])
plt.ylabel(features[m])
plt.legend(["Male: 0", "Female: 1", "Infant: 2"])

In [None]:
linear_svm = LinearSVC().fit(X_mul[:,n:m+1], y_mul)
print("Coefficient shape: ", linear_svm.coef_.shape)
print("Intercept shape: ", linear_svm.intercept_.shape)

In [None]:
mglearn.discrete_scatter(X_mul[:, n], X_mul[:, m], y_mul)
line = np.linspace(-15, 15)
for coef, intercept, color in zip(linear_svm.coef_, linear_svm.intercept_,['b', 'r', 'g']):
    plt.plot(line, -(line * coef[0] + intercept) / coef[1], c=color)
plt.ylim(0, 30)
plt.xlim(0, 1)
plt.xlabel(features[n])
plt.ylabel(features[m])
plt.legend(['Male: 0', 'Female: 1', 'Infant: 2', 'Line class 0', 'Line class 1',
'Line class 2'], loc=(1.01, 0.3))

# Conclusions about Linear Models

* The main parameter of linear models is the regularization parameter (L1 or L2). If you assume that only a few of your features are actually important, you should use L1. Otherwise, you should use the default  L2. L1 can also be useful if interpretability of the model is important.

* Linear models are very fast to train, and also fast to predict. They scale to very large datasets and work well with sparse data.

* Linear models often perform well when the number of features is large compared to the number of samples.

<center>
<img src="./images/questions.jpg" style="width:1000px">
    ANY QUESTIONS?
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

## Exercise

Fit three different linear models by testing different values of their respective parameters:
    1. One model for predicting the amount of alcohol,
    2. one for predicting the type of wine, and
    3. one for predicting the quality.

<center>
<img src="./images/break.png" style="width:1000px">
    15 min. Break
</center>

# Naive-Bayes Classifiers

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features.
$$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)} {P(x_1, \dots, x_n)}
 $$

Because the argument of independency

$$ P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y) \\
\Downarrow \\
\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),
$$

## NB Family
*  **Gaussian Naive Bayes**:
$$ P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right) $$

* **Multinomial Naive Bayes**:
   The distribution is parametrized by vectors $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class $y$,
$$ \hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n} $$
   where $N_{yi} = \sum_{x \in T} x_i$ is the number of times feature $i$ appears in a sample of class $y$ in the training set $T$, and $N_{y} = \sum_{i=1}^{|T|} N_{yi}$ is the total count of all features for class $y$.
   
* **Bernoulli Naive Bayes:** Multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.
$$P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)$$


<center>
<img src="./images/NB_scheme.png" style="width:1800px">
We estimate $P(x$$_{\alpha}$$|y)$ independently in each dimension (middle two images) and then obtain an estimate of the full data distribution by assuming conditional independence $P(x|y)=$$\prod_{\alpha}$$P(x$$_{\alpha}$$|y)$ (very right image).
</center>

* Multinomial NB is a NB with a multinomial distribution. We assume the data is distributed following a multinomial distribution. Eg. distribution of words in texts.
* Bernoulli NB assumes a binary distribution of data. Eg. Text after using the bag-of-words method.

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
y_pred = gnb.fit(X_mul,y_mul).predict(X_mul)

print("Number of mislabeled points out of a total {} points : {}".format(X_mul.shape[0],(y_mul != y_pred).sum()))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_mul, y_mul, random_state=42)
gnb_fit = gnb.fit(X_train, y_train)
print("Training set score: {:.3f}".format(gnb_fit.score(X_train, y_train)))
print("Test set score: {:.3f}".format(gnb_fit.score(X_test, y_test)))

# Conclusions about NB Classifiers

*  GaussianNB is mostly used on very high-dimensional data, while the other two variants of naive Bayes are widely used for sparse count data such as text. MultinomialNB usually performs better than BinaryNB , particularly on datasets with a relatively large number of nonzero features (i.e., large documents).

* The naive Bayes models share many of the strengths and weaknesses of the linear models. They are very fast to train and to predict, and the training procedure is easy to understand.

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# Decision Trees
Decision trees are widely used models for classification and regression tasks. Essentially, they learn a hierarchy of if/else questions, leading to a decision.

<center>    
<img src="./images/02-DT.png" alt="Drawing" style="width: 500px;"/>
**Figure 6**. A decision tree to distinguish among several animals
</center>    

In [None]:
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X_bin,y_bin, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))


In [None]:
from sklearn.tree import export_graphviz

export_graphviz(tree, out_file="tree.dot", class_names=["Male", "Female"],
feature_names=new_names, impurity=False, filled=True)

In [None]:
# Note: you should install graphviz in your system. 
#! pip install graphviz
import graphviz
with open('tree.dot') as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

In [None]:
print("Feature importances:\n{}".format(tree.feature_importances_))

In [None]:
def plot_feature_importances(model):
    n_features = X_bin.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), new_names)
    plt.title('Feature importance for Abalon dataset M/F')
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")

In [None]:
plot_feature_importances(tree)

In [None]:
# Decision trees for regression
ram_prices = pd.read_csv("data/ram_price.csv")
plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel("Year")
plt.ylabel("Price in $/Mbyte")

In [None]:
from sklearn.tree import DecisionTreeRegressor
# use historical data to forecast prices after the year 2000
data_train = ram_prices[ram_prices.date < 2000]
data_test = ram_prices[ram_prices.date >= 2000]
# predict prices based on date
X_train = data_train.date[:, np.newaxis]
# we use a log-transform to get a simpler relationship of data to target
y_train = np.log(data_train.price)
tree = DecisionTreeRegressor().fit(X_train, y_train)
linear_reg = LinearRegression().fit(X_train, y_train)
# predict on all data
X_all = ram_prices.date[:, np.newaxis]
pred_tree = tree.predict(X_all)
pred_lr = linear_reg.predict(X_all)
# undo log-transform
price_tree = np.exp(pred_tree)
price_lr = np.exp(pred_lr)

In [None]:
plt.semilogy(data_train.date, data_train.price, label="Training data")
plt.semilogy(data_test.date, data_test.price, label="Test data")
plt.semilogy(ram_prices.date, price_tree, label="Tree prediction")
plt.semilogy(ram_prices.date, price_lr, label="Linear prediction")
plt.legend()

# Conclusions about Decision Trees Classifier

* One of the main drawbacks of decision trees is the tendency to overfit and provide poor generalization performance.

* Usually, picking one of the pre-pruning strategies by setting either *max_depth* , *max_leaf_nodes* , or   *min_samples_leaf* is sufficient to prevent overfitting.

* The resulting model can easily be visualized and understood by nonexperts (at least for smaller trees), and the algorithms are completely invariant to scaling of the data.

* Decision trees do not have the ability to generate *new* responses, outside of what was seen in the training data. This shortcoming applies to all models based on trees.


<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# Ensembles of Decision Trees

Ensembles are methods that combine multiple machine learning models to create more powerful models.

* Random forests

* Gradient boosted regression trees (gradient boosting machines)

In [None]:
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X_mul,y_mul, random_state=42)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))


In [None]:
# Feature Importance for Random Forest
plot_feature_importances(forest)

## Conclusions about Random Forest

* Random forests for regression and classification are currently among the most widely used machine learning methods.

* Easy parallelization.

* Random forests don’t tend to perform well on very high dimensional, sparse data, such as text data. For this kind of data, linear models might be more appropriate.

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# Gradient boosted trees (gradient boosting machines)

* The gradient boosted  tree is another ensemble method that combines multiple decision trees to create a more powerful model.

* Use for both regression and classification. 

* Gradient boosting works by building trees in a serial manner, where each tree tries to correct the mistakes of the previous one.

* Combine many simple models like shallow trees.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

X_train, X_test, y_train, y_test = train_test_split(X_mul,y_mul, random_state=42)

gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
plot_feature_importances(gbrt)


<center>    
<img src="./images/mistakes.jpg" alt="Drawing" style="width: 500px;"/>
How to be the Best of the Best (from Kaggle's point of view)
</center>  


 
learns from the mistakes by increasing the weight of misclassified data points.

* AdaBoost (Adaptive Boosting):  

>Gradient boosting learns from the mistake — residual error directly, rather than update the weights of data points. 

* XGBoost 
* Catboost
* LightGBM

## Conclusions about Gradient boosted regression trees

* They are among the most powerful and widely used models for supervised learning. 

* Their main drawback is that they require careful tuning of the parameters and may take a long time to train.

* They are usually the winner methods in competitions such as Kaggle.


<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

# Kernelized Support Vector Machines

Kernelized support vector machines (SVMs) are an extension of Linear Support Vector Machines that allows for more complex models that are not defined simply by hyperplanes in the input space.

In [None]:
from sklearn.datasets import make_blobs
plt.rcParams['lines.markersize'] = 25.0


X, y = make_blobs(centers=4, random_state=8)
y = y % 2
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

In [None]:
from sklearn.svm import LinearSVC

def plot_lin_svm():
    linear_svm = LinearSVC().fit(X, y)
    mglearn.plots.plot_2d_separator(linear_svm, X)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")

In [None]:
plot_lin_svm()

In [None]:
from mpl_toolkits.mplot3d import Axes3D, axes3d

X_new = np.hstack([X, X[:, 1:] ** 2])
mask = y == 0

def plot_3ln_svm():


    figure = plt.figure()
    # visualize in 3D
    ax = Axes3D(figure, elev=-152, azim=-26)
    # plot first all the points with y == 0, then all with y == 1
   
    ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b',
    cmap=mglearn.cm2, s=60)
    ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^',
    cmap=mglearn.cm2, s=60)
    ax.set_xlabel("feature0")
    ax.set_ylabel("feature1")
    ax.set_zlabel("feature1 ** 2")

In [None]:

plot_3ln_svm()

In [None]:
linear_svm_3d = LinearSVC().fit(X_new, y)
coef, intercept = linear_svm_3d.coef_.ravel(), linear_svm_3d.intercept_

X_new = np.hstack([X, X[:, 1:] ** 2])
xx = np.linspace(X_new[:, 0].min() - 2, X_new[:, 0].max() + 2, 50)
yy = np.linspace(X_new[:, 1].min() - 2, X_new[:, 1].max() + 2, 50)
XX, YY = np.meshgrid(xx, yy)
ZZ = (coef[0] * XX + coef[1] * YY + intercept) / -coef[2]


def plot_3plane_svm():
    


    # show linear decision boundary
    figure = plt.figure()
    ax = Axes3D(figure, elev=-152, azim=-26)
    ax.plot_surface(XX, YY, ZZ, rstride=8, cstride=8, alpha=0.3)
    ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b',
    cmap=mglearn.cm2, s=60)
    ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^',
    cmap=mglearn.cm2, s=60)
    ax.set_xlabel("feature0")
    ax.set_ylabel("feature1")
    ax.set_zlabel("feature0 ** 2")

In [None]:
plot_3plane_svm()

In [None]:
def plot_3proj_svm():

    ZZ = YY ** 2
    dec = linear_svm_3d.decision_function(np.c_[XX.ravel(), YY.ravel(), ZZ.ravel()])
    plt.contourf(XX, YY, dec.reshape(XX.shape), levels=[dec.min(), 0, dec.max()],
    cmap=mglearn.cm2, alpha=0.5)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")

In [None]:
plot_3proj_svm()

In [None]:
# radial basis function (RBF) kernel, also known as the Gaussian kernel.

from sklearn.svm import SVC

def plot_svc_rbf():
    X, y = mglearn.tools.make_handcrafted_dataset()
    svm = SVC(kernel='rbf', C=10, gamma=0.1).fit(X, y)
    mglearn.plots.plot_2d_separator(svm, X, eps=.5)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
    # plot support vectors
    sv = svm.support_vectors_
    # class labels of support vectors are given by the sign of the dual coefficients
    sv_labels = svm.dual_coef_.ravel() > 0
    mglearn.discrete_scatter(sv[:, 0], sv[:, 1], sv_labels, s=15, markeredgewidth=3)
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")

In [None]:
plot_svc_rbf()

In [None]:
def plot_mul_svc_rbf():
    fig, axes = plt.subplots(3, 3)
    plt.rcParams.update({'font.size': 15})
    for ax, C in zip(axes, [-1, 0, 3]):
        for a, gamma in zip(ax, range(-1, 2)):
            mglearn.plots.plot_svm(log_C=C, log_gamma=gamma, ax=a)
    axes[0, 0].legend(["class 0", "class 1", "sv class 0", "sv class 1"],
    ncol=4, loc=(.25, 1.2),fontsize='25')

In [None]:
plot_mul_svc_rbf()

In [None]:
plt.rcParams.update({'font.size': 25})

X_train, X_test, y_train, y_test = train_test_split(X_mul,y_mul, random_state=42)

svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))


In [None]:
features=['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings','Years']

def plot_feats_svc_rbf():

    plt.plot(X_train.min(axis=0), 'o', label="min")
    plt.xticks(range(len(features)), names, rotation=90)
    plt.plot(X_train.max(axis=0), '^', label="max")
    plt.legend(loc=2)
    plt.xlabel("Feature index")
    plt.ylabel("Feature magnitude")

# For SVN Preprocessing is very important.

In [None]:
plot_feats_svc_rbf()

## Conclusions about Kernelized Support Vector Machines

* SVMs allow for complex decision boundaries, even if the data has only a few features.

* They work very well on high- and low- dimensional data.

* Quite poor scaling with the number of samples (more than 100k can be a headache).

* SVMs require very careful tuning of parameters and preprocesing of data.

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis


def foooo(h =.02): # step size in the mesh

    names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
             "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
             "Naive Bayes", "QDA"]

    classifiers = [
        KNeighborsClassifier(3),
        SVC(kernel="linear", C=0.025),
        SVC(gamma=2, C=1),
        GaussianProcessClassifier(1.0 * RBF(1.0)),
        DecisionTreeClassifier(max_depth=5),
        RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
        MLPClassifier(alpha=1),
        AdaBoostClassifier(),
        GaussianNB(),
        QuadraticDiscriminantAnalysis()]

    X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                               random_state=1, n_clusters_per_class=1)
    rng = np.random.RandomState(2)
    X += 2 * rng.uniform(size=X.shape)
    linearly_separable = (X, y)

    datasets = [make_moons(noise=0.3, random_state=0),
                make_circles(noise=0.2, factor=0.5, random_state=1),
                linearly_separable
                ]

    figure = plt.figure(figsize=(30, 12))
    i = 1
    # iterate over datasets
    for ds_cnt, ds in enumerate(datasets):
        # preprocess dataset, split into training and test part
        X, y = ds
        X = StandardScaler().fit_transform(X)
        X_train, X_test, y_train, y_test = \
            train_test_split(X, y, test_size=.4, random_state=42)

        x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
        y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                             np.arange(y_min, y_max, h))

        # just plot the dataset first
        cm = plt.cm.RdBu
        cm_bright = ListedColormap(['#FF0000', '#0000FF'])
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        if ds_cnt == 0:
            ax.set_title("Input data")
        # Plot the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
                   edgecolors='k')
        # Plot the testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
                   edgecolors='k')
        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        i += 1

        # iterate over classifiers
        for name, clf in zip(names, classifiers):
            ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)

            # Plot the decision boundary. For that, we will assign a color to each
            # point in the mesh [x_min, x_max]x[y_min, y_max].
            if hasattr(clf, "decision_function"):
                Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
            else:
                Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

            # Put the result into a color plot
            Z = Z.reshape(xx.shape)
            ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

            # Plot the training points
            ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
                       edgecolors='k')
            # Plot the testing points
            ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                       edgecolors='k', alpha=0.6)

            ax.set_xlim(xx.min(), xx.max())
            ax.set_ylim(yy.min(), yy.max())
            ax.set_xticks(())
            ax.set_yticks(())
            if ds_cnt == 0:
                ax.set_title(name)
            ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                    size=15, horizontalalignment='right')
            i += 1

    plt.tight_layout()
    plt.show()


In [None]:
plt.rcParams['lines.markersize'] = 12.0
plt.rcParams.update({'font.size': 15})

In [None]:
foooo()

# General Conclusion I

* **Nearest neighbors**: For small datasets, good as a baseline, easy to explain.
* **Linear models**: Go-to as a first algorithm to try, good for very large datasets, good for very highdimensional data.
* **Naive Bayes**: Only for classification. Even faster than linear models, good for very large data sets and high-dimensional data. Often less accurate than linear models.

# General Conclusion II

* **Decision trees**: Very fast, don't need scaling of the data, can be visualized and easily explained. They don't predict on new data out of their training set.
* **Random forests**: Nearly always perform better than a single decision tree, very robust and powerful. Don’t need scaling of data. Not good for very high-dimensional sparse data.
* **Gradient boosted decision trees**: Often slightly more accurate than random forests. Slower to train but faster to predict than random forests, and smaller in memory. Need more parameter tuning than random forests.
* **Support vector machines**: Powerful for medium-sized datasets of features with similar meaning. Require scaling of data, sensitive to parameters.

# General Conclusion III

* Copy and paste an algorithm for any given problem is the zero step. But first, check your data. The most important part is rehearse and explore different methods as much as you can on different interesting problems. 

* Simplicity is a virtue by itself. **Occam's razor**.

* The best way to stand out over the crowd is try to understand the concepts (and the math) behind the algorithms.

# Extras: Model selection in competitive data science vs real world

* *Problem Definition*: The real world sucks! Use as much as possible domain knowledge.
* *Metrics*: In the real world they can be very problem dependent. Remember the Goodhart's law.
* *Interpretability*: People who pays for your time needs to know what you do.
* *Data Quality*: Here is the line that divides the real of the kaggle world.
* *Scalability*: Because benchmarking is nice, but real life is tough.

# Bibliography:
<center>
<img src="./images/biblio.png" alt="Drawing" style="width: 1000px;"/>
</center>

* http://shop.oreilly.com/product/0636920030515.do
* http://www-bcf.usc.edu/~gareth/ISL/
* https://web.stanford.edu/~hastie/ElemStatLearn/

# Other sources:
* https://coursera.org/learn/machine-learning
* https://www.kdnuggets.com/

## Recomended lecture:
*A Few Useful Things to Know about Machine Learning* by Pedro Domingos (Communications of the ACM, Vol. 55 No. 10, Pages 78-87, 2012.):
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

<center>
<img src="./images/00_thats_all.jpg" style="width:1000px">
</center>