# Visualizing Overfitting

There is always a risk of "overfitting" the data when training machine learning algorithms. However, what does overfitting actually mean and how can we, well not do that? In this project we will make use of some simple interactive widgets to give the reader a better understanding of:
- What does overfitting look like?
- When is it considered overfitting?
- How to prevent overfitting?

Have a play at the widgets below and pay attention to how to background and bar graphs change!

## K Nearest Neighbours

In [13]:
# nbi:hide_in
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
import matplotlib.pyplot as plt
from matplotlib import gridspec
from ipywidgets import interact
import ipywidgets as widgets
import numpy as np
plt.style.use('ggplot')

In [21]:
# nbi:hide_in
@interact(test_data=widgets.Checkbox(value=False, description="Show test data only",  disabled=False), 
          neighbours=widgets.IntSlider(min=1, max=35, step=1))
def plot_decision_boundaries(neighbours, test_data):
    MESH_STEP_SIZE = 0.01
    iris = load_iris()
    X, y = iris.data[:, :2], iris.target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    clf = KNeighborsClassifier(neighbours)
    clf.fit(X_train, y_train)
    min_x, max_x = X_train[:, 0].min() - 1.0, X_train[:, 0].max() + 1.0
    min_y, max_y = X_train[:, 1].min() - 1.0, X_train[:, 1].max() + 1.0
    x_vals, y_vals = np.meshgrid(
    np.arange(min_x, max_x, MESH_STEP_SIZE), np.arange(
                min_y, max_y, MESH_STEP_SIZE)
        )
    output = clf.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
    output = output.reshape(x_vals.shape)
    x = np.arange(0, 10, 0.2)
    y = np.sin(x)
    fig = plt.figure(figsize=(10, 12)) 
    gs = gridspec.GridSpec(2, 1, height_ratios=[3, 1]) 
    ax0 = plt.subplot(gs[0])
    ax0.pcolormesh(x_vals, y_vals, output, cmap=plt.cm.Pastel1)
    if test_data:
        ax0.scatter(
                X_test[:, 0], X_test[:, 1], c=y_test, s=20, edgecolors="black", linewidth=1, cmap=plt.cm.Set1
            )
    else:
        ax0.scatter(
                X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolors="black", linewidth=1, cmap=plt.cm.Set1
            )        
    ax0.set_xlim(x_vals.min(), x_vals.max())
    ax0.set_ylim(y_vals.min(), y_vals.max())
    ax1 = plt.subplot(gs[1])
    scores = [clf.score(X_train, y_train), clf.score(X_test, y_test)]
    names = ("Train", "Test")
    y_pos = np.arange(len(names))
    ax1.barh(y_pos, scores, align="center", alpha=0.5)
    ax1.set_xticks(y_pos, names)
    ax1.set_yticks(y_pos)
    ax1.set_yticklabels(names)
    ax1.invert_yaxis()
    ax1.set_xlabel('Accuracy')
    totals = []
    for i in ax1.patches:
        totals.append(i.get_width())
    total = sum(totals)
    for i in ax1.patches:
        ax1.text(i.get_width()-.1, i.get_y()+.43, \
            str(round((i.get_width())*100, 2))+'%', fontsize=10,
            color='black')
    plt.tight_layout()

interactive(children=(IntSlider(value=1, description='neighbours', max=35, min=1), Checkbox(value=False, descr…

The slider above controls a hyperparameter specific to the Nearest Neighbour algorithm. It basically tells the algorithm how many points to look at in order to decide if a particular point belongs to a category. As you would have guessed, a neighbour=1 model would mean that any data point would just refer to its closest neighbour and take its category. We can see from the different background colours, which represent the decision boundary made by the algorithm that each number of neighbours lead to a very specific set of boundaries. As we increase the number of neighbours, we can observe how the model actually becomes less prone to overfitting, but at the same time miss out some inherent patterns between the different categories. Can you tell which setting of neighbours give the best overall training and testing accuracy?

## Support Vector Machines

Support vector machines, or SVMs are another popular classification algorthim. Note that for multi-class classification, a OVR (one vs rest) approach is taken here.

In [28]:
# nbi:hide_in
@interact(
    test_data=widgets.Checkbox(value=False, description="Show test data only",  disabled=False), 
    kernel=widgets.ToggleButtons(
    options=["linear", "rbf", "poly"],
    description="Kernel selection",
    disabled=False
),
    reg=widgets.FloatSlider(min=0.1, max=10, step=0.1)
          )
def plot_decision_boundaries(reg, kernel, test_data):
    MESH_STEP_SIZE = 0.01
    iris = load_iris()
    X, y = iris.data[:, :2], iris.target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    clf = svm.SVC(C=reg, kernel=kernel, gamma="scale", random_state=42)
    
    clf.fit(X_train, y_train)
    min_x, max_x = X_train[:, 0].min() - 1.0, X_train[:, 0].max() + 1.0
    min_y, max_y = X_train[:, 1].min() - 1.0, X_train[:, 1].max() + 1.0
    x_vals, y_vals = np.meshgrid(
    np.arange(min_x, max_x, MESH_STEP_SIZE), np.arange(
                min_y, max_y, MESH_STEP_SIZE)
        )
    output = clf.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
    output = output.reshape(x_vals.shape)
    x = np.arange(0, 10, 0.2)
    y = np.sin(x)
    fig = plt.figure(figsize=(10, 12)) 
    gs = gridspec.GridSpec(2, 1, height_ratios=[3, 1]) 
    ax0 = plt.subplot(gs[0])
    ax0.pcolormesh(x_vals, y_vals, output, cmap=plt.cm.Pastel1)
    if test_data:
        ax0.scatter(
                X_test[:, 0], X_test[:, 1], c=y_test, s=20, edgecolors="black", linewidth=1, cmap=plt.cm.Set1
            )
    else:
        ax0.scatter(
                X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolors="black", linewidth=1, cmap=plt.cm.Set1
            )        
    ax0.set_xlim(x_vals.min(), x_vals.max())
    ax0.set_ylim(y_vals.min(), y_vals.max())
    ax1 = plt.subplot(gs[1])
    scores = [clf.score(X_train, y_train), clf.score(X_test, y_test)]
    names = ("Train", "Test")
    y_pos = np.arange(len(names))
    ax1.barh(y_pos, scores, align="center", alpha=0.5)
    ax1.set_xticks(y_pos, names)
    ax1.set_yticks(y_pos)
    ax1.set_yticklabels(names)
    ax1.invert_yaxis()
    ax1.set_xlabel('Accuracy')
    totals = []
    for i in ax1.patches:
        totals.append(i.get_width())
    total = sum(totals)
    for i in ax1.patches:
        ax1.text(i.get_width()-.1, i.get_y()+.43, \
            str(round((i.get_width())*100, 2))+'%', fontsize=10,
            color='black')
    plt.tight_layout()

interactive(children=(FloatSlider(value=0.1, description='reg', max=10.0, min=0.1), ToggleButtons(description=…

The reg corresponds to C in the algorithim and it tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable. If you want a deeper dive into the mechanics of SVMs, there are some pretty awesome links in the [resources](https://whobrokemycode.netlify.com/resources/) page.