### Training vs testing  
Splitting of data in a train and test dataset is difficult. Choosing the right model also. 
If percentage of train dataset is too high or model to complex: Overfitting  
If percentage of test dataset is too high or model to simple: Underfitting  

#### Underfitting / Bias
- Model is to simple / generic
- Trends are not reflected in model
  
<img src="../doc/decision_tree_flat.png" alt="Decision Tree" width="300"/>

#### Overfitting / Variance
- Model too detailed
- Outliers will be learned and classify new outliers
  
<img src="../doc/decision_tree_deep.png" alt="Decision Tree" width="300"/>

### Accuracy v.s. complexity
Problem  
- Left side is underfitting  
- Right side is overfitting  

#### Validation curve
Solution
- Try to train model to get into the 'middle'
- This plot is called a validation curve  
<img src="../doc/29_accuracy_vs_complexity.png" alt="Accuracy v.s. Complexity" width="300"/>

#### Learning curve
Does more data help my model?
<img src="../doc/29_learning_curve.png" alt="Learning Curve" width="500"/>

In [None]:
# imports
import graphviz
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import validation_curve
from sklearn.model_selection import learning_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle

In [None]:
# Read CSV
df = pd.read_csv("../res/classification.csv")
df.head()

In [None]:
# Prepare data
x = df[["age", "interest"]].values
y = df["success"].values

In [None]:
# Create validation curve for k-Neighbors
param_range = np.array([40, 30, 20, 15, 10, 8, 7, 6, 5, 4, 3, 2, 1])

train_scores, test_scores = validation_curve(
    KNeighborsClassifier(),
    x,
    y,
    param_name = "n_neighbors",
    param_range = param_range)

In [None]:
# Plot performance of model with increasing neighbors k
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(param_range, np.mean(train_scores, axis = 1), label = 'train')
plt.plot(param_range, np.mean(test_scores, axis = 1), label = 'test')

plt.xlabel('Neighbors k')
plt.ylabel('Model performance r2')

plt.xlim(np.max(param_range), 0) # Switch x-axis as high k is more generic
plt.legend()
plt.show()

In [None]:
# Create validation curve for Decision tree
param_range = np.linspace(1,10,10)
print(param_range)

train_scores, test_scores = validation_curve(
    DecisionTreeClassifier(criterion = 'entropy'),
    x,
    y,
    param_name = "max_depth",
    param_range = param_range)

In [None]:
# Plot performance of model with increasing neighbors k
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(param_range, np.mean(train_scores, axis = 1), label = 'train')
plt.plot(param_range, np.mean(test_scores, axis = 1), label = 'test')

plt.xlabel('Depth of tree')
plt.ylabel('Model performance r2')

plt.legend()
plt.show()

In [None]:
# Create learning curve for k-Neighbors
x,y = shuffle(x, y)
train_sizes_abs, train_scores, test_scores = learning_curve(KNeighborsClassifier(), x, y)

In [None]:
# Plot learning curve
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(train_sizes_abs, np.mean(train_scores, axis = 1), label = 'train')
plt.plot(train_sizes_abs, np.mean(test_scores, axis = 1), label = 'test')

plt.xlabel('Number of data samples')
plt.ylabel('Model performance r2')

plt.legend()
plt.show()