# Hyperparameters and Model Validation

In the introduction to scikit-learn notebook, we briefly explored supervised and unsupervised learning. When building an ML model, a simple 4-step plan was shown. In this notebook, we're going to look closer at the first two steps: choosing a class of model, and the model hyperparameters.

### Naive model valuation

In [4]:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X,y)
y_model = model.predict(X)

from sklearn.metrics import accuracy_score as acc
acc(y, y_model)

1.0

An accuracy of 100%? Seems a little *too* perfect, no? Main thing being, our training and testing data are the exact same - so for this specific instance, it has 100% accuracy! This is bad practice, as ML models help us predict values for unidentified data.

So how do we counter this? We can create a *holdout set*: holding back some of the training data to test the model on.

In [5]:
from sklearn.model_selection import train_test_split
X1, X2, y1, y2 = train_test_split(X, y, random_state=0, train_size=0.5)

model.fit(X1, y1)
y2_model = model.predict(X2)
acc(y2, y2_model)

0.9066666666666666

91% accuracy seems a little more realistic. But now we have another issue - we're only testing and training on specific halves of the data. We can address this via *cross-validation,* using a sequence of fits where each subset of data is used both as training and validation sets.

In [6]:
y2_model = model.fit(X1, y1).predict(X2)
y1_model = model.fit(X2, y2).predict(X1)
acc(y1, y1_model), acc(y2, y2_model)

(0.96, 0.9066666666666666)