<a href="https://colab.research.google.com/github/mcfatbeard57/Hands-On-ML-Tensor-FLow/blob/main/Learning_Curves.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How can you tell your model is overfitting or underfitting the data?

#### Using cross-validation to get an estimate of a model's generalization performance.

In [None]:
# If a model performs well on the training data but generalizes poorly
# according to the cross-validation metrics, then your model is overfitting. If it performs
# poorly on both, then it is underfitting. 
# This is one way to tell when a model is too simple or too complex.

## Learning Curves

In [None]:
# these are plots of the model’s performance
# on the training set and the validation set as a function of the training set size.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


In [None]:
def plot_learning_curves(model, X, y):
    # Train-Test split
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    train_errors, val_errors = [], []
    # Taking a subset of size m from X_train set
    for m in range(1, len(X_train)):
        # Model fit on X_train and y_train
        model.fit(X_train[:m], y_train[:m])
        # predict model on X_train and X_val and append on respective lists
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        # Cal MSE with predict and actual value
        train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
        val_errors.append(mean_squared_error(y_val_predict, y_val))
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")

### Learning curve of linear reg model model

In [None]:
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

#### UNDERFIT !!
If your model is underfitting the training data, adding more training
examples will not help. You need to use a more complex model
or come up with better features.

### Learning curve of 10th degree polynomial model

In [None]:
from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline((
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("sgd_reg", LinearRegression()),
))
plot_learning_curves(polynomial_regression, X, y)

#### OVERFIT!!
One way to improve an overfitting model is to feed it more training
data until the validation error reaches the training error.

### Diff between lin_reg and poly_reg

In [2]:
# The error on the training data is much lower than with the Linear Regression 
# model.
# • There is a gap between the curves. This means that the model performs significantly
# better on the training data than on the validation data, which is the hallmark
# of an overfitting model. However, if you used a much larger training set,
# the two curves would continue to get closer.