In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
from IPython.display import Image

# Logistic Regression and Overfitting / Underfitting

Source: 

- Maha K, **Overfitting and Underfitting: Simple Explanations and Python Examples,** https://medium.com/@maheshhkanagavell/overfitting-and-underfitting-simple-explanations-and-python-examples-74424b1076c3
- Ivan Zakharchuk, **Generalization, Overfitting, and Under-fitting in Supervised Learning,** https://ivanzakharchuk.medium.com/generalization-overfitting-and-underfitting-in-supervised-learning-a21f02ebf3df

In [1]:
Image(filename="figures/regularization_5.png", width=800)

NameError: name 'Image' is not defined

- **Underfitting** occurs when the model is too simple and cannot capture the underlying patterns in the data, even with sufficient training data. 

- This leads to poor performance on both the training and testing datasets.

- **Overfitting** occurs when the model is too complex and starts to fit to noise or random fluctuations in the training data. 

- This can result in very high accuracy on the training data but poor performance on the testing data, indicating that the model has failed to generalize well to new, unseen data.

#### Example:

**Overfitting:** 
- If the training set MSE is much lower than the test set MSE, it indicates that the model has overfit the training data, i.e., it has learned the noise in the training data instead of the underlying pattern. 
- This leads to high variance and low bias.

**Underfitting**: 
- On the other hand, if the training set MSE and test set MSE are both high, it indicates that the model has underfit the data. 
- The model is too simple to capture the underlying pattern. 
- This leads to high bias and low variance.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Generate some random data
np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.randn(100, 1)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model on the training set
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the training set and calculate the mean squared error
y_train_pred = model.predict(X_train)
mse_train = mean_squared_error(y_train, y_train_pred)

# Predict on the test set and calculate the mean squared error
y_test_pred = model.predict(X_test)
mse_test = mean_squared_error(y_test, y_test_pred)

print(f"Training set MSE: {mse_train:.2f}")
print(f"Test set MSE: {mse_test:.2f}")