## Session 1: Machine Learning Basics Notebook

This notebook is a part of the Nano Degree in Generative AI.

## Objective: Explore the Basics of Machine Learning

In this notebook, we will learn about basic concepts of machine learning. We will explore following concepts to get started with machine learning. We will also learn about the different types of machine learning algorithms. Exploratory Data Analysis (EDA) is an important step in the machine learning workflow. We will also learn about these along with feature engineering.

## Part 1: Understanding Underfitting and Overfitting

### Underfitting



**What is Underfitting?**

Underfitting occurs when a machine learning model cannot capture the underlying patterns and relationships in the data. This can happen when the model is too simple or when the data is too complex. Underfitting can lead to poor performance on both training and test data.

**How to identify underfitting?**

There are several ways to identify underfitting in a machine learning model:
1. **Training and test accuracy:** If the training accuracy is much higher than the test accuracy, it may indicate that the model is underfitting.
2. **Visualization:** Plotting the training and test data can help identify underfitting. If the training data and test data have different distributions, it may indicate that the model is underfitting.
3. **Cross-validation:** Using cross-validation can help identify underfitting. If the model performs poorly on all folds of the cross-validation, it may indicate that the model is underfitting.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Generate synthetic non-linear data
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Train a linear regression model to demonstrate underfitting
linear_model = LinearRegression()
linear_model.fit(X, y)

# Predict using the model
y_pred = linear_model.predict(X)

# Plotting the data and model prediction
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, y_pred, color='red', label='Linear Model Prediction')
plt.title('Demonstrating Underfitting with Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# Evaluation - Mean Squared Error
mse = mean_squared_error(y, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

# Plot residuals to see the poor fit
plt.scatter(X, y - y_pred, color='green')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot - Underfitting')
plt.xlabel('X')
plt.ylabel('Residuals (y - y_pred)')
plt.show()

## Overfitting

**What is overfitting?**
Overfitting is a common problem in machine learning and statistics, where a statistical model or machine learning algorithm fits the training data too closely, including noise or random fluctuations in the training data. This can lead to poor generalization to new, unseen data, as the model may be too specialized to the training data and not able to capture the underlying patterns or relationships in the data.

**How to avoid overfitting?**
There are several ways to avoid overfitting in machine learning and statistics:
1. **Regularization**: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function that is minimized during training. This penalty term encourages the model to have simpler weights, which can help prevent overfitting.
2. **Cross-validation**: Cross-validation is a technique used to evaluate the performance of a model on unseen data. It involves splitting the data into training and validation sets, and then training the model on the training set and evaluating its performance on the validation set. This can help identify if the model is overfitting or underfitting the data.
3. **Early stopping**: Early stopping is a technique used to prevent overfitting by stopping the training process when the model starts to overfit the training data. This can be done by monitoring the performance of the model on a validation set and stopping the training process when the performance starts to degrade.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# Generate synthetic non-linear data
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Apply polynomial features to the data to address underfitting
polynomial_model = make_pipeline(PolynomialFeatures(degree=3), LinearRegression())
polynomial_model.fit(X, y)

# Predict using the model
y_pred = polynomial_model.predict(X)

# Plotting the data and model prediction
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, y_pred, color='red', label='Polynomial Model Prediction')
plt.title('Mitigating Underfitting with Polynomial Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# Evaluation - Mean Squared Error
mse = mean_squared_error(y, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

# Plot residuals to see the fit
plt.scatter(X, y - y_pred, color='green')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot - Polynomial Regression')
plt.xlabel('X')
plt.ylabel('Residuals (y - y_pred)')
plt.show()

### Avoid Overfitting with Lasso Regression

To avoid overfitting in the example above, we can use a polynomial regression model combined with regularization techniques like Ridge Regression or Lasso Regression. These methods add a penalty term to the model, helping to reduce the complexity and thus prevent overfitting.

Here's an improved version of the code with Ridge Regression:



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Step 1: Create a synthetic dataset with a non-linear relationship
np.random.seed(0)
X = np.linspace(0, 10, 100).reshape(-1, 1)  # 100 points between 0 and 10
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])  # Sine wave with some noise

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Create a polynomial regression model with regularization (Ridge Regression)
degree = 8  # Degree of the polynomial
alpha = 1.0  # Regularization strength (1.0 is a moderate amount of regularization)

model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
model.fit(X_train, y_train)

# Step 4: Make predictions on the test set
y_pred = model.predict(X_test)

# Step 5: Calculate the Mean Squared Error (MSE) to evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Step 6: Visualize the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, model.predict(X), color='red', label=f'Polynomial Model (Degree={degree}) with Ridge')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Ridge Regression to Avoid Overfitting')
plt.legend()
plt.show()
