<a href="https://colab.research.google.com/github/nxthxnael/Machine-Learning-Essentials/blob/master/Practical%20Session%201/Nathanael_Mutua_SC212_0588_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classification Algorithms**

## **Step 1**

I will import the Iris dataset from the scikit learn library.

> *It contains features for different iris flower species, which the model will learn to categorize.*

### **Loading & creating a dataframe for the Iris Dataset**

- I will run `sklearn.datasets import load_iris`
- I will also import pandas, load the dataset and create a dataframe

Then I will display the first few rows of the dataset.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# loading the dataset
iris = load_iris()

In [None]:
# creating a dataframe

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

# I'm adding the target column to include the additional information
# (numerical values representing the species)

print(iris_df.head())

---
---


## **Step 2**
Carrying out the classification algorithms using 5 algorithms.

Where for each, you must follow a standard pipeline: **Train the model, make predictions, display the confusion matrix, and compute performance metrics**.

### **1. K Nearest Neighbour (KNN)**

**Got me wondering what exactlly is KNN?**
KNN is a **"lazy learner"** that does not build a formal model during training; instead, it stores the data and makes decisions during the prediction phase based on the classes of the closest neighbors.

Key considerationsI will have: Use the **"Rule of Thumb"** ($K = \sqrt{n}$) or **Cross-Validation** to select the best value for $K$.

Key steps:
1. import neccessry libraries
2. split the data
3. create the KNN model
4. fit the model
5. make predictions
6. display the confusion matrix
7. compute performance metrics




#### **1.1 Import Libraries**
importing necessary libraries for data handling, model training and evaluation

- `from sklearn.model_selection import train_test_split`

  Purpose: This function is used to split a dataset into two subsets: one for training the model and another for testing it.
- `from sklearn.neighbors import KNeighborsClassifier`

  Purpose: This imports the K-Nearest Neighbors (KNN) classifier (a simple, instance-based learning algorithm used for classification and regression tasks).
- `from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, ConfusionMatrixDisplay`

  Purpose: These functions are used to evaluate the performance of the machine learning model.
- `import seaborn as sns`
  
  Purpose: Seaborn is a statistical data visualization library built on top of Matplotlib. It'll provide a high-level interface for drawing attractive and informative statistical graphics.
- `import matplotlib.pyplot as plt`

  Purpose: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension, NumPy.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt
!pip install --upgrade scikit-learn

#### **1.2 Split the data**

Here we will be splitting the data into features (X) and target labels (y), and then further splitting those into training and testing sets, we'll ensure that the model can learn from one portion of the data while being evaluated on a separate, unseen portion.

My expected outcome:
- **X_train**: Contains 80% of the feature data for training the model.
- **X_test**: Contains 20% of the feature data for testing the model.
- **y_train**: Contains 80% of the target labels corresponding to X_train.
- **y_test**: Contains 20% of the target labels corresponding to X_test.

In [None]:
# Features
X = iris_df.drop('target', axis=1)

# Target variable
y = iris_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### **1.3 Train the KNN Model**

I will first carry out cross-validation technique to validate which is the best number of neighbors for my data.

In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Range of k values to test
k_values = range(1, 21)
scores = []

# Perform cross-validation for each k
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    cv_scores = cross_val_score(knn, X_train, y_train, cv=5)  # 5-fold cross-validation
    scores.append(np.mean(cv_scores))

# Find the best k
best_k = k_values[np.argmax(scores)]
print(f'The best number of neighbors is: {best_k}')

# Plotting the results
plt.plot(k_values, scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Cross-Validated Accuracy')
plt.title('KNN Hyperparameter Tuning')
plt.xticks(k_values)  # Show all k values on the x-axis
plt.grid()
plt.show()

I will then initialize the **classifier**

This parameter specifies the number of nearest neighbors to consider when making predictions. In this case, the model will look at the 3 closest data points in the training set to determine the class of a new data point.

In [None]:
# training the KNN Model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

#### **1.4 Making Predictions**

In [None]:
y_pred = knn.predict(X_test)

#### **Displaying the Confusion Matrix**

The confusion matrix typically has four components:

- **True Positives (TP)**: The number of instances correctly predicted as positive.
- **True Negatives (TN)**: The number of instances correctly predicted as negative.
- **False Positives (FP)**: The number of instances incorrectly predicted as positive (also known as Type I error).
- **False Negatives (FN)**: The number of instances incorrectly predicted as negative (also known as Type II error).

We should enter a code snippet that computes the confusion matrix for the KNN model's predictions, creates a visual representation of it, and displays the results.

This visualization will help in understanding how many instances were correctly or incorrectly classified for each class, providing valuable insights into the model's strengths and weaknesses.

In [None]:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names).plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

#### **1.5 Compute Performance Metrics**

I will first calculate the accuracy of the KNN model, usually in theory it's

$\frac{TP + TN}{Total}$

Then I will generate a detailed classification report that provides additional performance metrics

Then finally, I will print out the accuracy mmodel and the detailed classification report

In [None]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)
precision = precision_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-Score: {f1:.4f}')
print('Classification Report:\n', report)

---

### **2. Naive Bayes**
**Definition:** This is a family of probabilistic algorithms based on Bayes' theorem, used primarily for classification tasks. It is called "naive" because it makes a simplifying assumption that the features (or predictors) are conditionally independent given the class label. This means that the presence of one feature does not affect the presence of another feature within the same class.

Steps for Naive Bayes Classification
1. **Import Libraries:** Import the necessary libraries for data manipulation, modeling, and evaluation.

2. **Load the Dataset:** I already loaded it into a data frame

3. **Split the Dataset:** I already split the data into training and testing sets

4. **Train the Model:** Instantiate the Naive Bayes classifier and fit it to the training data.

5. **Make Predictions:** Use the trained model to make predictions on the test set.

6. **Display the Confusion Matrix:** Generate and visualize the confusion matrix to evaluate the model's predictions.

7. **Compute Performance Metrics:** Calculate accuracy, precision, recall, F1-score, and any other relevant metrics.

#### **2.1 Import Libraries**

The only library we need to add is *GaussianNB*:

`GaussianNB` is an implementation of the Gaussian Naive Bayes algorithm, which is a probabilistic classifier based on Bayes' theorem. It assumes that the features follow a normal (Gaussian) distribution.

In [None]:
# additional import of the Gaussian Naive Bayes classifier library
from sklearn.naive_bayes import GaussianNB

Since steps `2.2` & `2.3` have been completed in the KNN classification we move on to `2.4`

#### **2.4 Training the model**
- I will initialize a new model object `nb_model` that will hold the parameters and methods necessary for training and making predictions with the Naive Bayes algorithm

- Then we will train the Naive Bayes model using the training dataset

In [None]:
nb_model = GaussianNB()  # Instantiate the Naive Bayes classifier
nb_model.fit(X_train, y_train)  # Fit the model to the training data

#### **2.5 Making Predictions**

We will use the trained Naive Bayes model to make predictions on the test dataset

We can't skip this step because the previos `y_pred` variable stored the predictions for the KNN classification, hence we need to regenerate the predictions accordingly.

In [None]:
y_pred = nb_model.predict(X_test)  # Predict on the test set

#### **2.6 Displaying the Confusion Matrix**

In [None]:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names).plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

#### **2.7 Computing Performance Metrics**

In [None]:
# Compute Performance Metrics
accuracy = accuracy_score(y_test, y_pred) # Sum of Correct Predictions / Total Number of Predictions
report = classification_report(y_test, y_pred, target_names=iris.target_names) # summarizes the metrics for each class
specificity = cm[0, 0] / (cm[0, 0] + cm[0, 1])  # True Negatives / (True Negatives + False Positives)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:\n', report)
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'Specificity: {specificity:.2f}')
print(f'F1-Score: {f1:.4f}')

---

### **3. Support Vector Machine (SVM)**

This is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that separates data points of different classes in a high-dimensional space.

SVM aims to maximize the margin between the closest points of each class (support vectors) and the hyperplane, which enhances the model's ability to generalize to unseen data. SVM can also use kernel functions to handle non-linear separations by transforming the input space into higher dimensions.

Steps to Perform SVM on the Iris Dataset:
1. **Import Libraries:** Import the necessary libraries for data manipulation, modeling, and evaluation.

2. **Load the Dataset:** I already loaded it into a data frame

3. **Split the Dataset:** I already split the data into training and testing sets

4. **Train the Model:** Instantiate the Naive Bayes classifier and fit it to the training data.

5. **Make Predictions:** Use the trained model to make predictions on the test set.

6. **Display the Confusion Matrix:** Generate and visualize the confusion matrix to evaluate the model's predictions.

7. **Compute Performance Metrics:** Calculate accuracy, recall, F1-score, and any other relevant metrics.

#### **3.1 Import Libraries**

- I will import the SVM library, since the rest have been imported in previous code blocks

In [None]:
from sklearn.svm import SVC

Since steps `3.2` & `3.3` have been completed in the KNN classification we move on to `3.4`

#### **3.4 Training the SVM model**
- I will create an instance of the Support Vector Classifier (SVC) from the sklearn.svm module.
- Then I will train the SVM model using the training dataset

But first let's run k-cross cross-validation to determine which is the best kernel to use for the model (linear, polynomiol, radial basis function, or even sigmoid) based on performance.

Kernel Functions:
- **Linear Kernel:** This is used when the data is linearly separable in the feature space. It creates a hyperplane that separates the classes with a straight line (or flat hyperplane in higher dimensions).

- **Radial Basis Function (RBF) Kernel:** This is a popular choice for non-linear data. It maps input features into an infinite-dimensional space, allowing complex boundaries between classes.

- **Polynomial Kernel:** This kernel allows for polynomial decision boundaries and can model interactions between features.

- **Other kernels** can also be specified, such as 'sigmoid' and custom kernels, depending on the problem's requirements.

In [None]:
# I had already imported cross_val_score in a previos code block

kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for kernel in kernels:
    svm_model = SVC(kernel=kernel)
    scores = cross_val_score(svm_model, X, y, cv=5)  # 5-fold cross-validation
    print(f'Kernel: {kernel}, Accuracy: {scores.mean():.2f} ± {scores.std():.2f}')


We will go with the linear kernel

In [None]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

#### **3.5 Make Predictions**

In [None]:
y_pred = svm_model.predict(X_test)
print(y_pred)

#### **3.6 Displaying the confusion matrix**

In [None]:
cm = confusion_matrix(y_test, y_pred)

print('Confusion Matrix:\n', cm)

In [None]:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names).plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

#### **3.7 Evaluating Performace Metrics**

In [None]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)
specificity = cm[0, 0] / (cm[0, 0] + cm[0, 1])
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:\n', report)
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'Specificity: {specificity:.2f}')
print(f'F1-Score: {f1:.4f}')

### **4. Decision Tree**
A Decision Tree classifies by recursively splitting data into subsets based on the most significant feature.

**How it works:**

- At each node, select the feature that best separates classes (using Gini impurity or Information Gain/Entropy).

- Split data into branches.

- Repeat until pure nodes or stopping criteria (max depth, min samples) are met.

- Classify new data by traversing the tree to a leaf and outputting the majority class.

**Steps to perform Decision Tree Classification on the Iris Dataset:**
1. Importing Libraries
2. Creating and Training the Decision Tree model
3. Making Predictions
4. Displaying the Confusion Matrix
5. Evaluating the model's metrics
6. Visualizing the Decision Tree

#### **4.1 Importing Libraries**

- `DecisionTreeClassifier`
- `plot_tree`

In [None]:
from sklearn.tree import DecisionTreeClassifier # creates & trains model
from sklearn.tree import plot_tree # visualizes the trained tree structure

#### **4.2 Creating and Training the Decision Tree mode**

Let's first determine the best parameters to use using `GridSearchCV`

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

dt_model = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
print(grid_search.best_params_)
best_model = grid_search.best_estimator_

In [None]:
best_dt_model = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=5,
    min_samples_leaf=4,
    min_samples_split=2,
    random_state=42
)
best_dt_model.fit(X_train, y_train)

#### **4.3 Make Predictions**

In [None]:
y_pred = best_dt_model.predict(X_test)


#### **4.4 Display confusion Matrix**

In [None]:
cm = confusion_matrix(y_test, y_pred)

print('Confusion Matrix:\n', cm)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names).plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

#### **4.5 Evaluationg the model's metrics**

In [None]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)
specificity = cm[0, 0] / (cm[0, 0] + cm[0, 1])
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:\n', report)
print(f'Specificity: {specificity:.2f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-Score: {f1:.4f}')

#### **4.6 Visualize the Decision Tree**

In [None]:
plt.figure(figsize=(12, 8))
plot_tree(best_dt_model, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title('Decision Tree Visualization')
plt.show()

### **5. Random Forest Classification**

This is an ensemble method combining multiple decision trees:

**How it works:**

- Creates many trees on bootstrap samples (random sampling with replacement)

- Each split considers random subset of features (feature bagging)

- Final prediction = majority vote (classification) or average (regression)

**Steps to implment Random Forest**:
1. Import Libraries
2. Train the Random Forest Model
3. Make Predictions
4. Display confusion matrix
5. Compute Metrics


#### **5.1 Import Libreries**

In [None]:
from sklearn.ensemble import RandomForestClassifier

#### **5.2 Train Random Forest Model**

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

#### **5.3 Make Predictions**

In [None]:
y_pred = rf_model.predict(X_test)

#### **5.4 Display Confusion matrix**

In [None]:
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

#### **5.5 Evaluate Metrics**

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-Score: {f1:.4f}')

# **Regression Algorithms**

## **Step 1**

I will import the California dataset and use it to implement the algorithms below

> The official scikit-learn California Housing dataset contains **20,640 samples** with **8 numeric features** and **1 target** variable, **no missing values**

**Features (all per block group)** :
- `MedInc`: Median income
- `HouseAge`: Median house age  
- `AveRooms`: Average rooms per household
- `AveBedrms`: Average bedrooms per household
- `Population`: Block group population
- `AveOccup`: Average household members
- `Latitude` / `Longitude`: Geographic coordinates

**Target**: `MedHouseVal` - Median house value (**$100,000 units**)

Derived from **1990 U.S. census**; regression task only (not classification) .

### **Loading & making a dataframe for the California Housing Dataset**

- I will import the official scikit-learn California dataset together with libraries that will be used in this section
- I will create a dataframe for the dataset with target values
- I will display the first few rows

In [None]:
from sklearn.datasets import fetch_california_housing # importing the dataset
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # for metrics evaluation

In [None]:
# Load dataset
housing = fetch_california_housing()

# Create dataframe
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df['target'] = housing.target  # Add target column

print(housing_df.head())

## **Step 2**

We will carry out the following regression systems:
- Linear Regression

- Polynomial Regression

- Lasso Regression

- Ridge Regression

Ensuring the following steps are carried out fpr each model:
- Train the model

- Make predictions

- Evaluate using MAE, MSE, RMSE, and R²



### **1. Linear Regression**

**Definition**
So based on what I've found out, this algorithm models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It assumes the relationship is approximately linear and finds the optimal coefficients (weights) that minimize the sum of squared residuals between actual and predicted values. For a single feature, it's a line ($y = mx + b$); for multiple features, it's a hyperplane ($y = w_1x_1 + w_2x_2 + ... + w_nx_n + b$). It's simple, interpretable, and serves as a baseline for regression tasks, but assumes linearity, independence, homoscedasticity, and normality of errors.

**Steps to implement Linear Regression:**
1. Import necessary libraries & prepare Features and Target
2. Split data
3. Train Linear Regression Model
4. Make Predictions
5. Evaluate the model
6. Cross-validation
7. View Coefficients
8. Show first few predictions

#### **1.1 Prepare features and target**

Separates features (predictors) from target (what we're predicting).

- `X`: Feature matrix - all columns used to make predictions (MedInc, HouseAge, etc.)

- `y`: Target vector - the value we want to predict (median house value)

In [None]:
# Step 1: Prepare features and target & import libraries
from sklearn.linear_model import LinearRegression

X = housing_df.drop('target', axis=1)
y = housing_df['target']

#### **1.2 Split the data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### **1.3 Train the Linear Regression Model**

In [None]:
# Step 3: Train Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

#### **1.4 Make predictions**

In [None]:
y_pred = lr.predict(X_test)

#### **1.5 Evaluate model**

In [None]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'RMSE: ${rmse*100000:.2f}')
print(f'MAE: ${mae*100000:.2f}')
print(f'R² Score: {r2:.4f}')

#### **1.6 Cross-validation**

Cross-validation is necessary because:

1. **More reliable estimate:** The single train/test split R² might be lucky or unlucky. CV averages performance across 5 different splits.

2. **Detects overfitting:** If CV score is much lower than test score, your model is overfitting.

3. **Better use of data:** Every sample gets used for both training and validation.

4. **Stability check:** The standard deviation tells you if performance varies wildly across different data subsets.

In [None]:
cv_scores = cross_val_score(lr, X, y, cv=5, scoring='r2')
print(f'Cross-val R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})')

This means:

- **Average R²:** ~55.3% of variance in house prices is explained by the model across the 5 folds

- **Stability:** ±0.0617 standard deviation - performance varies by about 6 percentage points depending on which data subset is used for validation

#### **1.7 View Coeficients**

Shows which features most influence predictions and how they affect house values.

In [None]:
coef_df = pd.DataFrame({'Feature': housing.feature_names, 'Coefficient': lr.coef_})
print(coef_df.sort_values('Coefficient', key=abs, ascending=False))

#### **1.8 Show first few predictions vs actual**

In [None]:
# Step 8: Show first few predictions vs actual
results_df = pd.DataFrame({'Actual': y_test[:5], 'Predicted': y_pred[:5]})
print(results_df)

Despite potentially decent aggregate metrics, model fails on specific cases

Based on these results, I think Linear regression is too rigid for this data, maybe I have possible missing features, or I should try `log(price)` instead

### **2. Polynomial Regression Algorithm**

**Polynomial Regression** is a form of regression analysis that models the relationship between variables as an
-th degree polynomial.

**Steps to implement Polynomial Regression**:
1. Import Libraries
-  Load the Carlifornia Dataset (Already achieved in the previous algorithm)
-  Split the Dataset (Already achieved in the previous algorithm)
2. Create Polynomial Features
3. Train the Model
4. Make predictions
5. Evaluate the model

#### **2.1 Import Libraries**

- import `PolynomialFeatures` class, that is essential for transforming input features into polynomial features, enabling polynomial regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

#### **2.2 Create Polynomial Features**

In [None]:
degree = 2
poly_features = PolynomialFeatures(degree=degree)

x_train_poly = poly_features.fit_transform(X_train)
x_test_poly = poly_features.transform

#### **Train The Model**
Creating a lear regression model to fit it to the trasformed training data

In [None]:
model = LinearRegression()
model.fit(x_train_poly, y_train)

#### **Make Predictions**
I will use the trained model to make predicions on the test set

In [None]:
y_pred = model.predict(x_test_poly)

#### **Evaluate the Model**
I will use Mean Absolute Error(MAE), Mean Squared Error(MSE), Root Mean Squared Error(RMSE), and $R^2$ score

In [None]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae:.2f}')
print(f'MSE: {mse:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'R²: {r2:.2f}')


### **3. Lasso Regression**
**Lasso Regression** (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that incorporates L1 regularization. It adds a penalty equal to the absolute value of the magnitude of coefficients to the loss function, which helps prevent overfitting and can lead to sparse models by driving some coefficients to zero. This makes Lasso useful for feature selection in high-dimensional datasets.

**Steps to implement the Lasoo Regression:**
1. Import Libraries
2. Train the model
3. Make Predictions
4. Evaluate the model

#### **3.1 Import Libraries**

- import `Lasso`: This class is used to implement Lasso Regression, which applies L1 regularization to linear regression models, helping to prevent overfitting and enabling feature selection by shrinking some coefficients to zero.

In [None]:
from sklearn.linear_model import Lasso

#### **3.2 Train the Model**
Create a Lasso regression model and fit it to the training data.

In [None]:
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train, y_train)

#### **3.3 Make Predictions**
Use the trained Lasso model to make predictions on the test set.

In [None]:
y_pred = lasso_model.predict(X_test)

#### **3.4 Evaluate the Model**
Calculate evaluation metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² score

In [None]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae:.2f}')
print(f'MSE: {mse:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'R²: {r2:.2f}')