
# Polynomial Regression Assignment

---

In this assignment, you will explore **polynomial regression**, an extension of linear regression models that captures non-linear relationships between variables by adding powers of the original features.

## Objective
By the end of this assignment, you will:
- Understand the concept of polynomial regression and how it differs from linear regression.
- Learn how to implement polynomial regression in Python.
- Compare the performance of linear and polynomial models on a dataset to observe the impact of model complexity.

---


## About Dataset
### Manufacturing Data Report
This report presents an analysis of a manufacturing dataset, which simulates real-world data collected from a manufacturing process. The dataset is designed to explore the relationships between various process parameters and product quality. It contains both feature variables that represent process conditions and a target variable that represents the quality rating of the manufactured items.

**This dataset is clean and does not require any preprocessing steps.**

### Features
Temperature (°C): This column represents the temperature during the manufacturing process, measured in degrees Celsius. Temperature plays a critical role in many manufacturing processes, influencing material properties and product quality.

Pressure (kPa): The pressure applied during the manufacturing process, measured in kilopascals (kPa). Pressure can affect the material transformation and the overall outcome of the manufacturing process.

Temperature x Pressure: This feature is an interaction term between temperature and pressure, which captures the combined effect of these two process parameters.

Material Fusion Metric: A derived metric calculated as the sum of the square of temperature and the cube of pressure. It represents a material fusion-related measurement during the manufacturing process.

Material Transformation Metric: Another derived metric calculated as the cube of temperature minus the square of pressure. It provides insight into material transformation dynamics.

Quality Rating: The target variable, 'Quality Rating,' represents the overall quality rating of the produced items. Quality is a crucial aspect of manufacturing, and this rating serves as a measure of the final product's quality.


## Assignment Instruction Overview

#### 0. Preliminary steps
Set up the environment by importing necessary libraries (such as numpy, pandas, and matplotlib) and ensure the dataset is accessible for loading. This step ensures you have the required tools to work with data and perform polynomial regression.

#### 1. Loading and Visualizing the Data
Load the dataset into a DataFrame and plot it to observe the relationship between features and the target variable. This initial exploration helps determine if the data exhibits a non-linear pattern that may benefit from polynomial regression.

#### 2. Prepare the dataset
Preprocess the data by selecting the feature(s) and target variable, handling any missing values, and potentially scaling the data if needed. Preparing the dataset is key for obtaining meaningful and consistent results from the model.

#### 3. Fitting Polynomial Regression Models
Transform the features into polynomial terms and fit a polynomial regression model to capture non-linear relationships. Begin with a lower degree and increase it as needed to balance model complexity and accuracy.

#### 4. Visual Comparison
Compare the fitted polynomial regression model with a simple linear regression model visually. Plotting both models against the data helps illustrate how polynomial regression captures patterns that linear regression cannot.


---

### 0. Preliminary steps
Install and import any libraries you may need.

In [None]:
# # Install core libraries for polynomial regression
# !pip install --quiet scikit-learn
# !pip install --quiet matplotlib
# !pip install --quiet pandas
# !pip install --quiet numpy
# !pip install --quiet seaborn

# %matplotlib inline

In [3]:
# Core library imports for polynomial regression on the manufacturing dataset.

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

### 1. Loading and visualizing the data
In this step, we will load the dataset, inspect its structure, and create a visualization to understand the relationship between the feature(s) and the target variable.

1. **Load the Dataset**: Use `pandas.read_csv()` or another appropriate method to load the dataset into a DataFrame.

2. **Inspect the Data**: Print out the first few rows of the data using `.head()` to get a sense of its structure and identify any key columns.

3. **Visualize the Data**: Useful libraries for visualization include `matplotlib` and `seaborn`. You can use scatter plots, pair plots, histogram plots, heatmaps, distribution plots, boxplots or other relevant plots to explore the data. 


In [4]:
DATASET_PATH = Path("manufacturing.csv")

manufacturing_df = pd.read_csv(DATASET_PATH)
print(
    f"Loaded manufacturing dataset with {manufacturing_df.shape[0]:,} rows and {manufacturing_df.shape[1]} columns."
)

FileNotFoundError: [Errno 2] No such file or directory: 'manufacturing.csv'

In [None]:
column_overview = pd.DataFrame(
    {
        "dtype": manufacturing_df.dtypes,
        "missing_values": manufacturing_df.isna().sum(),
    }
)

column_overview


In [None]:
manufacturing_df.head()


In [None]:
manufacturing_df.describe().T


In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(
    manufacturing_df.corr(numeric_only=True),
    annot=True,
    cmap="Blues",
    fmt=".2f",
)
plt.title("Correlation Heatmap for Manufacturing Features")
plt.tight_layout()
plt.show()


In [None]:
manufacturing_df.hist(figsize=(12, 8), bins=20)
plt.suptitle("Distribution of Manufacturing Features", y=1.02)
plt.tight_layout()
plt.show()


In [None]:
target_column = "Quality Rating"
feature_columns = [
    "Temperature (°C)",
    "Pressure (kPa)",
    "Temperature x Pressure",
    "Material Fusion Metric",
    "Material Transformation Metric",
]

fig, axes = plt.subplots(len(feature_columns), 1, figsize=(6, 18), sharey=True)

for ax, feature in zip(axes, feature_columns):
    sns.scatterplot(
        data=manufacturing_df,
        x=feature,
        y=target_column,
        ax=ax,
        alpha=0.6,
    )
    ax.set_title(f"{feature} vs. {target_column}")

fig.suptitle("Feature Relationships with Quality Rating", y=0.92)
plt.tight_layout()
plt.show()


### 2. Prepare the dataset

1. **Choose a feature**: Choose one column from the dataset to use as an independent feature for the regression model. Explain your choice based on the findings from the previous step.

2. **Split the Data**: Use `train_test_split()` from `sklearn.model_selection` to divide the data into training and testing sets. This allows us to assess the model's ability to generalize to new data.

In [78]:
# TODO: add your code; add more cells as needed; add explanations

### 3. Fitting Polynomial Regression Models

1. **Polynomial features**: Transform your independent feature to include polynomial terms. Use `PolynomialFeatures` from `sklearn.preprocessing` to generate polynomial terms up to the chosen degree. 
2. **Find the best**: Find the best polynomial degree for the independent feature. Test with a low-degree polynomial (e.g., degree 2) and increase the degree to see how the fit changes  (This is best done by a loop. There is usually a sweet spot, where the MSE is optimally low). Evaluate the fit my measuring the Mean Squared Error. Use `mean_squared_error` from `sklearn.metrics`. You can also experiment with other metrics like `r2_score`.

In [None]:
# TODO: add your code; add more cells as needed

### 4. Comparison
1. **Fit Linear Regression Model**: First, fit a simple linear regression model on the same data as a baseline.

2. **Overlay Polynomial and Linear Predictions**: On a scatter plot of the data, plot the predictions from both the polynomial and linear models to compare.

3. **Observe Differences**: The polynomial model should capture non-linear patterns in the data that the linear model cannot, especially as the polynomial degree increases.


In [83]:
# TODO: add your code; add more cells as needed