# Report: Wine Quality #
---

## Data Preprocessing ##
---

**Purpose:** Make sure that the model isn't learning garbage. <br>
**Check:** Missing values, outliers, scaling, skewed variables.

**Missing Values:**<br>
The wine dataset has no null/missing values. 

**Outliers:**<br>
The points shown outside the whiskers of the boxplots represent values that fall more than 1.5 × IQR from the middle 50% of the data. The boxplot flags these as potential outliers, but in this dataset a lot of these values are not errors or anomalies. The wine chemistry variables (such as sulphates, residual sugar, and chlorides) are naturally skewed (their values are not evenly distributed around the center). Because of this skewness, the standard boxplot rule identifies a large number of observations as outliers, even though they are legitimate measurements. The observations are retained to preserve the true variability of the data. Removing them could distort the relationships between variables and negatively affect the regression analysis. The presence of skewed distributions motivates the use of transformed regression models (like log, square-root, Box-Cox, or Yeo-Johnson transformations), which help stabilize variance and improve model fit without discarding valid data.

## Exploratory Data Analysis (EDA) ##
---

We use the EDA to justify our modeling choices later on.

**Check:** 
* Variable relationships with target:
    * Checking correlation strength revealed several predictors (particularly alcohol, volatile acidity, and sulphates) show clear relationships with wine quality. However, some relationships display curvature, suggesting violations of linearity assumptions.
* Data shape diagnostic:
    * QQ plots are used to check if each variable is normally distributed.
    * We check normality because non-normal predictors can cause nonlinear relationships and violate regression assumptions.
    * This detects skewness which then signals that transformation is necessary.
    * Histograms indicated multiple predictors are right-skewed, motivating the use of log and Box–Cox transformations.
* Predictor correlation:
    * Correlation matrices are used to check collinearity/multicollinearity. |r|>0.7 is a warning.
    * The correlation matrix revealed substantial multicollinearity among acidity-related variables, justifying the application of Ridge and Lasso regression to stabilize estimates and perform feature selection.


### Quality of Fit
**Using project1_utils.py**

| Metric Type              | R² Value | What It Measures                                      | Verdict |
|--------------------------|----------|--------------------------------------------------------|---------|
| In-Sample R²             | 0.361    | Fit on the same data used to train the model           | Moderate fit; may be optimistic but not excessive |
| Validation R² (80–20)    | 0.4032   | Predictive performance on unseen test data             | Consistent with training → model generalizes reasonably |
| 5-Fold Cross-Validation  | 0.3424   | Average performance across multiple resampled splits   | Most reliable estimate; confirms only moderate explanatory power |

Because all of these values are close together, the model generalizes fairly well.
These show that the model explains (roughly) 30% - 40% of the variability in wine quality.

### Feature Selection: 

Features (updated ranking):
* fixed acidity
* alcohol
* sulphates
* volatile acidity
* density
* total sulfur dioxide
* chlorides
* free sulfur dioxide
* pH

## Evaluate Models ##
---
**Using WineQualityRegression.scala**

### Linear Regression 

| Metric | In-Sample (Full) | Train-Test Split (80/20) | 5-Fold Cross-Validation (Avg) |
| :--- | :--- | :--- | :--- |
| **$R^2$** | 0.352067 | 0.352067 | 0.3381522 |
| **Adj. $R^2$** | 0.347987 | 0.347987 | 0.270089 |
| **RMSE** | 0.649844 | 0.649844 | 0.655629 |
| **MAPE** | 9.29% | 9.29% | 9.39% |

Linear Regression (LR) creates a baseline model that links wine chemistry to quality, but its moderate R² suggests that the relationship is only partially captured by a linear type structure. Also, some predictors add little explanatory value, motivating the use of regularization and feature-selection methods (Ridge, Lasso, and Stepwise) to test whether a simpler or more stable model can perform as well or even better.
The R² indicates that approximately 35% of the variation in wine quality is explained by the measured chemical properties. 35% is a meaningful relationship, but it also means that wine quality is definitely influenced by other unmeasured factors.
Five-fold cross-validation had similar performance across folds, meaning the model generalizes fairly well and is not severely overfit.

### Ridge Regression 

| Model  | R²  | RMSE |
| ------ | --- | ---- |
| Linear | 0.352067 | 0.649844  |
| Ridge  | 0.351946 | 0.649905  |

With Ridge Regression (RR) we test for overfitting risk and regularization. Our results show that RR produced comparable performance to LR, suggesting limited multicollinearity impact.
RR was applied to address potential multicollinearity by shrinking coefficient magnitudes using L2 regularization (λ = 0.1). However, predictive performance remained about the same to the baseline linear regression (R² = 0.352, RMSE = 0.650), this indicates that multicollinearity was not substantially devaluing model generalization. So, regularization provided little benefit for this dataset (the added bias was not super useful).
Since ridge regression did not improve RMSE or R², multicollinearity was not severely harming the linear model. While we saw a couple predictor pairs that had a strong correlation, the vast majority of variables are fairly distinct, so these RR results make sense. 

### Lasso Regression

| Model  | R²       | RMSE     | # Predictors |
| ------ | -------- | -------- | ------------ |
| Linear | 0.352067 | 0.649844 | 11           |
| Ridge  | 0.351946 | 0.649905 | 11           |
| Lasso  | 0.351005 | 0.650376 | 8            |

Lasso Regression (LR) removes any coefficient that is exactly 0.00000. Those predictors don't add a useful signal.
Lasso regression applied L1 regularization (λ = 0.1), which shrinks some coefficients to zero, performing automatic **feature selection**. The method removed residual sugar, free sulfur dioxide, and total sulfur dioxide from the model, reducing the predictor set without affecting predictive accuracy (R² = 0.351, RMSE = 0.650). This shows that these variables don't give much information for predicting wine quality and that a simpler model can achieve comparable performance.

### Recap

| Model  | Question Answered                       |
| ------ | --------------------------------------- |
| Linear | What relationships exist?               |
| Ridge  | Does multicollinearity hurt prediction? |
| Lasso  | Which variables actually matter?        |

### Transformed Regression Comparison

| Model | R² | RMSE | MAPE | Verdict |
| :--- | :--- | :--- | :--- | :--- |
| **Baseline (OLS)** | **0.352067** | **0.649844** | **9.29%** | **Best** |
| Sqrt Transform | 0.350789 | 0.650485 | 9.25% | Worse |
| Box-Cox | 0.351008 | 0.650375 | 9.25% | Worse |
| Yeo-Johnson | 0.351205 | 0.650276 | 9.26% | Woese |
| Log1p | 0.348866 | 0.651447 | 9.23% | Worse |


| Observation                                                       | Meaning                                                    |
| ----------------------------------------------------------------- | ---------------------------------------------------------- |
| All transformed models have very similar (R^2) (~0.349-0.351)     | Transformations did **not improve explanatory power**      |
| RMSE values differ by only ~0.001                                 | Prediction accuracy is basically unchanged                 |
| MAPE values remain between 9.2–9.3%                               | Relative prediction error stayed the same                  |
| OLS already had the best (R²)                                    | The original scale already fit the linear assumptions well |

Different transformation strategies (square-root, log1p, Box–Cox, and Yeo–Johnson) were tested to face possible nonlinearity and skewness in the predictors. All transformed models produced nearly identical performance to the original linear regression, with R² values around 0.35 and RMSE near 0.65. This says that the original variables already satisfied the assumptions of linear regression pretty well and that transformation did not improve predictive accuracy in a meaningful way. The baseline OLS model is the best for its interpretability and comparable performance.

### Symbolic Ridge Regression

The Symbolic Ridge Regression demonstrated that increasing a model's complexity without sufficient data or scaling can severely degrade performance. SymRidge regression was evaluated to capture potential nonlinear relationships through automatic feature expansion. However, the model exhibited terrible numerical instability, producing extremely large coefficient estimates, undefined standard errors (NaN), and a super negative R². These results say that the expanded feature space introduced extreme multicollinearity and badly conditioned matrix calculations (prob at the inversion step) that the regularization parameter couldn't control. SymRidge underperformed all simpler models which shows that increased model complexity is not supported by this data.