# Report #
---

## Data Preprocessing ##

**Missing Values:**<br>
The wine dataset has no null/missing values. 

**Outliers:**<br>
The points shown outside the whiskers of the boxplots represent values that fall more than 1.5 × IQR from the middle 50% of the data. The boxplot flags these as potential outliers, but in this dataset a lot of these values are not errors or anomalies. The wine chemistry variables (such as sulphates, residual sugar, and chlorides) are naturally skewed (their values are not evenly distributed around the center). Because of this skewness, the standard boxplot rule identifies a large number of observations as outliers, even though they are legitimate measurements. The observations are retained to preserve the true variability of the data. Removing them could distort the relationships between variables and negatively affect the regression analysis. The presence of skewed distributions motivates the use of transformed regression models (like log, square-root, Box-Cox, or Yeo-Johnson transformations), which help stabilize variance and improve model fit without discarding valid data.

## Exploratory Data Analysis (EDA) ##

**Correlation Matrix:**<br>
Correlation matrices examine pairwise correlations and a correlation of +/- 1 indicates perfect collinearity. The correlation matrix for the wine data shows that these pairs have a strong collinearity.
<br>Strong relationships (all fall around 0.6):<br>
-Citric acid & fixed acidity<br>
-Density & fixed acidity<br>
-Free sulfer dioxide & total sulfer dioxide
<br>Strong inverse relationships (all fall around -0.6):<br>
-Fixed acidity & pH<br>

## Regularized Regression Comparison

| Model              | R² Value | Effect on Coefficients                         | Interpretation | Verdict |
|--------------------|----------|------------------------------------------------|---------------|---------|
| Linear Regression  | 0.361    | No shrinkage                                   | Baseline model; susceptible to multicollinearity | Reference |
| Ridge Regression   | 0.2529   | Shrinks coefficients but keeps all variables    | Reduces instability caused by correlated predictors | More stable but less explanatory |
| Lasso Regression   | 0.0339   | Strong shrinkage; forces some features toward zero | Indicates many predictors carry overlapping information | Acts like feature selection |

The decrease in R² for Ridge and Lasso is expected since regularization introduces bias in order to reduce variance and stabilize coefficient estimates under multicollinearity. Lasso’s substantial reduction in R² suggests that several predictors contribute redundant information, effectively performing feature selection by shrinking weaker effects. Lasso forces the coefficients/weights of less important, noisy, or redundant features to 0. A high shrinkage implies a high penalty parameter (lambda).

## Transformed Regression Comparison

| Transformation | R² Value | Purpose of Transformation                     | Interpretation | Verdict |
|----------------|----------|-----------------------------------------------|---------------|---------|
| None (Linear)  | 0.361    | Baseline                                      | Moderate fit   | Reference |
| √(y)           | 0.3524   | Reduce right-skewness                         | Slight improvement | Limited impact |
| log1p(y)       | 0.3455   | Stabilize variance                            | Similar performance | Skew not primary issue |
| Box–Cox        | 0.3546   | Data-driven normalization                      | Slight change only | Distribution not main limitation |
| Yeo–Johnson    | 0.3545   | Handles skew with nonpositive values allowed   | Nearly identical to Box–Cox | Confirms limited transformation benefit |

Response transformations produced only marginal changes in R², suggesting that non-normality was not the primary limitation of the linear model. Instead, predictor dependence (multicollinearity) appears to play a larger role in restricting model performance.


## Quality of Fit
(Results from project1_utils.py)

| Metric Type              | R² Value | What It Measures                                      | Verdict |
|--------------------------|----------|--------------------------------------------------------|---------|
| In-Sample R²             | 0.361    | Fit on the same data used to train the model           | Moderate fit; may be optimistic but not excessive |
| Validation R² (80–20)    | 0.4032   | Predictive performance on unseen test data             | Consistent with training → model generalizes reasonably |
| 5-Fold Cross-Validation  | 0.3424   | Average performance across multiple resampled splits   | Most reliable estimate; confirms only moderate explanatory power |

Because all of these values are close together, the model generalizes fairly well.
These show that the model explains (roughly) 30% - 40% of the variability in wine quality.

## Statistical Summaries and Interpretation

Significance of Predictors (Based on p-values)

| Predictor                | p-value Significance | Verdict |
|--------------------------|----------------------|-----------------------|
| Volatile Acidity         | Significant          | Important predictor   |
| Chlorides                | Significant          | Important predictor |
| Total Sulfur Dioxide     | Significant          | Important predictor |
| Sulphates                | Significant          | Important predictor |
| Alcohol                  | Significant          | Strongest predictor |
| pH                       | Significant (weaker) | Moderately important |
| Free Sulfur Dioxide      | Minimally Significant | Minor contributor |
| Fixed Acidity            | Not Significant      | Candidate for removal |
| Citric Acid              | Not Significant      | Redundant variable |
| Residual Sugar           | Not Significant      | Weak predictor |
| Density                  | Not Significant      | Multicollinearity indicator |

p-val: “Does this variable still matter when all the other variables are included?”
These regression p-values identify variables with unique explanatory power

## Feature Selection (Forward, Backward, Stepwise) ##
fill

## Discussion of Results ##
fill

## WILL DELETE cell below... USING FOR REFERENCE ##