
# Comprehensive EDA Report

*(All recommendations are justified by the EDA results above.)*

---

## 0. Executive summary 

The red-wine quality dataset shows that class of quality is imbalanced, **quality is driven by higher alcohol and balanced acidity** (low volatile acidity + moderate fixed/citric acid), with supporting contributions from sulphates and density. Use **classification** models (multiclass), engineer **acidity-ratio** and **interaction** features, scale/transform skewed variables, treat outliers, and prefer tree-based classifiers for production-ready performance — all decisions below are tied to the EDA evidence.

---

## 1. Univariate findings

### Finding (evidence)

* Several variables are **right-skewed** (you observed residual sugar, total SO₂, chlorides skewed in histograms & boxplots).
* Outliers present for sulfur-related and acidity measures in boxplots.

### Recommendations & rationale

1. **Log-transform skewed variables** (e.g., `residual sugar`, `total sulfur dioxide`, `chlorides`).

   * *Rationale:* Skewness → asymmetric influence on distance/linear models and unstable variance. Your EDA showed these variables are right-skewed; log(1+x) reduces influence of extremes and makes distributions more Gaussian-like.
   * *Action:* `df['residual sugar_log'] = np.log1p(df['residual sugar'])` etc.

2. **Outlier handling (Winsorize or IQR-capping) for extreme acidity/sulphates**.

   * *Rationale:* Violin/boxplots show extreme tails. Outliers can distort many algorithms (especially KNN, SVM, linear models).
   * *Action:* Cap values at `Q1 - 1.5*IQR` and `Q3 + 1.5*IQR`, or Winsorize the top/bottom 1–2%.

3. **Scaling**

   * *Rationale:* Features have different ranges (alcohol ~10–12, pH ~3.x, SO₂ up to dozens). For distance-based or regularized models, scale features.
   * *Action:* Use `StandardScaler` for models like Logistic/SVM; `RobustScaler` if outliers remain; tree models do not require scaling but scaled engineered features help interpretability.

4. **Target handling: classification labels**

   * *Rationale:* `quality` is discrete (3–8). It has already been mapped to categories (Low/Medium/High). Use classification pipeline.
   * *Action:* For multiclass classification, either use the integer labels (3–8) with multiclass classifier or use grouped labels (`Low/Medium/High`)  coarser prediction is prefered.

---

## 2. Bivariate findings

### Key EDA evidence

* **Correlation**: `alcohol +0.48`, `volatile acidity −0.39`, `sulphates +0.25`, `citric acid +0.23`.
* **Visuals**: Box/violin plots showed medians shift with quality (alcohol medians increase; volatile acidity medians decrease).

### Recommendations & rationale

1. **Keep high-signal features**: `alcohol`, `volatile acidity`, `sulphates`, `citric acid`, `density`, `chlorides`.

   * *Rationale:* Ranked by correlation and supported by grouped means and visual separation.

2. **Evaluate and drop/ deprioritize low-signal features**: `residual sugar`, `free sulfur dioxide` (very low correlations ~0.01, −0.05).

   * *Rationale:* Low linear relationship  may add noise or overfit. Keep them only if domain knowledge says otherwise, or if new features involving them show predictive gain.

3. **Check multicollinearity** (VIF) among acidity measures and pH.

   * *Rationale:* `fixed acidity` and `pH` are chemically related; using them raw together can inflate VIF. If VIF > 5–10, consider replacing with ratios (see next section).
   * *Action:* Compute VIF and respond accordingly; if high, replace with engineered acidity ratios.

---

## 3. Trivariate & grouped multivariate findings 

### Core evidence

* **Interaction pattern:** high `alcohol` + low `volatile acidity`  highest quality (grouped means & trivariate visuals).
* **Group means:** `alcohol` increased ~9.96 --> 12.09 from quality 3 --> 8; `volatile acidity` decreased ~0.88 --> 0.42.


### Feature engineering recommendations


1. **Acidity ratio: `fixed_to_volatile = fixed_acidity / volatile_acidity`**

   * *Rationale:* Grouped means and violin plots show that quality improves when stable acids (fixed) dominate over volatile (harsh) acids. The ratio captures that balance better than either feature alone.
   * *Action:* `df['FA_to_VA_ratio'] = df['fixed acidity'] / (df['volatile acidity'] + eps)`

2. **Citric balance: `citric_over_total_acid = citric_acid / (fixed_acidity + volatile_acidity + eps)`**

   * *Rationale:* Citric acid adds freshness — EDA showed citric acid rises with quality. Normalizing by total acid captures relative freshness.
   * *Action:* create ratio; `eps` small constant to avoid divide-by-zero.

3. **Alcohol–Acidity interaction: `alcohol_acid_inter = alcohol * (1/ (volatile_acidity + eps))` or `alcohol / volatile_acidity`**

   * *Rationale:* Trivariate analysis showed high alcohol only signals high quality when volatile acidity is low. This interaction explicitly models that synergetic effect.

4. **Fermentation efficiency: `alcohol_density_ratio = alcohol / density`**

   * *Rationale:* Grouped means showed density decreases while alcohol increases with quality — ratio captures fermentation completeness.

5. **Sulphate–citric interaction: `sulphates * citric_acid` or `citric_acid / sulphates`**

   * *Rationale:* EDA suggested sulphates and citric acid jointly correlate with perceived freshness and preservation. Create both multiplicative and ratio variants and validate.



**Implementation notes**

* When adding ratios, add a tiny `eps` (e.g., 1e-6).
* Scale these engineered features after creation (same scaler as raw numeric features).

---

## 4. Model selection & evaluation

### Evidence behind the recommendation

* Nonlinear interactions discovered in trivariate analysis (alcohol × VA etc.) → simple linear separability is limited.
* Target class imbalance (Medium dominates) --> need balanced metrics/strategies.

### Recommended pipeline 

1. **Prepare dataset**

   * Create engineered features above.
   * Log-transform skewed columns.
   * Handle outliers.
   * Scale numeric features (StandardScaler or RobustScaler).

2. **Train/test split**

   * Use **stratified** split on the classification label (quality or quality_label) to preserve class distribution.

3. **Baseline classifier**

   * **Logistic Regression (multinomial)** or **LinearSVC**, to get interpretable baseline and coefficients for feature effect direction.
   * *Rationale:* fast, interpretable baseline against which to measure non-linear gains.

4. **Primary models**

   * **Random Forest Classifier** — robust, captures interactions, insensitive to scaling and moderate outliers.
   * **XGBoost / LightGBM** — usually gives best structured-data performance and handles non-linearity and feature interactions automatically.
   * *Rationale:* EDA showed interactions and nonlinear relationships; tree ensembles are ideal.

5. **If you want probability calibration** (for business decision thresholds), use **CalibratedClassifierCV** or isotonic regression on tree outputs.

6. **Class imbalance handling**

   * If using grouped three-class (`Low/Medium/High`) and Medium dominates:

     * Use **class_weight='balanced'** for Logistic/Tree, or
     * Use **resampling** (SMOTE for minority oversampling or undersampling majority) only **within CV (cross validation)** to avoid leakage.

7. **Cross-validation & hyperparameter tuning**

   * Use **StratifiedKFold** (k=5 or 10) for reliable multiclass estimates.
   * Use grid/random search with metric = **macro-F1** or **balanced accuracy** to prioritize performance across classes.

8. **Evaluation metrics**

   * **Primary:** macro-F1 (balances class-wise performance) or balanced accuracy.
   * **Secondary:** Confusion matrix to see which quality classes are confused. ROC-AUC per class can be used (one-vs-rest).

9. **Explainability**

   * For final model, run **feature importance** (trees) and **SHAP** values to verify engineered acidity ratios and interactions actually contribute.

---

## 5.  EDA evidence

| Recommendation                                                 | EDA evidence                                                                                       |
| -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| Add `fixed_to_volatile` ratio                                  | Grouped means & violin plots: high quality had higher fixed and much lower volatile acidity.            |
| Add `alcohol / density`                                        | Grouped means: alcohol increased ~9.96→12.09 while density decreased with quality.                      |
| Log-transform `residual sugar`, `total SO₂`, `chlorides`       | Univariate histograms: strong right-skew.                                                               |
| Keep `alcohol`, `volatile acidity`, `sulphates`, `citric acid` | High correlation magnitudes: alcohol +0.48, volatile acidity −0.39, sulphates +0.25, citric acid +0.23. |
| Drop/ deprioritize `residual sugar`, `free SO₂`                | Low correlation with quality (~0.01, −0.05) and weak group mean variation.                              |
| Prefer Random Forest / XGBoost                                 | Trivariate visuals show nonlinear interactions (alcohol effectiveness depends on acidity level).        |
| Use stratified CV and macro-F1                                 | Target imbalance (Medium dominates) discovered in categorical univariate analysis.                      |

---

## 6. **Production-oriented suggestions**
 *strictly based on dataset patterns only*

> I am *not* presuming domain expertise beyond the dataset; these are suggestions that follow from what the data shows. For real world production changes please consult an oenologist.

1. **Focus on achieving higher alcohol (within acceptable taste ranges)** — dataset shows mean alcohol is ~9.96 at quality 3 and ~12.09 at quality 8.

   * *Why:* Higher alcohol strongly correlates with higher quality in this dataset (+0.48).
 

2. **Minimize volatile acidity (VA)** — mean VA drops 0.88 → 0.42 as quality rises.

   * *Why:* VA has the second-strongest correlation (−0.39) and EDA visuals show lower VA in high-quality wines.
   * *Actionable:* Monitor fermentation to avoid acetic fermentation; keep VA below identified good-range in data (approx ≤0.4).

3. **Aim for balanced fixed-to-volatile acidity ratio** (engineer and track this ratio).

   * *Why:* EDA showed that balance (not just absolute fixed acid) correlates with higher quality. Ratio is a concise production metric.

4. **Moderate sulphates** — slightly higher sulphates associate with higher quality (but not linearly).

   * *Why:* Sulphates correlate positively (+0.25) and grouped means rise; use moderate, not excessive addition.

5. **Encourage complete fermentation** (reduce density while maintaining alcohol).

   * *Why:* Density decreased while alcohol increased among higher-quality samples. Incomplete fermentation leaves residual sugar (= higher density) and tends to be lower quality.

