# Dataset B – Wine Quality
## Exploratory Data Analysis (EDA)

We want to understand the structure, target distribution, and key properties of the Wine Quality dataset in order to formulate hypothesis-driven expectations about algorithmic performance.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(49)

### Data Loading

We load the Wine Quality dataset and inspect its basic structure.

In [2]:
DATA_PATH = "../wine.csv"
df = pd.read_csv(DATA_PATH)

df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,class,type,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0,5


The variables quality, class, and type are categorical in nature, despite being encoded as integers. Specifically, quality is an ordinal multi-class target variable, type is a binary categorical feature, and class represents a discrete categorical grouping. All remaining features are continuous numerical physicochemical measurements.

### Dataset Overview

We examine dataset size, feature types, and missing values.

In [3]:
df.shape

(6497, 14)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  class                 6497 non-null   int64  
 12  type                  6497 non-null   int64  
 13  quality               6497 non-null   int64  
dtypes: float64(11), int64(3)
memory usage: 710.7 KB


In [5]:
df.isna().sum().sort_values(ascending=False)

fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
class                   0
type                    0
quality                 0
dtype: int64

### Target Variable Distribution

We analyze class balance to motivate metric selection.

In [6]:
target_col = "quality"
df[target_col].value_counts()

4    2251
5    1561
3    1467
6     813
7     204
2     163
1      20
8      18
Name: quality, dtype: int64

In [7]:
df[target_col].value_counts(normalize=True)

4    0.346468
5    0.240265
3    0.225797
6    0.125135
7    0.031399
2    0.025089
1    0.003078
8    0.002771
Name: quality, dtype: float64

### Attributes Distribution

We analyze them to see if they can really add value to the analysis

In [8]:
numeric_cols = df.drop(columns=["quality", "class", "type"]).columns

df[numeric_cols].describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9


In [14]:
for col in ["class", "type"]:
    print(f"\n=== {col.upper()} ===")
    print(df[col].value_counts())
    print(df[col].value_counts(normalize=True))


=== CLASS ===
4    2251
5    1561
3    1467
6     813
7     204
2     163
1      20
8      18
Name: class, dtype: int64
4    0.346468
5    0.240265
3    0.225797
6    0.125135
7    0.031399
2    0.025089
1    0.003078
8    0.002771
Name: class, dtype: float64

=== TYPE ===
1    4898
0    1599
Name: type, dtype: int64
1    0.753886
0    0.246114
Name: type, dtype: float64


### Initial Observations

- The Wine Quality dataset contains 6497 observations and 14 features, with no missing values.
- All explanatory variables are numerical physicochemical measurements, resulting in a homogeneous
  and low-sparsity feature space.
- Feature distributions exhibit heterogeneous scales and the presence of extreme values,
  motivating feature normalization prior to distance-based or gradient-based learning.
- The target variable (wine quality) is an ordinal multi-class variable with strong class imbalance,
  as most observations are concentrated in quality levels 3–5, while extreme quality levels are rare.
- The binary categorical feature type introduces a clear structural separation between wine groups
  without increasing dimensionality.


### Preliminary Hypothesis

Based on the numerical nature of the features, scale heterogeneity, and class imbalance, we hypothesize that:

- Distance-based methods (kNN) will perform competitively after proper feature scaling, as the
  feature space is continuous and moderately low-dimensional.
- Linear models may underperform due to non-linear relationships between physicochemical attributes
  and wine quality.
- Tree-based models are expected to effectively capture non-linear interactions between attributes,
  but may overfit minority quality classes without sufficient regularization.
