# Exploratory Data Analysis (EDA) and Classification Problem

This document explores different alternatives for handling the dataset.  
I chose to approach the problem as a classification task rather than a regression one, although the latter should not be entirely ruled out.

### Some Important Considerations

- There are two datasets available: one for white wine and one for red wine.  
  Both datasets share the same features, but differ significantly in the number of samples:  
  the red wine dataset contains approximately 1,600 rows, while the white wine dataset has around 4,900 rows—roughly three times as many.  
  This presents a key issue: deciding whether to merge the two datasets into one by adding a feature to indicate wine type, or to train two separate models for each dataset.

  - If we choose to merge the datasets, we must deal with class imbalance. One solution is to perform undersampling on the white wine dataset, resulting in a merged dataset with ~3,200 rows (relatively few). Alternatively, we can apply oversampling to the red wine dataset, obtaining a dataset with approximately 10,000 rows.

  - If we choose to work with the datasets separately, the resulting models might be more accurate, but we would have to manage two distinct models and two separate pipelines. Additionally, if the datasets are very similar, the resulting models might also be very similar, making it potentially unnecessary to maintain two separate ones.

The most reasonable approach is to try both strategies and compare their performance using both balanced accuracy and standard accuracy (with respect to balanced and imbalanced datasets), tested on basic classification models such as Decision Trees, Support Vector Machines, and Neural Networks—without applying any preprocessing or hyperparameter tuning.  
This will allow us to determine which approach works best for our specific case.


In [None]:
import pandas as pd

red_wine_data = pd.read_csv('../winequality-red.csv', sep=';')
white_wine_data = pd.read_csv('../winequality-white.csv', sep=';')

print(red_wine_data.shape)
print(white_wine_data.shape)

(1599, 12)
(4898, 12)


In [46]:
# Concatenate the two DataFrames
wine_type = {'red': 0, 'white': 1}
wine_data = pd.concat([red_wine_data, white_wine_data], ignore_index=True)
wine_data['type'] = [wine_type['red']] * len(red_wine_data) + [wine_type['white']] * len(white_wine_data)

wine_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,0
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,1
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,1
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,1
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,1


In [None]:
# First method: Concatenated dataset with 3 Decision Tree, SVC and Neural Network
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Standardize the features

X = wine_data.drop(columns=['quality'])
y = wine_data['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import balanced_accuracy_score, accuracy_score

# Create and train the classifiers
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

svc_classifier = SVC(random_state=42)
svc_classifier.fit(X_train, y_train)

mlp_classifier = MLPClassifier(random_state=42)
mlp_classifier.fit(X_train, y_train)

# Evaluate the classifiers
dt_predictions = dt_classifier.predict(X_test)
svc_predictions = svc_classifier.predict(X_test)
mlp_predictions = mlp_classifier.predict(X_test)

dt_balanced_accuracy = balanced_accuracy_score(y_test, dt_predictions)
svc_balanced_accuracy = balanced_accuracy_score(y_test, svc_predictions)
mlp_balanced_accuracy = balanced_accuracy_score(y_test, mlp_predictions)
dt_accuracy = accuracy_score(y_test, dt_predictions)
svc_accuracy = accuracy_score(y_test, svc_predictions)
mlp_accuracy = accuracy_score(y_test, mlp_predictions)

print(f"Decision Tree Balanced Accuracy: {dt_balanced_accuracy:.4f}")
print(f"SVC Balanced Accuracy: {svc_balanced_accuracy:.4f}")
print(f"MLP Balanced Accuracy: {mlp_balanced_accuracy:.4f}")

print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"SVC Accuracy: {svc_accuracy:.4f}")
print(f"MLP Accuracy: {mlp_accuracy:.4f}")




Decision Tree Balanced Accuracy: 0.3638
SVC Balanced Accuracy: 0.2258
MLP Balanced Accuracy: 0.2753
Decision Tree Accuracy: 0.6115
SVC Accuracy: 0.5608
MLP Accuracy: 0.5677


### Results - concatenated dataset
| Model                | Balanced Accuracy | Accuracy |
|------------------------|-------------------|----------|
| Decision Tree          | 0.3638            | 0.6115   |
| SVC                    | 0.2258            | 0.5608   |
| MLP                    | 0.2753            | 0.5677   |

The results obtained so far are relatively low, but this is not a major issue, since we have not applied any preprocessing or performed any hyperparameter tuning.  
Additionally, the models were trained on an imbalanced dataset, so low performance is to be expected.

We will now evaluate the case where the same models are trained on the two datasets separately.  
By computing the average performance across both, we can assess whether this strategy leads to any improvements.

In [None]:
# Second method: Separate datasets with Decision Tree, SVC and Neural Network
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Standardize the features

# red wine dataset
X_red = red_wine_data.drop(columns=['quality'])
y_red = red_wine_data['quality']

X_train, X_test, y_train, y_test = train_test_split(X_red, y_red, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import balanced_accuracy_score, accuracy_score

# Create and train the classifiers
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

svc_classifier = SVC(random_state=42)
svc_classifier.fit(X_train, y_train)

mlp_classifier = MLPClassifier(random_state=42)
mlp_classifier.fit(X_train, y_train)

# Evaluate the classifiers
dt_predictions = dt_classifier.predict(X_test)
svc_predictions = svc_classifier.predict(X_test)
mlp_predictions = mlp_classifier.predict(X_test)

dt_accuracy = accuracy_score(y_test, dt_predictions)
svc_accuracy = accuracy_score(y_test, svc_predictions)
mlp_accuracy = accuracy_score(y_test, mlp_predictions)
dt_balanced_accuracy = balanced_accuracy_score(y_test, dt_predictions)
svc_balanced_accuracy = balanced_accuracy_score(y_test, svc_predictions)
mlp_balanced_accuracy = balanced_accuracy_score(y_test, mlp_predictions)

results_red = {
    'Decision Tree': dt_accuracy,
    'SVC': svc_accuracy,
    'MLP': mlp_accuracy,
    'Decision Tree balanced': dt_balanced_accuracy,
    'SVC balanced': svc_balanced_accuracy,
    'MLP balanced': mlp_balanced_accuracy
}

# white wine dataset
X_white = white_wine_data.drop(columns=['quality'])
y_white = white_wine_data['quality']

X_train, X_test, y_train, y_test = train_test_split(X_white, y_white, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create and train the classifiers
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

svc_classifier = SVC(random_state=42)
svc_classifier.fit(X_train, y_train)

mlp_classifier = MLPClassifier(random_state=42)
mlp_classifier.fit(X_train, y_train)

# Evaluate the classifiers
dt_predictions = dt_classifier.predict(X_test)
svc_predictions = svc_classifier.predict(X_test)
mlp_predictions = mlp_classifier.predict(X_test)

dt_accuracy = accuracy_score(y_test, dt_predictions)
svc_accuracy = accuracy_score(y_test, svc_predictions)
mlp_accuracy = accuracy_score(y_test, mlp_predictions)
dt_balanced_accuracy = balanced_accuracy_score(y_test, dt_predictions)
svc_balanced_accuracy = balanced_accuracy_score(y_test, svc_predictions)
mlp_balanced_accuracy = balanced_accuracy_score(y_test, mlp_predictions)

results_white = {
    'Decision Tree': dt_accuracy,
    'SVC': svc_accuracy,
    'MLP': mlp_accuracy,
    'Decision Tree balanced': dt_balanced_accuracy,
    'SVC balanced': svc_balanced_accuracy,
    'MLP balanced': mlp_balanced_accuracy
}

# Now I made the avarege of the accuracies for each classifier
average_results = {
    'Decision Tree': (results_red['Decision Tree'] + results_white['Decision Tree']) / 2,
    'SVC': (results_red['SVC'] + results_white['SVC']) / 2,
    'MLP': (results_red['MLP'] + results_white['MLP']) / 2,
    'Decision Tree balanced': (results_red['Decision Tree balanced'] + results_white['Decision Tree balanced']) / 2,
    'SVC balanced': (results_red['SVC balanced'] + results_white['SVC balanced']) / 2,
    'MLP balanced': (results_red['MLP balanced'] + results_white['MLP balanced']) / 2
}

print("Average Results:")
for classifier, accuracy in average_results.items():
    print(f"{classifier}: {accuracy:.4f}")

# Print also single datasets results
print("\nResults for Red Wine Dataset:")
for classifier, accuracy in results_red.items():
    print(f"{classifier}: {accuracy:.4f}")

print("\nResults for White Wine Dataset:")
for classifier, accuracy in results_white.items():
    print(f"{classifier}: {accuracy:.4f}")



Average Results:
Decision Tree: 0.5864
SVC: 0.5822
MLP: 0.5935
Decision Tree balanced: 0.3618
SVC balanced: 0.2745
MLP balanced: 0.3356

Results for Red Wine Dataset:
Decision Tree: 0.5625
SVC: 0.6031
MLP: 0.6156
Decision Tree balanced: 0.2858
SVC balanced: 0.2700
MLP balanced: 0.2912

Results for White Wine Dataset:
Decision Tree: 0.6102
SVC: 0.5612
MLP: 0.5714
Decision Tree balanced: 0.4379
SVC balanced: 0.2791
MLP balanced: 0.3800





### Results - separated datasets

| Model        | Average Balanced Accuracy | Average Accuracy | Red Balanced Accuracy | Red Accuracy | White Balanced Accuracy | White Accuracy |
| -------------- | ------------------------ | ---------------- | --------------------- | ------------ | ---------------------- | -------------- |
| Decision Tree  | 0.3618                   | 0.5864           | 0.2858                | 0.5625       | 0.4379                 | 0.6102         |
| SVC            | 0.2745                   | 0.5822           | 0.2700                | 0.6031       | 0.2791                 | 0.5612         |
| MLP            | 0.3356                   | 0.5935           | 0.2912                | 0.6156       | 0.3800                 | 0.5714         |

### Results - concatenated datasets
| Model                | Balanced Accuracy | Accuracy |
|------------------------|-------------------|----------|
| Decision Tree          | 0.3638            | 0.6115   |
| SVC                    | 0.2258            | 0.5608   |
| MLP                    | 0.2753            | 0.5677   |

Without any kind of preprocessing or fine-tuning it is clear that working with an unbalanced dataset produces much worse performances than working with the single datasets (red and white wine). Before throwing this method away, I would still like to try to balance the rows of the datasets with undersampling and oversampling so as to have approximately the same number of rows for red and white wine.

In general, the average balanced accuracy of the separate datasets has performances comparable to the balanced accuracy of the concatenated dataset, while the accuracy of the Decision Tree in the concatenated dataset surpasses all the others. In fact, an interesting behavior to note is that the Decision Tree has a lower average accuracy than the other models, but in general it performs better on the white wine dataset while there is an opposite behavior in the red wine dataset, where the SVC and the MLP perform better than the Decision Tree. This information could be reused when we need to choose the right model in case we choose to follow the method of training parallel models for separate datasets.

In the following sections we will oversample the red wine dataset and undersample the white wine dataset.
- *Oversampling*: In order to avoid overfitting with normal resampling where rows are extracted and concatenated to the original dataset creating duplicates, we decided to use SMOTE.

In [None]:
# Let's try oversampling the red wine by bringing the dataset
# to have the same number of rows as the white one using SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto', random_state=42)

# Fit and apply SMOTE to the red wine dataset
X_red_resampled, y_red_resampled = smote.fit_resample(X_red, y_red)
print(f"Resampled shape: {X_red_resampled.shape}")

X_red_resampled

Resampled shape: (4086, 11)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.400000,0.700000,0.000000,1.900000,0.076000,11.000000,34.000000,0.997800,3.510000,0.560000,9.400000
1,7.800000,0.880000,0.000000,2.600000,0.098000,25.000000,67.000000,0.996800,3.200000,0.680000,9.800000
2,7.800000,0.760000,0.040000,2.300000,0.092000,15.000000,54.000000,0.997000,3.260000,0.650000,9.800000
3,11.200000,0.280000,0.560000,1.900000,0.075000,17.000000,60.000000,0.998000,3.160000,0.580000,9.800000
4,7.400000,0.700000,0.000000,1.900000,0.076000,11.000000,34.000000,0.997800,3.510000,0.560000,9.400000
...,...,...,...,...,...,...,...,...,...,...,...
4081,7.460685,0.358786,0.319419,2.018466,0.074485,16.757260,25.577810,0.994567,3.253351,0.719419,11.569918
4082,8.293899,0.365820,0.393055,2.040515,0.059241,13.176834,29.000000,0.995526,3.159099,0.772154,10.996139
4083,7.729226,0.478521,0.326338,2.260916,0.075317,11.073933,19.390837,0.992978,3.213662,0.713169,12.519368
4084,8.128720,0.523680,0.157238,2.240233,0.067690,35.195346,49.333130,0.994221,3.388279,0.723564,12.565524


Now that the red wine dataset has a comparable number of rows to the white wine dataset, I merge the resampled dataset with the white one and re-run the balanced accuracy test of the 3 models.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Standardize the features

# I first recreate the red wine dataset with X and y resampled
red_wine_resampled = pd.concat([pd.DataFrame(X_red_resampled, columns=X_red.columns), 
                                 pd.Series(y_red_resampled, name='quality')], axis=1)

wine_data_resampled = pd.concat([red_wine_resampled, white_wine_data], ignore_index=True)
wine_data_resampled['type'] = [wine_type['red']] * len(red_wine_resampled) + [wine_type['white']] * len(white_wine_data)

print(wine_data_resampled['type'].map({v: k for k, v in wine_type.items()}).value_counts())

wine_data_resampled

type
white    4898
red      4086
Name: count, dtype: int64


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,0
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8979,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,1
8980,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,1
8981,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,1
8982,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,1


In the new concatenated dataset there are 4898 white wine samples and 4086 red wine samples, so I would say that the dataset is quite balanced, with a ratio of 1.2:1 between white wine and red wine.

In [51]:

X = wine_data_resampled.drop(columns=['quality'])
y = wine_data_resampled['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import balanced_accuracy_score, accuracy_score

# Create and train the classifiers
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

svc_classifier = SVC(random_state=42)
svc_classifier.fit(X_train, y_train)

mlp_classifier = MLPClassifier(random_state=42)
mlp_classifier.fit(X_train, y_train)

# Evaluate the classifiers
dt_predictions = dt_classifier.predict(X_test)
svc_predictions = svc_classifier.predict(X_test)
mlp_predictions = mlp_classifier.predict(X_test)

dt_balanced_accuracy = balanced_accuracy_score(y_test, dt_predictions)
svc_balanced_accuracy = balanced_accuracy_score(y_test, svc_predictions)
mlp_balanced_accuracy = balanced_accuracy_score(y_test, mlp_predictions)

dt_accuracy = accuracy_score(y_test, dt_predictions)
svc_accuracy = accuracy_score(y_test, svc_predictions)
mlp_accuracy = accuracy_score(y_test, mlp_predictions)

print(f"Decision Tree Balanced Accuracy: {dt_balanced_accuracy:.4f}")
print(f"SVC Balanced Accuracy: {svc_balanced_accuracy:.4f}")
print(f"MLP Balanced Accuracy: {mlp_balanced_accuracy:.4f}")

print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"SVC Accuracy: {svc_accuracy:.4f}")
print(f"MLP Accuracy: {mlp_accuracy:.4f}")



Decision Tree Balanced Accuracy: 0.6452
SVC Balanced Accuracy: 0.5700
MLP Balanced Accuracy: 0.6009
Decision Tree Accuracy: 0.6956
SVC Accuracy: 0.6194
MLP Accuracy: 0.6450


# Results on artificial balanced dataset - oversampling
| Model | Balanced Accuracy - SMOTE | Accuracy - SMOTE |
|---------|-------------------|-------------------|
| Decision Tree | 0.6452 | 0.6956 |
| SVC | 0.5700 | 0.6194 |
| MLP | 0.6009 | 0.6450 |

The obtained results are much better than those obtained with the unbalanced dataset (w.r.t. number of red and white wine samples), both for balanced accuracy (which in this case is less relevant since the dataset has been balanced) and accuracy. In particular, the Decision Tree obtained the best performance, followed by the MLP and the SVC. This suggests that the Decision Tree could be the most suitable model for this balanced dataset. Furthermore, these results are much better than those obtained with the separate datasets, which suggests that merging the datasets could be a good strategy to improve the model performance.

All that remains is to undersample the white wine dataset, in order to have a balanced dataset with a number of rows comparable to that of the red wine. In this case, the white wine dataset will be reduced to 1600 rows, while the red wine dataset will remain unchanged. Furthermore, since many of the values ​​are distributed around the `quality` class of 5 or 6, we decided to undersample trying to maintain the proportions of the classes, in order not to lose important information.

In [73]:
n_target = len(red_wine_data)

white_dist = white_wine_data['quality'].value_counts(normalize=True)

pd.DataFrame(white_dist)

Unnamed: 0_level_0,proportion
quality,Unnamed: 1_level_1
6,0.448755
5,0.297468
7,0.179665
8,0.035729
4,0.033279
3,0.004083
9,0.001021


As predicted, many of the quality values ​​are represented by the $5$ and $6$ classes. I want to maintain the class proportions even in the undersampled dataset.

In [82]:
# Ottengo il numero di campioni desiderati per ogni classe
n_per_class = (white_dist * n_target).round().astype(int)

white_wine_undersampled = pd.DataFrame()
for label, n_samples in n_per_class.items():
    subset = white_wine_data[white_wine_data['quality'] == label]
    sampled_subset = subset.sample(n=n_samples, random_state=42)
    white_wine_undersampled = pd.concat([white_wine_undersampled, sampled_subset], ignore_index=True)

white_wine_undersampled


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,6.2,0.15,0.46,1.60,0.039,38.0,123.0,0.99300,3.38,0.51,9.7,6
1,6.9,0.31,0.32,1.20,0.024,20.0,166.0,0.99208,3.05,0.54,9.8,6
2,6.0,0.28,0.34,1.60,0.119,33.0,104.0,0.99210,3.19,0.38,10.2,6
3,6.8,0.30,0.26,20.30,0.037,45.0,150.0,0.99727,3.04,0.38,12.3,6
4,6.2,0.30,0.26,13.40,0.046,57.0,206.0,0.99775,3.17,0.43,9.5,6
...,...,...,...,...,...,...,...,...,...,...,...,...
1595,8.6,0.55,0.35,15.55,0.057,35.5,366.5,1.00010,3.04,0.63,11.0,3
1596,10.3,0.17,0.47,1.40,0.037,5.0,33.0,0.99390,2.89,0.28,9.6,3
1597,7.1,0.49,0.22,2.00,0.047,146.5,307.5,0.99240,3.24,0.37,11.0,3
1598,6.6,0.36,0.29,1.60,0.021,24.0,85.0,0.98965,3.41,0.61,12.4,9


In [87]:

# Now I create the final dataset with the undersampled white wine
wine_data_undersampled = pd.concat([red_wine_data, white_wine_undersampled], ignore_index=True)
wine_data_undersampled['type'] = [wine_type['red']] * len(red_wine_data) + [wine_type['white']] * len(white_wine_undersampled)

X = wine_data_undersampled.drop(columns=['quality'])
y = wine_data_undersampled['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import balanced_accuracy_score, accuracy_score

# Create and train the classifiers
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

svc_classifier = SVC(random_state=42)
svc_classifier.fit(X_train, y_train)

mlp_classifier = MLPClassifier(random_state=42)
mlp_classifier.fit(X_train, y_train)

# Evaluate the classifiers
dt_predictions = dt_classifier.predict(X_test)
svc_predictions = svc_classifier.predict(X_test)
mlp_predictions = mlp_classifier.predict(X_test)

dt_balanced_accuracy = balanced_accuracy_score(y_test, dt_predictions)
svc_balanced_accuracy = balanced_accuracy_score(y_test, svc_predictions)
mlp_balanced_accuracy = balanced_accuracy_score(y_test, mlp_predictions)
dt_accuracy = accuracy_score(y_test, dt_predictions)
svc_accuracy = accuracy_score(y_test, svc_predictions)
mlp_accuracy = accuracy_score(y_test, mlp_predictions)

print(f"Decision Tree Balanced Accuracy: {dt_balanced_accuracy:.4f}")
print(f"SVC Balanced Accuracy: {svc_balanced_accuracy:.4f}")
print(f"MLP Balanced Accuracy: {mlp_balanced_accuracy:.4f}")
print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"SVC Accuracy: {svc_accuracy:.4f}")
print(f"MLP Accuracy: {mlp_accuracy:.4f}")

Decision Tree Balanced Accuracy: 0.3704
SVC Balanced Accuracy: 0.2695
MLP Balanced Accuracy: 0.2803
Decision Tree Accuracy: 0.5391
SVC Accuracy: 0.5875
MLP Accuracy: 0.5609




# Results on artificial balanced dataset  - undersampling
| Model | Balanced Accuracy | Accuracy |
|---------|-------------------|-------------------|
| Decision Tree | 0.3704 | 0.5391 |
| SVC | 0.2695 | 0.5875 |
| MLP | 0.2803 | 0.5609 |

The performance of balanced accuracy and accuracy is very low compared to that obtained with oversampling, which suggests that undersampling may not be the best strategy for this dataset. In particular, Decision Tree achieved the best performance using balanced accuracy while SVC achieved the best performance using accuracy.

Here is a unique table comparing the results obtained with different balancing methods and classification models:

| Model        | Balanced Accuracy (Undersampling) | Accuracy (Undersampling) | Balanced Accuracy (Oversampling/SMOTE) | Accuracy (Oversampling/SMOTE) |
| -------------- | --------------------------------- | ------------------------ | -------------------------------------- | ----------------------------- |
| Decision Tree  | 0.3704                            | 0.5391                   | 0.6452                                 | 0.6956                        |
| SVC            | 0.2695                            | 0.5875                   | 0.5700                                 | 0.6194                        |
| MLP            | 0.2803                            | 0.5609                   | 0.6009                                 | 0.6450                        |


Oversampling with SMOTE produced significantly better results than undersampling, both in terms of balanced accuracy and accuracy. In particular, Decision Tree achieved the best performance in both cases, followed by MLP and SVC. This suggests that oversampling could be the best strategy for this dataset, as it allows to keep a larger number of samples and therefore to preserve more useful information for classification.

# Conclusion

At this stage, we are left with two main options:  
- Apply SMOTE oversampling to the red wine dataset and then merge it with the white wine dataset, resulting in a balanced dataset with approximately 9,000 rows.  
- Train the models separately on the two datasets and compute the average balanced accuracy and average accuracy across both.

### Summary of Results

| Model          | Average Balanced Accuracy | Average Accuracy | Balanced Accuracy - SMOTE | Accuracy - SMOTE |
| -------------- | :------------------------: | :--------------: | :------------------------: | :--------------: |
| Decision Tree  |           0.3618           |      0.5864       |           0.6452           |      0.6956       |
| SVC            |           0.2745           |      0.5822       |           0.5700           |      0.6194       |
| MLP            |           0.3356           |      0.5935       |           0.6009           |      0.6450       |

### Explored Options:
- Concatenated dataset  
- Concatenated dataset with oversampling (SMOTE)  
- Concatenated dataset with undersampling  
- Separate datasets  

### Selected Option:
- Concatenated dataset with oversampling (SMOTE)

In conclusion, the SMOTE-based oversampling strategy led to better results compared to all other alternatives considered.  
However, it's important to note that the overall performance is still far from optimal and could be further improved through proper feature selection, hyperparameter tuning, and more advanced preprocessing techniques.
