## Data preprocessing

source: https://archive.ics.uci.edu/dataset/186/wine+quality

### Imports

In [27]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

### Data load

In [None]:
df_red = pd.read_csv("data/winequality-red.csv", sep=";")
df_white = pd.read_csv("data/winequality-white.csv", sep=";")

### Data catenation
Those data have one cathegorical variable _type_. As there are two .csv files, one for each variable, it is neccessary to transform two files without _type_ variable to one table with new column(s) to capture the new variable _type_. Because there is just one cathegorical variable which has only two values, the transformation could be done using on-hot trick without significant size/dimension affection of the new data.

In [None]:
# add new on-hot columns to each dataset
df_red["red"] = 1
df_red["white"] = 0

df_white["red"] = 0
df_white["white"] = 1

# catenate both datasets together
df = pd.concat([df_red, df_white], ignore_index=True)

# catenation were successful, column names and row count as expected
# print(df.info())
# print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  red                   6497 non-null   int64  
 13  white                 6497 non-null   int64  
dtypes: float64(11), int64(3)
memory usage: 710.7 KB
None


### Data preprocessing
There are a lot of data preprocessing methods which should be used in ussual case. But in this dataset, a lot of potentional problems cannot occur. There is not any _date_ variable, neither _strings_ or other objects. Using the _info()_ method of pandas dataframe, we can see that the dataset has none of _None_ or _NAN_ values, so all are presented, and there are also just _int_ or _float_ variables. Moreover, all columns have same number of rows.

In [62]:
# there is not any None value in any column
print(df.isna().sum())

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
red                     0
white                   0
dtype: int64


I also think that there is no need to do any normalization, transformation etc. The only thing which should be checked is the column range for all measurements. The dataset were study in https://doi.org/10.1016/j.dss.2009.05.016., so the provided ranges are correct based on the study.

In [63]:
# remove duplicates
print(df.shape)
df = df.drop_duplicates()
print(df.shape)

(6497, 14)
(5320, 14)


In [64]:
df["free sulfur dioxide"] = df["free sulfur dioxide"] / 1000
df["total sulfur dioxide"] = df["total sulfur dioxide"] / 1000
# Standardize features
features = df.drop(columns=["quality", "red", "white"]).columns
scaler = StandardScaler()
df[features] = scaler.fit_transform(df[features])

In [65]:
# Scale the features and target variable to the range [0, 1]
print(df.describe())
features = df.drop(columns=["quality"]).columns
scaler = MinMaxScaler()
df[features] = scaler.fit_transform(df[features])
print(df.describe())

       fixed acidity  volatile acidity   citric acid  residual sugar  \
count   5.320000e+03      5.320000e+03  5.320000e+03     5320.000000   
mean    3.205456e-16     -6.410912e-17  2.671213e-17        0.000000   
std     1.000094e+00      1.000094e+00  1.000094e+00        1.000094   
min    -2.588145e+00     -1.570028e+00 -2.164515e+00       -0.988604   
25%    -6.177717e-01     -6.784048e-01 -5.334545e-01       -0.721923   
50%    -1.630701e-01     -2.623138e-01 -5.772841e-02       -0.521912   
75%     3.674151e-01      3.915434e-01  5.539194e-01        0.544812   
max     6.581671e+00      7.346207e+00  9.116989e+00       13.501067   

          chlorides  free sulfur dioxide  total sulfur dioxide       density  \
count  5.320000e+03         5.320000e+03          5.320000e+03  5.320000e+03   
mean   1.282182e-16         8.547883e-17          2.564365e-16  1.857027e-14   
std    1.000094e+00         1.000094e+00          1.000094e+00  1.000094e+00   
min   -1.293816e+00        -1.6

In [66]:
# # Standardize features
# features = df.drop(columns=["quality", "red", "white"]).columns
# scaler = StandardScaler()
# df[features] = scaler.fit_transform(df[features])

In [67]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Define outliers
outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)

# View outliers
outlier_rows = df[outliers]
# print(outlier_rows)
# print(outlier_rows.info())
cleaned_df = df[~outliers]
print(cleaned_df.info())
df = cleaned_df


# z_scores = (df - df.mean()) / df.std()

# # Threshold for outliers (e.g., Z > 3 or Z < -3)
# outliers = (z_scores.abs() > 3).any(axis=1)

# # View outliers
# outlier_rows = df[outliers]
# # print(outlier_rows)
# cleaned_df = df[~outliers]
# df = cleaned_df
# print(cleaned_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 4081 entries, 5 to 6496
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4081 non-null   float64
 1   volatile acidity      4081 non-null   float64
 2   citric acid           4081 non-null   float64
 3   residual sugar        4081 non-null   float64
 4   chlorides             4081 non-null   float64
 5   free sulfur dioxide   4081 non-null   float64
 6   total sulfur dioxide  4081 non-null   float64
 7   density               4081 non-null   float64
 8   pH                    4081 non-null   float64
 9   sulphates             4081 non-null   float64
 10  alcohol               4081 non-null   float64
 11  quality               4081 non-null   int64  
 12  red                   4081 non-null   float64
 13  white                 4081 non-null   float64
dtypes: float64(13), int64(1)
memory usage: 478.2 KB
None


In [68]:
# put wine quality as a final column
cols = list(df.columns)
cols[-3], cols[-1] = cols[-1], cols[-3]
df = df[cols]

### Save prepared data

In [69]:
# df = df.sample(frac = 1)

In [70]:
# put the quality to the very end
df.to_csv("data/wine_prepared.csv", index=False)

In [46]:
print(df.describe())
print(df.info())
print(df.head())

       fixed acidity  volatile acidity   citric acid  residual sugar  \
count   5.320000e+03      5.320000e+03  5.320000e+03    5.320000e+03   
mean    1.702898e-16      6.878374e-17 -3.906649e-17    4.674623e-17   
std     1.000094e+00      1.000094e+00  1.000094e+00    1.000094e+00   
min    -2.588145e+00     -1.570028e+00 -2.164515e+00   -9.886039e-01   
25%    -6.177717e-01     -6.784048e-01 -5.334545e-01   -7.219228e-01   
50%    -1.630701e-01     -2.623138e-01 -5.772841e-02   -5.219120e-01   
75%     3.674151e-01      3.915434e-01  5.539194e-01    5.448122e-01   
max     6.581671e+00      7.346207e+00  9.116989e+00    1.350107e+01   

          chlorides  free sulfur dioxide  total sulfur dioxide       density  \
count  5.320000e+03         5.320000e+03          5.320000e+03  5.320000e+03   
mean  -5.142086e-17        -6.945155e-17          3.606138e-17 -4.273941e-17   
std    1.000094e+00         1.000094e+00          1.000094e+00  1.000094e+00   
min   -1.293816e+00        -1.6