<h1 style = "color : #0EE071; text-align : center;"><em>Nata Project</em> - Data Preprocessing Notebook</h1>
<p style = "font-size : 16px; text-align: center;">This notebook has the funcion of preparing the <code>Nata_Files/learn.csv</code> dataset for the learning model.</p>
<br>
<p style = "font-size : 12px; text-align: center;"><b>NOVA IMS</b></p>
<p style = "font-size : 10px; text-align: center;">Machine Learning I</p>
<p style = "font-size : 10px; text-align: center;">Diogo Gonçalves, João Marques, Juan Mendes, Gustavo Franco & Lucas Casimiro</p>
<br>

CENAS PA NAO ESQUECER NO 1 NOTEBOOK
- mostrar quais as strings unicas pas colunas origin e pastry_type
- outliers
- missing values
- datatypes das colunas (nao corrigi nenhum pq ns se e preciso)
- fazer o iqr pa identificar outliers (ainda n limpei nenhum q n os obvios por causa disso)
- aprender a tratar da skewness tp log transformations



## <a class="anchor" id="0th-bullet">Table of Contents</a>


* [<b>1. Importing the dataset and needed libraries</b>](#1st-bullet)<br>
    
    
* [<b>2. Initial Data Cleaning</b>](#2nd-bullet)<br>
    * [2.1 Standardizing texts](#3rd-bullet)<br>
    * [2.2 Target Handling and Dropping Irrelevant Columns](#4th-bullet)<br>
    * [2.3 Outlier "Masking"](#5th-bullet)<br>


* [<b>3. Data Partitioning</b>](#16th-bullet)<br>


* [<b>4. Data Imputation</b>](#16th-bullet)<br>


* [<b>5. Encoding Categorical Variables</b>](#16th-bullet)<br>


* [<b>6. Exports</b>](#16th-bullet)<br>

<h2  style = "color : #0EE071;"> 1. Importing the dataset and needed libraries</h2>

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

Reading the dataset and making a copy to work on, instead of altering the original.

In [4]:
learn_data = pd.read_csv("Nata_files/learn.csv", sep = ",", index_col = 0)
learn_data_copy = learn_data.copy()

<hr style = "border: 3px solid #0EE071;">
<h2 style = "color : #0EE071;">2. Initial Data Cleaning </h2>
<p style = "font-size : 15px;">Handling any outliers or troublesome data points identified in NB1.</p>

<h3 style="color: #0EE071;">2.1 Standardizing texts</h3>

First, converting 'origin' and 'pastry_type' values to standardized texts (e.g.,turning 'Lisboa', 'LISBOA', 'lisboa' into the same value), which was recognized as a problem in NB1.

In [5]:
# Turning every string in 'origin' and 'pastry_type' columns to lowercase and stripping whitespace
learn_data_copy['origin'] = learn_data_copy['origin'].astype(str).str.lower().str.strip()
learn_data_copy['pastry_type'] = learn_data_copy['pastry_type'].astype(str).str.lower().str.strip()

# Solving the problem 
learn_data_copy['pastry_type'] = learn_data_copy['pastry_type'].replace({'pastel nata': 'pastel de nata',
                                                                         'nan': np.nan})

<h3 style="color: #0EE071;">2.2 Target Handling and Dropping Irrelevant Columns </h3>

There is a missing value in the 'quality_class' column, which is the target variable, so we will drop that whole row as it won't be used for the model.

In [6]:
learn_data_copy.dropna(subset=['quality_class'], inplace=True)

As seen in NB1, the 'notes_baker' column has no values, so it will be dropped.

In [7]:
learn_data_copy.drop('notes_baker', axis=1, inplace=True)

<h3 style="color: #0EE071;">2.3 Outlier "Masking"</h3>

Starting by replacing the physically impossible values identified in NB1 as NaN, so they can be imputed later.

In [8]:
# Setting these values to NaN instead of removing the rows, to avoid losing too much data

# Sugar content cannot exceed 100g per 100g
learn_data_copy.loc[learn_data_copy['sugar_content'] > 100, 'sugar_content'] = np.nan

# Fat percentage cannot exceed 100%
learn_data_copy.loc[learn_data_copy['cream_fat_content'] > 100, 'cream_fat_content'] = np.nan

# Salt > 100g per kg is inedible 
learn_data_copy.loc[learn_data_copy['salt_ratio'] > 100, 'salt_ratio'] = np.nan

# Eggs cook at ~65C. 170C or 575C is impossible for raw egg addition
learn_data_copy.loc[learn_data_copy['egg_temperature'] > 100, 'egg_temperature'] = np.nan

# Oven/Final temp > 500C is likely an error 
learn_data_copy.loc[learn_data_copy['final_temperature'] > 500, 'final_temperature'] = np.nan
learn_data_copy.loc[learn_data_copy['oven_temperature'] > 500, 'oven_temperature'] = np.nan

<hr style = "border: 3px solid #0EE071;">
<h2 style = "color : #0EE071;">3. Data Partitioning</h2>
<p style = "font-size : 15px;">Before we perform any scaling or make, we must split our dataset. This is crucial to prevent <b>Data Leakage</b>. We will also use Label Encoding on the target variable in this section, so the stratify method is applied to these values.</p>

In [9]:
# Define X (features) and y (target)
# Manually mapping the target variable to ensure '1' is the positive class 'OK'
X = learn_data_copy.drop('quality_class', axis=1)
y = learn_data_copy['quality_class'].map({'OK': 1, 'KO' : 0})

# Split into training and testing sets, using stratify to maintain class distribution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Checking if the dataset is balanced in both sets
print("Training Set Class Distribution:")
print(y_train.value_counts(normalize=True))
print("\nTest Set Class Distribution:")
print(y_test.value_counts(normalize=True))

Training Set Class Distribution:
quality_class
1    0.635008
0    0.364992
Name: proportion, dtype: float64

Test Set Class Distribution:
quality_class
1    0.635577
0    0.364423
Name: proportion, dtype: float64


Note: We are performing a 2-way split (Train/Test). We will not create a validation set here. Instead, we'll use k-Fold Cross-Validation on the training set during the Modeling and Tuning phases. This ensures we maximize the data available for training while maintaining a robust evaluation protocol.

<hr style = "border: 3px solid #0EE071;">
<h2 style = "color : #0EE071;">4. Data Imputation</h2>
<p style = "font-size : 15px;">To deal with missing values, we will use either mean or median for numerical columns (depending on skewness), and use the mode for categorical variables.</p>

In [10]:
# Identifying for numerical columns
num_cols = X_train.select_dtypes(include=['float64'])

# Using mean or median to fill missing values in numerical columns based on skewness
for col in num_cols:
    skewness = X_train[col].skew()

    # If skewness is between -0.5 and 0.5, we consider it low skewness
    if abs(skewness) < 0.5:
        # If skewness is low, use mean
        method = 'mean'
        X_train[col] = X_train[col].fillna(X_train[col].mean())
        X_test[col] = X_test[col].fillna(X_train[col].mean())

    else:
        # If skewness is high, use median
        method = 'median'
        X_train[col] = X_train[col].fillna(X_train[col].median())
        X_test[col] = X_test[col].fillna(X_train[col].median())

#Clarifying which method was used for each column, along with the skewness value
    print(f'{col} - Skewness: {skewness}')
    print(f' - Missing values filled using {method}')

ambient_humidity - Skewness: 0.02024732522559659
 - Missing values filled using mean
baking_duration - Skewness: 1.5538202343798468
 - Missing values filled using median
cooling_period - Skewness: 0.3068104917121074
 - Missing values filled using mean
cream_fat_content - Skewness: -0.6072301918569015
 - Missing values filled using median
egg_temperature - Skewness: -0.02416030035571224
 - Missing values filled using mean
egg_yolk_count - Skewness: 0.5270916354134263
 - Missing values filled using median
final_temperature - Skewness: -0.05417159399359049
 - Missing values filled using mean
lemon_zest_ph - Skewness: 0.3401727412742592
 - Missing values filled using mean
oven_temperature - Skewness: -0.06937961360152138
 - Missing values filled using mean
preheating_time - Skewness: 1.627639439365064
 - Missing values filled using median
salt_ratio - Skewness: 0.7822309783435281
 - Missing values filled using median
sugar_content - Skewness: 1.005713226847118
 - Missing values filled usin

In [11]:
# Using the mode to fill missing values in categorical columns
cat_cols = ['origin', 'pastry_type']
for col in cat_cols:
    X_train[col] = X_train[col].fillna(X_train[col].mode()[0])
    X_test[col] = X_test[col].fillna(X_train[col].mode()[0])

<hr style = "border: 3px solid #0EE071;">
<h2 style = "color : #0EE071;">5. Encoding categorical variables </h2>
<p style = "font-size : 15px;">Turning the values of the categorical columns into numbers. In this case, we only have the 'origin' column (feature), and since there are only 2 values after the standardization ('lisboa' and 'porto'), we will use One-Hot Encoding (Binary Encoding), and can use .get_dummies.</p>

In [12]:
# drop_first=True creates just one column instead of two
X_train = pd.get_dummies(data=X_train, columns = ['origin'], drop_first = True)

<hr style = "border: 3px solid #0EE071;">
<h2 style = "color : #0EE071;">6. Exports</h2>
<p style = "font-size : 15px;">Exporting the training and testing datasets, so they can be used in the next notebooks.</p>

In [14]:
#Saving to the 'data_clean' folder inside 'Nata_files'
X_train.to_pickle('Nata_files/data_clean/X_train_clean.pkl')
X_test.to_pickle('Nata_files/data_clean/X_test_clean.pkl')
y_train.to_pickle('Nata_files/data_clean/y_train_clean.pkl')
y_test.to_pickle('Nata_files/data_clean/y_test_clean.pkl')