## <font color="orange">**Step 3: Data Preparation**</font>

In this step, we preprocess the dataset to ensure it is clean, structured, and suitable for training a machine learning model.

### <font color="green">**3.1 Select Data (Handling Missing Values and Feature Selection)**</font>

In this step, we carefully select the relevant features from the dataset for further processing, handle the missing data identified earlier, and remove any irrelevant or redundant features. 

Also, I decided not to use **Feature Engineering** in the Penguins Classification Project because:
Feature engineering can sometimes improve model performance by creating new variables, but in this case, I justify not performing additional feature engineering for these reasons:

- The Dataset Already Contains Strong Predictive Features
- The existing features (bill measurements, body mass, and categorical variables like island and sex) may already provide a clear distinction between different penguin species.
- Prior visualizations (e.g., pairplots, correlation heatmaps, and decision tree analysis) confirm that these features are good indicators for classification.
So the dataset is already structured well, and additional feature engineering may not add significant value.

---

#### **3.1.1 Handling Missing Values**

There are two methods for handling missing data:

- Removal: Simply drop rows containing missing values.
- Imputation: Fill in missing values using statistical methods

**Dataset Size:**
- Removal would slightly reduce dataset size, still maintaining sufficient data.
- Imputation preserves all 344 records, slightly adjusting data distributions but retaining maximum information.

**Impact on Model Accuracy:**
- Given our relatively small proportion of missing data, both approaches would have a minimal impact. However, removing data could unintentionally introduce slight biases or reduce class representativeness.

**Final Decision: Imputation**

I chose imputation as preferred method due to:
- Retaining maximum available data for training and validation.
- Maintaining representative class proportions for all penguin species.
- Minimizing the risk of potential bias and loss of information.

#### **Methodology for Handling Missing Values:**

**Using Median for Numerical Features**
- The missing values in bill length, bill depth, flipper length, and body mass will be replaced with the median because:
   - **Robustness to Outliers:** The median is less sensitive to outliers or extreme values, making it a safer choice given that our exploratory analysis identified potential outliers.

  - **Preserving Data Integrity:** Using the median maintains the natural central tendency of the data distribution without being significantly influenced by extreme data points.

**Using Mode for Categorical Features**
- The sex column contains missing values, and since it is categorical, we used the mode (most frequent value) because:
  - It maintains the existing distribution of male and female categories.
  - It ensures we don’t introduce bias by assigning a random value.



In [26]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv("../data/raw/penguins_dataset_w_target.csv")

# Fill missing numerical values with mean
df.loc[:, "bill_length_mm"] = df["bill_length_mm"].fillna(df["bill_length_mm"].median())
df.loc[:, "bill_depth_mm"] = df["bill_depth_mm"].fillna(df["bill_depth_mm"].median())
df.loc[:, "flipper_length_mm"] = df["flipper_length_mm"].fillna(df["flipper_length_mm"].median())
df.loc[:, "body_mass_g"] = df["body_mass_g"].fillna(df["body_mass_g"].median())

# Fill missing categorical values with mode
df.loc[:, "sex"] = df["sex"].fillna(df["sex"].mode()[0])

# Save the cleaned dataset to the processed data folder
df.to_csv("../data/processed/Secondcleaned_dataset_filling_miss_values.csv", index=False)

# Verify no missing values remain
print(df.isna().sum())

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64


#### **3.1.2 Initial Feature Selection**
Based on insights from Data Understanding (pairplots, heatmaps,bar harts, and boxplots), we retain and document features valuable for classification.

**Features to Retain:**
- Bill Length (provides moderate species differentiation)
- Bill Depth (moderate differentiation, especially between Adelie & Chinstrap)
- Flipper Length (highly differentiating, especially for Gentoo penguins)
- Body Mass (strongly differentiating Gentoo penguins)
- Island (categorical, highly predictive for specific species presence like Gentoo penguins which only found on Biscoe Island)
- Sex (categorical, supportive feature)

So we have no immediate drops. All available features show sufficient predictive potential or supportive roles.
Note: Further feature refinement may occur after modeling phase evaluations.

---

### <font color="green">**3.2 Clean Data (Outlier Detection and Handling)**</font>

In the Data Understanding phase (Section 2.3.5), we already visually analyzed feature distributions using boxplots. These plots allowed us to identify potential outliers.
It showed us that each numerical feature (bill length, bill depth, flipper length, and body mass) has a few data points visually identified as outliers.

**Decision: Retain Identified Outliers**

The reasons:
- **Natural Variability:**
Outliers in the penguin dataset represent real, natural differences among penguins, not errors.

- **Limited Influence:**
Keeping these few outliers doesn't strongly affect our analysis or distort the data but instead preserves important natural differences, helping us create more accurate and realistic classification models.


---

### <font color="green">**3.3 Construct Data (Categorical Feature Encoding)**</font>

Categorical variables cannot be directly used by many machine learning algorithms. In this step, we convert these categorical features (island, sex) into numeric formats.

#### **Why This Step is Important:**
- Machine learning algorithms generally require numerical inputs.
- Proper encoding ensures model accuracy and interpretability.
  
**Categorical Features to Encode:**
- Island (object type)
- Sex (male/female)

**Encoding Method: One-Hot Encoding**

We use one-hot encoding to represent these categorical features numerically because:

- One-Hot Encoding (OHE) assigns each category a separate binary column, ensuring that the model treats them independently rather than assigning them an arbitrary numerical value.
- Label Encoding (assigning 0,1,2) is not suitable here because it introduces an artificial ordinal relationship (e.g., "Biscoe" < "Dream" < "Torgersen"), which does not exist in reality.


In [5]:
df = pd.read_csv("../data/processed/Secondcleaned_dataset_filling_miss_values.csv")

# Perform one-hot encoding
df = pd.get_dummies(df, columns=["island", "sex"], drop_first=True)
df.to_csv("../data/processed/cleaned_dataset_encoding2.csv", index=False)

# Verify the encoding
df.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Dream,island_Torgersen,sex_Male
0,Adelie,39.1,18.7,181.0,3750.0,False,True,True
1,Adelie,39.5,17.4,186.0,3800.0,False,True,False
2,Adelie,40.3,18.0,195.0,3250.0,False,True,False
3,Adelie,44.45,17.3,197.0,4050.0,False,True,True
4,Adelie,36.7,19.3,193.0,3450.0,False,True,False


---

### <font color="green">**3.4 Format Data (Feature Scaling/Normalization)**</font>

In this step, we scale numerical features to ensure each feature contributes equally to the machine learning model. Different features have different scales and ranges, which might bias our classification algorithms.

#### **Why Scaling/Normalization is Important?**

Feature scaling ensures that numerical features are on a similar scale, preventing models from giving more importance to features with larger values. Since bill length, bill depth, flipper length, and body mass have different units and ranges, scaling helps improve model performance.

#### **Chosen Approach: Standardization (Z-score Scaling)**

The reasons:  
- It centers data to have a mean of 0 and a standard deviation of 1, making features comparable.
- It works well with most machine learning models, especially distance-based models like KNN and SVM.
- It preserves the influence of outliers better than Min-Max scaling.

In [13]:
df = pd.read_csv("../data/processed/cleaned_dataset_encoding2.csv")

# Select numerical features to scale
numerical_features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]

# Apply StandardScaler
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Display the modified dataframe
df.head()
df.to_csv("../data/processed/cleaned_dataset_scaling2.csv", index=False)
df.head()


Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Dream,island_Torgersen,sex_Male
0,Adelie,-0.887622,0.787289,-1.420541,-0.564625,False,True,True
1,Adelie,-0.814037,0.126114,-1.063485,-0.50201,False,True,False
2,Adelie,-0.666866,0.431272,-0.420786,-1.190773,False,True,False
3,Adelie,0.096581,0.075255,-0.277964,-0.188936,False,True,True
4,Adelie,-1.329133,1.092447,-0.563608,-0.940314,False,True,False


### <font color="green">**3.5 Data Splitting (Train/Test Split)**</font>

#### **Why Do We Split the Data?**
To build a reliable machine learning model, we divide our dataset into training and testing sets. This helps in evaluating the model’s performance on unseen data and prevents overfitting.

#### **How Did We Split the Data?**
We used a train-test split, which separates the dataset into:
- **Training Set (X_train3, y_train3)**: Used to train the model.
- **Testing Set (X_test3, y_test3)**: Used to evaluate model performance.

#### **Why is this important?**
- **Prevents Overfitting**:  If we train on all data, the model might memorize instead of learning patterns.
- **Ensures Generalization**: Testing on unseen data shows how well the model performs on real-world cases.
- **Allows Fair Model Evaluation**: If we test on the same data used for training, performance metrics will be misleading.



In [27]:
# Define X (features) and y (target)
X = df.drop(columns=["species"])  # Drop target column
y = df["species"]

# Split data into training (80%) and test sets (20%)
X_train3, X_test3, y_train3, y_test3 = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
# Confirm split ratios
print("Training set size:", X_train3.shape)
print("Testing set size:", X_test3.shape)

Training set size: (275, 6)
Testing set size: (69, 6)


#### **Why Use Stratification?**
Stratified splitting ensures each class (e.g., Adelie, Chinstrap, Gentoo) is proportionally represented in both datasets. This prevents class imbalance, ensuring fair evaluation of the model.

For example:
Without stratification, one species could be overrepresented in either the training or test set.
Using stratify=y ensures balanced representation across both sets.

In [24]:
X_train3.to_csv("../data/processed/X_train3.csv", index=False)
X_test3.to_csv("../data/processed/X_test3.csv", index=False)
y_train3.to_csv("../data/processed/y_train3.csv", index=False)
y_test3.to_csv("../data/processed/y_test3.csv", index=False)