# <span style="color:gold;">Penguin Species Dataset – Exploratory Overview</span>

The **Palmer Penguins dataset** is a well-regarded resource in ecological and data science studies.  
It contains physical measurements of three penguin species, collected from the **Palmer Archipelago** in Antarctica:

- **Adélie**  
- **Chinstrap**  
- **Gentoo**

This dataset is often presented as an alternative to the *Iris dataset* for classification tasks, especially when introducing supervised learning concepts in Python.

---

## <span style="color:deepskyblue;">Dataset Structure</span>

The dataset contains the following key features:

| Feature              | Description |
|----------------------|-------------|
| **species**          | Target variable indicating penguin species. |
| **island**           | Island of observation (*Biscoe*, *Dream*, *Torgersen*). |
| **bill_length_mm**   | Length of the bill in millimeters. |
| **bill_depth_mm**    | Depth of the bill in millimeters. |
| **flipper_length_mm**| Length of the flipper in millimeters. |
| **body_mass_g**      | Body mass in grams. |
| **sex**              | Biological sex (*Male* / *Female*). |

---

## <span style="color:seagreen;">Why This Dataset for Decision Trees?</span>

> A **Decision Tree Classifier** works by splitting the dataset into subsets based on feature values, aiming to separate the target classes as effectively as possible.

**Reasons this dataset is ideal:**
- Well-separated classes based on physical measurements.
- Contains a mix of **numerical** and **categorical** variables.
- Small dataset size → quick training and easy visualization.

---

## <span style="color:orangered;">Possible Challenges</span>

- Missing values in certain features (e.g., `sex`, `bill_length_mm`).
- Measurement overlaps between species, leading to borderline classifications.
- Need for categorical encoding before feeding into a scikit-learn Decision Tree.

---

## <span style="color:purple;">Objective</span>

This notebook aims to:
1. Perform **Exploratory Data Analysis (EDA)** to explore feature relationships.
2. Preprocess data (handle missing values, encode categorical features).
3. Train a **Decision Tree Classifier** to predict species.
4. Evaluate the model and visualize the decision process.

By the end, we will not only have built a working model but also gained deeper insight into how classification algorithms learn from biological datasets.


## IMPORT IMPORTANT LIBRARIES

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML WORK/DECISION TREE ON PENGUINS_SIZE DATASET/penguins_size.csv')

In [None]:
df.head()

# EXPLORATORY DATA ANALYSIS (EDA)

### Missing Data

The purpose is to create a model for future use, so data points missing crucial information won't help in this task, especially since for future data points we will assume the research will grab the relevant feature information.

In [None]:
df.info()

In [None]:
df.isna().sum()

there are missing values in the data.

In [None]:
# What percentage are we dropping?
100*(10/344)

 **Handling Missing Values**

Upon inspection, it was found that the dataset contains approximately **2.9% missing values**.  
Given the small proportion of missing entries relative to the total dataset size, removing these rows is unlikely to introduce bias or significantly affect the overall distribution of the data.

Therefore, rather than imputing these values, we will **drop the rows containing null entries** to maintain dataset integrity and ensure cleaner preprocessing for the Decision Tree model.


In [None]:
#DROPING THE ALL NULL VALUES
df = df.dropna()

In [None]:
#RECHECK THE DATA NOW
df.info()

Now we can see that data have no more null values

In [None]:
df.head()

In [None]:
df['sex'].unique()

## **Data Anomaly in `sex` Feature**

During the exploratory analysis, an unusual category `"."` was detected in the `sex` column.  
This is **not a valid biological classification** for penguin sex and likely originates from one of the following causes:

- **Data entry error** during field recording.  
- **Placeholder or missing value marker** used by the original data collector.  
- **Parsing issue** during dataset compilation or export.

After reviewing the corresponding records and feature distributions, it was observed that these entries align more closely with the characteristics of female penguins in terms of body mass, flipper length, and bill measurements.

Therefore, for consistency and to preserve the data, these anomalous entries will be **reclassified under the `"Female"` category** rather than being discarded.


In [None]:
df[df.species =="Gentoo"].groupby("sex").describe().T

In [None]:
df.at[336, "sex"] = "FEMALE"

In [None]:
df.at[336, "sex"]

In [None]:
df['island'].unique()

# **Visualization**

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='sex', data=df, hue='sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.title(f'Distribution of Categories in "sex" column')

plt.xticks(rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='species', data=df, hue='species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.title(f'Distribution of Categories in "species" column')

plt.xticks(rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='island', data=df, hue='island')
plt.xlabel('Islamd')
plt.ylabel('Count')
plt.title(f'Distribution of Categories in "Island" column')

plt.xticks(rotation=90)
plt.show()

In [None]:
import plotly.express as px
fig = px.scatter_3d(df,
                    x='culmen_length_mm',
                    y='flipper_length_mm',
                    z='culmen_depth_mm',
                    color='species')
fig.show();

In [None]:
sns.pairplot(df, hue='species');

In [None]:
sns.catplot(
    x='species',
    y='culmen_length_mm',
    hue='species',
    data=df,
    kind='box',
    col='sex',
   )

#  FEATURE ENGINEERING

In [None]:
#As model canot train on string data so its need to converty into integer so i am using one hot encoding
pd.get_dummies(df)

In [None]:
#to reduce number of column as we can get same knowldge if we drop one column from every categorical columns
pd.get_dummies(df.drop('species',axis=1),drop_first=True)

# Train | Test Split

In [None]:
X = pd.get_dummies(df.drop('species',axis=1),drop_first=True)
y = df['species']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# **MODEL TRANING**
## Decision Tree Classifier

## Default Hyperparameters

In [None]:
from sklearn.tree import DecisionTreeClassifier,plot_tree

In [None]:
model = DecisionTreeClassifier()

In [None]:
model.fit(X_train,y_train)

In [None]:
base_pred = model.predict(X_test)

## MODEL EVALUATION

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, ConfusionMatrixDisplay

In [None]:
print(accuracy_score(y_test,base_pred))

In [None]:
print(classification_report(y_test,base_pred))

In [None]:
print(confusion_matrix(y_test,base_pred))

In [None]:
ConfusionMatrixDisplay.from_estimator(model,X_test,y_test)

##DECISION TREE PLOT

In [None]:
plot_tree(model,filled=True)

In [None]:
model.feature_importances_

In [None]:
pd.DataFrame(index=X.columns,data=model.feature_importances_,columns=['Feature Importance'])

In [None]:
sns.boxplot(x='species',y='body_mass_g',data=df)

# **UNDERSTANDING HYPERPARAMETER: MAX DEPTH**

In a Decision Tree, **MAX DEPTH** is a hyperparameter that controls the **maximum number of levels (or depth)** the tree can grow.

- **Shallow Tree (Low Max Depth):**  
  Limits the number of splits, resulting in a simpler model.  
  Pros: Less risk of overfitting, faster training.  
  Cons: May underfit if the tree is too shallow.

- **Deep Tree (High Max Depth or Unlimited):**  
  Allows more splits, capturing more patterns in the training data.  
  Pros: High accuracy on training data.  
  Cons: Increased risk of overfitting and poor generalization to unseen data.

**KEY POINTS:**
- Setting `max_depth` to `None` in scikit-learn means the tree will grow until all leaves are pure or contain fewer than `min_samples_split` samples.
- Choosing the right `max_depth` requires experimentation, often using cross-validation.

**IN THIS NOTEBOOK:**  
We will tune `max_depth` to find the balance between bias and variance, aiming for a model that generalizes well to new data.


In [None]:
#i am funcation that give me report
def report_model(model):
    model_preds = model.predict(X_test)
    print(classification_report(y_test,model_preds))
    print('\n')
    plt.figure(figsize=(12,8),dpi=150)
    plot_tree(model,filled=True,feature_names=X.columns);

In [None]:
help(DecisionTreeClassifier)

In [None]:
pruned_tree = DecisionTreeClassifier(max_depth=2)
pruned_tree.fit(X_train,y_train)

In [None]:
report_model(pruned_tree)

## Max Leaf Nodes

In [None]:
pruned_tree = DecisionTreeClassifier(max_leaf_nodes=3)
pruned_tree.fit(X_train,y_train)

In [None]:
report_model(pruned_tree)

## CRITERION

In [None]:
entropy_tree = DecisionTreeClassifier(criterion='entropy')
entropy_tree.fit(X_train,y_train)

In [None]:
report_model(entropy_tree)