# Classical Machine Learning with Scikit-learn
## Iris Species Classification Using Decision Tree

**Dataset:** Iris Species Dataset

**Goals:**
1. Preprocess the data (handle missing values, encode labels)
2. Train a decision tree classifier to predict iris species
3. Evaluate using accuracy, precision, and recall

---

## 1. Import Required Libraries

We'll import all necessary libraries for data manipulation, visualization, model training, and evaluation.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Dataset loading
from sklearn.datasets import load_iris

# Data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Machine learning model
from sklearn.tree import DecisionTreeClassifier

# Model evaluation metrics
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report
)

# Visualization
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

print("All libraries imported successfully!")

## 2. Load the Iris Dataset

The Iris dataset is a classic dataset in machine learning, containing measurements of 150 iris flowers from three different species.

In [None]:
# Load the Iris dataset from scikit-learn
iris = load_iris()

# Convert to pandas DataFrame for easier manipulation
df = pd.DataFrame(
    data=iris.data,
    columns=iris.feature_names
)

# Add the target column (species)
df['species'] = iris.target

# Display basic information about the dataset
print("Dataset loaded successfully!")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1] - 1}")  # Excluding target column
print(f"\nFeature names: {iris.feature_names}")
print(f"Target names: {iris.target_names}")

## 3. Explore the Dataset

Let's examine the structure and characteristics of our data to understand what we're working with.

In [None]:
# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

print("\n" + "="*60 + "\n")

# Display statistical summary of the features
print("Statistical summary of features:")
print(df.describe())

print("\n" + "="*60 + "\n")

# Check data types
print("Data types:")
print(df.dtypes)

print("\n" + "="*60 + "\n")

# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

print("\n" + "="*60 + "\n")

# Check the distribution of target classes
print("Distribution of species:")
print(df['species'].value_counts().sort_index())
print("\nMapping: 0 = setosa, 1 = versicolor, 2 = virginica")

## 4. Preprocess the Data

We'll handle any missing values (though the Iris dataset is typically clean) and prepare our features and target variables.

In [None]:
# Step 1: Handle missing values
# Check if there are any missing values
missing_count = df.isnull().sum().sum()

if missing_count > 0:
    print(f"Found {missing_count} missing values.")
    # For numerical features, we could fill with median or mean
    # For this dataset, we'll drop rows with missing values
    df_clean = df.dropna()
    print(f"Dropped rows with missing values. New shape: {df_clean.shape}")
else:
    print("No missing values found. Dataset is clean!")
    df_clean = df.copy()

print("\n" + "="*60 + "\n")

# Step 2: Separate features (X) and target (y)
# Features: all columns except 'species'
X = df_clean.drop('species', axis=1)

# Target: the 'species' column (already encoded as 0, 1, 2)
y = df_clean['species']

print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)

print("\n" + "="*60 + "\n")

# Step 3: Verify encoding
# The Iris dataset from sklearn already has encoded labels (0, 1, 2)
# If we had string labels, we would use LabelEncoder:
# le = LabelEncoder()
# y_encoded = le.fit_transform(y)

print("Target variable is already encoded:")
print(f"Unique values: {sorted(y.unique())}")
print(f"Class distribution:\n{y.value_counts().sort_index()}")

print("\nData preprocessing completed successfully!")

## 5. Split the Data into Training and Testing Sets

We'll split the data using an 80-20 ratio (80% for training, 20% for testing) to evaluate our model on unseen data.

In [None]:
# Split the data: 80% training, 20% testing
# random_state=42 ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing
    random_state=42,    # For reproducibility
    stratify=y          # Maintain class distribution in both sets
)

print("Data split completed!")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

print("\n" + "="*60 + "\n")

# Verify class distribution in train and test sets
print("Class distribution in training set:")
print(y_train.value_counts().sort_index())

print("\nClass distribution in testing set:")
print(y_test.value_counts().sort_index())

## 6. Train a Decision Tree Classifier

Decision trees are intuitive models that make decisions based on asking a series of questions about the features.

In [None]:
# Initialize the Decision Tree Classifier
# We'll use default parameters first, but you can tune these for better performance
dt_classifier = DecisionTreeClassifier(
    criterion='gini',      # Measure of split quality ('gini' or 'entropy')
    max_depth=None,        # Maximum depth of the tree (None = unlimited)
    min_samples_split=2,   # Minimum samples required to split a node
    min_samples_leaf=1,    # Minimum samples required at a leaf node
    random_state=42        # For reproducibility
)

print("Decision Tree Classifier initialized with parameters:")
print(dt_classifier.get_params())

print("\n" + "="*60 + "\n")

# Train the model on the training data
print("Training the Decision Tree Classifier...")
dt_classifier.fit(X_train, y_train)
print("Model training completed!")

print("\n" + "="*60 + "\n")

# Display tree information
print(f"Tree depth: {dt_classifier.get_depth()}")
print(f"Number of leaves: {dt_classifier.get_n_leaves()}")
print(f"Number of features used: {dt_classifier.n_features_in_}")

## 7. Make Predictions

Now we'll use our trained model to make predictions on the test set.

In [None]:
# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

print("Predictions made on test set!")
print(f"Number of predictions: {len(y_pred)}")

print("\n" + "="*60 + "\n")

# Display first 10 predictions vs actual values
print("First 10 predictions vs actual values:")
comparison_df = pd.DataFrame({
    'Actual': y_test.values[:10],
    'Predicted': y_pred[:10],
    'Match': y_test.values[:10] == y_pred[:10]
})
print(comparison_df)

print("\n" + "="*60 + "\n")

# Also get prediction probabilities for each class
y_pred_proba = dt_classifier.predict_proba(X_test)
print("Prediction probabilities for first 5 samples:")
print("(Columns represent: setosa, versicolor, virginica)")
print(y_pred_proba[:5])

## 8. Evaluate the Model

We'll evaluate our model using multiple metrics: accuracy, precision, recall, F1-score, and confusion matrix.

In [None]:
# Calculate evaluation metrics

# 1. Accuracy: Overall correctness of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print("Interpretation: Percentage of correct predictions out of all predictions")

print("\n" + "="*60 + "\n")

# 2. Precision: How many predicted positives are actually positive
# Using 'weighted' average to account for class imbalance
precision = precision_score(y_test, y_pred, average='weighted')
print(f"Precision (weighted): {precision:.4f}")
print("Interpretation: Of all predicted species, how many were correctly identified")

print("\n" + "="*60 + "\n")

# 3. Recall: How many actual positives were correctly predicted
# Using 'weighted' average to account for class imbalance
recall = recall_score(y_test, y_pred, average='weighted')
print(f"Recall (weighted): {recall:.4f}")
print("Interpretation: Of all actual species, how many were correctly identified")

print("\n" + "="*60 + "\n")

# 4. F1-Score: Harmonic mean of precision and recall
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score (weighted): {f1:.4f}")
print("Interpretation: Balance between precision and recall")

print("\n" + "="*60 + "\n")

# 5. Per-class metrics using classification report
print("Detailed Classification Report:")
print("=" * 60)
print(classification_report(
    y_test, 
    y_pred,
    target_names=iris.target_names,
    digits=4
))
print("\nNote: Support shows the number of actual occurrences of each class")

In [None]:
# 6. Confusion Matrix
# Shows the number of correct and incorrect predictions for each class
cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print("=" * 60)
print(cm)

print("\n" + "="*60 + "\n")

# Create a more readable confusion matrix using pandas
cm_df = pd.DataFrame(
    cm,
    index=[f"Actual {name}" for name in iris.target_names],
    columns=[f"Predicted {name}" for name in iris.target_names]
)
print("Confusion Matrix (detailed):")
print(cm_df)

print("\n" + "="*60 + "\n")

# Visualize the confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(
    cm_df,
    annot=True,
    fmt='d',
    cmap='Blues',
    cbar=True,
    square=True,
    linewidths=1,
    linecolor='black'
)
plt.title('Confusion Matrix - Decision Tree Classifier\nIris Species Classification', 
          fontsize=14, fontweight='bold')
plt.ylabel('Actual Species', fontsize=12)
plt.xlabel('Predicted Species', fontsize=12)
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Diagonal elements: Correctly classified samples")
print("- Off-diagonal elements: Misclassified samples")

## 9. Visualize the Decision Tree

Let's visualize the decision tree to understand how it makes classification decisions.

In [None]:
# Visualize the decision tree structure
plt.figure(figsize=(20, 10))

# Plot the tree with detailed information
plot_tree(
    dt_classifier,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,              # Color nodes by class
    rounded=True,             # Rounded box corners
    fontsize=10,
    proportion=True           # Show proportions instead of counts
)

plt.title('Decision Tree Visualization - Iris Species Classification', 
          fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nHow to read the tree:")
print("=" * 60)
print("- Each box (node) shows:")
print("  * The decision rule (e.g., 'petal width <= 0.8')")
print("  * gini: Impurity measure (0 = pure, higher = more mixed)")
print("  * samples: Proportion of samples reaching this node")
print("  * value: Distribution of samples across classes")
print("  * class: The majority class at this node")
print("\n- Colors represent the dominant class at each node")
print("- The tree splits samples based on feature values")
print("- Leaf nodes (bottom) contain the final predictions")

## 10. Feature Importance

Let's examine which features are most important for the classification.

In [None]:
# Get feature importances from the trained model
feature_importance = dt_classifier.feature_importances_

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("Feature Importance Ranking:")
print("=" * 60)
print(importance_df.to_string(index=False))

print("\n" + "="*60 + "\n")

# Visualize feature importance
plt.figure(figsize=(10, 6))
bars = plt.barh(importance_df['Feature'], importance_df['Importance'], color='steelblue')

# Add value labels on bars
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width, bar.get_y() + bar.get_height()/2, 
             f'{width:.4f}', 
             ha='left', va='center', fontsize=10, fontweight='bold')

plt.xlabel('Importance Score', fontsize=12, fontweight='bold')
plt.ylabel('Feature', fontsize=12, fontweight='bold')
plt.title('Feature Importance in Decision Tree Classifier', 
          fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3, linestyle='--')
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Higher values indicate more important features for classification")
print("- These scores show how much each feature contributes to reducing impurity")
print(f"- The most important feature is: {importance_df.iloc[0]['Feature']}")

## Summary

### Key Findings:

1. **Data Preprocessing**: The Iris dataset was clean with no missing values, and labels were already encoded.

2. **Model Performance**: Our Decision Tree Classifier achieved strong performance metrics on the test set.

3. **Evaluation Metrics**:
   - **Accuracy**: Measures overall correctness
   - **Precision**: Measures how many predicted species were correct
   - **Recall**: Measures how many actual species were correctly identified
   - **Confusion Matrix**: Shows detailed breakdown of correct and incorrect predictions

4. **Feature Importance**: The model identified which physical measurements (sepal length/width, petal length/width) were most important for classification.

### Next Steps:

- Try other classifiers (Random Forest, SVM, KNN)
- Tune hyperparameters for better performance
- Use cross-validation for more robust evaluation
- Implement feature scaling for other algorithms