# XGBoost Classification Example using the Iris Dataset

This notebook demonstrates the following steps:

1. Load and preprocess the Iris dataset.
2. Train an XGBoost classifier.
3. Evaluate the model.
4. Save the trained model.
5. Load and use the saved model for predictions.

### Dependencies
- `xgboost`
- `scikit-learn`
- `pandas`
- `numpy`
- `joblib`

To install the dependencies, use:
```bash
pip install xgboost scikit-learn pandas numpy joblib
```

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from xgboost import XGBClassifier
import joblib

## Step 1: Load the Iris dataset

In [2]:
# Load the Iris dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(url, header=None, names=column_names)

# Display the first few rows of the dataset
print("First few rows of the Iris dataset:")
print(df.head())

First few rows of the Iris dataset:
   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


In [3]:
# Display basic statistics about the dataset
print("\nBasic statistics of the dataset:")
print(df.describe())


Basic statistics of the dataset:
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


In [4]:
# Display the distribution of the target class
print("\nClass distribution:")
print(df['class'].value_counts())


Class distribution:
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64


## Step 2: Preprocess the data

In [5]:
# Encode the target labels (classes) to numeric values
df['class'] = df['class'].astype('category').cat.codes

# Define features (X) and target (y)
X = df.drop('class', axis=1)
y = df['class']

# Split the data into training and testing sets
# 70% for training, 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Display the size of training and testing sets
print("\nTraining set size:", X_train.shape)
print("Testing set size:", X_test.shape)


Training set size: (105, 4)
Testing set size: (45, 4)


## Step 3: Create and train an XGBoost classifier

In [6]:
# Define the model with basic parameters
model = XGBClassifier(
    objective='multi:softprob',  # For multi-class classification
    num_class=3,  # Number of classes in the Iris dataset
    eval_metric='mlogloss',  # Metric used for evaluation
    use_label_encoder=False,  # Disable label encoder as it is deprecated
    random_state=42  # For reproducibility
)

# Train the model on the training data
print("\nTraining the XGBoost classifier...")
model.fit(X_train, y_train)


Training the XGBoost classifier...


## Step 4: Make predictions on the test set

In [7]:
# Make predictions on the test set
y_pred = model.predict(X_test)

## Step 5: Evaluate the model

In [8]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Generate the classification report
class_report = classification_report(y_test, y_pred, target_names=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])

print("\nModel evaluation:")
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)


Model evaluation:
Accuracy: 0.98
Confusion Matrix:
[[15  0  0]
 [ 0 14  1]
 [ 0  0 15]]
Classification Report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       1.00      0.93      0.97        15
 Iris-virginica       0.94      1.00      0.97        15

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45


## Step 6: Save the trained model to a file

In [9]:
# Save the trained model to a file
model_file = "xgboost_iris_model.json"
joblib.dump(model, model_file)
print(f"\nModel saved to {model_file}")


Model saved to xgboost_iris_model.json


## Step 7: Load the model and make a prediction on a sample data point

In [10]:
# Load the model from the file
loaded_model = joblib.load(model_file)
print("\nLoaded model from file.")

# Predict on a sample data point
sample_data = X_test.iloc[0:1]
predicted_class = loaded_model.predict(sample_data)[0]
actual_class = y_test.iloc[0]

print("\nPrediction on a sample data point:")
print(f"Sample features: {sample_data.values}")
print(f"Predicted class: {predicted_class}, Actual class: {actual_class}")


Loaded model from file.

Prediction on a sample data point:
Sample features: [[5.1 3.5 1.4 0.2]]
Predicted class: 0, Actual class: 0


## Feature Importances

In [11]:
# Detailed breakdown of what the classifier sees
print("\nFeature importances from the model:")
print(model.get_booster().get_score(importance_type='weight'))


Feature importances from the model:
{'sepal_length': 8, 'sepal_width': 5, 'petal_length': 20, 'petal_width': 10}
