# Explore Xgboost
I use my brain to explore the Xgboost algorithm,

## Reduce epoch and depth

- `max_depth` is set to 3 instead of 4 to simplify the tree structure.
- `epochs` is reduced to 5, meaning the boosting process stops after 5 iterations.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Load the Iris dataset
iris = load_iris()

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

# Convert data into DMatrix format for XGBoost
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

# Define parameters for XGBoost model
param = {
    'max_depth': 3,    # Try reducing max_depth
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3
} 

# Reduce the number of boosting rounds (epochs)
epochs = 5  # Try reducing the number of epochs

# Train the XGBoost model
model = xgb.train(param, train, epochs)

# Make predictions
predictions = model.predict(test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)


Accuracy: 1.0


In [13]:
# Define new parameters for XGBoost model
param = {
    'max_depth': 3,    
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3
} 

# Reduce the number of boosting rounds (epochs)
epochs = 1  
model = xgb.train(param, train, epochs)

predictions = model.predict(test)

accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.7792207792207793



The high accuracy of approximately 96.7% on the Iris dataset suggests good model performance, but the slightly lower test accuracy compared to training accuracy hints at potential overfitting, where the model may not generalize optimally to new data.

To address overfitting:

- Use techniques like cross-validation to evaluate performance on different data subsets and ensure consistent model performance.
- Experiment with simpler model configurations by reducing max_depth and adjusting other parameters to strike a balance between model complexity and generalization.

In [13]:
# Define parameters for XGBoost model
param = {
    'max_depth': 10,    # Set a very low max_depth
    'eta': 0.1,        # Lower the learning rate
    'objective': 'multi:softmax',
    'num_class': 3
} 

epochs = 1  # Use only 1 boosting round
model = xgb.train(param, train, epochs)

predictions = model.predict(test)

accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 1.0


## Change split ratio

In [19]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.7, random_state=42)

# Convert data into DMatrix format for XGBoost
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

# Define parameters for XGBoost model
param = {
    'max_depth': 3,    # Try reducing max_depth
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3
} 

# Reduce the number of boosting rounds (epochs)
epochs = 1  # Try reducing the number of epochs

# Train the XGBoost model
model = xgb.train(param, train, epochs)

# Make predictions
predictions = model.predict(test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)


Accuracy: 0.9142857142857143



The achieved accuracy of approximately 91.4% on a large test size (70%) suggests potential underfitting due to reduced model complexity (low max_depth and 1 epoch). Adjusting parameters and exploring a more balanced dataset split can help optimize the model's generalization and accuracy.

## How to detect overfitting ?
If train_accuracy > val_accuracy then the model is overfitting. Because the model is learning the training data too well and may not generalize well to new data.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into train, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=0)  # 60% train, 20% validation, 20% test

# Convert data into DMatrix format for XGBoost
train = xgb.DMatrix(X_train, label=y_train)
val = xgb.DMatrix(X_val, label=y_val)
test = xgb.DMatrix(X_test, label=y_test)

# Define parameters for XGBoost model
param = {
    'max_depth': 3,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3
} 

# Train the XGBoost model on the training set
epochs = 100
model = xgb.train(param, train, epochs, evals=[(train, 'train'), (val, 'validation')], early_stopping_rounds=10, verbose_eval=False)

# Evaluate the model on the training set
train_predictions = model.predict(train)
train_accuracy = accuracy_score(y_train, train_predictions)

# Evaluate the model on the validation set
val_predictions = model.predict(val)
val_accuracy = accuracy_score(y_val, val_predictions)

# Evaluate the model on the test set (unseen data)
test_predictions = model.predict(test)
test_accuracy = accuracy_score(y_test, test_predictions)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

# Check if there's a large difference between training and validation accuracies
if train_accuracy > val_accuracy:
    print("Warning: Potential overfitting detected.")
else:
    print("Model is generalizing well.")


Training Accuracy: 1.0000
Validation Accuracy: 0.9000
Test Accuracy: 1.0000



The training accuracy of 100% combined with a lower validation accuracy of 90% and perfect test accuracy of 100% indicates potential overfitting, where the model is likely memorizing the training data but not generalizing well to unseen validation and test data.

## Change dataset

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Load the diabetes dataset from CSV file
diabetes = pd.read_csv('../data/diabetes.csv')

# Split the dataset into features (X) and target (y)
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

# Split the dataset into train, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=0)  # 60% train, 20% validation, 20% test

# Convert data into DMatrix format for XGBoost
train = xgb.DMatrix(X_train, label=y_train)
val = xgb.DMatrix(X_val, label=y_val)
test = xgb.DMatrix(X_test, label=y_test)

# Define parameters for XGBoost model
param = {
    'max_depth': 3,
    'eta': 0.3,
    'objective': 'binary:logistic',  # Use binary classification objective
    'eval_metric': 'error'  # Specify evaluation metric as classification error
}

# Reduce the number of boosting rounds (epochs)
epochs = 50  # Use more epochs for better performance

# Train the XGBoost model
model = xgb.train(param, train, epochs, evals=[(train, 'train'), (val, 'validation')], early_stopping_rounds=10, verbose_eval=False)

# Predict classes (0 or 1) on the training, validation, and test sets
train_predictions = model.predict(train)
val_predictions = model.predict(val)
test_predictions = model.predict(test)

# Convert predicted probabilities to class labels (0 or 1)
train_pred_labels = [1 if pred > 0.5 else 0 for pred in train_predictions]
val_pred_labels = [1 if pred > 0.5 else 0 for pred in val_predictions]
test_pred_labels = [1 if pred > 0.5 else 0 for pred in test_predictions]

# Calculate accuracy scores for training, validation, and test sets
train_accuracy = accuracy_score(y_train, train_pred_labels)
val_accuracy = accuracy_score(y_val, val_pred_labels)
test_accuracy = accuracy_score(y_test, test_pred_labels)

# Print accuracy scores
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

# Check for potential overfitting (compare training accuracy with validation accuracy)
if train_accuracy > val_accuracy:
    print("Warning: Potential overfitting detected.")
else:
    print("Model is generalizing well.")


Training Accuracy: 0.8935
Validation Accuracy: 0.7532
Test Accuracy: 0.8312


### Find best parameters

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Load the diabetes dataset from CSV file
diabetes = pd.read_csv('../data/diabetes.csv')

# Split the dataset into features (X) and target (y)
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

# Split the dataset into train, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=0)  # 60% train, 20% validation, 20% test

# Convert data into DMatrix format for XGBoost
train = xgb.DMatrix(X_train, label=y_train)
val = xgb.DMatrix(X_val, label=y_val)
test = xgb.DMatrix(X_test, label=y_test)

# Define a grid of parameters to search over
param_grid = {
    'max_depth': [3, 5, 7],
    'eta': [0.1, 0.3, 0.5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
}

best_params = None
best_val_accuracy = 0.0

# Try different combinations of parameters using nested loops
for max_depth in param_grid['max_depth']:
    for eta in param_grid['eta']:
        for subsample in param_grid['subsample']:
            for colsample_bytree in param_grid['colsample_bytree']:
                params = {
                    'max_depth': max_depth,
                    'eta': eta,
                    'subsample': subsample,
                    'colsample_bytree': colsample_bytree,
                    'objective': 'binary:logistic',  # Use binary classification objective
                    'eval_metric': 'error'  # Specify evaluation metric as classification error
                }
                
                # Train the XGBoost model with current parameters
                model = xgb.train(params, train, num_boost_round=100, evals=[(val, 'validation')], early_stopping_rounds=10, verbose_eval=False)
                
                # Evaluate the model on the validation set
                val_predictions = model.predict(val)
                val_pred_labels = [1 if pred > 0.5 else 0 for pred in val_predictions]
                val_accuracy = accuracy_score(y_val, val_pred_labels)
                
                # Check if current parameters yield better validation accuracy
                if val_accuracy > best_val_accuracy:
                    best_val_accuracy = val_accuracy
                    best_params = params

# Print the best parameters and their corresponding validation accuracy
print("Best Parameters:", best_params)
print("Best Validation Accuracy:", best_val_accuracy)

# Train the final model using the best parameters on the combined train + validation set
final_train = xgb.DMatrix(pd.concat([X_train, X_val]), label=pd.concat([y_train, y_val]))
final_test = xgb.DMatrix(X_test, label=y_test)

final_model = xgb.train(best_params, final_train, num_boost_round=100)

# Evaluate the final model on the test set
test_predictions = final_model.predict(final_test)
test_pred_labels = [1 if pred > 0.5 else 0 for pred in test_predictions]
test_accuracy = accuracy_score(y_test, test_pred_labels)

print("Final Test Accuracy:", test_accuracy)

# Check for potential overfitting (compare training + validation accuracy with test accuracy)
if (best_val_accuracy > test_accuracy):
    print("Warning: Potential overfitting detected.")
else:
    print("Model is generalizing well.")


Best Parameters: {'max_depth': 5, 'eta': 0.1, 'subsample': 1.0, 'colsample_bytree': 1.0, 'objective': 'binary:logistic', 'eval_metric': 'error'}
Best Validation Accuracy: 0.7922077922077922
Final Test Accuracy: 0.8246753246753247
Model is generalizing well.
