# XGBoost (Extreme Gradient Boosting)
XGBoost is an optimized, efficient, and scalable implementation of Gradient Boosting. It has gained immense popularity in machine learning competitions, including Kaggle, due to its high performance and speed. XGBoost improves upon traditional Gradient Boosting by introducing several enhancements that make it more powerful, faster, and easier to use for both classification and regression tasks.

## Key Features of XGBoost:
* **Efficiency:** XGBoost uses advanced techniques like parallel processing and tree pruning to speed up training.
* **Regularization:** It includes L1 (Lasso) and L2 (Ridge) regularization, which helps prevent overfitting.
* **Handling Missing Data:** XGBoost can handle missing data without requiring imputation.
* **Tree Pruning:** Instead of growing trees to a fixed depth, XGBoost uses Max Depth and then prunes trees based on the Gamma parameter, ensuring that the model doesn't become too complex.
* **Scalability:** XGBoost is highly scalable, which makes it suitable for large datasets.
* **Flexibility:** It supports a wide range of loss functions for both regression and classification tasks.
* **Cross-validation:** XGBoost offers built-in support for cross-validation during training, which helps with hyperparameter tuning.
* **Early Stopping:** It supports early stopping to prevent overfitting and saves training time.

### XGBoost Workflow:
* **Boosting Process:** Similar to traditional Gradient Boosting, but optimized for speed and efficiency. Trees are built sequentially, and each tree tries to correct the mistakes (residuals) of the previous tree.
* **Learning Rate (Eta):** A smaller learning rate helps prevent overfitting but requires more trees (iterations) to learn effectively.
* **Gradient and Hessian:** XGBoost uses second-order gradient information (Hessian) in the optimization process, leading to faster and more accurate convergence compared to using only the first-order gradient.

### Basic Steps in XGBoost:
* **Data Preprocessing:** Preprocess your data (e.g., handle missing values, scale/normalize features if necessary).
* **Set Hyperparameters:** Set XGBoost-specific parameters like learning_rate, n_estimators, max_depth, and subsample.
* **Train the Model:** Train the model on the dataset.
* **Evaluate the Model:** Use performance metrics like accuracy (classification) or RMSE (regression).
* **Tune Hyperparameters:** Fine-tune the hyperparameters using cross-validation or grid search.


In [1]:
!pip install xgboost 


Collecting xgboost
  Downloading xgboost-2.1.4-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.4-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.1.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [22]:
import sys
sys.path.append('./VENV/lib/python3.11/site-packages')

In [5]:
# Importing libraries
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data  # Features
y = data.target  # Target variable (class labels)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert the datasets into DMatrix format (optimized for XGBoost)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define the parameters for XGBoost (basic setup)
params = {
    'objective': 'multi:softmax',  # Multi-class classification (softmax)
    'num_class': 3,               # Number of classes (3 for Iris dataset)
    'max_depth': 4,               # Maximum depth of trees
    'eta': 0.1,                   # Learning rate
    'eval_metric': 'merror'       # Evaluation metric (multi-class error rate)
}

# Train the XGBoost model with 100 boosting rounds
num_round = 100
bst = xgb.train(params, dtrain, num_round)

# Make predictions on the test set
y_pred = bst.predict(dtest)

# Convert predictions into integers (class labels)
y_pred = y_pred.astype(int)

# Evaluate the model using accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")


Accuracy: 100.00%


# Adaboost Vs XGBoost:

|Feature|	AdaBoost|	XGBoost|
|----|----|----|
|Boosting Mechanism|	Focuses on misclassified data points	|Focuses on residuals using gradient descent|
|Weak Learners|	Typically uses decision stumps (shallow trees)|	Uses shallow decision trees, more flexible|
|Regularization	|None, prone to overfitting|	Built-in L1 and L2 regularization to prevent overfitting|
|Handling Missing Data|	Does not handle missing data directly	|Can handle missing data naturally|
|Performance|	Good for smaller datasets, less computationally intensive	|High performance and scalability, optimized for large datasets|
|Hyperparameter Tuning|	Few hyperparameters|	More hyperparameters to tune|
|Overfitting	|Prone to overfitting on noisy data	|More robust to overfitting with regularization|
|Computational Complexity	|Simpler and faster for small datasets	|Requires more computational resources, but scales better|
|Interpretability	|Easier to interpret, simple weak learners	|More difficult to interpret, but can be analyzed with SHAP|
|Use Case	|Simple classification tasks, small datasets	|Complex, large datasets, high-performance tasks|

![alt-text](images/Multiclass_ErrorRate.png 'Definitions')


# CatBoost

** CatBoost (Categorical Boosting)** is an open-source, high-performance gradient boosting algorithm developed by Yandex. It is designed to handle categorical features directly, without needing to manually encode them (like one-hot encoding or label encoding), which is one of its standout features. CatBoost is used for classification, regression, and ranking tasks, and it's known for being highly efficient, handling large datasets with high dimensionality, and providing state-of-the-art performance.

## Key Features of CatBoost:

**1.  Handling Categorical Features:**

One of the major advantages of CatBoost is its ability to handle categorical features directly. Most machine learning models require converting categorical variables into numeric values (e.g., using one-hot encoding or label encoding). CatBoost, however, uses its own algorithm to process categorical features without the need for preprocessing them manually.

**2. Gradient Boosting:**
Like other gradient boosting algorithms (e.g., XGBoost and LightGBM), CatBoost builds an ensemble of trees sequentially. Each new tree tries to correct the errors (residuals) made by the previous trees in the ensemble. This helps in improving the overall model performance.

**3. Efficient and Fast:**

CatBoost uses ordered boosting, which helps reduce overfitting and provides better generalization, especially on smaller datasets. This ordered boosting process helps it make better use of categorical features.

**4. Robust to Overfitting:**

Thanks to its ordered boosting and regularization techniques, CatBoost is robust to overfitting, making it a good choice for datasets that have complex patterns or noise.

**5. Supports Missing Values:**

CatBoost automatically handles missing data during model training, so you don’t need to perform any special preprocessing steps to handle missing values.

**6. Fast Training with Parallelization:**

CatBoost is optimized for speed, thanks to parallelization and its ability to efficiently use hardware resources (e.g., multi-core CPUs or GPUs).

**7. Great for Tabular Data:**

CatBoost is particularly powerful for tabular data, where categorical variables and interactions between features are important.

**8. Model Interpretation:**

CatBoost offers tools for model interpretation and understanding feature importance, making it easier to explain and debug the model.

### When to Use CatBoost:
* **Categorical Data:** CatBoost is ideal when you have categorical features and don’t want to manually encode them.
* **General Purpose:** It works well on a variety of tasks like classification, regression, and ranking.
* **Complex, Noisy Data:** If your data has noise or complex relationships, CatBoost’s robustness to overfitting is beneficial.
* **Scalability:** If you have large datasets and require fast, scalable solutions, CatBoost's parallelized training can be an advantage.

# Summary CatBoost

* CatBoost actually divides a given dataset into random permutation. by default, Catboost creates four random permutations. With this randomness, we can further stop overfitting your model.
We can further control this randomness by tuning parameter bagging_temperature.
Note that the ordered boosting typically gets slower with small datasets (i.e., less than 50k samples), but it generally has very fast inference, because the algorithm uses specific kind of trees called symmetric trees.

Catboost and XGBoost are slow with CPU training.
Catboost doesnot perform well with sparse dataset.


In [19]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp311-cp311-macosx_11_0_universal2.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting numpy<2.0,>=1.16.0 (from catboost)
  Using cached numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (61 kB)
Collecting plotly (from catboost)
  Downloading plotly-6.0.0-py3-none-any.whl.metadata (5.6 kB)
Collecting narwhals>=1.15.1 (from plotly->catboost)
  Downloading narwhals-1.28.0-py3-none-any.whl.metadata (10 kB)
Downloading catboost-1.2.7-cp311-cp311-macosx_11_0_universal2.whl (27.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.1/27.1 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl (20.6 MB)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.1/47.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00

In [23]:
# Import necessary libraries
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the CatBoostClassifier
model = CatBoostClassifier(iterations=500, depth=6, learning_rate=0.05, cat_features=[])

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")


ImportError: dlopen(./VENV/lib/python3.11/site-packages/catboost/_catboost.so, 0x0002): symbol not found in flat namespace '_PyCMethod_New'

In [24]:
from sklearn.datasets import load_wine
from catboost import CatBoostClassifier, Pool, cv

ImportError: dlopen(./VENV/lib/python3.11/site-packages/catboost/_catboost.so, 0x0002): symbol not found in flat namespace '_PyCMethod_New'