<a href="https://colab.research.google.com/github/mukeshrock7897/GenerativeAI/blob/main/XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Introduction to XGBoost
   - **Definition**: XGBoost stands for eXtreme Gradient Boosting. It is a scalable and accurate implementation of gradient boosting machines designed to optimize performance and computational speed. It is particularly effective for structured/tabular data and is used in various machine learning competitions due to its accuracy and efficiency.
   - **Use Cases**:
     - Data science competitions (e.g., Kaggle)
     - Customer churn prediction
     - Anomaly detection
     - Risk management in finance

### 2. Installation and Setup
   - **Installation**:
     ```python
     pip install xgboost
     ```
     Installs the XGBoost library.
   - **Setting up the environment**:
     Importing necessary libraries and datasets, and preparing the data for use with XGBoost.

### 3. Key Features of XGBoost
   - **Regularization**:
     - Prevents overfitting by adding penalties to the loss function for increasing model complexity.
     - Parameters like `alpha` (L1 regularization term on weights) and `lambda` (L2 regularization term on weights) help control overfitting.
   - **Parallel Processing**:
     - Uses multiple CPU cores for training, which speeds up the computation significantly.
     - The `nthread` parameter specifies the number of threads to use.
   - **Handling Missing Values**:
     - Automatically learns the best way to handle missing data during training.
     - This is useful when dealing with incomplete datasets.
   - **Tree Pruning**:
     - Uses a depth-first approach to build trees.
     - The `max_depth` parameter controls the maximum depth of a tree, helping to reduce overfitting and memory consumption.

### 4. Basic Usage
   - **Loading Dataset**:
     ```python
     import xgboost as xgb
     from sklearn.datasets import fetch_california_housing
     from sklearn.model_selection import train_test_split
     from sklearn.metrics import mean_squared_error

     # Load data
     housing = fetch_california_housing()
     X, y = housing.data, housing.target
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
     ```
     - Fetches the California Housing dataset, splits it into training and testing sets.
   - **DMatrix**:
     - A specialized data structure in XGBoost optimized for both memory efficiency and training speed.
     ```python
     dtrain = xgb.DMatrix(X_train, label=y_train)
     dtest = xgb.DMatrix(X_test, label=y_test)
     ```

### 5. Training the Model
   - **Parameters**:
     - Defines hyperparameters to control the training process.
     ```python
     params = {
         'objective': 'reg:squarederror',  # for regression task
         'max_depth': 6,
         'eta': 0.3,  # learning rate
         'subsample': 0.7,
         'colsample_bytree': 0.7
     }
     ```
   - **Training**:
     - Trains the model using the specified parameters and dataset.
     ```python
     num_round = 100
     bst = xgb.train(params, dtrain, num_round)
     ```

### 6. Making Predictions
   - **Predicting and Evaluating**:
     - Generates predictions on the test dataset and evaluates the performance using Root Mean Squared Error (RMSE).
     ```python
     preds = bst.predict(dtest)
     rmse = mean_squared_error(y_test, preds, squared=False)
     print(f'RMSE: {rmse}')
     ```

### 7. Advanced Features
   - **Early Stopping**:
     - Stops training when the evaluation metric on a validation set does not improve after a specified number of rounds.
     ```python
     evallist = [(dtest, 'eval'), (dtrain, 'train')]
     bst = xgb.train(params, dtrain, num_round, evallist, early_stopping_rounds=10)
     ```
   - **Cross-Validation**:
     - Performs cross-validation to tune hyperparameters and prevent overfitting.
     ```python
     cv_results = xgb.cv(params, dtrain, num_boost_round=100, nfold=5, metrics={'rmse'}, early_stopping_rounds=10)
     print(cv_results)
     ```
   - **Feature Importance**:
     - Visualizes the importance of each feature in making predictions.
     ```python
     import matplotlib.pyplot as plt
     xgb.plot_importance(bst)
     plt.show()
     ```

### 8. Hyperparameter Tuning
   - **Grid Search**:
     - Uses grid search to find the optimal hyperparameters for the model.
     ```python
     from sklearn.model_selection import GridSearchCV
     param_grid = {
         'max_depth': [3, 5, 7],
         'learning_rate': [0.01, 0.1, 0.3],
         'subsample': [0.5, 0.7, 1.0]
     }
     grid_search = GridSearchCV(estimator=xgb.XGBRegressor(objective='reg:squarederror'), param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
     grid_search.fit(X_train, y_train)
     print(grid_search.best_params_)
     ```

### 9. Model Interpretation
   - **SHAP Values**:
     - SHAP (SHapley Additive exPlanations) values are used to interpret the output of machine learning models.
     ```python
     import shap
     explainer = shap.Explainer(bst)
     shap_values = explainer(X_test)
     shap.summary_plot(shap_values, X_test)
     ```

### 10. Saving and Loading Models
   - **Saving Model**:
     - Saves the trained model to a file for later use.
     ```python
     bst.save_model('xgboost_model.json')
     ```
   - **Loading Model**:
     - Loads a saved model from a file.
     ```python
     loaded_bst = xgb.Booster()
     loaded_bst.load_model('xgboost_model.json')
     ```

### 11. Integration with Other Libraries
   - **Scikit-learn Integration**:
     - Integrates XGBoost with scikit-learn for a more familiar interface and additional functionalities like pipelines.
     ```python
     from xgboost import XGBRegressor
     model = XGBRegressor(objective='reg:squarederror')
     model.fit(X_train, y_train)
     preds = model.predict(X_test)
     ```

### 12. Dealing with Imbalanced Datasets
   - **Parameter Adjustment**:
     - Adjusts the `scale_pos_weight` parameter to handle imbalanced datasets, especially in binary classification tasks.
     ```python
     params['scale_pos_weight'] = sum(y_train == 0) / sum(y_train == 1)  # For binary classification
     ```

### 13. Custom Objective and Evaluation Functions
   - **Custom Objective**:
     - Defines a custom objective function for specific needs.
     ```python
     def custom_obj(preds, dtrain):
         labels = dtrain.get_label()
         return 'custom_obj', sum((preds - labels) ** 2)

     bst = xgb.train(params, dtrain, num_round, obj=custom_obj)
     ```

### 14. GPU Support
   - **Enabling GPU**:
     - Utilizes GPU acceleration for faster computation.
     ```python
     params['tree_method'] = 'gpu_hist'
     bst = xgb.train(params, dtrain, num_round)
     ```

### 15. XGBoost in a Distributed Setting
   - **Running on Multiple Machines**:
     - Uses Dask or Spark to handle large datasets distributed across multiple machines for parallel processing and distributed computing.
