# XGboost

## 1. What is boosting?

- Boosting is an algorithm in machine learning for primarily
reducing bias, and also variance in supervised learning.

- It converts set of weak learners to strong learner

- A weak learner is defined to be a classifier that is only
slightly correlated with the true classification threshold

## 2. Explain the following three models: Adaboost, Gradient boost and XGboost (also discuss their similarities and differences)

**AdaBoost**

- In Adaboost algorithm misclassified samples receive higher weight then thouse who are correctly classified.

-  The higher the weight make possible to give more priority while a training

- Algorithm generates weak classifier by training the next learner on the mistakes of the previous one

- it the upper bound of the training error by properly choosing the optimal weak classifier and voting weight.

- In case of AdaBoost, it minimises the exponential loss function that can make the algorithm sensitive to the outliers.

**Gradient boost**

- Gradient boosting algorithms are some of the most powerful machine
learning algorithms for classification and regression

- It is a special case of boosting where gradient descent algorithm is used to minimise errors and it use loss function and weak learner.

- In Gradient boost the size of constituent trees J, the number of boosting iterations M and Shrinkage (learning rate) can be tuned for better performance.

-  With Gradient Boosting, any differentiable loss function can be utilised. Gradient Boosting algorithm is more robust to outliers than AdaBoost.

- Gradient Boosting more flexible than AdaBoost in searching the approximate solutions to the additive modelling problem

**XGboost**

- Design principles of XGBoost comprises Combining weak learners into a strong learner and new models are added to correct the errors made by existing models.

- XGBoost builds trees in parallel so due to which it is
faster where as other gradient-boosted trees are built sequentially, as a result they are slowly learning from data to improve its prediction
in succeeding iteration


## 3. what is a Dmatrix? 

- DMatrix is an internal data structure that is used by XGBoost

- It is optimized for both memory efficiency and training speed

- We can construct DMatrix from multiple different sources of data

## 4. Load the dataset attached with this HW (wholesale)

In [104]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,confusion_matrix,classification_report,roc_auc_score,average_precision_score
import xgboost as xgb

In [44]:
df=pd.read_csv('/content/wholesale-data.csv')
df.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


## 5. Change the dataset into a dmatrix

In [45]:
X=df.iloc[:,1:]
y=df['Channel']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state = 42)
print(X_train.shape)
print(X_test.shape)

train_dmatrix = xgb.DMatrix(data = X_train, label = y_train)
test_dmatrix = xgb.DMatrix(data = X_test, label = y_test)

(352, 7)
(88, 7)


## 6. What is an imbalanced dataset?

In [46]:
df.Channel.value_counts()

1    298
2    142
Name: Channel, dtype: int64

- Here Channels has either 1 or 2 as lable.

- Number of times 1 and 2 are present are not same so it is known as imbalanced data.

- Accuracy of algorithm for detection of 1 is higher then 2 as input values for 1 is higher which is almost 2 times then for 2.

- So, Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations

## 6. perform classification with "channel" as the target variable and XGBoost as the model

In [88]:
xgb_class = xgb.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, objective='binary:logistic', booster='gbtree', n_jobs=1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=42)
xgb_class.fit(X_train, y_train)
y_train_pred = xgb_class.predict(X_train)
y_test_pred = xgb_class.predict(X_test)

In [89]:
y_train_pred

array([2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1,
       1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1,
       1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2,
       2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2,
       1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1,
       1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2,
       1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1,
       1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2,
       1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1,
       2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1,

In [90]:
y_test_pred

array([1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
       2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2,
       2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1])

In [None]:
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,'max_depth': 5, 'alpha': 10}
CV_training = xgb.cv(dtrain=train_dmatrix, params=params, nfold=3, num_boost_round=2,early_stopping_rounds=1, as_pandas=True, seed=123)
CV_test = xgb.cv(dtrain=test_dmatrix, params=params, nfold=3, num_boost_round=2,early_stopping_rounds=1, as_pandas=True, seed=123)

In [80]:
CV_training.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,0.886079,0.020403,0.888743,0.041592
1,0.824922,0.018652,0.827549,0.042901


In [81]:
CV_test.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,0.828047,0.044423,0.825027,0.094567
1,0.784019,0.041457,0.780885,0.097133


## 7. Evaluate the model using accuracy, f1 score, AUROC and AUPRC and explain which one is the best evaluation metric for this model and why?

In [107]:
#for Training Data set
print('confusion_matrix\n',confusion_matrix(y_train,y_train_pred))
print('classification_report\n',classification_report(y_train, y_train_pred))
print('roc_auc_score\n',roc_auc_score(y_train,y_train_pred))
print('average_precision_score\n',average_precision_score(y_train,y_train_pred))

confusion_matrix
 [[231   2]
 [  5 114]]
classification_report
               precision    recall  f1-score   support

           1       0.98      0.99      0.99       233
           2       0.98      0.96      0.97       119

    accuracy                           0.98       352
   macro avg       0.98      0.97      0.98       352
weighted avg       0.98      0.98      0.98       352

roc_auc_score
 0.9746997511450933
average_precision_score
 0.656397994672192


In [108]:
#for Test Data set
print('confusion_matrix\n',confusion_matrix(y_test,y_test_pred))
print('classification_report\n',classification_report(y_test, y_test_pred))
print('roc_auc_score\n',roc_auc_score(y_test, y_test_pred))
print('average_precision_score\n',average_precision_score(y_test, y_test_pred))

confusion_matrix
 [[60  5]
 [ 3 20]]
classification_report
               precision    recall  f1-score   support

           1       0.95      0.92      0.94        65
           2       0.80      0.87      0.83        23

    accuracy                           0.91        88
   macro avg       0.88      0.90      0.89        88
weighted avg       0.91      0.91      0.91        88

roc_auc_score
 0.8963210702341137
average_precision_score
 0.6972027972027972


- Accuracy represent the correct lable identified with out considereing balanced or non balanced data set

- F1 score is measurement of higher accuracy as well as precison so, according to my opinion in case of priority for TP or TN, this method should given highest priority.

- AUROC is not good measurement for imbalanced data

- AUPRC is used when True negative is not our prior conern