### Imports

In [5]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
some_data = pd.read_csv('diabetes.csv')

# Classification
### Little revision
- AUC curve is the most versatile and common evaluation metric used to judge the quality of a binary classification model. <br>
- When dealing with multi-class classification problems, it is common to use the accuracy score and confusion matrix.
- Numeric features should be scaled (Z-scored) <br>


### Introduction to XGBoost
It is an optimized gradient-boosting ml library, speed and great performance - very powerful. It involves creating a meta-model that is composed of many individual models (base-learners) that combine to give a final predciction.

In [2]:
X, y = some_data.iloc[:,:-1], some_data.iloc[:,-1]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
# binary:logistic: logistic regression for binary classification, output probability
# n_estimators represents the number of trees in the forest
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train, y_train)
preds = xg_cl.predict(X_test)

accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.798701


### Trees
XGboost is usually used with trees as base learners, it uses `CART` - Classification and Regression Trees. Those are described in my pdf with notes.
- Base learner - individual learning algorithm in an ensamble algorithm

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
dt_clf_4 = DecisionTreeClassifier(max_depth=4)
dt_clf_4.fit(X_train, y_train)
y_pred_4 = dt_clf_4.predict(X_test)
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)

accuracy: 0.7727272727272727


### Boosting
It's an ensemble "meta-algorithm" - concept that can be applied to a set of ML models, used to convert many (collection of) weak learners into a strong learner.
- Weak learner - ML algorithm that is slightly better than chance (50%)
- Strong learner - ML algo that can be tuned to achieve good performace


#### Cross-validation
We used special data structure called DMatrix <br>

`params` is your parameter dictionary, `nfold` is the number of cross-validation folds, `num_boost_round` is the number of trees we want to build, `metrics` is the metric you want to compute (this will be "error", which we will convert to an accuracy).



In [19]:
some_dmatrix = xgb.DMatrix(data=X, label=y) 
params = {"objective":"binary:logistic","max_depth":4}
cv_results = xgb.cv(dtrain=some_dmatrix, params=params, nfold=4, 
                    num_boost_round=10, metrics="error", as_pandas=True, seed=123)
print("Accuracy {}".format(1-cv_results['test-error-mean'].iloc[-1]))

Accuracy 0.76171875


In [21]:
cv_results = xgb.cv(dtrain=some_dmatrix, params=params,nfold=3, num_boost_round=5,
                    metrics="auc", as_pandas=True,seed=123)
print(cv_results)
# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.869696       0.012520       0.773788      0.037565
1        0.903689       0.016147       0.803789      0.047307
2        0.912587       0.013909       0.814454      0.044036
3        0.919556       0.010974       0.818057      0.046373
4        0.924181       0.009099       0.815802      0.045803
0.8158021262367859


#### When should we used XGBoost?
- Large number of training examples - greater than 1000 training samples and less than 100 features (however in general as long as number of features < number of training samples you should be fine)
- Mixture of categorical and numeric features or just numeric


#### When to not use XGBoost?
- Image recognition
- Computer vision
- NLP
- Very small training set and number of training samples significantly smaller than number of features

# Regression
To evaluate our model we usually use RMSE (root mean squared error) or MAE (mean absolute error)
### Objective (loss) functions and base learners
We obvously hope to minimze loss functions.

![loss-fun.png](attachment:loss-fun.png)

`reg:linear` was deprecated in favor of `reg:squarederror` !!!

In [6]:
boston_data = pd.read_csv('boston.csv')
X, y = boston_data.iloc[:,:-1], boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=10, seed=123)
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: {}".format(rmse))

RMSE: 3.7824431053497274


booster [default=gbtree]: which booster to use, can be gbtree or
gblinear. gbtree uses tree based model while gblinear uses linear
function.

In [11]:
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test = xgb.DMatrix(data=X_test, label=y_test)
params = {"booster":"gblinear", "objective":"reg:squarederror"}
xg_reg = xgb.train(params=params, dtrain=DM_train, num_boost_round=10) 
preds = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: {}".format(rmse))

RMSE: 6.162421747363111
