**Classification with XGBoost**
___
Introduction:
- Supervised learning
    - **has labeled data** - we have some understanding of past behavior of problem we are trying to solve or what we are trying to predict
    - **Classification** - binary or multi-class
        - AUC is metric for binary classification
            - Area on receiver operating characteristic curve
            - probability that a randomly chosen positive data point will have a higher rank than a randomly chosen negative data point for your data problem
            - higher AUC, better model
        - for multi-class problems, confusion-matrix and accuracy score is how this happens
- features are either numeric or categorical
    - numeric features should be scaled (e.g., SVM models)
    - categorical features should be encoded (one-hot)
- **Ranking** - predicting an ordering on a srt of choices
- **Recommending** - recommending an item to a user based on consumption history and profile
___

**Introduction to XGBoost**
___
- optimized gradient boosting machine learning library
- written in C++
- fast
- paralellizable

In [None]:
#XGBoost: Fit/Predict

#It's time to create your first XGBoost model! As Sergey showed you
#in the video, you can use the scikit-learn .fit() / .predict()
#paradigm that you are already familiar to build your XGBoost models,
#as the xgboost library has a scikit-learn compatible API!

#Here, you'll be working with churn data. This dataset contains imaginary
#data from a ride-sharing app with user behaviors over their first month
#of app usage in a set of imaginary cities as well as whether they used
#the service 5 months after sign-up. It has been pre-loaded for you into
#a DataFrame called churn_data - explore it in the Shell!

#Your goal is to use the first month's worth of data to predict whether
#the app's users will remain users of the service at the 5 month mark.
#This is a typical setup for a churn prediction problem. To do this,
#you'll split the data into training and test sets, fit a small xgboost
#model on the training set, and evaluate its performance on the test set
#by computing its accuracy.

#pandas and numpy have been imported as pd and np, and train_test_split
#has been imported from sklearn.model_selection. Additionally, the arrays
#for the features and the target have been created as X and y.

# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
#X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
#X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
#xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
#xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
#preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
#accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
#print("accuracy: %f" % (accuracy))

#################################################
#<script.py> output:
#    accuracy: 0.743300
#################################################

**What is a decision tree?**
___
- series of binary choices
- prediction happens at the "leaves" of the tree
- **base learner** - individual learning algorithm in an ensemble algorithm
- decision trees and CART are constructed iteratively until a stopping criterion is met
- individual decision trees tend to overfit
    - low bias, high variance
    - generalize to new data poorly
- **Classification and Regression Trees (CART)**
    - each leaf *always* contains a real-valued score which can be later converted into categories
___

In [None]:
#Decision trees

#Your task in this exercise is to make a simple decision tree using
#scikit-learn's DecisionTreeClassifier on the breast cancer dataset
#that comes pre-loaded with scikit-learn.

#This dataset contains numeric measurements of various dimensions of
#individual tumors (such as perimeter and texture) from breast biopsies
#and a single outcome value (the tumor is either malignant, or benign).

#We've preloaded the dataset of samples (measurements) into X and the
#target values per tumor into y. Now, you have to split the complete
#dataset into training and testing sets, and then train a
#DecisionTreeClassifier. You'll specify a parameter called max_depth.
#Many other parameters can be modified within this model, and you can
#check all of them out at
#https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
#dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
#dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
#y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
#accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
#print("accuracy:", accuracy)

#################################################
#<script.py> output:
#   accuracy: 0.9649122807017544
#################################################

**What is Boosting?**
___
- a concept that can be applied to a set of machine learning models
- ensemble meta-algorithm used to convert many weak learners into a strong learner
- **weak learner**
    - ML algorithm that is slightly better than chance (50/50)
- **strong learner**
    - any algorithm that can be tuned to achieve good performance
- iteratively learning a set of weak models on subsets of the data
- weighting each weak prediction according to each weak learner's performance
- combine weighted predictions to obtain a single weighted prediction
- **Cross-validation** is baked into XGBoost
___

In [None]:
#Measuring accuracy

#You'll now practice using XGBoost's learning API through its baked
#in cross-validation capabilities. As Sergey discussed in the previous
#video, XGBoost gets its lauded performance and efficiency gains by
#utilizing its own optimized data structure for datasets called a
#DMatrix.

#In the previous exercise, the input datasets were converted into
#DMatrix data on the fly, but when you use the xgboost cv object,
#you have to first explicitly convert your data into a DMatrix. So,
#that's what you will do here before running cross-validation on
#churn_data.

# Create arrays for the features and the target: X, y
#X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix
#churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
#params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
#cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
#                    nfold=3, num_boost_round=5,
#                    metrics="error", as_pandas=True, seed=123)

# Print cv_results
#print(cv_results)

# Print the accuracy
#print(((1-cv_results["test-error-mean"]).iloc[-1]))

#################################################
#<script.py> output:
#       train-error-mean  train-error-std  test-error-mean  test-error-std
#    0           0.28232         0.002366          0.28378        0.001932
#    1           0.26951         0.001855          0.27190        0.001932
#    2           0.25605         0.003213          0.25798        0.003963
#    3           0.25090         0.001845          0.25434        0.003827
#    4           0.24654         0.001981          0.24852        0.000934
#    0.75148
#################################################

#Measuring AUC
#Now that you've used cross-validation to compute average out-of-sample
#accuracy (after converting from an error), it's very easy to compute
#any other metric you might be interested in. All you have to do is pass
#it (or a list of metrics) in as an argument to the metrics parameter
#of xgb.cv().

#Your job in this exercise is to compute another common metric used in
#binary classification - the area under the curve ("auc"). As before,
#churn_data is available in your workspace, along with the DMatrix
#churn_dmatrix and parameter dictionary params.

#Perform cross_validation: cv_results
#cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
#                  nfold=3, num_boost_round=5,
#                  metrics="auc", as_pandas=True, seed=123)

# Print cv_results
#print(cv_results)

# Print the AUC
#print((cv_results["test-auc-mean"]).iloc[-1])

#################################################
#<script.py> output:
#       train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
#    0        0.768893       0.001544       0.767863      0.002820
#    1        0.790864       0.006758       0.789157      0.006846
#    2        0.815872       0.003900       0.814476      0.005997
#    3        0.822959       0.002018       0.821682      0.003912
#    4        0.827528       0.000769       0.826191      0.001937
#    0.826191
#################################################

**When should I use XGBoost?**
___
- any supervised learning example with:
    - large number of training samples (1000+ samples with fewer than 100 features)
    - number of features < number of training samples
    - a mixture of categorical and numeric features
    - just numeric features
- do not use XGBoost with:
    - image recognition
    - computer vision
    - natural language processing/understanding
    - smaller number of training samples (see above)
___

**Regression Review**
___
- Common regression metrics
    - root mean squared error (RMSE)
        - square root of mean of [difference between actual and predicted values, squared]
        - treats negative and positive values equally
        - tends to punish larger differences between predicted and actual values
    - mean absolute error (MAE)
        - sums absolute differences
- **Algorithms**
    - linear regression
    - decision trees
___

**Objective (loss) functions and base learners**
___
- Quantifies how far off a prediction is from the actual result
- Measures the difference between the estimated true values for some collection of data
- **Goal**: find the model that yields the minimum value of the loss function
- in xgboost:
    - reg:linear - use for regression problems
    - reg:logistic - use for classification problems when you want decision, not probability
    - binary:logistic - use when you want probability rather than just decision
- base learners and why we need them
        - we want base learners that when combined create a final prediction that is **non-linear**
        - each base learner should be good at distinguishing or predicting different parts of the dataset
- two kinds of base learners: tree and linear

In [None]:
#Decision trees as base learners

#It's now time to build an XGBoost model to predict house prices -
#not in Boston, Massachusetts, as you saw in the video, but in Ames,
#Iowa! This dataset of housing prices has been pre-loaded into a
#DataFrame called df. If you explore it in the Shell, you'll see that
#there are a variety of features about the house and its location in
#the city.

#In this exercise, your goal is to use trees as base learners. By default,
#XGBoost uses trees as base learners, so you don't have to specify that you
#want to use trees here with booster="gbtree".

#xgboost has been imported as xgb and the arrays for the features and
#the target are available in X and y, respectively.

# Create the training and test sets
#X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
#xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10, seed=123)

# Fit the regressor to the training set
#xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
#preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
#rmse = np.sqrt(mean_squared_error(y_test, preds))
#print("RMSE: %f" % (rmse))

#################################################
#<script.py> output:
#    [18:40:17] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
#    RMSE: 78847.401758
#################################################

#Linear base learners

#Now that you've used trees as base models in XGBoost, let's use the
#other kind of base model that can be used with XGBoost - a linear
#learner. This model, although not as commonly used in XGBoost, allows
#you to create a regularized linear regression using XGBoost's powerful
#learning API. However, because it's uncommon, you have to use XGBoost's
#own non-scikit-learn compatible functions to build the model, such as
#xgb.train().

#In order to do this you must create the parameter dictionary that
#describes the kind of booster you want to use (similarly to how you
#created the dictionary in Chapter 1 when you used xgb.cv()). The
#key-value pair that defines the booster type (base model) you need is
#"booster":"gblinear".

#Once you've created the model, you can use the .train() and .predict()
#methods of the model just like you've done in the past.

#Here, the data has already been split into training and testing sets,
#so you can dive right into creating the DMatrix objects required by the
#XGBoost learning API.

# Convert the training and testing sets into DMatrixes: DM_train, DM_test
#DM_train = xgb.DMatrix(data=X_train, label=y_train)
#DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params
#params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
#xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
#preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
#rmse = np.sqrt(mean_squared_error(y_test,preds))
#print("RMSE: %f" % (rmse))

#################################################
#<script.py> output:
#    [18:52:04] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
#    RMSE: 40738.238504
#################################################

#Evaluating model quality
#It's now time to begin evaluating model quality.

#Here, you will compare the RMSE and MAE of a cross-validated XGBoost
#model on the Ames housing data. As in previous exercises, all necessary
#modules have been pre-loaded and the data is available in the DataFrame
#df.

# Create the DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary: params
#params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
#cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="mae", as_pandas=True, seed=123)

# Print cv_results
#print(cv_results)

# Extract and print final round boosting round metric
#print((cv_results["test-mae-mean"]).tail(1))

#results printed below for "rmse" first and "mae" second
#################################################
#train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
#    0    141767.531250      429.454591   142980.429688    1193.794436
#    1    102832.544922      322.474657   104891.394532    1223.158855
#    2     75872.619140      266.472468    79478.935547    1601.344218
#    3     57245.650390      273.624608    62411.922851    2220.149653
#    4     44401.297851      316.422372    51348.279297    2963.377719
#    4    51348.279297
#   Name: test-rmse-mean, dtype: float64

#train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
#    0   127343.476562     668.342129  127633.980469   2404.003469
#    1    89770.052735     456.962096   90122.501953   2107.915156
#    2    63580.791992     263.403452   64278.561524   1887.563452
#    3    45633.153321     151.884551   46819.167969   1459.819091
#    4    33587.092774      86.999100   35670.644531   1140.607997
#    4    35670.644531
#    Name: test-mae-mean, dtype: float64
#################################################

**Regularization and base learners in XGBoost**
___
- Regularization is a control on model complexity
- Want models that are both accurate and as simple as possible
- Regularization parameters in XGBoost
    - *gamma* - minimum loss reduction allowed for a split to occur (higher value means fewer splits)
    - *alpha* - l1 regularization (many weights will go to zero) of leaf weights (higher values mean more regularization)
    - *lambda* - l2 regularization (smooths leaf weights)
- Linear base learner
    - sum of linear terms
    - boosted model is weighted sum of linear models (linear)
    - rarely used (you can get same performance from regularized linear model
- Tree base learner
    - decision tree
    - boosted model is weighted sum of decision trees (nonlinear)
    - almost always used in XGBoost

In [None]:
#Using regularization in XGBoost
#Having seen an example of l1 regularization in the video, you'll now
#vary the l2 regularization penalty - also known as "lambda" - and see
#its effect on overall model performance on the Ames housing dataset.

# Create the DMatrix: housing_dmatrix
#housing_dmatrix = xgb.DMatrix(data=X, label=y)

#reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
#params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
#rmses_l2 = []

# Iterate over reg_params
#for reg in reg_params:

    # Update l2 strength
#    params["lambda"] = reg

    # Pass this updated param dictionary into cv
#    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)

    # Append best rmse (final round) to rmses_l2
#    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
#print("Best rmse as a function of l2:")
#print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2","rmse"]))

#################################################
#Best rmse as a function of l2:
#        l2          rmse
#    0    1  52275.359375
#    1   10  57746.064453
#    2  100  76624.625000
#################################################

In [None]:
#Visualizing individual XGBoost trees
#Now that you've used XGBoost to both build and evaluate regression as
#well as classification models, you should get a handle on how to
#visually explore your models. Here, you will visualize individual trees
#from the fully boosted model that XGBoost creates using the entire
#housing dataset.

#XGBoost has a plot_tree() function that makes this type of
#visualization easy. Once you train a model using the XGBoost learning
#API, you can pass it to the plot_tree() function along with the number
#of trees you want to plot using the num_trees argument.


![_images/11.1.svg](_images/11.1.svg)
![_images/11.2.svg](_images/11.2.svg)
![_images/11.3.svg](_images/11.3.svg)