# Extreme Gradient Boosting with XGBoost
Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes. 

XGboost is a very fast, scalable implementation of gradient boosting

Summary
1. Classification with XGBoost
2. Regression with XGBoost
3. Fine-tuning XGBoost model
4. XGBoost in Pipelines

Reference: Sergey Fogelson, VP Analytics, Viacom, DataCamp

In [None]:
# note on adding image
<img src="image.png" width="500" />

## 1. Classification with XGBoost
Understand the basics of:
- supervised classification
- decision trees
- boosting

Supervised learning
- relies on labeled data - some understanding on past behavior
- 2 kinds of problems: regression and classificaiton
- Classification problems
    - outcomes are binary or multi-class
    - AUC - metric for binary classification models
        - larger area under the ROC curve = better model, more sensitive
    - Accuracy score and confusion matrix - metric for multiclass
    - common algorithms: logistic regression and decision trees

Other supervised learning considerations
- require a table of feature vectors
- Features can be either numeric or categorical
- Numeric features should be scaled (Z-scored)
- Categorical features should be encoded (one-hot)
- Other problems
    - Ranking - predicting an ordering on a set of choices
        - ie. Google search
    - Recommendation
        - Recommending an item to a user
        - based on consumption history and profile
        - ie. Netflix

XGBoost introduction

What is XGBoost?
- Optimized gradient-boosting ML library
- orginally written in C++ command line application
- Has APIs in several languages:
    - Python, R, Scala, Julia, Java
    
What makes XGBoost so popular?
- speed and performance
- core algorithm is parallelizable - GPUs and networks of computers
    - feasible to scale to 100s of millions of training examples
- the real draw is: consistently outperforms single-algorithm methods
    - state of the art performance in many ML tasks




### 1.1 XGBoost: Classification example

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# load data
class_data = pd.read_csv("classification_data.csv")
X, y = class_data.iloc[:,:-1], class_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=123)

# instantiate XGBoost classifier
xg_cl = xgb.XGBClassifier(objective='binary:logistic',
                          n_estimators=10, seed=123)
# fit and predict
xg_cl.fit(X_train, y_train)
preds = xg_cl.predict(X_test)

# evaluate accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
# 0.743300

### 1.2 XGBoost: Fit/Predict
It's time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!

Here, you'll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called churn_data - explore it in the Shell!

Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

pandas and numpy have been imported as pd and np, and train_test_split has been imported from sklearn.model_selection. Additionally, the arrays for the features and the target have been created as X and y.

In [None]:
# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

# accuracy: 0.743300

### 1.3 What is a decision tree?
Decision trees as base learners
- Base learner - individual learning algorithm in an ensemble algorithm
    - note: XGB is an ensemble learning method that uses outputs of many models for a final prediction
- Composed of a series of binary questions/decisions - y/n, T/F
- Predictions happen at the "leaves" of the tree

Decision trees and CART (=classification and regression trees for ML)
- Constructed iteratively (one decision at a time)
    - Until a stopping criterion is met
- Individual decision trees tend to overfit
    - Low Bias and High Variance learning models
        - good at learning relationships but tend to overfit, so generalize poorly
- CART: Classification and Regression Trees
    - XGB uses this
    - Each leaf ALWAYS contains a real-valued score for classification or regression
    - Can later be converted into categories
        

### 1.4 Decision trees
Your task in this exercise is to make a simple decision tree using scikit-learn's DecisionTreeClassifier on the breast cancer dataset that comes pre-loaded with scikit-learn.

This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).

We've preloaded the dataset of samples (measurements) into X and the target values per tumor into y. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier. You'll specify a parameter called max_depth. Many other parameters can be modified within this model, and you can check all of them out here (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier).

In [None]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123)

# Instantiate the classifier: dt_clf_4
# :param max_depth of 4. This parameter specifies the maximum number 
#  of successive split points you can have before reaching a leaf node.
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)


### 1.5 What is Boosting?
Boosting overview
- Not a specific machine learning algorithm
- concept that can be applied to a set of ML models
    - "Meta-algorithm"
- Ensemble meta-algorithm used to convert many weak learners into a strong learner

Weak learners and strong learners
- Weak learner = ML algorithm that is slightly better than chance
    - Example: decision tree whose predictions slightly better than 50%
- Boosting converts a collection of weak learners into a strong learner
- Strong learner = any algorithm that can be tuned to achieve good performance

How boosting is accomplished?
- Iteratively learning a set of weak models on subsets of the data
- Weighting each weak prediction according to each weak learner's performance
    - final prediction is much better than any individual predictions

Model evaluation through cross-validation
- Cross-validation = robust method for estimating the performance of a model on unseen data by...
- Generating many non-overlapping train/test splits on training data
- Reports the average test set performance across all data splits


#### 1.5.a Cross-validation in XGBoost

In [None]:
import xgboost as xgb
import pandas as pd
class_data = pd.read_csv("classification_data.csv")

# DMatrix
churn_dmatrix = xgb.DMatrix(data=churn_data.iloc[:,:-1],
                            label=churn_data.month_5_still_here])

# parameter dictionary to pass into cross validation
params={"objective":"binary:logistic","max_depth":4}

# use cv method and pass required dmatrix
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=4,
                   num_boost_round=10, metrics="error", as_pandas=True)

# output accuracy
print("Accuracy: %f" %((1-cv_results["test-error-mean"]).iloc[-1]))
# Accuracy: 0.88315

### 1.6 Measuring accuracy
You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In the previous exercise, the input datasets were converted into DMatrix data on the fly, but when you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix. So, that's what you will do here before running cross-validation on churn_data.

In [None]:
# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
# "error" metrics will be converted to accuracy
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, 
                    num_boost_round=5, metrics="error", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

# output
       test-error-mean  test-error-std  train-error-mean  train-error-std
    0          0.28378        0.001932           0.28232         0.002366
    1          0.27190        0.001932           0.26951         0.001855
    2          0.25798        0.003963           0.25605         0.003213
    3          0.25434        0.003827           0.25090         0.001845
    4          0.24852        0.000934           0.24654         0.001981
    0.75148

cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!

### 1.7 Measuring AUC
Now that you've used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv().

Your job in this exercise is to compute another common metric used in binary classification - the area under the curve ("auc"). As before, churn_data is available in your workspace, along with the DMatrix churn_dmatrix and parameter dictionary params.

In [None]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, 
                    num_boost_round=5, metrics="auc", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

<script.py> output:
       test-auc-mean  test-auc-std  train-auc-mean  train-auc-std
    0       0.767863      0.002820        0.768893       0.001544
    1       0.789157      0.006846        0.790864       0.006758
    2       0.814476      0.005997        0.815872       0.003900
    3       0.821682      0.003912        0.822959       0.002018
    4       0.826191      0.001937        0.827528       0.000769
    0.826191

# An AUC of 0.84 is quite strong. As you have seen, XGBoost's 
# learning API makes it very easy to compute any metric you may be 
# interested in.

### 1.8 When should I use XGBoost?
When to use XGBoost? Criteria:
- large number of training samples
    - > 1000 training samples and < 100 features
    - should be ok when number of features < number of training samples
- you have mixture of categorical and numeric features
- Or just numeric features

When to NOT use XGBoost? Criteria:
- had success with other algorithms
    - Not suited for: (Deep learning is better)
        - image recognition
        - computer vision
        - NLP and NL understanding problems
- dataset size issues
    - < 100 training samples
    - when number of training samples is significantly smaller than the number of features
    

### 1.9 Using XGBoost
XGBoost is a powerful library that scales very well to many samples and works for a variety of supervised learning problems. That said, as Sergey described in the video, you shouldn't always pick it as your default machine learning library when starting a new project, since there are some situations in which it is not the best option. In this exercise, your job is to consider the below examples and select the one which would be the best use of XGBoost.

Possible Answers
- Visualizing the similarity between stocks by comparing the time series of their historical prices relative to each other.
- Predicting whether a person will develop cancer using genetic data with millions of genes, 23 examples of genomes of people that didn't develop cancer, 3 genomes of people who wound up getting cancer.
- Clustering documents into topics based on the terms used in them.
- Predicting the likelihood that a given user will click an ad from a very large clickstream log with millions of users and their web interactions.

Answer:
- D

## 2. Regression with XGBoost
Regression review
- Outcome is real-valued - ie. height

Common regression metrics
- Root mean squared error (RMSE)
    - Total Squared Error = compute difference b/n Actual and Predicted 
    - Mean squared error - take mean
    - Root Mean Squared error - take square root
        - allows us treat negative and positive errors equally
        - but tends to punish larger differences b/n Actual and Predicted
- Mean absolute error (MAE)
    - Total Absolute Error
    - Mean Absolute Error
    - isn't affected by large differences like RMSE
    - but not frequently used b/c it lacks some mathematical properties

Common regression algorithms
- linear regression
- Decision trees


### Which of these is a regression problem?
Here are 4 potential machine learning problems you might encounter in the wild. Pick the one that is a clear example of a regression problem.

Possible Answers
- Recommending a restaurant to a user given their past history of restaurant visits and reviews for a dining aggregator app.
- Predicting which of several thousand diseases a given person is most likely to have given their symptoms.
- Tagging an email as spam/not spam based on its content and metadata (sender, time sent, etc.).
- Predicting the expected payout of an auto insurance claim given claim properties (car, accident type, driver prior history, etc.).

Answer: D

### 2.1 Objective (loss) functions and base learners
Objective functions and why we use them
- Loss functions quantifies how far off a prediction is from the actual result
- Measures the difference b/n estimated and true values for some collection of data
- Goal: Find the model that yields the minimum value of the loss function

Common Loss Functions in XGBoost
- reg:linear - use for regression problems
- reg:logistic - use for classification problems when you want just decision, not probability
- binary:logistic - use when you want probability rather than just decision

Base Learners and why we need them
- XGBoost involves creating a meta-model (ensemble learning method) that is composed of many individual models that combine to give a final prediction
- Individual models = Base Learners
- Want base learners that when combined create final prediction that is non-linear
- Each base learner should be good at distinguishing or predicting different parts of the dataset
    - goal of XGB is to have base learners that are slightly better than random guessing on certain subsets of training examples and uniformly bad on the remainder
    - when predictions are all combined, the uniformly bad predictions cancel out, and those slightly better than chance predictions combine into a very good prediction
- 2 kinds of base learners: tree and linear


#### 2.1.a Trees as Base Learners example: Scikit-learn API
- Bosting housing dataset from UIC repo

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
boston_data = pd.read_csv("boston_housing.csv")
X, y, = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=123)
# regression, note objective parameter
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10,
                          seed=123)
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))


#### 2.1.b Linear Base Learners example: Learning API Only

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
boston_data = pd.read_csv("boston_housing.csv")
X, y, = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=123)
# note: difference here
# Learning API requires DMatrix
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test = xgb.DMatrix(data=X_test, label=y_test)
params = {"booster":"gblinear","objective":"reg:linear"}
xg_reg = xgb.train(params=params, dtrain=DM_train, num_boost_round=10)
preds = xg_reg.predict(DM_test)

# same here
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

### 2.2.a Decision trees as base learners
It's now time to build an XGBoost model to predict house prices - not in Boston, Massachusetts, as you saw in the video, but in Ames, Iowa! This dataset of housing prices has been pre-loaded into a DataFrame called df. If you explore it in the Shell, you'll see that there are a variety of features about the house and its location in the city.

In this exercise, your goal is to use trees as base learners. By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".

xgboost has been imported as xgb and the arrays for the features and the target are available in X and y, respectively.

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=123)

# Instantiate the XGBRegressor: xg_reg
# :param n_estimators = 10, specifies 10 trees
# Note: You don't have to specify booster="gbtree" as this is the default.
xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

# RMSE: 78847.401758

### 2.2.b Linear base learners
Now that you've used trees as base models in XGBoost, let's use the other kind of base model that can be used with XGBoost - a linear learner. This model, although not as commonly used in XGBoost, allows you to create a regularized linear regression using XGBoost's powerful learning API. However, because it's uncommon, you have to use XGBoost's own non-scikit-learn compatible functions to build the model, such as xgb.train().

In order to do this you must create the parameter dictionary that describes the kind of booster you want to use (similarly to how you created the dictionary in Chapter 1 (https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost/classification-with-xgboost?ex=9) when you used xgb.cv()). The key-value pair that defines the booster type (base model) you need is "booster":"gblinear".

Once you've created the model, you can use the .train() and .predict() methods of the model just like you've done in the past.

Here, the data has already been split into training and testing sets, so you can dive right into creating the DMatrix objects required by the XGBoost learning API.

In [None]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, 
                   dtrain=DM_train, 
                   num_boost_round =5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

# RMSE: 43965.314324
# Looks like the linear base learners performed better than the tree

### 2.3 Evaluating model quality
It's now time to begin evaluating model quality.

Here, you will compare the RMSE and MAE of a cross-validated XGBoost model on the Ames housing data. As in previous exercises, all necessary modules have been pre-loaded and the data is available in the DataFrame df.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, 
                    num_boost_round=5, metrics="rmse", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["RMSE"]).tail(1))

   test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
0   142980.433594    1193.791602    141767.531250      429.454591
1   104891.394532    1223.158855    102832.544922      322.469930
2    79478.937500    1601.344539     75872.615235      266.475960
3    62411.920899    2220.150028     57245.652343      273.625086
4    51348.279297    2963.377719     44401.298828      316.423666
4    51348.279297
Name: test-rmse-mean, dtype: float64

In [None]:
# Now adapt code to compute the "mae" instead of the "rmse"

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, 
                    num_boost_round=5, metrics="mae", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))

   test-mae-mean  test-mae-std  train-mae-mean  train-mae-std
0  127634.000000   2404.009898   127343.482421     668.308109
1   90122.501953   2107.912810    89770.056641     456.965267
2   64278.558594   1887.567576    63580.791016     263.404950
3   46819.168945   1459.818607    45633.155274     151.883420
4   35670.646484   1140.607452    33587.090820      86.999396
4    35670.646484
Name: test-mae-mean, dtype: float64

### 2.4 Regularization and base learners in XGBoost
- Regularization is a control on model complexity
    - penalize models as they get more complex
- Want models that are both accurate and as simple as possible
- Regularization parameters in XGBoost:
    - gamma - minimum loss reduction allowed for a split to occur
        - higher values lead to fewer splits
    - alpha - l1 regularization on leaf weights, larger values mean more regularization (which means many leaf weights to go to 0)
    - lambda - l2 regularization on leaf weights
    

#### 2.4.a L1 Regularization in XGBoost example

In [None]:
import xgboost as xgb
import pandas as pd
boston_data = pd.read_csv("boston_data.csv")
X,y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
boston_dmatrix = xgb.DMatrix(data=X, label=y)
params={"objective":"reg:linear","max_depth"=4}
l1_params = [1,10,1000]
# empty list to store rmse values
rmses_l1 = []

for reg in l1_params:
    params["alpha"] = reg
    cv_results = xgb.cv(dtrain=boston_dmatrix, params=params,nfold=4,
                       num_boost_round=10,metrixs="rmse",
                       as_pandas=True, seeds=123)
    rmses_l1.append(cv_results["test-rmse-mean"] \
                    .tail(1).values[0])
    
print("Best rmse as a function of l1:")
print(pd.DataFrame(list(zip(11_params,rmses_11)), 
                   columns=["l1", "rmse"]))

### 2.4.b Base Learners in XGBoost - 2 types
- Linear Base Learner:
    - Sum of linear terms
        - Like linear/logistic regression model
    - Boosted model is weighted sum of linear models (thus is itself linear)
    - Rarely used - since you don't get any non-linear combo of features
        - can get identical performance from a regularized linear model
- Tree Base Learner:
    - Decision Tree
    - Boosted model is weighted sum of decision trees (nonlinear)
    - Almost exclusively used in XGBoost
    
Creating DataFrames from muiltiple equal-length lists
- pd.DataFrame(list(zip(list1,list2)),columns=["list1","list2"]))
- zip creates a generator of parallel values:
    - zip([1,2,3],['a','b','c']) = [1,'a'],[2,'b'],[3,'c']
    - generators need to be completely instantiated before they can be used in DataFrame objects
- list() instantiates the full generator and passing that into the DataFrame converts the whole expression


In [None]:
# Creating DataFrames from muiltiple equal-length lists
pd.DataFrame(list(zip(list1,list2)),columns=["list1","list2"]))

# example of zip
zip([1,2,3],['a','b','c']) = [1,'a'],[2,'b'],[3,'c']

### 2.4.a Using regularization in XGBoost
Having seen an example of l1 regularization in the video, you'll now vary the l2 regularization penalty - also known as "lambda" - and see its effect on overall model performance on the Ames housing dataset.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []

# Iterate over reg_params
for reg in reg_params:

    # Update l2 strength
    params["lambda"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, 
                             params=params, 
                             nfold=2, 
                             num_boost_round=5, 
                             metrics="rmse", 
                             as_pandas=True, 
                             seed=123)
    
    # Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2", "rmse"]))

<script.py> output:
    Best rmse as a function of l2:
        l2          rmse
    0    1  52275.357421
    1   10  57746.064453
    2  100  76624.625000
    
# looks like as lambda increases, rmse increases

### 2.5 Visualizing individual XGBoost trees - plot_tree()
Now that you've used XGBoost to both build and evaluate regression as well as classification models, you should get a handle on how to visually explore your models. Here, you will visualize individual trees from the fully boosted model that XGBoost creates using the entire housing dataset.

XGBoost has a plot_tree() function that makes this type of visualization easy. Once you train a model using the XGBoost learning API, you can pass it to the plot_tree() function along with the number of trees you want to plot using the num_trees argument.

Plot the first tree using xgb.plot_tree(). It takes in two arguments - the model (in this case, xg_reg, which is trained), and num_trees, which is 0-indexed. So to plot the first tree, specify num_trees=0.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":2}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the first tree
xgb.plot_tree(xg_reg, num_trees=0)
plt.show()

# Plot the fifth tree
xgb.plot_tree(xg_reg, num_trees=4)
plt.show()

# Plot the last tree sideways
xgb.plot_tree(xg_reg, num_trees=4, rankdir="LR")
plt.show()

Have a look at each of the plots. They provide insight into how the model arrived at its final decisions and what splits it made to arrive at those decisions. This allows us to identify which features are the most important in determining house price. In the next exercise, you'll learn another way of visualizing feature importances.

<img src="images/xgb1.png" width="500" />
<img src="images/xgb2.png" width="500" />
<img src="images/xgb3.png" width="500" />

### 2.6 Visualizing feature importances: What features are most important in my dataset - plot_importance()
Another way to visualize your XGBoost models is to examine the importance of each feature column in the original dataset within the model.

One simple way of doing this involves counting the number of times each feature is split on across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the features ordered according to how many times they appear. XGBoost has a plot_importance() function that allows you to do exactly this, and you'll get a chance to use it in this exercise!

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()

It looks like GrLivArea is the most important feature.

<img src="images/xgb4.png" width="500" />

## 3. Fine-tuning your XGBoost model
Why tune your model?


<img src="images/xgb4.png" width="300" />

### 

### 

### 

### 

### 

### 

### 

### 

## 4. XGBoost in Pipelines

### 

### 

### 

### 

### 

### 

### 

### 