# Extreme Gradient Boosting with XGBoost
Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes. 

XGboost is a very fast, scalable implementation of gradient boosting

Summary
1. Classification with XGBoost
2. Regression with XGBoost
3. Fine-tuning XGBoost model
4. XGBoost in Pipelines

Reference: Sergey Fogelson, VP Analytics, Viacom, DataCamp

In [None]:
# note on adding image
<img src="image.png" width="500" />

## 1. Classification with XGBoost
Understand the basics of:
- supervised classification
- decision trees
- boosting

Supervised learning
- relies on labeled data - some understanding on past behavior
- 2 kinds of problems: regression and classificaiton
- Classification problems
    - outcomes are binary or multi-class
    - AUC - metric for binary classification models
        - larger area under the ROC curve = better model, more sensitive
    - Accuracy score and confusion matrix - metric for multiclass
    - common algorithms: logistic regression and decision trees

Other supervised learning considerations
- require a table of feature vectors
- Features can be either numeric or categorical
- Numeric features should be scaled (Z-scored)
- Categorical features should be encoded (one-hot)
- Other problems
    - Ranking - predicting an ordering on a set of choices
        - ie. Google search
    - Recommendation
        - Recommending an item to a user
        - based on consumption history and profile
        - ie. Netflix

XGBoost introduction

What is XGBoost?
- Optimized gradient-boosting ML library
- orginally written in C++ command line application
- Has APIs in several languages:
    - Python, R, Scala, Julia, Java
    
What makes XGBoost so popular?
- speed and performance
- core algorithm is parallelizable - GPUs and networks of computers
    - feasible to scale to 100s of millions of training examples
- the real draw is: consistently outperforms single-algorithm methods
    - state of the art performance in many ML tasks




### 1.1 XGBoost: Classification example

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# load data
class_data = pd.read_csv("classification_data.csv")
X, y = class_data.iloc[:,:-1], class_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=123)

# instantiate XGBoost classifier
xg_cl = xgb.XGBClassifier(objective='binary:logistic',
                          n_estimators=10, seed=123)
# fit and predict
xg_cl.fit(X_train, y_train)
preds = xg_cl.predict(X_test)

# evaluate accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
# 0.743300

### 1.2 XGBoost: Fit/Predict
It's time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!

Here, you'll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called churn_data - explore it in the Shell!

Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

pandas and numpy have been imported as pd and np, and train_test_split has been imported from sklearn.model_selection. Additionally, the arrays for the features and the target have been created as X and y.

In [None]:
# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

# accuracy: 0.743300

### 1.3 What is a decision tree?
Decision trees as base learners
- Base learner - individual learning algorithm in an ensemble algorithm
    - note: XGB is an ensemble learning method that uses outputs of many models for a final prediction
- Composed of a series of binary questions/decisions - y/n, T/F
- Predictions happen at the "leaves" of the tree

Decision trees and CART (=classification and regression trees for ML)
- Constructed iteratively (one decision at a time)
    - Until a stopping criterion is met
- Individual decision trees tend to overfit
    - Low Bias and High Variance learning models
        - good at learning relationships but tend to overfit, so generalize poorly
- CART: Classification and Regression Trees
    - XGB uses this
    - Each leaf ALWAYS contains a real-valued score for classification or regression
    - Can later be converted into categories
        

### 1.4 Decision trees
Your task in this exercise is to make a simple decision tree using scikit-learn's DecisionTreeClassifier on the breast cancer dataset that comes pre-loaded with scikit-learn.

This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).

We've preloaded the dataset of samples (measurements) into X and the target values per tumor into y. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier. You'll specify a parameter called max_depth. Many other parameters can be modified within this model, and you can check all of them out here (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier).

In [None]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123)

# Instantiate the classifier: dt_clf_4
# :param max_depth of 4. This parameter specifies the maximum number 
#  of successive split points you can have before reaching a leaf node.
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)


### 1.5 What is Boosting?
Boosting overview
- Not a specific machine learning algorithm
- concept that can be applied to a set of ML models
    - "Meta-algorithm"
- Ensemble meta-algorithm used to convert many weak learners into a strong learner

Weak learners and strong learners
- Weak learner = ML algorithm that is slightly better than chance
    - Example: decision tree whose predictions slightly better than 50%
- Boosting converts a collection of weak learners into a strong learner
- Strong learner = any algorithm that can be tuned to achieve good performance

How boosting is accomplished?
- Iteratively learning a set of weak models on subsets of the data
- Weighting each weak prediction according to each weak learner's performance
    - final prediction is much better than any individual predictions

Model evaluation through cross-validation
- Cross-validation = robust method for estimating the performance of a model on unseen data by...
- Generating many non-overlapping train/test splits on training data
- Reports the average test set performance across all data splits


#### 1.5.a Cross-validation in XGBoost

In [None]:
import xgboost as xgb
import pandas as pd
class_data = pd.read_csv("classification_data.csv")

# DMatrix
churn_dmatrix = xgb.DMatrix(data=churn_data.iloc[:,:-1],
                            label=churn_data.month_5_still_here])

# parameter dictionary to pass into cross validation
params={"objective":"binary:logistic","max_depth":4}

# use cv method and pass required dmatrix
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=4,
                   num_boost_round=10, metrics="error", as_pandas=True)

# output accuracy
print("Accuracy: %f" %((1-cv_results["test-error-mean"]).iloc[-1]))
# Accuracy: 0.88315

### 1.6 Measuring accuracy
You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In the previous exercise, the input datasets were converted into DMatrix data on the fly, but when you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix. So, that's what you will do here before running cross-validation on churn_data.

In [None]:
# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
# "error" metrics will be converted to accuracy
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, 
                    num_boost_round=5, metrics="error", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

# output
       test-error-mean  test-error-std  train-error-mean  train-error-std
    0          0.28378        0.001932           0.28232         0.002366
    1          0.27190        0.001932           0.26951         0.001855
    2          0.25798        0.003963           0.25605         0.003213
    3          0.25434        0.003827           0.25090         0.001845
    4          0.24852        0.000934           0.24654         0.001981
    0.75148

cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!

### 1.7 Measuring AUC
Now that you've used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv().

Your job in this exercise is to compute another common metric used in binary classification - the area under the curve ("auc"). As before, churn_data is available in your workspace, along with the DMatrix churn_dmatrix and parameter dictionary params.

In [None]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, 
                    num_boost_round=5, metrics="auc", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

<script.py> output:
       test-auc-mean  test-auc-std  train-auc-mean  train-auc-std
    0       0.767863      0.002820        0.768893       0.001544
    1       0.789157      0.006846        0.790864       0.006758
    2       0.814476      0.005997        0.815872       0.003900
    3       0.821682      0.003912        0.822959       0.002018
    4       0.826191      0.001937        0.827528       0.000769
    0.826191

# An AUC of 0.84 is quite strong. As you have seen, XGBoost's 
# learning API makes it very easy to compute any metric you may be 
# interested in.

### 1.8 When should I use XGBoost?
When to use XGBoost? Criteria:
- large number of training samples
    - > 1000 training samples and < 100 features
    - should be ok when number of features < number of training samples
- you have mixture of categorical and numeric features
- Or just numeric features

When to NOT use XGBoost? Criteria:
- had success with other algorithms
    - Not suited for: (Deep learning is better)
        - image recognition
        - computer vision
        - NLP and NL understanding problems
- dataset size issues
    - < 100 training samples
    - when number of training samples is significantly smaller than the number of features
    

### 1.9 Using XGBoost
XGBoost is a powerful library that scales very well to many samples and works for a variety of supervised learning problems. That said, as Sergey described in the video, you shouldn't always pick it as your default machine learning library when starting a new project, since there are some situations in which it is not the best option. In this exercise, your job is to consider the below examples and select the one which would be the best use of XGBoost.

Possible Answers
- Visualizing the similarity between stocks by comparing the time series of their historical prices relative to each other.
- Predicting whether a person will develop cancer using genetic data with millions of genes, 23 examples of genomes of people that didn't develop cancer, 3 genomes of people who wound up getting cancer.
- Clustering documents into topics based on the terms used in them.
- Predicting the likelihood that a given user will click an ad from a very large clickstream log with millions of users and their web interactions.

Answer:
- D

## 2. Regression with XGBoost
Regression review
- Outcome is real-valued - ie. height

Common regression metrics
- Root mean squared error (RMSE)
    - Total Squared Error = compute difference b/n Actual and Predicted 
    - Mean squared error - take mean
    - Root Mean Squared error - take square root
        - allows us treat negative and positive errors equally
        - but tends to punish larger differences b/n Actual and Predicted
- Mean absolute error (MAE)
    - Total Absolute Error
    - Mean Absolute Error
    - isn't affected by large differences like RMSE
    - but not frequently used b/c it lacks some mathematical properties

Common regression algorithms
- linear regression
- Decision trees


### Which of these is a regression problem?
Here are 4 potential machine learning problems you might encounter in the wild. Pick the one that is a clear example of a regression problem.

Possible Answers
- Recommending a restaurant to a user given their past history of restaurant visits and reviews for a dining aggregator app.
- Predicting which of several thousand diseases a given person is most likely to have given their symptoms.
- Tagging an email as spam/not spam based on its content and metadata (sender, time sent, etc.).
- Predicting the expected payout of an auto insurance claim given claim properties (car, accident type, driver prior history, etc.).

Answer: D

### 2.1 Objective (loss) functions and base learners
Objective functions and why we use them
- Loss functions quantifies how far off a prediction is from the actual result
- Measures the difference b/n estimated and true values for some collection of data
- Goal: Find the model that yields the minimum value of the loss function

Common Loss Functions in XGBoost
- reg:linear - use for regression problems
- reg:logistic - use for classification problems when you want just decision, not probability
- binary:logistic - use when you want probability rather than just decision

Base Learners and why we need them
- XGBoost involves creating a meta-model (ensemble learning method) that is composed of many individual models that combine to give a final prediction
- Individual models = Base Learners
- Want base learners that when combined create final prediction that is non-linear
- Each base learner should be good at distinguishing or predicting different parts of the dataset
    - goal of XGB is to have base learners that are slightly better than random guessing on certain subsets of training examples and uniformly bad on the remainder
    - when predictions are all combined, the uniformly bad predictions cancel out, and those slightly better than chance predictions combine into a very good prediction
- 2 kinds of base learners: tree and linear


#### 2.1.a Trees as Base Learners example: Scikit-learn API
- Bosting housing dataset from UIC repo

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
boston_data = pd.read_csv("boston_housing.csv")
X, y, = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=123)
# regression, note objective parameter
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10,
                          seed=123)
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))


#### 2.1.b Linear Base Learners example: Learning API Only

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
boston_data = pd.read_csv("boston_housing.csv")
X, y, = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=123)
# note: difference here
# Learning API requires DMatrix
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test = xgb.DMatrix(data=X_test, label=y_test)
params = {"booster":"gblinear","objective":"reg:linear"}
xg_reg = xgb.train(params=params, dtrain=DM_train, num_boost_round=10)
preds = xg_reg.predict(DM_test)

# same here
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

### 2.2.a Decision trees as base learners
It's now time to build an XGBoost model to predict house prices - not in Boston, Massachusetts, as you saw in the video, but in Ames, Iowa! This dataset of housing prices has been pre-loaded into a DataFrame called df. If you explore it in the Shell, you'll see that there are a variety of features about the house and its location in the city.

In this exercise, your goal is to use trees as base learners. By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".

xgboost has been imported as xgb and the arrays for the features and the target are available in X and y, respectively.

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=123)

# Instantiate the XGBRegressor: xg_reg
# :param n_estimators = 10, specifies 10 trees
# Note: You don't have to specify booster="gbtree" as this is the default.
xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

# RMSE: 78847.401758

### 2.2.b Linear base learners
Now that you've used trees as base models in XGBoost, let's use the other kind of base model that can be used with XGBoost - a linear learner. This model, although not as commonly used in XGBoost, allows you to create a regularized linear regression using XGBoost's powerful learning API. However, because it's uncommon, you have to use XGBoost's own non-scikit-learn compatible functions to build the model, such as xgb.train().

In order to do this you must create the parameter dictionary that describes the kind of booster you want to use (similarly to how you created the dictionary in Chapter 1 (https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost/classification-with-xgboost?ex=9) when you used xgb.cv()). The key-value pair that defines the booster type (base model) you need is "booster":"gblinear".

Once you've created the model, you can use the .train() and .predict() methods of the model just like you've done in the past.

Here, the data has already been split into training and testing sets, so you can dive right into creating the DMatrix objects required by the XGBoost learning API.

In [None]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, 
                   dtrain=DM_train, 
                   num_boost_round =5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

# RMSE: 43965.314324
# Looks like the linear base learners performed better than the tree

### 2.3 Evaluating model quality
It's now time to begin evaluating model quality.

Here, you will compare the RMSE and MAE of a cross-validated XGBoost model on the Ames housing data. As in previous exercises, all necessary modules have been pre-loaded and the data is available in the DataFrame df.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, 
                    num_boost_round=5, metrics="rmse", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["RMSE"]).tail(1))

   test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
0   142980.433594    1193.791602    141767.531250      429.454591
1   104891.394532    1223.158855    102832.544922      322.469930
2    79478.937500    1601.344539     75872.615235      266.475960
3    62411.920899    2220.150028     57245.652343      273.625086
4    51348.279297    2963.377719     44401.298828      316.423666
4    51348.279297
Name: test-rmse-mean, dtype: float64

In [None]:
# Now adapt code to compute the "mae" instead of the "rmse"

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, 
                    num_boost_round=5, metrics="mae", as_pandas=True, 
                    seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))

   test-mae-mean  test-mae-std  train-mae-mean  train-mae-std
0  127634.000000   2404.009898   127343.482421     668.308109
1   90122.501953   2107.912810    89770.056641     456.965267
2   64278.558594   1887.567576    63580.791016     263.404950
3   46819.168945   1459.818607    45633.155274     151.883420
4   35670.646484   1140.607452    33587.090820      86.999396
4    35670.646484
Name: test-mae-mean, dtype: float64

### 2.4 Regularization and base learners in XGBoost
- Regularization is a control on model complexity
    - penalize models as they get more complex
- Want models that are both accurate and as simple as possible
- Regularization parameters in XGBoost:
    - gamma - minimum loss reduction allowed for a split to occur
        - higher values lead to fewer splits
    - alpha - l1 regularization on leaf weights, larger values mean more regularization (which means many leaf weights to go to 0)
    - lambda - l2 regularization on leaf weights
    

#### 2.4.a L1 Regularization in XGBoost example

In [None]:
import xgboost as xgb
import pandas as pd
boston_data = pd.read_csv("boston_data.csv")
X,y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1]
boston_dmatrix = xgb.DMatrix(data=X, label=y)
params={"objective":"reg:linear","max_depth"=4}
l1_params = [1,10,1000]
# empty list to store rmse values
rmses_l1 = []

for reg in l1_params:
    params["alpha"] = reg
    cv_results = xgb.cv(dtrain=boston_dmatrix, params=params,nfold=4,
                       num_boost_round=10,metrics="rmse",
                       as_pandas=True, seeds=123)
    rmses_l1.append(cv_results["test-rmse-mean"] \
                    .tail(1).values[0])
    
print("Best rmse as a function of l1:")
print(pd.DataFrame(list(zip(11_params,rmses_11)), 
                   columns=["l1", "rmse"]))

### 2.4.b Base Learners in XGBoost - 2 types
- Linear Base Learner:
    - Sum of linear terms
        - Like linear/logistic regression model
    - Boosted model is weighted sum of linear models (thus is itself linear)
    - Rarely used - since you don't get any non-linear combo of features
        - can get identical performance from a regularized linear model
- Tree Base Learner:
    - Decision Tree
    - Boosted model is weighted sum of decision trees (nonlinear)
    - Almost exclusively used in XGBoost
    
Creating DataFrames from muiltiple equal-length lists
- pd.DataFrame(list(zip(list1,list2)),columns=["list1","list2"]))
- zip creates a generator of parallel values:
    - zip([1,2,3],['a','b','c']) = [1,'a'],[2,'b'],[3,'c']
    - generators need to be completely instantiated before they can be used in DataFrame objects
- list() instantiates the full generator and passing that into the DataFrame converts the whole expression


In [None]:
# Creating DataFrames from muiltiple equal-length lists
pd.DataFrame(list(zip(list1,list2)),columns=["list1","list2"]))

# example of zip
zip([1,2,3],['a','b','c']) = [1,'a'],[2,'b'],[3,'c']

### 2.4.a Using regularization in XGBoost
Having seen an example of l1 regularization in the video, you'll now vary the l2 regularization penalty - also known as "lambda" - and see its effect on overall model performance on the Ames housing dataset.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []

# Iterate over reg_params
for reg in reg_params:

    # Update l2 strength
    params["lambda"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, 
                             params=params, 
                             nfold=2, 
                             num_boost_round=5, 
                             metrics="rmse", 
                             as_pandas=True, 
                             seed=123)
    
    # Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2", "rmse"]))

<script.py> output:
    Best rmse as a function of l2:
        l2          rmse
    0    1  52275.357421
    1   10  57746.064453
    2  100  76624.625000
    
# looks like as lambda increases, rmse increases

### 2.5 Visualizing individual XGBoost trees - plot_tree()
Now that you've used XGBoost to both build and evaluate regression as well as classification models, you should get a handle on how to visually explore your models. Here, you will visualize individual trees from the fully boosted model that XGBoost creates using the entire housing dataset.

XGBoost has a plot_tree() function that makes this type of visualization easy. Once you train a model using the XGBoost learning API, you can pass it to the plot_tree() function along with the number of trees you want to plot using the num_trees argument.

Plot the first tree using xgb.plot_tree(). It takes in two arguments - the model (in this case, xg_reg, which is trained), and num_trees, which is 0-indexed. So to plot the first tree, specify num_trees=0.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":2}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the first tree
xgb.plot_tree(xg_reg, num_trees=0)
plt.show()

# Plot the fifth tree
xgb.plot_tree(xg_reg, num_trees=4)
plt.show()

# Plot the last tree sideways
xgb.plot_tree(xg_reg, num_trees=4, rankdir="LR")
plt.show()

Have a look at each of the plots. They provide insight into how the model arrived at its final decisions and what splits it made to arrive at those decisions. This allows us to identify which features are the most important in determining house price. In the next exercise, you'll learn another way of visualizing feature importances.

<img src="images/xgb1.png" width="500" />
<img src="images/xgb2.png" width="500" />
<img src="images/xgb3.png" width="500" />

### 2.6 Visualizing feature importances: What features are most important in my dataset - plot_importance()
Another way to visualize your XGBoost models is to examine the importance of each feature column in the original dataset within the model.

One simple way of doing this involves counting the number of times each feature is split on across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the features ordered according to how many times they appear. XGBoost has a plot_importance() function that allows you to do exactly this, and you'll get a chance to use it in this exercise!

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()

It looks like GrLivArea is the most important feature.

<img src="images/xgb4.png" width="500" />

## 3. Fine-tuning your XGBoost model
Why tune your model?
- Untuned Model vs Tuned Model Example

In [None]:
# Untuned Model Example
import pandas as pd
housing_data = pd.read_csv("ames_housing_trimmed_processed.csv")
X,y = housing_data[housing_data.columns.tolist()[:-1]],
      housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# most basic parameters for regression
untuned_params={"objective":"reg:linear"}

untuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, 
                                 params=untuned_params,nfold=4,
                                 metrics="rmse",
                                 as_pandas=True, seeds=123)
    
print("Untuned rmse: %f" %((untuned_cv_results_rmse["test-rmse-mean"]) \
                          .tail(1)))
# code got cut off on the end of the line

# Untuned rmse: 34624.229980

In [None]:
# Tuned Model Example
import pandas as pd
housing_data = pd.read_csv("ames_housing_trimmed_processed.csv")
X,y = housing_data[housing_data.columns.tolist()[:-1]],
      housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# tuned parameters - few xgb parameters that are important
tuned_params={"objective":"reg:linear", 'colsample_bytree':0.3,
             'learning_rate':0.1, 'max_depth':5}

# cross validation with 200 constructed trees
tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, 
                               params=tuned_params,nfold=4,
                               num_boost_round=200,
                               metrics="rmse",
                               as_pandas=True, seeds=123)
    
print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]) \
                         .tail(1)))

# Tuned rmse: 29812.683594
# about 14% reduction in rmse

### 3.1 When is tuning your model a bad idea?
Now that you've seen the effect that tuning has on the overall performance of your XGBoost model, let's turn the question on its head and see if you can figure out when tuning your model might not be the best idea. Given that model tuning can be time-intensive and complicated, which of the following scenarios would NOT call for careful tuning of your model?

Possible Answers
- You have lots of examples from some dataset and very many features at your disposal.
- You are very short on time before you must push an initial model to production and have little data to train your model on.
- You have access to a multi-core (64 cores) server with lots of memory (200GB RAM) and no time constraints.
- You must squeeze out every last bit of performance out of your xgboost model.

Answer: B - short on time, need initial model, little data

Need time to tune.

### 3.2 Tuning the number of boosting rounds - # of trees
- this example attempts to cherry pick the best possible number of boosting rounds

Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You'll use xgb.cv() inside a for loop and build one model per num_boost_round parameter.

Here, you'll continue working with the Ames housing dataset. The features are available in the array X, and the target vector is contained in y.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params 
params = {"objective":"reg:linear", "max_depth":3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, 
                        num_boost_round=curr_num_rounds, metrics="rmse", 
                        as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail() \
                                .values[-1])

# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,
                   columns=["num_boosting_rounds", "rmse"]))

# output
   num_boosting_rounds          rmse
0                    5  50903.299479
1                   10  34774.194010
2                   15  32895.098958

# increasing the number of boosting rounds, decreases the RMSE

### 3.3 Automated boosting round selection using early_stopping
Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within xgb.cv(). This is done using a technique called early stopping.

Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds. Here you will use the early_stopping_rounds parameter in xgb.cv() with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when num_boosting_rounds is reached, then early stopping does not occur.

Here, the DMatrix and parameter dictionary have been created for you. Your task is to use cross-validation with early stopping. Go for it!

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix,params=params,
                    early_stopping_rounds=10, num_boost_round=50, 
                    metrics="rmse",as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

In [None]:
    test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
0    142640.656250     705.559400    141871.630208      403.632626
1    104907.664063     111.113862    103057.036458       73.769561
2     79262.059895     563.766991     75975.966146      253.726099
3     61620.136719    1087.694282     57420.529948      521.658354
4     50437.562500    1846.448017     44552.955729      544.170190
5     43035.658854    2034.471024     35763.949219      681.798925
6     38600.880208    2169.796232     29861.464844      769.571318
7     36071.817708    2109.795430     25994.675781      756.521419
8     34383.184896    1934.546688     23306.836588      759.238254
9     33509.139974    1887.375633     21459.770833      745.624404
10    32916.805990    1850.893363     20148.721354      749.612769
11    32197.832682    1734.456935     19215.382813      641.387376
12    31770.852865    1802.155484     18627.389323      716.256596
13    31482.782552    1779.123767     17960.695312      557.043568
14    31389.990234    1892.319927     17559.736979      631.412969
15    31302.883464    1955.166046     17205.712891      590.171393
16    31234.058594    1880.705796     16876.571940      703.631755
17    31318.347656    1828.860164     16597.662110      703.677609
18    31323.634766    1775.909567     16330.460937      607.274494
19    31204.135417    1739.076156     16005.972982      520.470911
20    31089.863932    1756.022575     15814.300781      518.604760
21    31047.998047    1624.672407     15493.405924      505.616658
22    31056.916667    1668.043013     15270.734375      502.018453
23    31024.983724    1548.985354     15086.382162      503.913199
24    30983.685547    1663.130510     14917.608399      486.206187
25    30989.477214    1686.668050     14709.589518      449.668010
26    30952.113932    1613.172643     14457.286133      376.787666
27    31066.902344    1648.534310     14185.567057      383.102691
28    31095.642578    1709.225327     13934.066732      473.465449
29    31103.887370    1778.880069     13749.644857      473.670886
30    30976.084635    1744.514164     13549.836589      454.898834
31    30938.469401    1746.052597     13413.484700      399.603323
32    30931.000000    1772.469510     13275.915364      415.408340
33    30929.056641    1765.541578     13085.878255      493.792778
34    30890.629557    1786.510976     12947.181315      517.790039
35    30884.493490    1769.729143     12846.027344      547.732372
36    30833.542318    1691.001567     12702.378581      505.523315
37    30856.688151    1771.445059     12532.243815      508.298162
38    30818.016927    1782.784630     12384.055013      536.225042
39    30839.392578    1847.327022     12198.443359      545.165562
40    30776.964844    1912.781000     12054.583659      508.841772
41    30794.702474    1919.674832     11897.036458      477.177568
42    30780.956380    1906.820987     11756.221354      502.992395
43    30783.754557    1951.259784     11618.846680      519.837469
44    30776.731120    1953.447693     11484.080404      578.428621
45    30758.543620    1947.454953     11356.552734      565.368794
46    30729.971354    1985.698867     11193.557943      552.299272
47    30732.662760    1966.997355     11071.315755      604.090310
48    30712.241536    1957.751573     10950.778320      574.862779
49    30720.854167    1950.511057     10824.865560      576.665674

### 3.4 Overview of XGBoost's hyperparameters
Tunable parameters in XGBoost
- Note significantly different for each base learner

Common Tree tunable parameters (most frequently used)
- learning rate: learning rate/eta
- gamma: min loss reduction to cretae new tree split
- lambda: L2 reg on leaf weights
- alpha: L1 reg on leaf weights
- max_depth: max depth per tree, must be positive integer
- subsample: % samples used per tree, value b/n 0 and 1
    - low value >> fraction of training data used per boosting round would be low >> possible underfitting
    - high value >> may lead to overfitting
- colsample_bytree: % features used per tree, value b/n 0 and 1
    - large value >> almost all features used
    - small value >> small subset of features
    - generally, smaller value can provide additional regularization
    - large value may overfit

Linear tunable parameters
- lambda: L2 reg on weights
- alpha: L1 reg on weights
- lambda_bias: L2 reg term on bias

Note: You can also tune the number of estimators used for both base model types (Tree and Linear)

#### 3.4.a Tuning eta - aka 'learning rate'
It's time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You'll begin by tuning the "eta", also known as the learning rate.

The learning rate in XGBoost is a parameter that can range between 0 and 1, 
- with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:linear", "max_depth":3}

# Create list of eta values and empty list to store final round rmse 
# per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta 
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix,params=params,
    nfold=3, early_stopping_rounds=5, num_boost_round=10, 
    metrics="rmse", as_pandas=True, seed=123)
    
    
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), 
                   columns=["eta","best_rmse"]))

# output
     eta      best_rmse
0  0.001  195736.406250
1  0.010  179932.182292
2  0.100   79759.411458

#### 3.4.b Tuning max_depth
In this exercise, your job is to tune max_depth, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees. 

In [None]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params = {"objective":"reg:linear"}

# Create list of max_depth values
max_depths = [2,5,10,20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix,params=params,nfold=2,
                        early_stopping_rounds=5,num_boost_round=10,
                        metrics="rmse",seed=123)
    
    
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),
                   columns=["max_depth","best_rmse"]))

# output
   max_depth     best_rmse
0          2  37957.468750
1          5  35596.599610
2         10  36065.546875
3         20  36739.576172


#### 3.4.c Tuning colsample_bytree - fraction of features, value 0-1
Now, it's time to tune "colsample_bytree". You've already seen this if you've ever worked with scikit-learn's RandomForestClassifier or RandomForestRegressor, where it just was called max_features. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree must be specified as a float between 0 and 1.

In [None]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:linear","max_depth":3}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1,0.5,0.8,1]
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params['colsample_bytree'] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), 
                   columns=["colsample_bytree","best_rmse"]))

# output
   colsample_bytree     best_rmse
0               0.1  45017.404296
1               0.5  36050.654297
2               0.8  35372.572266
3               1.0  35836.046875


#### Other parameters to tune...
There are several other individual parameters that you can tune, such as "subsample", which dictates the fraction of the training data that is used during any given boosting round. 

### 3.5 Review of Grid Search and Random Search
- find optimal hyperparameters simultaneously to get lowest loss possible

2 strategies: Grid Search and Random Search

#### 3.5.a Grid Search - Review
- Search exhaustively over a given set of hyperparameters, once per set of hyperparameters
- Number of models = number of distinct values per hyperparameter multiplied across each hyperparameter
    - ie. 2 hyperparameters and 4 values per each hyperparameter, would try 16 combos
- Pick final model hyperparameter values that give best cross-validated evaluation metric value

#### 3.5.b Grid Search Example

In [None]:
In [1]: import pandas as pd
In [2]: import xgboost as xgb
In [3]: import numpy as np
In [4]: from sklearn.model_selection import GridSearchCV
    
In [5]: housing_data = pd.read_csv("ames_housing_trimmed_processed.csv")
In [6]: X, y = housing_data[housing_data.columns.tolist()[:-1]],
   ...: housing_data[housing_data.columns.tolist()[-1]
In [7]: housing_dmatrix = xgb.DMatrix(data=X,label=y)
                     
# 4 learning rate/eta values, 1 number of trees, 3 subsamples = 12 models
In [8]: gbm_param_grid = {
   ...: 'learning_rate': [0.01,0.1,0.5,0.9],
   ...: 'n_estimators': [200],
   ...: 'subsample': [0.3, 0.5, 0.9]}
                     
In [9]: gbm = xgb.XGBRegressor()
In [10]: grid_mse = GridSearchCV(estimator=gbm,
    ...: param_grid=gbm_param_grid, 
    ...: scoring='neg_mean_squared_error', cv=4, verbose=1)
In [11]: grid_mse.fit(X, y)
                     
In [12]: print("Best parameters found: ",grid_mse.best_params_)
Best parameters found: {'learning_rate': 0.1, 
'n_estimators': 200, 'subsample': 0.5}
In [13]: print("Lowest RMSE found: ", 
               np.sqrt(np.abs(grid_mse.best_score_)))
Lowest RMSE found:  28530.1829341

#### 3.5.c Random Search - Review
- Create a (possibly infinite) range of hyperparameter values per hyperparameter that you would like to search over
- Set the number of iterations you would like for the random search to continue
- During each iteration, randomly draw a value in the range of specified values for each hyperparameter searched over and train/evaluate a model with those hyperparameters
- After you've reached the maximum number of iterations, select the hyperparameter configuration with the best evaluated score


#### 3.5.d Random Search Example

In [None]:
In [1]: import pandas as pd
In [2]: import xgboost as xgb
In [3]: import numpy as np
In [4]: from sklearn.model_selection import RandomizedSearchCV
In [5]: housing_data = pd.read_csv("ames_housing_trimmed_processed.csv")
In [6]: X,y = housing_data[housing_data.columns.tolist()[:-1]],
   ...: housing_data[housing_data.columns.tolist()[-1]]
In [7]: housing_dmatrix = xgb.DMatrix(data=X,label=y)
    
# 20 eta/learning rate, 20 values for subsample = 400 models to try
In [8]: gbm_param_grid = {
   ...: 'learning_rate': np.arange(0.05,1.05,.05),
   ...: 'n_estimators': [200],
   ...: 'subsample': np.arange(0.05,1.05,.05)}
    
# try 25 random combos
In [9]: gbm = xgb.XGBRegressor()
In [10]: randomized_mse = RandomizedSearchCV(estimator=gbm,
    ...: param_distributions=gbm_param_grid, n_iter=25,
    ...: scoring='neg_mean_squared_error', cv=4, verbose=1)
In [11]: randomized_mse.fit(X, y)

In [12]: print("Best parameters found: ",randomized_mse.best_params_)
Best parameters found: {'subsample': 0.60000000000000009,
'n_estimators': 200, 'learning_rate': 0.20000000000000001}
In [13]: print("Lowest RMSE found: ",
    ...: np.sqrt(np.abs(randomized_mse.best_score_)))
Lowest RMSE found: 28300.2374291


#### 3.5.d Practice - Grid Search with XGBoost
Now that you've learned how to tune parameters individually with XGBoost, let's take your parameter tuning to the next level by using scikit-learn's GridSearch and RandomizedSearch capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. Let's get to work, starting with GridSearchCV! 

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid,
                        scoring='neg_mean_squared_error', cv=4, verbose=1)

# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

# output
Fitting 4 folds for each of 4 candidates, totalling 16 fits
Best parameters found:  {'n_estimators': 50, 'max_depth': 5,
                         'colsample_bytree': 0.7}
Lowest RMSE found:  30342.16964561695

#### 3.5.e Practice - Random Search with XGBoost
Often, GridSearchCV can be really time consuming, so in practice, you may want to use RandomizedSearchCV instead, as you will do in this exercise. The good news is you only have to make a few modifications to your GridSearchCV code to do RandomizedSearchCV. The key difference is you have to specify a param_distributions parameter instead of a param_grid parameter.

In [None]:
# Create the parameter grid: gbm_param_grid 
# range(2,12) - give values b/n 2 and 11
gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, 
                                    estimator=gbm, 
                                    scoring='neg_mean_squared_error', 
                                    n_iter=5, cv=4, verbose=1)

# Fit randomized_mse to the data
randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

# output
Fitting 4 folds for each of 5 candidates, totalling 20 fits
Best parameters found:  {'max_depth': 5, 'n_estimators': 25}
Lowest RMSE found:  36636.35808132903

### 3.6 Limits of Grid Search and Random Search
Grid Search
    - if hyperparameter grid is small, you'll get an answer in a reasonable amount of time
    - Number of models you must build with every additional new parameter grows very quickly
    
Random Search
- problem: as you add new hyperparameters, hyperparameter space to explore can be massive
- Randomly jumping throughout the space looking for a "best" result becomes a waiting game

Both approaches have significant limitations


### 3.7 When Should you Use Grid Search and Random Search?
Now that you've seen some of the drawbacks of grid search and random search, which of the following most accurately describes why both random search and grid search are non-ideal search hyperparameter tuning strategies in all scenarios?

Possible Answers
- Grid Search and Random Search both take a very long time to perform, regardless of the number of parameters you want to tune.
- Grid Search and Random Search both scale exponentially in the number of hyperparameters you want to tune.
- The search space size can be massive for Grid Search in certain cases, whereas for Random Search the number of hyperparameters has a significant effect on how long it takes to run.
- Grid Search and Random Search require that you have some idea of where the ideal values for hyperparameters reside.

Answer: C

## 4. XGBoost in Pipelines
- incorporate models into 2 end-to-end ML pipelines
- tune most important XGBoost hyperparameters in a pipeline
- more advanced preprocessing techniques

Pipelines Review using sklearn
- Takes a list of named 2-tuples (name, pipeline_step) as input
- Tuples can contain any arbitrary scikit-learn compatible estimator or transformer object
- Pipeline implements fit/predict methods
- Can be used as input estimator into grid/randomized search and cross_val_score methods


In [None]:
# sklearn pipeline example
In [1]: import pandas as pd
   ...: from sklearn.ensemble import RandomForestRegressor
   ...: import numpy as np
   ...: from sklearn.preprocessing import StandardScaler
   ...: from sklearn.pipeline import Pipeline
   ...: from sklearn.model_selection import cross_val_score
In [2]: names = ["crime","zone","industry","charles",
   ...: "no","rooms","age", "distance",
   ...: "radial","tax","pupil","aam","lower","med_price"]

In [3]: data = pd.read_csv("boston_housing.csv",names=names)

In [4]: X, y = data.iloc[:,:-1], data.iloc[:,-1]

# random forest pipeline
In [5]: rf_pipeline = Pipeline[("st_scaler", 
   ...: StandardScaler()),
   ...: ("rf_model",RandomForestRegressor())]

# neg_mean_squared_error is MSE in API compatible way
In [6]: scores = cross_val_score(rf_pipeline,X,y,    
   ...: scoring="neg_mean_squared_error",cv=10)
    
In [7]: final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

In [8]: print("Final RMSE:", final_avg_rmse)
Final RMSE: 4.54530686529

### Preprocessing I: LabelEncoder and OneHotEncoder
- LabelEncoder: Converts a categorical column of strings into integers
- OneHotEncoder: Takes the column of integers and encodes them as dummy variables
- Cannot be done within a pipeline - use DictVectorizer

### Preprocessing II: DictVectorizer
- Traditionally used in text processing
- Converts lists of feature mappings into vectors
- Need to convert DataFrame into a list of dictionary entries
- Explore the scikit-learn documentation (http://scikit-learn.org/stable/documentation.html)

### 4.1 Exploratory data analysis
Before diving into the nitty gritty of pipelines and preprocessing, let's do some exploratory analysis of the original, unprocessed Ames housing dataset (https://www.kaggle.com/c/house-prices-advanced-regression-techniques). When you worked with this data in previous chapters, we preprocessed it for you so you could focus on the core XGBoost concepts. In this chapter, you'll do the preprocessing yourself!

A smaller version of this original, unprocessed dataset has been pre-loaded into a pandas DataFrame called df. Your task is to explore df in the Shell and pick the option that is incorrect. The larger purpose of this exercise is to understand the kinds of transformations you will need to perform in order to be able to use XGBoost.

Possible Answers
- The DataFrame has 21 columns and 1460 rows.
- The mean of the LotArea column is 10516.828082.
- The DataFrame has missing values.
- The LotFrontage column has no missing values and its entries are of type float64.
- The standard deviation of SalePrice is 79442.502883

Answer: 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

<img src="images/xgb4.png" width="300" />