In [1]:
# Import libraries
import numpy as np
import pandas as pd
import xgboost as xgb

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn_pandas import DataFrameMapper

In [2]:
# Define global variables
SEED = 123

In [3]:
# Define global configuration
np.random.seed(SEED) 
pd.set_option("display.max_columns", 0)

In [4]:
# Read data
names = ["crime","zone","industry","charles","no","rooms", "age", "distance","radial","tax","pupil","aam","lower","med_price"]
boston_data = pd.read_csv("boston.csv", names=names, skiprows=1)
#print(boston_data.head())
#print(boston_data.info())

In [5]:
housing_unproc = pd.read_csv("ames_unprocessed_data.csv")
#print(housing_unproc.head())
#print(housing_unproc.info())

In [6]:
columns_name = ['age', 'bp'   , 'sg' , 'al' , 'su'   , 'rbc', 'pc', 'pcc', 'ba' , 'bgr', 
                'bu' , 'sc'   , 'sod', 'pot', 'hemo' , 'pcv', 'wc' , 'rc', 'htn', 'dm' ,
                'cad', 'appet', 'pe' , 'ane', 'class'] 
kideney = pd.read_csv("chronic_kidney_disease.csv", names=columns_name, na_values='?')
#print(kideney.head())
#print(kideney.info())

# 4. Using XGBoost in pipelines

Take your XGBoost skills to the next level by incorporating your models into two end-to-end machine learning pipelines. You'll learn how to tune the most important XGBoost hyperparameters efficiently within a pipeline, and get an introduction to some more advanced preprocessing techniques.

## 4.1 Review of pipelines using sklearn

1. Review of pipelines using sklearn
>Let's begin the final chapter in this course by reviewing how pipelines are used in scikit-learn. Refreshing our memory about how pipelines work will allow us to use XGBoost effectively in pipelines going forward. Before working through an example script using pipelines, lets briefly go over how they work.

2. Pipeline review
>Pipelines in sklearn are objects that take a list of named tuples as input. The named tuples must always contain a string name as the first element in each tuple and any scikit-learn compatible transformer or estimator object as the second element. Each named tuple in the pipeline is called a step, and the list of transformations that are contained in the list are executed in order once some data is passed through the pipeline. This is done using the standard fit/predict paradigm that is standard in scikit-learn. Finally, where pipelines are really useful is that they can be used as input estimator objects into other scikit-learn objects themselves, the most useful of which are the cross_val_score method, which allows for efficient cross-validation and out of sample metric calculation, and the grid search and random search approaches for tuning hyperparameters.

3. Scikit-learn pipeline example
>Now that we've talked about how pipelines work, lets seem them in action. In this example, we will use the Boston Housing dataset. As you've seen many times before, we first import all of the functionality we will need for the example. We will use a randomforestregressor model to predict housing prices, and will import pipeline from sklearn's pipeline submodule. In lines 2-4, we load in our data and create our X feature matrix and y target vector. Lines 5-6 are the ones that do the real work here. In line 5, we create our pipeline, which contains a standardscaler transformer followed by our RandomForestRegressor estimator. Line 6 takes the just created pipeline estimator as an input along with our X matrix and y vector and performs 10-fold cross-validation using the pipeline and the data and outputs the neg_mean_squared_error as an evaluation metric once per fold. As a brief aside, neg_mean_squared_error is scikit-learn's API-specific way of calculating the mean squared error in an API-compatible way. Negative mean squared errors don't actualy exist as all squares must be positive when working with real numbers.

4. Scikit-learn pipeline example
>Thus, in lines 7 and 8 we simply take the absolute value of the scores, take each of their square roots, and compute their mean to get a root mean squared error across all 10 cross-validation folds. We can see that on average our prediction was off by about 4-point-5 units. In the following exercises, because we will be working with the Ames housing dataset, which is more complex than the Boston housing dataset,

5. Preprocessing I: LabelEncoder and OneHotEncoder
>some additional preprocessing steps will be required. Specifically, we will do the same preprocessing steps in two different ways, only one of which can be done within a pipeline. The first approach involves using the LabelEncoder and OneHotEncoder classes of scikit-learn’s preprocessing submodule one after the other. LabelEncoder simply converts a categorical column of strings into integers that map onto those strings. OneHotEncoder takes a column of integers that are treated as categorical values, and encodes them as dummy variables, which you may already be familiar with. The problem with this 2-step method, however, is that it cannot currently be done within a pipeline. However, not all hope is lost. The second approach,

6. Preprocessing II: DictVectorizer
>which involves using a dict-vectorizer, can accomplish both steps in one line of code.The DictVectorizer is a class found in scikit-learn’s feature extraction submodule, and is traditionally used in text processing pipelines by converting lists of feature mappings into vectors. Using pandas DataFrames, we don’t initially have such a list, however, if we explicitly convert a DataFrame into a list of dictionary entries, then we have exactly what we need. For more details on these classes, I encourage you to explore the scikit-learn documentation.

7. Let's build pipelines!
>You will use both approaches in the next few exercises. I hope you have fun building pipelines!

In [7]:
# Scikit-learn pipeline example
#X, y = boston_data.iloc[:,:-1], boston_data.iloc[:,-1]
X, y = boston_data.drop('med_price', axis=1), boston_data.med_price

rf_pipeline = Pipeline([("st_scaler", StandardScaler()), 
                        ("rf_model", RandomForestRegressor(random_state=SEED))])

scores = cross_val_score(rf_pipeline,
                         X, y, 
                         scoring="neg_mean_squared_error", 
                         cv=10)

final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))
print("Final RMSE:", final_avg_rmse)

Final RMSE: 4.163273962502389


## 4.2 Exploratory data analysis

**Instructions**

Before diving into the nitty gritty of pipelines and preprocessing, let's do some exploratory analysis of the original, unprocessed Ames housing dataset. When you worked with this data in previous chapters, we preprocessed it for you so you could focus on the core XGBoost concepts. In this chapter, you'll do the preprocessing yourself!

A smaller version of this original, unprocessed dataset has been pre-loaded into a pandas DataFrame called df. Your task is to explore df in the Shell and pick the option that is incorrect. The larger purpose of this exercise is to understand the kinds of transformations you will need to perform in order to be able to use XGBoost.

**Possible Answers**

1. The DataFrame has 21 columns and 1460 rows.
2. The mean of the LotArea column is 10516.828082.
3. The DataFrame has missing values.
4. <font color=red>The LotFrontage column has no missing values and its entries are of type float64</font>. **Correct!**
5. The standard deviation of SalePrice is 79442.502883.

**Results**

<font color=darkgreen>Well done! The LotFrontage column actually does have missing values: 259, to be precise. Additionally, notice how columns such as MSZoning, PavedDrive, and HouseStyle are categorical. These need to be encoded numerically before you can use XGBoost. This is what you'll do in the coming exercises.</font><font color=darkgreen></font>

In [8]:
#print(housing_unproc.head())
#print(housing_unproc.info())
print(housing_unproc.describe())

        MSSubClass  LotFrontage        LotArea  OverallQual  ...  BedroomAbvGr   Fireplaces   GarageArea      SalePrice
count  1460.000000  1201.000000    1460.000000  1460.000000  ...   1460.000000  1460.000000  1460.000000    1460.000000
mean     56.897260    70.049958   10516.828082     6.099315  ...      2.866438     0.613014   472.980137  180921.195890
std      42.300571    24.284752    9981.264932     1.382997  ...      0.815778     0.644666   213.804841   79442.502883
min      20.000000    21.000000    1300.000000     1.000000  ...      0.000000     0.000000     0.000000   34900.000000
25%      20.000000    59.000000    7553.500000     5.000000  ...      2.000000     0.000000   334.500000  129975.000000
50%      50.000000    69.000000    9478.500000     6.000000  ...      3.000000     1.000000   480.000000  163000.000000
75%      70.000000    80.000000   11601.500000     7.000000  ...      3.000000     1.000000   576.000000  214000.000000
max     190.000000   313.000000  215245.

## 4.3 Encoding categorical columns I: LabelEncoder

Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.

First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea (https://campus.datacamp.com/courses/supervised-learning-with-scikit-learn/preprocessing-and-pipelines?ex=1).

The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You'll practice using this here.

**Instructions**

1. Import LabelEncoder from sklearn.preprocessing.
2. Fill in missing values in the LotFrontage column with 0 using .fillna().
3. Create a boolean mask for categorical columns. You can do this by checking for whether df.dtypes equals object.
4. Create a LabelEncoder object. You can do this in the same way you instantiate any scikit-learn estimator.
5. Encode all of the categorical columns into integers using LabelEncoder(). To do this, use the .fit_transform() method of le in the provided lambda function.

**Results**

<font color=darkgreen>Well done! Notice how the entries in each categorical column are now encoded numerically. A BldgTpe of 1Fam is encoded as 0, while a HouseStyle of 2Story is encoded as 5.</font>

In [9]:
# Fill missing values with 0
housing_unproc.LotFrontage = housing_unproc.LotFrontage.fillna(0)

# Copy the dataframe to work with
df_sk = housing_unproc.copy(deep = True)

In [10]:
# Create a boolean mask for categorical columns
categorical_mask = (df_sk.dtypes == object)

# Get list of categorical column names
categorical_columns = df_sk.columns[categorical_mask].tolist()

# Print count distinct values in each categorical columns
print(df_sk[categorical_columns].nunique(axis=0),'\n')

# Print the head of the categorical columns
#print(df_sk[categorical_columns].head())

MSZoning         5
Neighborhood    25
BldgType         5
HouseStyle       8
PavedDrive       3
dtype: int64 



In [11]:
# Print the head of unlabeled categorical columns
print(df_sk[categorical_columns].head())

# Create LabelEncoder object: le
le = LabelEncoder()

# Apply LabelEncoder to categorical columns
df_sk[categorical_columns] = df_sk[categorical_columns].apply(lambda x: le.fit_transform(x))

# Print the head of the LabelEncoded categorical columns
print(df_sk[categorical_columns].head())

# Exploring a little bit more
#print('\nData in Neighborhood:')
#print('Before encoding: ', housing_unproc['Neighborhood'].unique())
#print('After  encoding: ', df_sk['Neighborhood'].unique())

  MSZoning Neighborhood BldgType HouseStyle PavedDrive
0       RL      CollgCr     1Fam     2Story          Y
1       RL      Veenker     1Fam     1Story          Y
2       RL      CollgCr     1Fam     2Story          Y
3       RL      Crawfor     1Fam     2Story          Y
4       RL      NoRidge     1Fam     2Story          Y
   MSZoning  Neighborhood  BldgType  HouseStyle  PavedDrive
0         3             5         0           5           2
1         3            24         0           2           2
2         3             5         0           5           2
3         3             6         0           5           2
4         3            15         0           5           2


## 4.4 Encoding categorical columns II: OneHotEncoder

Okay - so you have your categorical columns encoded numerically. Can you now move onto using pipelines and XGBoost? Not yet! In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.

As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or "dummy" variables. You can do this using scikit-learn's OneHotEncoder.

**Instructions**

1. Import OneHotEncoder from sklearn.preprocessing.
2. Instantiate a OneHotEncoder object called ohe. Specify the keyword arguments categorical_features=categorical_mask and sparse=False.
3. Using its .fit_transform() method, apply the OneHotEncoder to df and save the result as df_encoded. The output will be a NumPy array.
4. Print the first 5 rows of df_encoded, and then the shape of df as well as df_encoded to compare the difference.

**Results**

<font color=darkgreen>Superb! As you can see, after one hot encoding, which creates binary variables out of the categorical variables, there are now 62 columns.</font>

### Using <font color=red>sklearn.preprocessing.OneHotEncoder</font> to encode

In [12]:
# Print the list of categorical column names
#print(df_sk.head(1))
#print(categorical_mask)
#print(df_sk.columns[categorical_mask].tolist(),'\n')

############################################################
## Without drop='first' and after labeling
############################################################
# Create OneHotEncoder: ohe
ohe = ColumnTransformer([("OneHotEncoder", OneHotEncoder(), categorical_mask)], 
                        remainder = 'passthrough',
                        sparse_threshold = 0,
                        verbose = True)

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
housing_encoded = ohe.fit_transform(df_sk)

# Print the shape of the original DataFrame
print("Before OneHotEncoder:", df_sk.shape)

# Print the shape of the transformed array
print("After OneHotEncoder:", housing_encoded.shape)

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print('\nFirst 2 rows of the resulting dataset:')
print(housing_encoded[:2, :], '\n')

# Transforming to df
df_data = pd.DataFrame(data=housing_encoded, columns=ohe.get_feature_names())
print(df_data.head())
print(ohe.get_feature_names())

[ColumnTransformer] . (1 of 2) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer] ..... (2 of 2) Processing remainder, total=   0.0s
Before OneHotEncoder: (1460, 21)
After OneHotEncoder: (1460, 62)

First 2 rows of the resulting dataset:
[[0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 6.000e+01 6.500e+01 8.450e+03
  7.000e+00 5.000e+00 2.003e+03 0.000e+00 1.710e+03 1.000e+00 0.000e+00
  2.000e+00 1.000e+00 3.000e+00 0.000e+00 5.480e+02 2.085e+05]
 [0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 

In [13]:
############################################################
## With drop='first' and not labeling
############################################################
# Create OneHotEncoder: ohe
ohe = ColumnTransformer([("OneHotEncoder", OneHotEncoder(drop='first'), categorical_mask)], 
                        remainder = 'passthrough',
                        sparse_threshold = 0,
                        verbose = True)

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
housing_encoded_drop = ohe.fit_transform(housing_unproc)

# Print the shape of the original DataFrame
print("Before OneHotEncoder:", df_sk.shape)

# Print the shape of the transformed array
print("After OneHotEncoder:", housing_encoded_drop.shape)

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print('\nFirst 2 rows of the resulting dataset:')
print(housing_encoded_drop[:2, :], '\n')

[ColumnTransformer] . (1 of 2) Processing OneHotEncoder, total=   0.0s
[ColumnTransformer] ..... (2 of 2) Processing remainder, total=   0.0s
Before OneHotEncoder: (1460, 21)
After OneHotEncoder: (1460, 57)

First 2 rows of the resulting dataset:
[[0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 6.000e+01
  6.500e+01 8.450e+03 7.000e+00 5.000e+00 2.003e+03 0.000e+00 1.710e+03
  1.000e+00 0.000e+00 2.000e+00 1.000e+00 3.000e+00 0.000e+00 5.480e+02
  2.085e+05]
 [0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+0

### Using <font color=red>pandas.get_dummies</font> to encode

In [14]:
#print(housing_unproc.head(1))
#print(housing_unproc.shape)

# Using pandas to encode categorical columns
print('\n\nEncoded with Pandas Whithout drop parameters')
df_pd = housing_unproc.copy(deep = True)
df_pd = pd.get_dummies(df_pd)
print(df_pd.head())
print('Shape:', df_pd.shape)

# Using pandas to encode categorical columns
print('\n\nEncoded with Pandas Whit drop parameters')
df_pd = housing_unproc.copy(deep = True)
df_pd = pd.get_dummies(df_pd, drop_first=True)
print(df_pd.head())
print('Shape:', df_pd.shape)



Encoded with Pandas Whithout drop parameters
   MSSubClass  LotFrontage  LotArea  OverallQual  ...  HouseStyle_SLvl  PavedDrive_N  PavedDrive_P  PavedDrive_Y
0          60         65.0     8450            7  ...                0             0             0             1
1          20         80.0     9600            6  ...                0             0             0             1
2          60         68.0    11250            7  ...                0             0             0             1
3          70         60.0     9550            7  ...                0             0             0             1
4          60         84.0    14260            8  ...                0             0             0             1

[5 rows x 62 columns]
Shape: (1460, 62)


Encoded with Pandas Whit drop parameters
   MSSubClass  LotFrontage  LotArea  OverallQual  ...  HouseStyle_SFoyer  HouseStyle_SLvl  PavedDrive_P  PavedDrive_Y
0          60         65.0     8450            7  ...                  0 

**OneHotEncoder vs get_dummies**<br>
<Font color=red>For machine learning, you almost definitely want to use sklearn.OneHotEncoder. For other tasks like simple analyses, you might be able to use pd.get_dummies, which is a bit more convenient.</font>

## 4.5 Encoding categorical columns III: DictVectorizer

Alright, one final trick before you dive into pipelines. The two step process you just went through - LabelEncoder followed by OneHotEncoder - can be simplified by using a DictVectorizer.

Using a DictVectorizer on a DataFrame that has been converted to a dictionary allows you to get label encoding as well as one-hot encoding in one go.

Your task is to work through this strategy in this exercise!

**Instructions**

1. Import DictVectorizer from sklearn.feature_extraction.
2. Convert df into a dictionary called df_dict using its .to_dict() method with "records" as the argument.
3. Instantiate a DictVectorizer object called dv with the keyword argument sparse=False.
4. Apply the DictVectorizer on df_dict by using its .fit_transform() method.
5. Hit 'Submit Answer' to print the resulting first five rows and the vocabulary.

**Results**

<font color=darkgreen>Fantastic! Besides simplifying the process into one step, DictVectorizer has useful attributes such as vocabulary_ which maps the names of the features to their indices. With the data preprocessed, it's time to move onto pipelines!</font>

In [15]:
# Convert df into a dictionary: df_dict
df_dict = housing_unproc.to_dict('records')

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse = False)

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)

# Print the resulting first two rows
print('First 2 rows of the resulting dataset:')
print(df_encoded[:2,:], '\n')

# Print the vocabulary
print('Vocabulary:')
print(dv.vocabulary_)

# Transforming to df
print('\nTransforming to dataframe:')
df_data = pd.DataFrame(data=df_encoded, columns=dv.feature_names_)
print(df_data.head())

First 2 rows of the resulting dataset:
[[3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 2.000e+00 5.480e+02 1.710e+03 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
  8.450e+03 6.500e+01 6.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 7.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 2.085e+05 2.003e+03]
 [3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  1.000e+00 1.000e+00 2.000e+00 4.600e+02 1.262e+03 0.000e+00 0.000e+00
  0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  9.600e+03 8.000e+01 2.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.

In [16]:
cols = dict(sorted(dv.vocabulary_.items(), key=lambda w: w[1]))
print(cols)

{'BedroomAbvGr': 0, 'BldgType=1Fam': 1, 'BldgType=2fmCon': 2, 'BldgType=Duplex': 3, 'BldgType=Twnhs': 4, 'BldgType=TwnhsE': 5, 'BsmtFullBath': 6, 'BsmtHalfBath': 7, 'Fireplaces': 8, 'FullBath': 9, 'GarageArea': 10, 'GrLivArea': 11, 'HalfBath': 12, 'HouseStyle=1.5Fin': 13, 'HouseStyle=1.5Unf': 14, 'HouseStyle=1Story': 15, 'HouseStyle=2.5Fin': 16, 'HouseStyle=2.5Unf': 17, 'HouseStyle=2Story': 18, 'HouseStyle=SFoyer': 19, 'HouseStyle=SLvl': 20, 'LotArea': 21, 'LotFrontage': 22, 'MSSubClass': 23, 'MSZoning=C (all)': 24, 'MSZoning=FV': 25, 'MSZoning=RH': 26, 'MSZoning=RL': 27, 'MSZoning=RM': 28, 'Neighborhood=Blmngtn': 29, 'Neighborhood=Blueste': 30, 'Neighborhood=BrDale': 31, 'Neighborhood=BrkSide': 32, 'Neighborhood=ClearCr': 33, 'Neighborhood=CollgCr': 34, 'Neighborhood=Crawfor': 35, 'Neighborhood=Edwards': 36, 'Neighborhood=Gilbert': 37, 'Neighborhood=IDOTRR': 38, 'Neighborhood=MeadowV': 39, 'Neighborhood=Mitchel': 40, 'Neighborhood=NAmes': 41, 'Neighborhood=NPkVill': 42, 'Neighborhood=

## 4.6 Preprocessing within a pipeline

Now that you've seen what steps need to be taken individually to properly process the Ames housing data, let's use the much cleaner and more succinct DictVectorizer approach and put it alongside an XGBoostRegressor inside of a scikit-learn pipeline.

**Instructions**

1. Import DictVectorizer from sklearn.feature_extraction and Pipeline from sklearn.pipeline.
2. Fill in any missing values in the LotFrontage column of X with 0.
3. Complete the steps of the pipeline with DictVectorizer(sparse=False) for "ohe_onestep" and xgb.XGBRegressor() for "xgb_model".
4. Create the pipeline using Pipeline() and steps.
4. Fit the Pipeline. Don't forget to convert X into a format that DictVectorizer understands by calling the to_dict("records") method on X.

**Results**

<font color=darkgreen>Well done! It's now time to see what it takes to use XGBoost within pipelines.</font>

In [17]:
# Create arrays for the features and the target: X, y
#X, y = housing_unproc.iloc[:,:-1], housing_unproc.iloc[:,-1]
X, y = housing_unproc.drop('SalePrice', axis=1), housing_unproc.SalePrice

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse = False)),
         ("xgb_model", xgb.XGBRegressor(seed = SEED))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Fit the pipeline
xgb_pipeline.fit(X_train.to_dict('records'), y_train)

# Predict the labels of the test set: preds
preds = xgb_pipeline.predict(X_test.to_dict('records'))

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 28278.025825


## 4.7 Incorporating XGBoost into pipelines

1. Incorporating xgboost into pipelines
>Now that you've had some practice using pipelines in scikit-learn, let's see what it takes to use xgboost within pipelines.

2. Scikit-learn pipeline example with XGBoost
>This example is very similar to what was shown in the pipeline review that began this chapter. To get XGBoost to work within a pipeline, all that's really required is that you use XGBoost's scikit-learn API within a pipeline object. Let's see what that looks like in practice. As always, we first import everything we need for our purposes. We then proceed to load in the dataset and parse it into the matrix of features X and target vector y. At this point lies the only difference between using a scikit-learn native machine learning model and XGBoost. Specifically, we simply pass in an instance of the XGBoost XGBRegressor object into the pipeline where a normal scikit-learn estimator would be. The rest of the script is exactly what you've seen in the past. You compute the cross-validated negative mse using 10-fold cross-validation and then convert the 10-fold negative MSE into an average RMSE across all 10 folds. As you can see, without any hyperparameter tuning, the XGBoost model had a lower RMSE, of ~4-point-03 units, than the randomforest model we started the chapter with, which had an RMSE around 4-point-5.

3. Additional components introduced for pipelines
>We wanted you to see how a simple case of pipelining with XGBoost works, however, in the final end-to-end example, we will take a dataset that involves significantly more wrangling before it can be used with XGBoost and put it through a pipeline as well. As a result, we will have to work with a library that is not part of the standard suite of scikit-learn tools, as well as work with parts of pipelines that you may not be familiar with. Sklearn_pandas is a separate library that attempts to bridge the gap between working with pandas and working with scikit-learn, as they don't always work seamlessly together. Specifically, sklearn_pandas has a generic class called DataFrameMapper, that allows for easy conversion between scikit-learn aware objects, or pure numpy arrays, and the DataFrames that are the bread and butter of the pandas library. Additionally, we will use a class called CategoricalImputer that will allow us to impute missing categorical values directly, without having to first convert them to integers, as is the requirement in scikit-learn. We will also use some uncommon aspects of scikit-learn to accomplish our goals. Specifically, we will use the Imputer class from scikit-learn's preprocessing submodule, that allows us to fill in missing numerical values, and the FeatureUnion class found in scikit-learn's pipeline submodule. The FeatureUnion class allows us to combine separate pipeline outputs into a single pipeline output, as for example, we would need to do if we had one set of preprocessing steps we needed to perform on the categorical features of a dataset and a distinct set of preprocessing steps on the numeric features found in a dataset. The point is, we will introduce these topics at once, but don't want you to feel overwhelmed about what they are doing and how they can be used properly.

4. Let's practice!
>Hopefully, you just saw that its not particularly difficult to incorporate XGBoost into pipelines. Now, its your turn to practice what you just learned!

In [18]:
# Scikit-learn pipeline example with XGBoost
#X, y = boston_data.iloc[:,:-1], boston_data.iloc[:,-1]
X, y = boston_data.drop('med_price', axis=1), boston_data.med_price

# if you use Pipeline[] instead of Pipeline(), you get the error
# TypeError: 'ABCMeta' object is not subscriptable
xgb_pipeline = Pipeline([("st_scaler", StandardScaler()),
                         ("xgb_model",xgb.XGBRegressor(seed=SEED))])

scores = cross_val_score(xgb_pipeline, 
                         X, y,
                         scoring="neg_mean_squared_error",
                         cv = 10)

final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

print("Final XGB RMSE:", final_avg_rmse)

Final XGB RMSE: 4.37678209879882


## 4.8 Cross-validating your XGBoost model

In this exercise, you'll go one step further by using the pipeline you've created to preprocess and cross-validate your model.

**Instructions**

1. Create a pipeline called xgb_pipeline using steps.
2. Perform 10-fold cross-validation using cross_val_score(). You'll have to pass in the pipeline, X (as a dictionary, using .to_dict("records")), y, the number of folds you want to use, and scoring ("neg_mean_squared_error").
3. Print the 10-fold RMSE.

**Results**

<font color=darkgreen>Great work!</font>

In [19]:
# Create arrays for the features and the target: X, y
#X, y = housing_unproc.iloc[:,:-1], housing_unproc.iloc[:,-1]
X, y = housing_unproc.drop('SalePrice', axis=1), housing_unproc.SalePrice

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:squarederror", seed=SEED))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline,
                                   X.to_dict('records'), y,
                                   cv = 10,
                                   scoring = "neg_mean_squared_error")

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))

10-fold RMSE:  27683.04157118635


## 4.9 Kidney disease case study I: Categorical Imputer

You'll now continue your exploration of using pipelines with a dataset that requires significantly more wrangling. The <a href=https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease>chronic kidney disease dataset</a> contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.

As Sergey mentioned in the video, you'll be introduced to a new library, <a href=https://github.com/scikit-learn-contrib/sklearn-pandas>sklearn_pandas</a>, that allows you to chain many more processing steps inside of a pipeline than are currently supported in scikit-learn. Specifically, you'll be able to impute missing categorical values directly using the Categorical_Imputer() class in sklearn_pandas, and the DataFrameMapper() class to apply any arbitrary sklearn-compatible transformer on DataFrame columns, where the resulting output can be either a NumPy array or DataFrame.

We've also created a transformer called a Dictifier that encapsulates converting a DataFrame using .to_dict("records") without you having to do it explicitly (and so that it works in a pipeline). Finally, we've also provided the list of feature names in kidney_feature_names, the target name in kidney_target_name, the features in X, and the target in y.

In this exercise, your task is to apply the CategoricalImputer to impute all of the categorical columns in the dataset. You can refer to how the numeric imputation mapper was created as a template. Notice the keyword arguments input_df=True and df_out=True? This is so that you can work with DataFrames instead of arrays. By default, the transformers are passed a numpy array of the selected columns as input, and as a result, the output of the DataFrame mapper is also an array. Scikit-learn transformers have historically been designed to work with numpy arrays, not pandas DataFrames, even though their basic indexing interfaces are similar.

**Instructions**

1. Apply the categorical imputer using DataFrameMapper() and CategoricalImputer(). CategoricalImputer() does not need any arguments to be passed in. The columns are contained in categorical_columns. Be sure to specify input_df=True and df_out=True, and use category_feature as your iterator variable in the list comprehension.

**Results**

<font color=darkgreen>Great work!</font>

In [20]:
# Exploring the data
print('Exploring the data:')
print(kideney.info())

# Create arrays for the features and the target: X, y
df = kideney.copy(deep = True)
X, y = df.drop('class', axis=1), df['class']

# Apply LabelEncoder to target columns
le = LabelEncoder()
y = le.fit_transform(y)

# Check number of nulls in each feature column
print('\nNulls per column in kideney dataset: ')
print(X.isnull().sum())

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
    [([numeric_feature], SimpleImputer(strategy="median")) for numeric_feature in non_categorical_columns],
    input_df=True, df_out=True
)

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
    [(category_feature, SimpleImputer(strategy='most_frequent')) for category_feature in categorical_columns],
    input_df=True, df_out=True
)

Exploring the data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   rbc     248 non-null    object 
 6   pc      335 non-null    object 
 7   pcc     396 non-null    object 
 8   ba      396 non-null    object 
 9   bgr     356 non-null    float64
 10  bu      381 non-null    float64
 11  sc      383 non-null    float64
 12  sod     313 non-null    float64
 13  pot     312 non-null    float64
 14  hemo    348 non-null    float64
 15  pcv     329 non-null    float64
 16  wc      294 non-null    float64
 17  rc      269 non-null    float64
 18  htn     398 non-null    object 
 19  dm      398 non-null    object 
 20  cad     398 non-null    object 
 21  appet   399 non-nul

## 4.10 Kidney disease case study II: Feature Union

Having separately imputed numeric as well as categorical columns, your task is now to use scikit-learn's <a href=http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html>FeatureUnion</a> to concatenate their results, which are contained in two separate transformer objects - numeric_imputation_mapper, and categorical_imputation_mapper, respectively.

You may have already encountered FeatureUnion in <a href=https://campus.datacamp.com/courses/machine-learning-with-the-experts-school-budgets/improving-your-model?ex=7>Machine Learning with the Experts: School Budgets</a>. Just like with pipelines, you have to pass it a list of (string, transformer) tuples, where the first half of each tuple is the name of the transformer.

**Instructions**

1. Import FeatureUnion from sklearn.pipeline.
2. Combine the results of numeric_imputation_mapper and categorical_imputation_mapper using FeatureUnion(), with the names "num_mapper" and "cat_mapper" respectively.

**Results**

<font color=darkgreen>Great work!</font>

In [21]:
# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
    ("num_mapper", numeric_imputation_mapper),
    ("cat_mapper", categorical_imputation_mapper)
])

## 4.11 Kidney disease case study III: Full pipeline

It's time to piece together all of the transforms along with an XGBClassifier to build the full pipeline!

Besides the numeric_categorical_union that you created in the previous exercise, there are two other transforms needed: the Dictifier() transform which we created for you, and the DictVectorizer().

After creating the pipeline, your task is to cross-validate it to see how well it performs.

**Instructions**

1. Create the pipeline using the numeric_categorical_union, Dictifier(), and DictVectorizer(sort=False) transforms, and xgb.XGBClassifier() estimator with max_depth=3. Name the transforms "featureunion", "dictifier" "vectorizer", and the estimator "clf".
2. Perform 3-fold cross-validation on the pipeline using cross_val_score(). Pass it the pipeline, pipeline, the features, kidney_data, the outcomes, y. Also set scoring to "roc_auc" and cv to 3.

**Results**

<font color=darkgreen>Great work!</font>

In [22]:
# Custom transformer to convert Pandas DataFrame into Dict (needed for DictVectorizer)
class Dictifier(BaseEstimator, TransformerMixin):   
    """
    Encapsulates converting a DataFrame using .to_dict("records") without you having to do it explicitly 
    (and so that it works in a pipeline). 
    """
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.to_dict('records')

In [23]:
# Initialize an empty list for Simple Imputer details
transformers = []

# Apply numeric imputer
transformers.extend(
    [([numeric_feature], [SimpleImputer(strategy="median"), StandardScaler()]) for numeric_feature in non_categorical_columns]
)

# Apply categorical imputer
transformers.extend(
    [([category_feature], [SimpleImputer(strategy='most_frequent')]) for category_feature in categorical_columns]
)
# Combine the numeric and categorical transformations
numeric_categorical_union = DataFrameMapper(transformers, input_df=True, df_out=True)

# Create full pipeline
pipeline = Pipeline([("featureunion", numeric_categorical_union),
                     ("dictifier", Dictifier()),
                     ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, 
                                               use_label_encoder=False, eval_metric='error', 
                                               max_depth=3, seed=SEED))])

# Perform cross-validation
cross_val_scores = cross_val_score(pipeline, X, y, cv=3, scoring='roc_auc')

# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))

3-fold AUC:  0.9985589978963473


## 4.12 Tuning XGBoost hyperparameters

1. Tuning xgboost hyperparameters in a pipeline
>We are going to finish off this chapter, and the course, by seeing how automated hyperparameter tuning for an XGBoost model works within a scikit-learn pipeline. Once you have this down, you'll be able to make some of the most powerful well-tuned machine learning models today in an automated, reproducible manner.

2. Tuning XGBoost hyperparameters in a pipeline
>We will again use the Boston housing dataset to motivate our use of pipelines and hyperparameter tuning. As always, we first import what we will be using. The only difference is now we also import RandomizedSearchCV from the scikit-learn modelselection submodule. We then load in our data and create our feature matrix X and target vector y and also create our pipeline that includes both the standard scaling step and a base xgboostregressor object with all default parameters. At this point, you need to create the grid of parameters over which you will search. In order for the hyperparameters to be passed to the appropriate step, you have to name the parameters in the dictionary with the name of the step being referenced followed by 2 underscore signs and then the name of the hyperparameter you want to iterate over. Since the xgboost step is called xgb_model, all of our hyperparameter keys will start with xgboost_model__. In the example, we will tune subsample, max_depth, and colsample_bytree, and give each parameter a range of possible values. We then pass the pipeline in as an estimator to RandomizedSearchCV and the parameter grid to param_distributions. Everything else is as you've seen before, with appropriate scoring and cross-validation parameters passed in as well. Once that's done all you need to do is fit the randomizedsearch object and pass in the X and y objects we created earlier.

3. Tuning XGBoost hyperparameters in a pipeline II
>Finally, once you've fit the randomizedsearchcv object, you can inspect what the best score it found was, and convert it to an RMSE. You can also inspect what the best model found was and print it to screen.

4. Let's finish this up!
>Ok, last coding exercise of the course, let's finish this up!

In [24]:
# Scikit-learn pipeline example
#X, y = boston_data.iloc[:,:-1], boston_data.iloc[:,-1]
X, y = boston_data.drop('med_price', axis=1), boston_data.med_price

# Tuning XGBoost hyperparameters in a pipeline
xgb_pipeline = Pipeline([("st_scaler", StandardScaler()), 
                         ("xgb_model",xgb.XGBRegressor(seed=SEED))])
gbm_param_grid = {
    'xgb_model__subsample': np.arange(.05, 1, .05),
    'xgb_model__max_depth': np.arange(3,20,1),
    'xgb_model__colsample_bytree': np.arange(.1,1.05,.05) 
}

randomized_neg_mse = RandomizedSearchCV(
    estimator=xgb_pipeline, 
    param_distributions=gbm_param_grid, 
    n_iter=10,
    scoring='neg_mean_squared_error', 
    cv=4,
    random_state=SEED
)
randomized_neg_mse.fit(X, y)

print("Best rmse: ", np.sqrt(np.abs(randomized_neg_mse.best_score_)))
print("Best model: ", randomized_neg_mse.best_estimator_)

Best rmse:  4.825683665057551
Best model:  Pipeline(steps=[('st_scaler', StandardScaler()),
                ('xgb_model',
                 XGBRegressor(base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=0.9000000000000002, gamma=0,
                              gpu_id=-1, importance_type='gain',
                              interaction_constraints='',
                              learning_rate=0.300000012, max_delta_step=0,
                              max_depth=5, min_child_weight=1, missing=nan,
                              monotone_constraints='()', n_estimators=100,
                              n_jobs=8, num_parallel_tree=1, random_state=123,
                              reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                              seed=123, subsample=0.6500000000000001,
                              tree_method='exact', validate_parameters=1,
                  

## 4.13 Bringing it all together

Alright, it's time to bring together everything you've learned so far! In this final exercise of the course, you will combine your work from the previous exercises into one end-to-end XGBoost pipeline to really cement your understanding of preprocessing and pipelines in XGBoost.

Your work from the previous 3 exercises, where you preprocessed the data and set up your pipeline, has been pre-loaded. Your job is to perform a randomized search and identify the best hyperparameters.

**Instructions**

1. Set up the parameter grid to tune 'clf__learning_rate' (from 0.05 to 1 in increments of 0.05), 'clf__max_depth' (from 3 to 10 in increments of 1), and 'clf__n_estimators' (from 50 to 200 in increments of 50).
2. Using your pipeline as the estimator, perform 2-fold RandomizedSearchCV with an n_iter of 2. Use "roc_auc" as the metric, and set verbose to 1 so the output is more detailed. Store the result in randomized_roc_auc.
3. Fit randomized_roc_auc to X and y.
4. Compute the best score and best estimator of randomized_roc_auc.

**Results**

<font color=darkgreen>Amazing work! This type of pipelining is very common in real-world data science and you're well on your way towards mastering it.</font>

In [25]:
###########################################################
## Step 1, ex.4.9
###########################################################
# Create arrays for the features and the target: X, y
df = kideney.copy(deep = True)
X, y = df.drop('class', axis=1), df['class']

# Apply LabelEncoder to target columns
le = LabelEncoder()
y = le.fit_transform(y)

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

In [26]:
###########################################################
## Step 2, ex.4.11
###########################################################
# Initialize an empty list for Simple Imputer details
transformers = []

# Apply numeric imputer
transformers.extend(
    [([numeric_feature], [SimpleImputer(strategy="median"), StandardScaler()]) for numeric_feature in non_categorical_columns]
)

# Apply categorical imputer
transformers.extend(
    [([category_feature], [SimpleImputer(strategy='most_frequent')]) for category_feature in categorical_columns]
)

# Combine the numeric and categorical transformations
numeric_categorical_union = DataFrameMapper(transformers, input_df=True, df_out=True)

# Create full pipeline
pipeline = Pipeline([("featureunion", numeric_categorical_union),
                     ("dictifier", Dictifier()),
                     ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, 
                                               use_label_encoder=False, eval_metric='error', 
                                               max_depth=3, seed=SEED))])

In [27]:
###########################################################
## Step 3, ex.4.13
###########################################################
# Create the parameter grid
gbm_param_grid = {
    'clf__learning_rate': np.arange(0.05, 1, 0.05),
    'clf__max_depth'    : np.arange(3, 10, 1),
    'clf__n_estimators' : np.arange(50, 200, 50)
}

# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=gbm_param_grid,
    n_iter=2,
    cv=2,
    scoring='roc_auc',
    verbose=1,
    random_state=SEED
)

# Fit the estimator
randomized_roc_auc.fit(X, y)

# Compute metrics
print(randomized_roc_auc.best_score_)
print(randomized_roc_auc.best_estimator_)

Fitting 2 folds for each of 2 candidates, totalling 4 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.8s finished


0.9965333333333333
Pipeline(steps=[('featureunion',
                 DataFrameMapper(df_out=True, drop_cols=[],
                                 features=[(['age'],
                                            [SimpleImputer(strategy='median'),
                                             StandardScaler()]),
                                           (['bp'],
                                            [SimpleImputer(strategy='median'),
                                             StandardScaler()]),
                                           (['sg'],
                                            [SimpleImputer(strategy='median'),
                                             StandardScaler()]),
                                           (['al'],
                                            [SimpleImputer(strategy='median'),
                                             StandardScaler()]),
                                           (['su'],
                                            [Simple

## 4.14 Final Thoughts

1. Final Thoughts
>Congratulations on completing this course. Let's go over everything we've covered in this course, as well as where you can go from here with learning other topics related to XGBoost that we didn't have a chance to cover.

2. What We Have Covered And You Have Learned
>So, what have we been able to cover in this course? Well, we've learned how to use XGBoost for both classification and regression tasks. We've also covered all the most important hyperparameters that you should tune when creating XGBoost models, so that they are as performant as possible. And we just finished up how to incorporate XGBoost into pipelines, and used some more advanced functions that allow us to seamlessly work with Pandas DataFrames and scikit-learn. That's quite a lot of ground we've covered and you should be proud of what you've been able to accomplish.

3. What We Have Not Covered (And How You Can Proceed)
>However, although we've covered quite a lot, we didn't cover some other topics that would advance your mastery of XGBoost. Specifically, we never looked into how to use XGBoost for ranking or recommendation problems, which can be done by modifying the loss function you use when constructing your model. We also didn't look into more advanced hyperparameter selection strategies. The most powerful strategy, called Bayesian optimization, has been used with lots of success, and entire companies have been created just for specifically using this method in tuning models (for example, the company sigopt does exactly this). It's a powerful method, but would take an entire other DataCamp course to teach properly! Finally, we haven't talked about ensembling XGBoost with other models. Although XGBoost is itself an ensemble method, nothing stops you from combining the predictions you get from an XGBoost model with other models, as this is usually a very powerful additional way to squeeze the last bit of juice from your data. Learning about all of these additional topics will help you become an even more powerful user of XGBoost. Now that you know your way around the package, there's no reason for you to stop learning how to get even more benefits out of it.

4. Congratulations!
>I hope you've enjoyed taking this course on XGBoost as I have teaching it. Please let us know if you've enjoyed the course and definitely let me know how I can improve it. It's been a pleasure, and I hope you continue your data science journey from here!

# Aditional material
- Datacamp course: https://learn.datacamp.com/courses/extreme-gradient-boosting-with-xgboost
- Xgboost documentation: https://xgboost.readthedocs.io/en/latest/
- sklearn.tree.DecisionTreeClassifier documentation: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
- metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter