# Classification with XGBoost
>  This chapter will introduce you to the fundamental idea behind XGBoostâ€”boosted learners. Once you understand how XGBoost works, you'll apply it to solve a common classification problem found in industry: predicting whether a customer will stop being a customer at some point in the future.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp, XGBoost]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 1 exercises "Extreme Gradient Boosting with XGBoost" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

plt.rcParams['figure.figsize'] = (8, 8)

## Introduction

### Which of these is a classification problem?

<p>Given below are 4 potential machine learning problems you might encounter in the wild. Pick the one that is a classification problem.</p>

<pre>
Possible Answers

Given past performance of stocks and various other financial data, predicting the exact price of a given stock (Google) tomorrow.

Given a large dataset of user behaviors on a website, generating an informative segmentation of the users based on their behaviors.

<b>Predicting whether a given user will click on an ad given the ad content and metadata associated with the user.</b>

Given a user's past behavior on a video platform, presenting him/her with a series of recommended videos to watch next.

</pre>

**This is indeed a classification problem.**

### Which of these is a binary classification problem?

<p>Great! A classification problem involves predicting the category a given data point belongs to out of a finite set of possible categories. Depending on how many possible categories there are to predict, a classification problem can be either binary or multi-class. Let's do another quick refresher here. Your job is to pick the <strong>binary</strong> classification problem out of the following list of supervised learning problems.</p>

<pre>
Possible Answers

<b>Predicting whether a given image contains a cat.</b>

Predicting the emotional valence of a sentence (Valence can be positive, negative, or neutral).

Recommending the most tax-efficient strategy for tax filing in an automated accounting system.

Given a list of symptoms, generating a rank-ordered list of most likely diseases.

</pre>

**A binary classification problem involves picking between 2 choices.**

### Introducing XGBoost

### XGBoost: Fit/Predict

<div class=""><p>It's time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn <code>.fit()</code> / <code>.predict()</code> paradigm that you are already familiar to build your XGBoost models, as the <code>xgboost</code> library has a scikit-learn compatible API!</p>
<p>Here, you'll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called <code>churn_data</code> - explore it in the Shell!</p>
<p>Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small <code>xgboost</code> model on the training set, and evaluate its performance on the test set by computing its accuracy.</p>
<p><code>pandas</code> and <code>numpy</code> have been imported as <code>pd</code> and <code>np</code>, and <code>train_test_split</code> has been imported from <code>sklearn.model_selection</code>. Additionally, the arrays for the features and the target have been created as <code>X</code> and <code>y</code>.</p></div>

In [2]:
churn_data = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/05-extreme-gradient-boosting-with-xgboost/datasets/churn_data.csv')

Instructions
<ul>
<li>Import <code>xgboost</code> as <code>xgb</code>.</li>
<li>Create training and test sets such that 20% of the data is used for testing. Use a <code>random_state</code> of <code>123</code>.</li>
<li>Instantiate an <code>XGBoostClassifier</code> as <code>xg_cl</code> using <code>xgb.XGBClassifier()</code>. Specify <code>n_estimators</code> to be <code>10</code> estimators and an <code>objective</code> of <code>'binary:logistic'</code>. Do not worry about what this means just yet, you will learn about these parameters later in this course.</li>
<li>Fit <code>xg_cl</code> to the training set (<code>X_train, y_train)</code> using the <code>.fit()</code> method.</li>
<li>Predict the labels of the test set (<code>X_test</code>) using the <code>.predict()</code> method and hit 'Submit Answer' to print the accuracy.</li>
</ul>

In [5]:
# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.743300


**Your model has an accuracy of around 74%. In Chapter 3, you'll learn about ways to fine tune your XGBoost models. For now, let's refresh our memories on how decision trees work. See you in the next video!**

## What is a decision tree?

### Decision trees

<div class=""><p>Your task in this exercise is to make a simple decision tree using scikit-learn's <code>DecisionTreeClassifier</code> on the <code>breast cancer</code> dataset that comes pre-loaded with scikit-learn. </p>
<p>This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign). </p>
<p>We've preloaded the dataset of samples (measurements) into <code>X</code> and the target values per tumor into <code>y</code>. Now, you have to split the complete dataset into training and testing sets, and then train a <code>DecisionTreeClassifier</code>. You'll specify a parameter called <code>max_depth</code>. Many other parameters can be modified within this model, and you can check all of them out <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier" target="_blank" rel="noopener noreferrer">here</a>.</p></div>

In [15]:
df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/05-extreme-gradient-boosting-with-xgboost/datasets/breast.csv')
X, y = df.iloc[:, :-1].values, df.iloc[:, -1].values

Instructions
<ul>
<li>Import:<ul>
<li><code>train_test_split</code> from <code>sklearn.model_selection</code>.</li>
<li><code>DecisionTreeClassifier</code> from <code>sklearn.tree</code>.</li></ul></li>
<li>Create training and test sets such that 20% of the data is used for testing. Use a <code>random_state</code> of <code>123</code>.</li>
<li>Instantiate a <code>DecisionTreeClassifier</code> called <code>dt_clf_4</code> with a <code>max_depth</code> of <code>4</code>. This parameter specifies the maximum number of successive split points you can have before reaching a leaf node.</li>
<li>Fit the classifier to the training set and predict the labels of the test set.</li>
</ul>

In [16]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)

accuracy: 0.9736842105263158


**It's now time to learn about what gives XGBoost its state-of-the-art performance: Boosting.**

### What is Boosting?

Boosting isn't really a specific machine learning algorithm, but a concept that can be applied to a set of machine learning models. So, its really a meta-algorithm. Specifically, it is an ensemble meta-algorithm primarily used to reduce any given single learner's variance and to convert many weak learners into an arbitrarily strong learner.

## Measuring accuracy

<div class=""><p>You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a <code>DMatrix</code>.</p>
<p>In the previous exercise, the input datasets were converted into <code>DMatrix</code> data on the fly, but when you use the <code>xgboost</code> <code>cv</code> object, you have to first explicitly convert your data into a <code>DMatrix</code>. So, that's what you will do here before running cross-validation on <code>churn_data</code>.</p></div>

In [17]:
churn_data = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/05-extreme-gradient-boosting-with-xgboost/datasets/churn_data.csv')

Instructions
<ul>
<li>Create a <code>DMatrix</code> called <code>churn_dmatrix</code> from <code>churn_data</code> using <code>xgb.DMatrix()</code>. The features are available in <code>X</code> and the labels in <code>y</code>.</li>
<li>Perform 3-fold cross-validation by calling <code>xgb.cv()</code>. <code>dtrain</code> is your <code>churn_dmatrix</code>, <code>params</code> is your parameter dictionary, <code>nfold</code> is the number of cross-validation folds (<code>3</code>), <code>num_boost_round</code> is the number of trees we want to build (<code>5</code>), <code>metrics</code> is the metric you want to compute (this will be <code>"error"</code>, which we will convert to an accuracy).</li>
</ul>

In [18]:
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

   train-error-mean  train-error-std  test-error-mean  test-error-std
0           0.28232         0.002366          0.28378        0.001932
1           0.26951         0.001855          0.27190        0.001932
2           0.25605         0.003213          0.25798        0.003963
3           0.25090         0.001845          0.25434        0.003827
4           0.24654         0.001981          0.24852        0.000934
0.75148


**cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!**

### Measuring AUC

<div class=""><p>Now that you've used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the <code>metrics</code> parameter of <code>xgb.cv()</code>. </p>
<p>Your job in this exercise is to compute another common metric used in binary classification - the area under the curve (<code>"auc"</code>). As before, <code>churn_data</code> is available in your workspace, along with the DMatrix <code>churn_dmatrix</code> and parameter dictionary <code>params</code>.</p></div>

Instructions
<ul>
<li>Perform 3-fold cross-validation with <code>5</code> boosting rounds and <code>"auc"</code> as your metric.</li>
<li>Print the <code>"test-auc-mean"</code> column of <code>cv_results</code>.</li>
</ul>

In [19]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.768893       0.001544       0.767863      0.002820
1        0.790864       0.006758       0.789157      0.006846
2        0.815872       0.003900       0.814476      0.005997
3        0.822959       0.002018       0.821682      0.003912
4        0.827528       0.000769       0.826191      0.001937
0.826191


**An AUC of 0.84 is quite strong. As you have seen, XGBoost's learning API makes it very easy to compute any metric you may be interested in. In Chapter 3, you'll learn about techniques to fine-tune your XGBoost models to improve their performance even further. For now, it's time to learn a little about exactly when to use XGBoost.**

## When should I use XGBoost?

### Using XGBoost

<p>XGBoost is a powerful library that scales very well to many samples and works for a variety of supervised learning problems. That said, as Sergey described in the video, you shouldn't always pick it as your default machine learning library when starting a new project, since there are some situations in which it is not the best option. In this exercise, your job is to consider the below examples and select the one which would be the best use of XGBoost.</p>

<pre>
Possible Answers

Visualizing the similarity between stocks by comparing the time series of their historical prices relative to each other.

Predicting whether a person will develop cancer using genetic data with millions of genes, 23 examples of genomes of people that didn't develop cancer, 3 genomes of people who wound up getting cancer.

Clustering documents into topics based on the terms used in them.

<b>Predicting the likelihood that a given user will click an ad from a very large clickstream log with millions of users and their web interactions.</b>
</pre>

**Way to end the chapter. Time to apply XGBoost to solve regression problems!**