# Xgboost Classifier
This notebook reviews the basic concepts you need to know in order to
sucessfully create a **Xgboost** classifier. We will be using **XGBClassifier** from **xgboost**. If you do not have it installed, run the following line in your terminal:

> conda install -c anaconda py-xgboost

In the following block of code we load the functions we will be using through this notebook

In [1]:
#Import the classifier
from xgboost import XGBClassifier

#Import functions to create and split the data
from mlb_misc_functions import create_clf_table_1, prep_model_data

## Data Load and Prep

In the following block of code we will create a modeling table (**model_data**) using the ```create_clf_table_1``` function. This is a dummy/fake table that resembles the format of real data. The table has 7 columns. One of the columns is a row unique identifier, the following 5 columns are features, while the last column is the target. In this case the target takes the values 0 and 1. See below:

In [2]:
#Call the funciton that creates the table. 10,000 rows
model_data = create_clf_table_1(10000)

#print the first 5 rows to screen
model_data.head(5)

Unnamed: 0,row_id,feature_1,feature_2,feature_3,feature_4,feature_5,target
0,0,-7,-1,-9,-8,8,0
1,1,4,-2,2,-6,4,1
2,2,0,5,0,-6,3,1
3,3,4,-8,-7,3,-9,0
4,4,-7,-6,2,7,-4,0


The next step is to split the data into training and a testing set, and put it in a format that our classifier can read it. In the block of code below we do that using a wrapper function called ```prep_model_data```. This function takes in the a pandas table, and two lists: one with the name of the column used as target, and a second one with the name of the features. **Note** that this function (```prep_model_data```) is a custom wrapper function that we are importing from the file **mlb_misc_functions**. We will talk about how to split a training set into training and testing sets in another notebook.

In [3]:
#target and features names
target = ["target"]
features = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5"]

#Prep the data for training
train_x, train_y, test_x, test_y = prep_model_data(model_data, target, features)

Before we continue, let’s take a look at the first 5 elements of **train_x** and **train_y**. **train_x** is a list of lists where each element (a list) contains the feature values. On the other hand, **train_y** is a simple list containing the targets (0 or 1). See below:

In [4]:
#Print 5 first elements
print(train_x[:5])
print(train_y[:5])

[[  6  -2  -7   8   1]
 [ -4   2   6   5  -3]
 [-10   7   1   6   1]
 [  7   8 -10  -3   7]
 [  3  -1   4   8  -7]]
[1 1 0 1 0]



## Training

Now we will create our classifier and train it on the training data (**train_x** and **train_y**), se below:

In [5]:
#Create classifier
xgb_clf = XGBClassifier(n_estimators=100)

#Train the classifier
xgb_clf.fit(train_x, train_y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

It is important to notice that we are initializing our Xgboost classifier with ```n_estimators=100```, while using the default value for all the other hyperparameters, we will talk about how to tune hyperparamters in a different notebook. For now, you can read about all the other hyperparameters [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html).

## Predictions and Probabilities

Once you have your Xgboost classifier trained, you can give it new data and either obtain predictions on the target (0s or 1s) or probabilities (numbers between 0 and 1).

The probabilities are obtained using the method ```predict_proba```, while the predictions are obtained using the method ```predict```. In the following block of code we obtain the predictions and probabilities for the testing set, and we print to screen the first 5 elements of each list.

In [6]:
predictions = xgb_clf.predict(test_x)
probabilities = xgb_clf.predict_proba(test_x)

In [7]:
print(predictions[0:5])

[0 1 0 0 0]


In [8]:
print(probabilities[:5])

[[0.53015244 0.46984753]
 [0.40298903 0.59701097]
 [0.9807196  0.01928039]
 [0.97738254 0.02261743]
 [0.56789935 0.43210062]]


While the list ```predictions``` is a simple array where each element takes the values 0 or 1, the list ```probabilities``` is an array where each element is also an array containing 2 elements. The first element is the probability of belonging to class 0, and the second element is the probability of belonging to the class 1. The addition of both numbers is always 1.

It is important to mention that the probabilities have more information than the predictions, actually the predictions can be constructed using the probabilities. In order to do so, you only need to choose a threshold, if the probability for class 1 is larger than the chosen threshold, then the prediction would be 1, otherwise it would be 0.

## Final words
We have covered the basic concepts to create and train a Xgboost classifier, plus how to obtain predictions and probabilities. Now is your turn to start coding! Try to use some metric (such as precision and recall) to evaluate the performance of your model, and try to specify other hyper parameters.
