# Lab 2 - Training our first model using XGBoost

Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First let's isntall xgboost

In [None]:
!pip install xgboost

Now let's bring in the Python libraries that we'll use throughout the analysis

In [None]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
from xgboost import XGBClassifier                 # Training our XGBoost model

# ensure graphs are displayed correctly inline in this notebook
%matplotlib inline

Load our training and test dataset

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [None]:
X_train = train.drop(['y_yes'],axis=1)
y_train = train['y_yes']
X_test = test.drop(['y_yes'],axis=1)
y_test = test['y_yes']

In [None]:
model = XGBClassifier()
model.fit(X_train, y_train)

## Evaluating our model

Now that we have trained our model it is time to evaluate. To do this we will use the test data which the model has not seen during training. We will compare the prediction of the model for the test set with the ground truth provided by the dataset.

In [None]:
from sklearn.metrics import accuracy_score

# get predictions for test dataset
y_pred = model.predict(X_test)
# round them up or down
predictions = [round(value) for value in y_pred]
# calculate accuracy
accuracy_score(y_test, predictions)


A simple metric to calculate and understand is accuracy, how many of the predictions where correct. However it can also be misleading especially if the class distribution in the dataset is highly skewed as is in our case. 
A better metric for classification metrics is the F1-score as it strikes a balance between precision and recall and takes false positives and false negatives into account! If you want to learn more, checkout this excellent article: https://towardsdatascience.com/essential-things-you-need-to-know-about-f1-score-dbd973bf1a3

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, predictions)

Not looking so good anymore. However is it enough? This is usually a business decision. Let's dig a little deeper and plot the results using a confusion matrix:

In [None]:
pd.crosstab(index=y_test, columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

So, of the ~4000 potential customers, we predicted 112 would subscribe and 81 of them actually did.  We also had 317 subscribers who subscribed that we did not predict would.  This is less than desirable, but the model can (and should) be tuned to improve this.  Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](https://core.ac.uk/download/pdf/55631291.pdf).

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._

---
Congratulations you have trained your first model and made your first predictions. However training a model on your laptop or in this case on a notebook running in AWS is usually not enough. In the next lab you will learn how you can automate the model training and deploy an endpoint in the cloud to start operationalizing your machine learning model.