## XGBoost

### Table of contents
* Load and Prepare Data
* Train the XGBoost Model
* Make Predictions with XGBoost Model
* Summary

### Load and Prepare Data

Start by importing classes/functions needed.

In [3]:
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Load the csv file with numpy function `loadtext()`.

In [4]:
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

Seperate the features columns of the dataset into X and the predicted variable into Y using numpy array formating.

In [5]:
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]

Split the data into training and test subsets.  The training set will be used to prepare the XGBoost model and the test set will be used to make new predictions, from which model performance may be determined.

To perform the split, use the `train_test_split()` function from scikit-learn, with a random seed of 7 for replication purposes.

In [6]:
# split data into train and test sets
seed = 1234
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

### Train the XGBoost Model

XGBoost provides a wrapper class to allow models to be treated like classifiers or regressors in the scikit-learn framework.  Thus, the full scikit-learn library can be used with XGBoost models.  

The XGBoost model for classification is called `XGBClassifier`. It can be created and and fit to the training dataset. Models are fit using the scikit-learn API and the model.fit() function.

Parameters for training the model can be passed to the model in the constructor.  

In [7]:
# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)

XGBClassifier()

The parameters used in the trained model can be viewed by printing the model.

In [8]:
print(model)

XGBClassifier()


More information about the defaults for the `XGBClassifier` and `XGBRegressor` classes may be found in the [XGBoost Python scikit-learn API](http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn). 

More on the meaning of parameters and how they should be configured is available at the [XGBoost parameters page](http://xgboost.readthedocs.io/en/latest//parameter.html).

Now that the model is fit, it can be used to make predictions.

### Make Predictions with XGBoost Model

Using the fit model, predictions can be made on the test dataset.  Predictions are made with scikit-learn, using the function `model.predict()`.  

By default, the predictions made by XGBoost are probabilities.  Since this predictive model is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class.  

In [9]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

Now the fit model has been used to make predictions on new data.  Therefore, the performance of the predictions can be evaluated by comparing the predictions to the expected values.  To accomplish this, the scikit-learn function `accuracy_score()` can be used. 

In [11]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 77.95%


This out of the box (not optimized) example of XGBoost results in a 77.95% accuracy--pretty good!

### Summary

Above, an XGBoost model was created, used to make predictions, and evaluated for performance accuracy.  