# XGBoost

Chapter 7 of our book covers Ensemble Learning and Random Forests. All the methods described in the chapter are fairly important to know but this notebook focuses on XGBoost.

### Install XGBoost
The first thing you need to do is install the XGBoost Library. You can install it for Anaconda using

    conda install -c conda-forge xgboost
    
If that doesn't work use Google to search for your particular installation specs. 

## Example using the Iris Data Set

### Load the dataset:


In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

iris = pd.read_csv('https://raw.githubusercontent.com/zacharski/machine-learning/master/data/iris.csv')

iris_train, iris_test = train_test_split(iris, test_size = 0.2)
train_X = iris_train[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']]
train_y = iris_train['Class']
test_X = iris_test[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']]
test_y = iris_test['Class']

### Create an instance of the XGBoost classifier

In [6]:
from xgboost import XGBClassifier
model = XGBClassifier()

### The simplist example: fitting model to the data

In [9]:
model.fit(train_X, train_y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

### evaluate model

In [10]:
from sklearn.metrics import accuracy_score

iris_predictions = model.predict(test_X)
accuracy_score(test_y, iris_predictions)

  if diff:


0.9666666666666667

If you get a deprecation warning about an empty array it is caused by numpy which states:

> The long and short is that truth-testing on empty arrays is dangerous, misleading, and not in any way useful, and should be deprecated.

Unfortunately, this is done in sklearn version 19.1. sklearn 19.2 fixes this but it is not yet (at the time of making this notebook) available in Anaconda

# The task - The Adult Dataset

For this project we are going to use the [Adult Dataset](http://archive.ics.uci.edu/ml/datasets/Adult). The webpage describes the problem. We are trying to predict wether someone makes more that $50,000 year based on a number of features. The data folder contains both training data `adult.data` and test data `adult.test`. 

## Prepare the data. 
1. Values that are missing are represented by a `?`. You should remove rows containing missing data.
2. In the training data the values of the column we are trying to predict (`wage_class`) are `<=50K` and `>50K`. Unfortunately, in the test data the values are `<=50K.` and `>50K.` (note the periods). You need to alter this so they are the same. You can use the Pandas DataFrame `.replace` method.

## Finding the best hyperpameters
We are interested in finding the best combination of hyperparameters:
* max_depth with values 3, 5, and 7
* min_child_weight with values 1, 3, 5

The hyperparameters of the XGBClassifier we are not adjusting include:
       'learning_rate': 0.1, 
       'n_estimators': 1000, 
       'seed':0, 
       'subsample': 0.8, 
       'colsample_bytree': 0.8, 
       'objective': 'binary:logistic'
       
Finally, the hyperparameters of the Grid Search include

    scoring = 'accuracy', 
    cv = 5, 
    n_jobs = -1)
    
#### Here is what I got

    optimized_GBM.best_params_
    {'max_depth': 3, 'min_child_weight': 1}
    
    cvres = optimized_GBM.cv_results_
    for mean_score, params in zip (cvres['mean_test_score'], cvres['params']):
         print(np.sqrt(mean_score), params)
         
    0.9320100122602681 {'max_depth': 3, 'min_child_weight': 1}
    0.9319922256399311 {'max_depth': 3, 'min_child_weight': 3}
    0.9313872784451372 {'max_depth': 3, 'min_child_weight': 5}
    0.92858873422616 {'max_depth': 5, 'min_child_weight': 1}
    0.9288386283598029 {'max_depth': 5, 'min_child_weight': 3}
    0.9280172949509803 {'max_depth': 5, 'min_child_weight': 5}
    0.9262113772874686 {'max_depth': 7, 'min_child_weight': 1}
    0.9250652086256056 {'max_depth': 7, 'min_child_weight': 3}
    0.926372443522455 {'max_depth': 7, 'min_child_weight': 5}
    
## Finally run the best model on the test data

    model = optimized_GBM.best_estimator_