# Secure XGBoost Tutorial
#### RISE Camp tutorial on the Secure XGBoost project.

## Single Party XGBoost

Import the necessary libraries

In [12]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error

Load in and examine our training data

In [13]:
training_data = pd.read_csv('/home/ubuntu/data/msd_training_data_split.csv', sep=",", header=None)
training_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,81,82,83,84,85,86,87,88,89,90
0,2007,25.23214,-232.77465,-37.51542,-40.34335,56.11564,-55.94831,43.06882,15.46278,-38.6737,...,-15.64058,257.69408,113.5974,-90.14988,-13.41911,-72.59105,-185.49959,1.16272,-73.13128,-6.89193
1,2007,27.96974,-166.08713,-11.19265,-28.07397,-56.10902,-35.47258,23.35854,7.19973,-36.81179,...,21.49227,289.05914,-34.75972,-19.38242,2.44006,-67.78591,-46.62749,0.38383,98.98315,13.14364
2,2007,24.75152,-97.45055,-40.15226,-43.39929,-57.25665,-33.93026,-1.95605,0.93121,7.76578,...,-5.96584,573.94557,11.83355,-107.81947,-3.42495,-141.79299,-150.794,0.55715,148.7149,-2.41587
3,2007,20.19082,-162.50028,-123.04788,-71.11772,-8.96605,-51.72176,30.5383,15.27979,-34.99486,...,-73.13628,18.76005,46.07843,-309.69087,-24.52842,-35.79334,-774.53143,3.34849,-194.68101,-41.23842
4,2007,25.10092,-189.85543,-28.69605,-34.42398,24.64007,-55.86989,63.91339,17.88235,-3.39713,...,-3.70478,40.14964,95.55738,-36.47506,-8.63102,-34.57157,-13.6361,8.25615,108.42127,3.51335


In [14]:
y_train = training_data.iloc[:, 0]
y_train.head()

0    2007
1    2007
2    2007
3    2007
4    2007
Name: 0, dtype: int64

In [15]:
x_train = training_data.iloc[:, 1:]
x_train.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,81,82,83,84,85,86,87,88,89,90
0,25.23214,-232.77465,-37.51542,-40.34335,56.11564,-55.94831,43.06882,15.46278,-38.6737,-10.30987,...,-15.64058,257.69408,113.5974,-90.14988,-13.41911,-72.59105,-185.49959,1.16272,-73.13128,-6.89193
1,27.96974,-166.08713,-11.19265,-28.07397,-56.10902,-35.47258,23.35854,7.19973,-36.81179,-7.84188,...,21.49227,289.05914,-34.75972,-19.38242,2.44006,-67.78591,-46.62749,0.38383,98.98315,13.14364
2,24.75152,-97.45055,-40.15226,-43.39929,-57.25665,-33.93026,-1.95605,0.93121,7.76578,4.96972,...,-5.96584,573.94557,11.83355,-107.81947,-3.42495,-141.79299,-150.794,0.55715,148.7149,-2.41587
3,20.19082,-162.50028,-123.04788,-71.11772,-8.96605,-51.72176,30.5383,15.27979,-34.99486,-5.25631,...,-73.13628,18.76005,46.07843,-309.69087,-24.52842,-35.79334,-774.53143,3.34849,-194.68101,-41.23842
4,25.10092,-189.85543,-28.69605,-34.42398,24.64007,-55.86989,63.91339,17.88235,-3.39713,4.55056,...,-3.70478,40.14964,95.55738,-36.47506,-8.63102,-34.57157,-13.6361,8.25615,108.42127,3.51335


Do the same with the test data

In [16]:
test_data = pd.read_csv('/home/ubuntu/data/msd_test_data_split.csv', sep=",", header=None)
y_test = test_data.iloc[:, 0]
x_test = test_data.iloc[:, 1:]

Train the model with the training data

In [19]:
model = xgb.XGBRegressor(n_estimators=20)
model.fit(x_train, y_train)

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=20,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

Get predictions and evaluate the model with the test data

In [20]:
preds = model.predict(x_test)
np.sqrt(mean_squared_error(preds, y_test))

243.45763709390334

## Federated XGBoost
We will now discuss running XGBoost in the federated setting.

### Edit hosts.config 
The `hosts.config` file should contain the IPs and ports of all workers in the federation. After loading in the `hosts.config` file, modify it to contain the IPs of your new friends! Then write the new addresses back to the file by adding a magic to the top of the cell:

`%%writefile hosts.config`

Make sure to delete the `# %load hosts.config` line from the cell before saving it. We'll be continually using the `%load` and `%%writefile` magics in this tutorial to edit files.

In [3]:
%%writefile hosts.config
35.167.132.178:22
34.222.205.126:22
34.222.177.218:22


Overwriting hosts.config


### Modifying the Training/Eval Script
We will now modify the script that will be run for training and evaluation. Load it in by running the following cell. The contents of the script should appear in the cell.

In [5]:
%%writefile tutorial.py
from FederatedXGBoost import FederatedXGBoost

# Instantiate a FederatedXGBoost instance
fxgb = FederatedXGBoost()

# Get number of federating parties
print(fxgb.get_num_parties())

# Load training data
fxgb.load_training_data('/home/ubuntu/data/msd_training_data_split.csv')

# Train a model
params = {'max_depth': 3, 'min_child_weight': 1.0, 'lambda': 1.0}
num_rounds = 50
fxgb.train(params, num_rounds)

# Save the model
fxgb.save_model("tutorial_model.model")

# Load the test data
fxgb.load_test_data('/home/ubuntu/data/msd_test_data_split.csv')

# Evaluate the model
print(fxgb.eval())

# Get predictions
ypred = fxgb.predict()

# Shutdown
fxgb.shutdown()


Overwriting tutorial.py


### Start Job
After modifying the script, we can start our job! We use the `start_job.sh` script with the given options to do so.

The following flags must be specified when running the script.

`./start_job.sh`

* `-m | --worker-memory` string, specified as "<memory>g", e.g. 3g
    * Amount of memory on workers allocated to job
* `-p | --num-parties` integer
    * Number of parties in the federation
* `-d | --dir` string
    * Path to created subdirectory containing job script, e.g. `/home/ubuntu/mc2/federated-xgboost/risecamp`
* `-j | --job` string
    * Path to job script. This should be the parameter passed into the `--dir` option concatenated with the job script file name, e.g. `/home/ubuntu/mc2/federated-xgboost/risecamp/tutorial.py`

In [6]:
!./start_job.sh -p 3 -m 3g -d /home/ubuntu/mc2/federated-xgboost/risecamp/ -j /home/ubuntu/mc2/federated-xgboost/risecamp/tutorial.py

2019-09-18 08:31:27,924 INFO start listen on 172.31.41.140:9091
2019-09-18 08:31:27,932 INFO rsync /home/ubuntu/mc2/federated-xgboost/risecamp/ -> 35.167.132.178:/home/ubuntu/mc2/federated-xgboost/risecamp/
2019-09-18 08:31:27,932 INFO rsync /home/ubuntu/mc2/federated-xgboost/risecamp/ -> 34.222.205.126:/home/ubuntu/mc2/federated-xgboost/risecamp/
2019-09-18 08:31:27,932 INFO rsync /home/ubuntu/mc2/federated-xgboost/risecamp/ -> 34.222.177.218:/home/ubuntu/mc2/federated-xgboost/risecamp/
2019-09-18 08:31:30,319 INFO @tracker All of 3 nodes getting started
3
[08:31:46] Tree method is automatically selected to be 'approx' for distributed training.
[0]	eval-rmse:9.287228
  "because it will generate extra copies and increase memory consumption")
3
[08:31:46] Tree method is automatically selected to be 'approx' for distributed training.
[0]	eval-rmse:9.287228
  "because it will generate extra copies and increase memory consumption")
2019-09-18 08:32:10,584 INFO @tracker All nodes finishes j