## Multiparty XGBoost with Centralized Training
In this exercise, we'll demonstrate a workflow in which each party has its own data and sends a copy of its data to the central server. Therefore, all the training data is sent over the network to the central server, who collects it and locally trains a model on all the data. The central server will then broadcast the trained model back to the parties, who will load the model and test it on their local test datasets. 

![title](img/exercise2.png)


We will also measure the number of bytes sent over the network to show the large bandwidth needed for this workflow. 
This shows the benefits of using as much data as possible to make the model more robust.

### Data Transfer
Import the necessary libraries

In [None]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from Utils import scp

Ensure that you've properly set up SSH credentials in Exercise 0. `scp` the training data you used in Exercise 1 to the aggregator. Note how many bytes are transferred over the network.

* Training data for the hospital dataset is at `/data/hospital/hospital_training_{party_id}.csv`
* Training data for the insurance dataset is at `/data/insurance/insurance_training_{party_id}.csv`

In [None]:
# Make sure you use the training data you used in exercise 1
training_data = "/path/to/training_data" # TODO: fill in the path to the training data
dest_ip = "aggregator_ip" # TODO: fill in the IP of the aggregator
dest_dir = "~/shared_data"
scp(training_data, dest_ip, dest_dir)

### Aggregate the Received Data
If you're the aggregator, load all the data that has been sent to your machine. For example, if three other parties sent you data, make 4 calls to `read_csv()`: one for your own data and three for the other parties' data.

**Do this only if you're the central server. If you're not the central server, time to take a break**.

In [None]:
# TODO: load in all the training data that the parties have sent
# The data should've been sent to the shared_data directory

header = # TODO: 0 if using insurance dataset, None if using hospital dataset

# TODO: modify the path to shared training data. The data will be in the `shared_data` directory
# e.g., shared_data/hospital_training_1.csv
aggregator_training_data = pd.read_csv('path/to/shared/data.csv', sep=",", header=header)
p2_training_data = pd.read_csv('path/to/shared/data.csv', sep=",", header=header)

Concatenate all the data in preparation for training

In [None]:
aggregated_training_data = pd.concat([]) # TODO: add training data to the argument passed to pd.concat()
aggregated_training_data.shape

In [None]:
# TODO: Split the aggregated training data into features and labels

### Train a Model

In [None]:
# TODO: fit a model to the aggregated training data
multiparty_model = xgb.XGBClassifier()

### Broadcast the Trained Model
Save the trained model and send it to all parties in the federation

In [None]:
multiparty_model.save_model("ex2_model.model")

In [None]:
# If you're the central server, run this cell as many times as needed to send the saved model
# to all parties in the federation
model = "ex2_model.model"
dest_ip = "party_ip"
dest_dir = "~"
scp(model, dest_ip, dest_dir)

If you're not the central server, your break is over. The aggregator should have sent the centrally trained model to you.

In [None]:
# If you're not the central server, ensure that you received the model and load it in
multiparty_model = xgb.XGBClassifier()
multiparty_model.load_model("ex2_model.model")

### Test Data Preprocessing

In [None]:
# TODO: load in your local test data and preprocess it to split it into features and labels

In [None]:
# TODO: evaluate the model on your local test data

arg1, arg2 = # TODO: set arg1 to the test features, arg2 to the test labels
preds = model.predict(arg1)
print(accuracy_score(arg2, preds))

Discuss the results with other members of your federation. How did the centrally trained model perform on your local test data compared with the locally trained model? Did adding more data help?

Once you're ready, please move to [Exercise 3](Exercise 3.ipynb).