## Multiparty XGBoost with Centralized Training
In this exercise, we'll demonstrate a workflow in which each party has its own data and sends a copy of its data to the central server. Therefore, all the training data is sent over the network to the central server, who collects it and locally trains a model on all the data. The central server will then broadcast the trained model back to the parties, who will load the model and test it on their local test datasets. 

![title](img/exercise2.png)


We will also measure the number of bytes sent over the network to show the large bandwidth needed for this workflow. 
This shows the benefits of using as much data as possible to make the model more robust.

Import the necessary libraries

In [None]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error

Ensure that you've properly set up SSH credentials in Exercise 0. `scp` the training data you used in Exercise 1 to the central server. Note how many bytes are transferred over the network.

As a reminder, the training data is located at `/data/<insurance or hospital>/<insurance or hospital>_training_<party ID>.csv`. For example, if you're party 2 and your federation is using the hospital dataset, your training data is at `/data/hospital/hospital_training_2.csv`.

You can run this cell if you're the central server as well to see how much bandwidth sending the raw data over the network would take.

In [None]:
# Make sure you use the scp the training data you used in exercise 1

!scp -v -P 5522 -o StrictHostKeyChecking=no /data/hospital/hospital_training_1.csv <central server ip>:~/shared_data/

If you're the central server, load all the data that has been sent to your machine. For example, if three other parties sent you data, make 4 calls to `read_csv()`: one for your own data and three for the other parties' data.

**Do this only if you're the central server. If you're not the central server, time to take a break**.

In [None]:
# TODO: load in all the training data that the parties have sent
# The data should've been sent to the ~/shared_data directory

Concatenate all the data in preparation for training

In [None]:
aggregated_training_data = pd.concat([]) # TODO: add training data to the argument passed to pd.concat()
aggregated_training_data.shape

In [None]:
# TODO: Split the aggregated training data into features and labels

In [None]:
# TODO: fit a model to the aggregated training data
multiparty_model = xgb.XGBClassifier()

Save the trained model and send it to all parties in the federation

In [None]:
multiparty_model.save_model("ex2_model.model")

In [None]:
# If you're the central server, run this cell as many times as needed to send the saved model
# to all parties in the federation
!scp -v -P 5522 -o StrictHostKeyChecking=no ex2_model.model <party_ip>:~

If you're not the central server, your break is over. The central server should have sent the centrally trained model to you.

In [None]:
# If you're not the central server, ensure that you received the model and load it in
multiparty_model = xgb.XGBClassifier()
multiparty_model.load_model("ex2_model.model")

Preprocess your local test data

In [None]:
# TODO: load in your local test data and preprocess it to split it into features and labels

In [None]:
# TODO: evaluate the model on your local test data

Discuss the results with other members of your federation. How did the centrally trained model perform on your local test data compared with the locally trained model? Did adding more data help?

Once you're ready, please move to [Exercise 3](Exercise 3.ipynb).