## Multiparty XGBoost with Federated Training
We will now discuss running XGBoost in the federated setting. Unlike the previous exercise, in the federated setting all data stays on its respective machine. This eliminates the need to transfer over the network which incurs high overhead and requires significant bandwidth. Instead, in the federated setting in each iteration each party sends a summary of the update made to its model. The central server then aggregates these updates, applies the aggregated update to its model, and broadcasts the new model to all parties. The parties then train locally with the new model and sends the update to the central server.

![title](img/exercise3.png)

In our project, all this is abstracted away. The central server simply starts the training, and everything else is performed automatically.

Import some helper functions.

In [1]:
import pandas as pd
import subprocess
from start_job import start_job

### Edit hosts.config 
The `hosts.config` file should contain the IPs and ports of all workers in the federation. After loading in the `hosts.config` file, modify it to contain the IPs of all parties in the federation! Then write the new addresses back to the file by adding a magic to the top of the cell:

`%%writefile hosts.config`

Make sure to delete the `# %load hosts.config` line from the cell before saving it. We'll be continually using the `%load` and `%%writefile` magics in this tutorial to edit files.

In [None]:
# %load hosts.config
35.167.132.178:22
34.222.205.126:22
34.222.177.218:22

### Set Variables For Network Analysis
We'll walk you through inspecting packets during this tutorial as well to make sure that the network topology is indeed federated. For each variable below, fill in the corresponding IP (don't worry about the ordering of the worker nodes)

In [4]:
master = '0'
worker_1 = '1'
worker_2 = '2'
worker_3 = '3'

### Modifying the Training Script
We will now modify the script that will be run for federated training. Load it in by running the following cell. The contents of the script should appear in the cell. 

The central server controls the training. If you're the central server, you can play with the `params` argument passed into the `train()` function. A list of possible parameters and their descriptions can be found [here](https://xgboost.readthedocs.io/en/latest/parameter.html).

In [None]:
%load train_model.py

### Using tcpdump to Capture Packets
Here, we will be using `tcpdump` to monitor the network traffic during training. The cell below spawns a subprocess that records all incoming network traffic.

In [None]:
tcpdump_cmd = 'tcpdump -ni en0 -s0 -w capture.pcap'
tcpdump_process = subprocess.Popen(tcpdump_cmd, stdout=subprocess.PIPE, shell=True)

### Start Job
After modifying the script, we can start our job! We can use the `start_job()` helper function to do so.
`start_job(num_parties, memory, script_path)` takes in three parameters:
* num_parties: The number of parties in the federation. This should be the same as the number of IPs added to hosts.config
* memory: The amount of memory to use for this job on each party's machine
* script_path: The absolute path to the script we want to run

After `start_job()` is finished, we want to kill the tcpdump subprocess in order to save the `.pcap` file

In [None]:
start_job(2, 3, "/home/$USER/train_model.py")
tcpdump_process.terminate()

## Network Analysis
The goal of this section is to show that the workers do not communicate with each other at all during training. First, let's convert the `pcap` file we created with `tcpdump` to a `.csv`:

In [None]:
!tshark -r capture.pcap -T fields -e frame.number -e eth.src -e eth.dst -e ip.src -e ip.dst -e frame.len -E header=y -E separator=, > capture.csv

### Loading
In the cells below, we first load in the `.csv` created by tshark into a pandas DataFrame and drop all rows that have `NaN` in either of the IP columns. Then, we rename the IPs corresponding to the members of the federation for easier reading

In [13]:
capture = pd.read_csv('capture.csv', names=['Frame Number', 'Ethernet Source', 'Ethernet Destination', 
                                            'IP Source', 'IP Destination', 'Frame Length'], header=0)
capture.dropna(subset=['IP Source', 'IP Destination'], inplace=True)
capture.head()

Unnamed: 0,Frame Number,Ethernet Source,Ethernet Destination,IP Source,IP Destination,Frame Length
0,1,f0:18:98:1d:21:88,01:00:5e:7f:ff:fa,192.168.42.9,239.255.255.250,217
63,64,f0:18:98:1d:21:88,01:00:5e:7f:ff:fa,192.168.42.9,239.255.255.250,217
64,65,3c:90:66:7e:eb:d3,ac:bc:32:aa:36:69,34.194.213.63,192.168.42.5,420
65,66,ac:bc:32:aa:36:69,3c:90:66:7e:eb:d3,192.168.42.5,34.194.213.63,66
66,67,3c:90:66:7e:eb:d3,ac:bc:32:aa:36:69,34.194.213.63,192.168.42.5,97


In [16]:
labels = {master: 'Master', worker_1: 'worker_1', worker_2: 'worker_2', worker_3: 'worker_3'}
capture.replace(labels, inplace=True)
capture.head()

Unnamed: 0,Frame Number,Ethernet Source,Ethernet Destination,IP Source,IP Destination,Frame Length,IP Source to IP Destination
0,1,f0:18:98:1d:21:88,01:00:5e:7f:ff:fa,192.168.42.9,239.255.255.250,217,192.168.42.9 -> 239.255.255.250
63,64,f0:18:98:1d:21:88,01:00:5e:7f:ff:fa,192.168.42.9,239.255.255.250,217,192.168.42.9 -> 239.255.255.250
64,65,3c:90:66:7e:eb:d3,ac:bc:32:aa:36:69,34.194.213.63,192.168.42.5,420,34.194.213.63 -> 192.168.42.5
65,66,ac:bc:32:aa:36:69,3c:90:66:7e:eb:d3,192.168.42.5,34.194.213.63,66,192.168.42.5 -> 34.194.213.63
66,67,3c:90:66:7e:eb:d3,ac:bc:32:aa:36:69,34.194.213.63,192.168.42.5,97,34.194.213.63 -> 192.168.42.5


### Preprocessing
Create a new column in the table to see the communications to and from this particular node. In order to get the total number of bytes transmitted for each type of transmission, group by this new column and then sum all the frame lengths. In order to get the total number of packets sent by each transmission, simply count how many times that particular transmission occurs.

In [36]:
capture['Transmission'] = capture.apply(lambda row: row['IP Source'] + ' -> ' + row['IP Destination'], axis=1)
count_bytes = capture.groupby('Transmission', as_index=False)['Transmission', 'Frame Length'].sum()
count_bytes.rename(mapper={'Frame Length': 'Total Bytes Transmitted'}, inplace=True, axis=1)

### Number of Bytes Sent

In [38]:
count_bytes

Unnamed: 0,Transmission,Total Bytes Transmitted
0,151.101.189.140 -> 192.168.42.5,570386
1,157.240.22.19 -> 192.168.42.5,1529
2,157.240.22.35 -> 192.168.42.5,132
3,172.217.0.34 -> 192.168.42.5,178864
4,172.217.0.35 -> 192.168.42.5,66
5,172.217.0.46 -> 192.168.42.5,1905
6,172.217.164.98 -> 192.168.42.5,5171
7,172.217.164.99 -> 192.168.42.5,4750
8,172.217.5.102 -> 192.168.42.5,1155696
9,172.217.5.97 -> 192.168.42.5,127184


In [34]:
capture['Transmission'].value_counts().rename_axis('Transmission').reset_index(name='Number of Packets')

Unnamed: 0,Transmission,Number of Packets
0,172.217.5.102 -> 192.168.42.5,819
1,192.168.42.5 -> 172.217.5.102,505
2,151.101.189.140 -> 192.168.42.5,451
3,192.168.42.5 -> 151.101.189.140,366
4,192.168.42.5 -> 23.203.221.142,230
5,23.203.221.142 -> 192.168.42.5,207
6,172.217.0.34 -> 192.168.42.5,151
7,172.217.5.97 -> 192.168.42.5,92
8,192.168.42.5 -> 172.217.0.34,92
9,192.168.42.5 -> 172.217.5.97,66


## Modifying the Evaluation Script
We'll now use the model we trained in the previous step to make predictions on our test data. Load in the test script like in the previous step. 

In [None]:
%load test_model.py

In [None]:
start_job(2, 3, "/home/$USER/test_model.py")