# MC<sup>2</sup>
This tutorial demonstrates how to use [MC<sup>2</sup>](https://github.com/mc2-project/mc2) (<b>M</b>ultiparty <b>C</b>ollaboration and <b>C</b>ompetition), our platform that enables collaborating parties to jointly perform analytics and train machine learning models on their sensitive data without sharing the contents of the data. In particular, this tutorial focuses on a module of MC<sup>2</sup> that supports gradient boosted decision tree learning, [Secure XGBoost](https://github.com/mc2-project/secure-xgboost).

Secure XGBoost leverages secure enclaves, e.g., Intel SGX, to perform computation in a secure environment. Parties can send their encrypted data to an untrusted server hosting Secure XGBoost, which will then load the data into an enclave before decrypting it. Since enclaves provide encrypted regions of memory, even the OS, hypervisor, and other (privileged) processes on the same machine won't be able to see the unencrypted data or intermediate results during computation.

Secure XGBoost's architecture is shown below. Clients make requests to a central untrusted RPC orchestrator, which queues up requests and relays each request to each enclave server once all parties have made a particular request. Computation happens in a distributed manner across the enclave cluster.

![Secure XGBoost architecture](figures/sys-arch.png)

Normally, if this tutorial were in-person, we'd break everyone into small groups. Each group would form a collaboration, and work together to train a model on their pooled data. Pooling the data to form a larger dataset makes the model much more robust.

Unfortunately, since everything is virtual, we cannot break everyone into small groups to form a physical collaboration. Instead, for this tutorial you will individually play the role of two different parties that want to collaborate without sharing the contents of their data.

In practice, there will exist a central enclave cluster controlled by no one member of the party, on which all computation will occur. For this tutorial, you will start the enclave server that enables clients to jointly orchestrate a training pipeline that will run inside an enclave. All parties will submit requests to execute the pipeline together.

MC<sup>2</sup> is open source and available on [GitHub](https://github.com/mc2-project/mc2).

## Mushroom Dataset
In this tutorial we'll be using the [Mushroom Dataset](https://archive.ics.uci.edu/ml/datasets/mushroom). This dataset contains 22 features, each of which represents a physical characteristic of a particular mushroom sample. Labels in this dataset are binary, and represent whether a mushroom sample is edible. As a result, the datasets lends itself quite nicely to a binary classification task.

<img src="figures/mushroom.png" width="100"/>

Imagine that you're part of a mushroom enthusiast group, and have stumbled across some mushroom samples whose edibility is unknown even after much examination. You could of course decide to try eating them, but eating even one poisonous mushroom would lead to the end of your mushroom collection career. Instead, you decide to team up with a few other mushroom enthuasists and combine your data to train a more robust mushroom edibility classification model. 

However, collecting all your mushroom samples was hard work -- you don't want other mushroom enthuasists to have access to your hard earned data, and consequently don't want to share your data in plaintext.

You will play the role of two distinct mushroom enthusiasts who will be working together to collectively train a model on their aggregated data. To do so, please open up a second copy of this notebook before modifying anything. Each copy will represent the workflow of one enthusiast.

## 1. User Setup
We'll first need to set up your user by inputting a username, generating a keypair, generating a certificate, and generating a symmetric key. 

**Make sure that you enter a unique username in each notebook**.

**TODO:** Create and enter a username. 

**Note**: The following cell may take ~10 seconds to run.

In [None]:
import securexgboost as mc2
from Utils import *

# TODO: Enter your username below as a string. Ensure that your username doesn't
# contain any spaces.
username = # ...
data_dir = "/home/mc2/risecamp/mc2/tutorial/data/"

In [None]:
# Run this cell to generate a keypair and a certificate
generate_certificate(username)
PUB_KEY = "config/{0}.pem".format(username)
CERT_FILE = "config/{0}.crt".format(username)

In [None]:
# Run this cell to generate a symmetric key
KEY_FILE = "{}_key.txt".format(username)
mc2.generate_client_key(KEY_FILE)

## 2. Data Preparation


Since attendees have been split into groups of 4, we've prepared four sets of training data, one set for each person in the group. Coordinate who will be using which set.

Since you will play the role of two mushroom enthusiasts, we've prepared two sets of training and test data, one per enthusiast.

Training data for each is located at the following paths:
* enthusiast 1: `/home/mc2/risecamp/mc2/tutorial/data/agaricus1.txt`
* enthusiast 2: `/home/mc2/risecamp/mc2/tutorial/data/agaricus2.txt`

Test data for each is located at the following paths:
* enthusiast 1: `/home/mc2/risecamp/mc2/tutorial/data/agaricus1.txt.test`
* enthusiast 2: `/home/mc2/risecamp/mc2/tutorial/data/agaricus2.txt.test`


### Plaintext Data Examination
First, examine your training data -- check out the mushroom samples you've collected! 

Secure XGBoost uses LibSVM format. The first column represents the sample label (whether the sample is edible). All features are categorical, and features have been one hot encoded -- each column represents the category of the feature. In particular, note that the data is in plaintext and is readable.

**TODO:** Fill in the path to the training data for the user of this notebook.

In [None]:
# TODO: fill in the path to your training data
# You should use `agaricus1.txt` in one notebook, `agaricus2.txt` in the other
!tail -n 10 # ...

### Data Encryption
Next, use the symmetric key generated above to encrypt your data. You've spent inordinate amounts of time collecting your mushroom samples and examining them, and don't want to share the fruits of your labor with anyone else.

**TODO:** Specify the paths to your training and test data.

In [None]:
# TODO: edit the `training_data` and `test_data` strings with the paths to your data
# You should use `agaricus1.txt` in one notebook, `agaricus2.txt` in the other
training_data = # ...
test_data = # ...

In [None]:
# Paths to output encrypted data
enc_training_data = data_dir + "{}_train.enc".format(username)
enc_test_data = data_dir + "{}_test.enc".format(username)

In [None]:
# Encrypt data
mc2.encrypt_file(training_data, enc_training_data, KEY_FILE)
mc2.encrypt_file(test_data, enc_test_data, KEY_FILE)

### Encrypted Data Examination
The encrypted data is at `/home/mc2/risecamp/mc2/tutorial/data/<username>_train.enc` and `/home/mc2/risecamp/mc2/tutorial/data/<username>_test.enc`. Let's take a look to confirm it's encrypted and that no one can see the characteristics of your samples.

**TODO:** Fill in your username to specify the path to your encrypted data.

In [None]:
# TODO: fill in your username
!tail -n 10 # ...

In [None]:
# Run this cell to store variables for use in subsequent notebooks
%store username
%store PUB_KEY 
%store CERT_FILE 
%store KEY_FILE 
%store enc_training_data 
%store enc_test_data

**Once you've finished this step, wait for breakout rooms to reconverge.**

## 3. Enclave server setup
While in practice there'll be an enclave server controlled by no one party, to complete this tutorial you will act on behalf of one mushroom enthusaist to launch the enclave server.

From the notebook of the user that you want to launch the enclave server, click [here](./Exercise 2.ipynb) to go to the next notebook, where you'll launch the enclave server. You'll have to set up the enclave server before you can begin training.

Otherwise, from the non-enclave-server user notebook, click [here](./Exercise 2 - Mirror.ipynb).