# MC<sup>2</sup>
This tutorial demonstrates how to use [MC<sup>2</sup>](https://github.com/mc2-project/mc2) (<b>M</b>ultiparty <b>C</b>ollaboration and <b>C</b>ompetition), our platform that enables collaborating parties to jointly perform analytics and train machine learning models on their sensitive data without sharing the contents of the data. In particular, this tutorial focuses on a module of MC<sup>2</sup> that supports gradient boosted decision tree learning, [Secure XGBoost](https://github.com/mc2-project/secure-xgboost).

Secure XGBoost leverages secure enclaves, e.g. Intel SGX, to perform computation in a secure environment. Parties can send their encrypted data to an untrusted server hosting Secure XGBoost, which will then load the data into an enclave before decrypting it. Since enclaves provide encrypted regions of memory, even the OS, hypervisor, and other (privileged) processes on the same machine won't be able to see the unencrypted data or intermediate results during computation.

However, secure enclaves have been shown to be vulnerable to a whole host of side-channel attacks. To combat this, Secure XGBoost redesigns GBDT learning algorithms to be data-oblivious, i.e. to make memory accesses independent of input. The use of data-oblivous algorithms eliminates a large class of leakage that side-channel attacks rely on to extract information.

Secure XGBoost's architecture is shown below. Clients make requests to a central untrusted RPC orchestrator, which queues up requests and relays each request to each enclave server once all parties have made a particular request. Computation happens in a distributed manner across the enclave cluster.

![Secure XGBoost architecture](figures/sys-arch.png)

In this tutorial, we'll break everyone into small groups -- each group will be collaborating to jointly train a decision tree model. While in practice there will exist a central enclave server controlled by no one member of the party, in this tutorial one member per group will start the enclave server that enables clients to jointly orchestrate a training pipeline that will run inside an enclave. All group members will submit requests to jointly execute the pipeline.

MC<sup>2</sup> is open source and available on [GitHub](https://github.com/mc2-project/mc2).

## Mushroom Dataset
In this tutorial we'll be using the [Mushroom Dataset](https://archive.ics.uci.edu/ml/datasets/mushroom). This dataset contains 22 features, each of which represents a physical characteristic of a particular mushroom sample. Labels in this dataset are binary, and represent whether a mushroom sample is edible. As a result, the datasets lends itself quite nicely to a binary classification task.

<img src="figures/mushroom.png" width="100"/>

Imagine that you're part of a mushroom enthusiast group, and have stumbled across some mushroom samples whose edibility is unknown even after much examination. You could of course decide to try eating them, but eating even one poisonous mushroom would lead to the end of your mushroom collection career. Instead, you decide to team up with a few other mushroom enthuasists and combine your data to train a more robust mushroom edibility classification model. 

However, collecting all your mushroom samples was hard work -- you don't want other mushroom enthuasists to have access to your hard earned data, and consequently don't want to share your data in plaintext.

## 1. User Setup
We'll first need to set up your user by inputting a username, generating a keypair, generating a certificate, and generating a symmetric key.

**TODO:** Create and enter a username.

In [1]:
import securexgboost as mc2
from Utils import *

# TODO: Enter your username below as a string. Ensure that your username doesn't
# contain any spaces.
username = "chief"
cwd = "/home/mc2/risecamp/mc2/tutorial/"

In [2]:
# Run this cell to generate a keypair and a certificate
generate_certificate(username)
PUB_KEY = "config/{0}.pem".format(username)
CERT_FILE = "config/{0}.crt".format(username)

Generating keypair
Generating RSA private key, 3072 bit long modulus (2 primes)
.......................................................................++++
.......++++
e is 3 (0x03)
Generating CSR
Signing CSR
Signature ok
subject=CN = chief
Getting CA Private Key


In [3]:
# Run this cell to generate a symmetric key
KEY_FILE = "key.txt"
mc2.generate_client_key(KEY_FILE)

## 2. Data Preparation


Since attendees have been split into groups of 4, we've prepared four sets of training data, one set for each person in the group. Coordinate who will be using which set.

Training data for each user is located at the following paths:
* user 1: `/home/mc2/risecamp/mc2/tutorial/data/agaricus1.txt`
* user 2: `/home/mc2/risecamp/mc2/tutorial/data/agaricus2.txt`
* user 3: `/home/mc2/risecamp/mc2/tutorial/data/agaricus3.txt`
* user 4: `/home/mc2/risecamp/mc2/tutorial/data/agaricus4.txt`

Test data for each use is located at the following paths:
* user 1: `/home/mc2/risecamp/mc2/tutorial/data/agaricus1.txt.test`
* user 2: `/home/mc2/risecamp/mc2/tutorial/data/agaricus2.txt.test`
* user 3: `/home/mc2/risecamp/mc2/tutorial/data/agaricus3.txt.test`
* user 4: `/home/mc2/risecamp/mc2/tutorial/data/agaricus4.txt.test`

### Plaintext Data Examination
First, examine your training data -- check out the mushroom samples you've collected! Secure XGBoost uses LibSVM format -- the first column represents the sample label (whether the sample is edible). Features have been one hot encoded, and each column represents the category of the feature. In particular, note that the data is in plaintext and is readable.

**TODO:** Fill in your username.

In [4]:
# TODO: fill in the path to your training data
!tail -n 10 /home/mc2/risecamp/mc2/tutorial/data/agaricus1.txt

0 4:1 7:1 11:1 22:1 29:1 34:1 37:1 39:1 42:1 54:1 58:1 62:1 66:1 77:1 86:1 88:1 92:1 95:1 98:1 105:1 114:1 120:1
1 4:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 51:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 118:1 120:1
0 3:1 10:1 11:1 22:1 29:1 34:1 37:1 39:1 41:1 54:1 58:1 62:1 66:1 77:1 86:1 88:1 92:1 95:1 98:1 106:1 114:1 120:1
0 3:1 7:1 11:1 22:1 29:1 34:1 37:1 39:1 42:1 54:1 58:1 65:1 66:1 77:1 86:1 88:1 92:1 95:1 98:1 105:1 117:1 120:1
0 4:1 7:1 11:1 22:1 29:1 34:1 37:1 39:1 41:1 54:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 98:1 105:1 114:1 120:1
0 3:1 7:1 19:1 22:1 29:1 34:1 37:1 39:1 41:1 54:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 98:1 106:1 117:1 120:1
0 4:1 10:1 14:1 22:1 29:1 34:1 37:1 39:1 48:1 54:1 58:1 62:1 69:1 77:1 86:1 88:1 92:1 95:1 98:1 106:1 114:1 120:1
0 4:1 7:1 19:1 22:1 29:1 34:1 37:1 39:1 44:1 54:1 58:1 62:1 69:1 77:1 86:1 88:1 92:1 95:1 98:1 105:1 117:1 120:1
0 4:1 10:1 20:1 21:1 23:1 34:1 37:1 40:1 42:1 54:1 55:1 65:1 69:1 77:1 86:1 88:1 92:

### Data Encryption
Next, use the symmetric key generated above to encrypt your data. You've spent inordinate amounts of time collecting your mushroom samples and examining them, and don't want to share the fruits of your labor with anyone else.

**TODO:** Specify the paths to your training and test data.

In [5]:
# TODO: edit the `training_data` and `test_data` variables with the paths to your data
training_data = "/home/mc2/risecamp/mc2/tutorial/data/agaricus1.txt"
test_data = "/home/mc2/risecamp/mc2/tutorial/data/agaricus1.txt"

In [6]:
# Paths to output encrypted data
enc_training_data = cwd + "data/{}_train.enc".format(username)
enc_test_data = cwd + "data/{}_test.enc".format(username)

In [7]:
# Encrypt data
mc2.encrypt_file(training_data, enc_training_data, KEY_FILE)
mc2.encrypt_file(test_data, enc_test_data, KEY_FILE)

### Encrypted Data Examination
The encrypted data is at `/home/mc2/risecamp/mc2/tutorial/data/<username>_train.enc` and `/home/mc2/risecamp/mc2/tutorial/data/<username>_test.enc`. Let's take a look to confirm it's encrypted and that no one can see the characteristics of your samples.

**TODO:** Fill in your username to specify the path to your encrypted data.

In [8]:
# TODO: fill in your username
!tail -n 10 /home/mc2/risecamp/mc2/tutorial/data/chief_train.enc

1616,1625,5rZEv4VbSkXODySB,50XrrD52CszjJW/kk/kefQ==,fKzRF6aWliv/7+EnB+WiQNe4hoyvdBtm61sxuuYLN7NocBjPMJRgzxf38mL09/R3qgsIYgcjlyAXO7FPP60WJ1lBAHjGsnU1/LCd8a8iyaCC4ZULOBWX3sC4VaTotKPymM0dKXDV4B39p4QDtfYTxA==
1617,1625,p+Q7N/KBHp5tVAop,H3XgXuwwFsS4xoiv5SIztw==,Dlc91IPrKGsVAo2P4T+jJYKHVO2YDBi5k+MWmnOdDV4v+d/vwdM1S8+3qv8XfgDmtP6XoT1/CMljKoGYITpoWiKgdpddBzw0vGI5lxFUN9SaNBmjg1n6StuDeHGID8AIFRq8zj8ygjUlc7Q6k4lgqVvN
1618,1625,GRVJwWpFRwKPOigp,HAJYY2HdO70yQW+QxTozXQ==,jv5GmxEKtrCxlZWSBBObbMzbCqQey0myf4IFjWOe4DrYSWyOd6ydoX0CeV9ol6CRaeEmcIJhXcYkyikQkZg48doJXSLj0yfxXbAY2By0Xho9EANk338xx8R3BZwR9aXfFHoSQycbKkwrSY6TON/nlYA=
1619,1625,nYHFZ1gtN4/2ZtAF,S+ascsjrwdvUGRlsAsDrdw==,kiCrSXKOQ/DbzfbwE/5sqjqm2SsWTbSfZVqd3pDTcHLivn7cIMJIYkJRFUbuSsdhne7WcOQ93TPPSeHxPzCYXq+UvQ7xwgsz2RaC3nqetlkx1qUKX8uY5JBFZwlwsfII79Xj0yMblyhA7ag2lSoKfg==
1620,1625,YK8DVioaqVbAZ4+Z,yF7qtwEkiNtL2OoXkK6ABg==,bhZU+Oqz3tqFdA04erd0u0H3Wcc4S3faEaHOGyOM2w8FJmVl3zBsOR01EX7MlHWg5JMK8bEDMApgLdl4ehoxeKezmKrrh/ALQ79l0o0y+/Gb75MRjZK9okbASB9z

In [9]:
# Run this cell to store variables for use in subsequent notebooks
%store username
%store PUB_KEY 
%store CERT_FILE 
%store KEY_FILE 
%store enc_training_data 
%store enc_test_data
%store cwd

Stored 'username' (str)
Stored 'PUB_KEY' (str)
Stored 'CERT_FILE' (str)
Stored 'KEY_FILE' (str)
Stored 'enc_training_data' (str)
Stored 'enc_test_data' (str)
Stored 'cwd' (str)


**Once you've finished this step, wait for breakout rooms to reconverge.**

## 3. Enclave server setup
While in practice there'll be an enclave server controlled by no one party, to complete this tutorial one party in the collaboration will have to act as both a party and the enclave server. Designate one person in the collaboration to control the enclave server.

If you've been designated as the enclave server, click [here](./Exercise 1.2.ipynb) to go to the next notebook. You'll have to set up the enclave server before everyone can begin training. 

Otherwise, click [here](./Exercise 2.ipynb).