# Capstone Project - Credit Default Risk Shared Model

It is common in Brazil that companies buy data from a bureau and then combine it with internal credit default indicators to create a risk model that drives concession. The credit default data is always siloed inside companies and since it is sensitive data, cannot be shared. But in the other hand, controlling default rate isn't usually the core business of these companies. 

So if there was a way for these companies to work together and achieve better default rates, while preserving its clients privacy... Enter Federated Learning and Encrypted Learning: this project aims to create a better credit default risk model by sharing a model between two or more companies, while still preserving customers' data.

## Dependencies

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import torch as th
import syft as sy
import seaborn as sns
import math

from torch.utils.data import DataLoader, Dataset

In [3]:
hook = sy.TorchHook(th)

## Entities

**Companies**
- Shiny: This is a fictitional company willing to rate its customers regarding their credit default risk.

- High: Another fictitional company, also interested in rating its customers.

**Bureau**
- BestView - This is us, a fictitional bureau of data, a company that sells data and models, and which will be the the central point of the shared model.

In [4]:
# creating virtual workers representing each party of our scenario
shiny = sy.VirtualWorker(hook, id='shiny')
high = sy.VirtualWorker(hook, id='high')
best_view = sy.VirtualWorker(hook, id="best_view")
secure_worker = sy.VirtualWorker(hook, id="secure_worker")

## Data

To simulate the willing scenario, we'll use [Home Credit Default Risk Kaggle competition](https://www.kaggle.com/c/home-credit-default-risk/overview) data. 

In this demonstration, we're only interested in training the model, therefore, all testing steps are being ignored.

In [5]:
# For the sake of simplicity , only a few features will be used, and they're are all continuous.

cols = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 
        'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
        'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG',
        'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG','LIVINGAPARTMENTS_AVG',
        'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 
        'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 
        'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 
        'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 
        'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI',
        'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 
        'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 
        'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
        'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 
        'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 
        'AMT_REQ_CREDIT_BUREAU_YEAR', 'TARGET']

data = pd.read_csv('application_train.csv', usecols=cols)
target = data.pop('TARGET').to_numpy()

# shuffling and spliting data into two parts, one for each fictitional company
idx = data.index.to_numpy()
np.random.shuffle(idx)

shiny_idx = idx[:10000]
high_idx = idx[10000:20000]

# filling nulls
data = data.fillna(data.mean())

# scaling
scaler = StandardScaler()
data = scaler.fit_transform(data)

## Local differential privacy

Let's suppose that one of the companies is not yet tottaly comfortable with all this "shared model" idea, and wishes to add an extra level of privacy protection to its customers.

One way it could be done is to add a "plausible deniability" to the data: in case of any data leakage one customer could argue that his data does not correspond to the truth, and in fact, that it has been set tottally at random. We'll do this by adding Local differential privacy:

First we randomly choose a small percentage of the customers to flip a coin, and then, this coin flip is going to be responsible for the target variable of this customer (which means he has a credit default or not).

In [6]:
small_percentage = (np.random.rand(len(target)) > 0.01).astype(int)
coin_flip = (np.random.rand(len(target)) > 0.5).astype(int)

dp_target = (target * small_percentage + (1-small_percentage) * coin_flip)

In [7]:
# now we create the tensors and send each dataset to the corresponding company
shiny_X = th.from_numpy(data[shiny_idx])
shiny_y = th.from_numpy(dp_target[shiny_idx])
shiny_dataset = sy.BaseDataset(shiny_X, shiny_y).send(shiny)

high_X = th.from_numpy(data[high_idx])
high_y = th.from_numpy(dp_target[high_idx])
high_dataset = sy.BaseDataset(high_X, high_y).send(high)

In [8]:
# combining both datasets into a FederatedDataset
federated_train_dataset = sy.FederatedDataset([high_dataset, shiny_dataset])

# and then, creating our DataLoader
federated_train_loader = sy.FederatedDataLoader(federated_train_dataset, 
                                                shuffle=True, batch_size=256)

## Model

In [18]:
# now we define a model

from torch import nn, optim
import torch.nn.functional as F

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(65, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, 2)
        
    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x
    
model = Network()

In [19]:
# and also our criterion and optmizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

To train our model, we'll iterate through the epochs and batches, and we'll keep sending the model to each company (worker) - since the data is located there.

In [20]:
for e in range(10):
    epoch_loss = 0
    for inputs, labels in federated_train_loader:
        worker = inputs.location
        model.send(worker)
        optimizer.zero_grad()
        output = model(inputs.float())
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step() 
        model.get()
        epoch_loss += loss.get().item()
        
    print('Epoch {} - Loss: {}'.format(e+1,epoch_loss))

Epoch 1 - Loss: 41.7953967154026
Epoch 2 - Loss: 30.085354417562485
Epoch 3 - Loss: 26.08218541741371
Epoch 4 - Loss: 24.370345383882523
Epoch 5 - Loss: 23.718352928757668
Epoch 6 - Loss: 23.366930544376373
Epoch 7 - Loss: 23.17524318397045
Epoch 8 - Loss: 23.099843487143517
Epoch 9 - Loss: 22.716980166733265
Epoch 10 - Loss: 22.870363876223564


# Encrypted Model

Now that we have a trained model, let's say we want other companies to be able to score their customers too (selling it as Machine Learning as a Service), but we don't want to expose our gradients. On the other hand, these companies don't want us to have access to their data either. 

One way to do that, is to encrypt our model AND the data, share all of it and do the whole scoring process while encrypted. We'll demonstrate it below:

In [21]:
# first, let's simulate this new company and its data
new_company = sy.VirtualWorker(hook, id='new_company')
new_company_idx = idx[20000:30000]

# and also, let's share the data between the Bureau and the company
new_company_X = th.from_numpy(data[new_company_idx]).fix_precision().share(new_company, best_view, crypto_provider=secure_worker, requires_grad=True)
new_company_y = th.from_numpy(dp_target[new_company_idx]).fix_precision().share(new_company, best_view, crypto_provider=secure_worker, requires_grad=True)

# we'll also share the already trained model between the same parties
model = model.fix_precision().share(new_company, best_view, crypto_provider=secure_worker, requires_grad=True)

In [22]:
# and now we can get an output using an encrypted model and encrypted data
output = model(new_company_X)

In [23]:
# it's also easy to retrieve the predictions from the encrypted output
predictions = output.get().float_precision()
predictions = F.softmax(predictions, dim=0).argmax(dim=1)

And that's it! We have: 
- Added Local Differential Privacy
- Trained a model using Federated Learning
- Encrypted a model and forwarded it through encrypted data!