In [10]:
from banker import BankerBase, run
from sklearn.ensemble import RandomForestClassifier
import pandas
import random
from sklearn.model_selection import cross_val_score
import numpy as np
import jdc

# Nicolabk_Kaiie_Banker Class

In [11]:
class Nicolabk_Kaiie_Banker(BankerBase):
    def __init__(self, interest_rate=0.005, epsilon=None):
        self.interest_rate = interest_rate
        if epsilon:
            self.epsilon = epsilon/20
        else:
            self.epsilon = epsilon

Class takes in optional `interest_rate` which can be redefined in `set_interest_rate()` and `epsilon` which if defined makes new predictions differentially private. We divide epsilon by 20 to account for the fact that we have 20 categories in the data set.

In [13]:
%%add_to Nicolabk_Kaiie_Banker
def set_interest_rate(self, interest_rate):
    self.interest_rate = interest_rate

Allows redefinition of the `interest_rate`.

In [14]:
%%add_to Nicolabk_Kaiie_Banker
def fit(self, X, y):
    n_estimators = range(50, 125, 25)
    clfs = [RandomForestClassifier(n_estimators=n) for n in n_estimators]
    scores = [np.mean(cross_val_score(clf, X, y, cv=10, scoring=self.utility_scoring, n_jobs=-1)) for clf in clfs]
    best_n = n_estimators[np.argmax(np.array(scores))]
    self.classifier = RandomForestClassifier(n_estimators=best_n, n_jobs=-1)
    if self.epsilon:
        self.sensitivity = {}
        for column in X.columns[X.dtypes == 'int64']:
            self.sensitivity[column] = X[column].max()-X[column].min()
    self.classifier.fit(X, y)

Calculates optimal `n_estimators` hyper-parameter for `RandomForestClassifier` using a 10-fold cross-validation. 
Also calculates the sensitivity of the numerical variables if `epsilon` is defined, and saves it in `self.sensitivity` for later use when predicting.

In [16]:
%%add_to Nicolabk_Kaiie_Banker
def expected_utility(self, X):
    pr_1, pr_2 = self.predict_proba(X).T
    U_1 = X['amount']*( (1+self.interest_rate)**X['duration'] - 1)
    U_2 = -X['amount']
    return pr_1*U_1 + pr_2*U_2

Calculates the expected utility (money earned) given `X`.

Calls `predict_proba()` to calculate our predicted probabilities of the getting repaid or defaulted, then uses these values to calculate the expected money earned.

In [17]:
%%add_to Nicolabk_Kaiie_Banker
def predict_proba(self, X):
    if isinstance(X, pandas.core.series.Series):
        X_copy = X.copy().to_frame().transpose()
    else:
        X_copy = X.copy()

    if self.epsilon:
        X_copy = self._privacy(X_copy)
    return self.classifier.predict_proba(X_copy)

Returns 2 values, the probability of the loan being repaid, and the probability of it getting defaulted.

Converts `X` to a `pandas.DataFrame` if it is a `pandas.Series` to remain compatible with both Christos Dimitrakakis' and Dirk Hesse's code.
Calls `_privacy()` on `X` if epsilon is defined, then calls our classifier's (`RandomForestClassifier`) `predict_proba()`.

In [18]:
%%add_to Nicolabk_Kaiie_Banker
def get_best_action(self, X):
    if isinstance(X, pandas.core.series.Series):
        for exp_utility in self.expected_utility(X):
            if exp_utility > 0:
                return 1
            else:
                return 2
    else:
        actions = []
        for exp_utility in self.expected_utility(X):
            if exp_utility > 0:
                actions.append(1)
            else:
                actions.append(2)
    return pandas.Series(actions, index = X.index)

If `X` is a `pandas.Series` we call `expected_utility()` and grant a loan if we expect to earn money. If `X` is a `pandas.DataFrame` we return a `pandas.Series` of decisions with matching indecies to `X`.

## Helper functions

In [19]:
%%add_to Nicolabk_Kaiie_Banker
def _utility_scoring(self, estimator, X, y):
    estimator.fit(X, y)
    pr_1, pr_2 = estimator.predict_proba(X).T
    U_1 = X['amount']*( (1+self.interest_rate)**X['duration'] - 1)
    U_2 = -X['amount']
    return sum(pr_1*U_1 + pr_2*U_2)

Used in `cross_val_score()` in `fit()` to score the models for different hyper-paramterers.

In [20]:
%%add_to Nicolabk_Kaiie_Banker
def _privacy(self, X):
    X_private = X.copy()
    for column in self.sensitivity:
        self._laplace_mechanism(X_private[column], self.sensitivity[column])
    cat_col = sorted(list(set(X.columns) - set(self.sensitivity.keys())))
    cat_grouped = []
    prev = ""
    for cat in cat_col:
        split = cat.split('_')[0]
        if split == prev:
            cat_grouped[-1].append(cat)
        else:
            cat_grouped.append([cat])
        prev = split

    for cat in cat_grouped:
        self._exponential(X_private, cat)
    return X_private

Makes `X` differentially private with epsilon = `self.epsilon`.

Calls `_laplace_mechanism()` on numerical columns using the sensitivities calculated during the fit. Groups the categorical one-hot encoded columns together and calls `_exponential()` on them.

In [21]:
%%add_to Nicolabk_Kaiie_Banker
def _exponential(self, X, category):
    quality = np.zeros(len(category)+1)
    for i in range(len(category)):
        if X[category[i]].iloc[0] == 1:
            quality[i+1] = 1
            X[category[i]].iloc[0] = 0
            break

    if np.count_nonzero(quality) == 0:
        quality[0] = 1

    pr_cat = np.exp(self.epsilon*quality)/sum(np.exp(self.epsilon*quality))
    choice = np.random.choice(['filler'] + category, p=pr_cat)
    if choice != 'filler':
        X[choice] = 1

Uses the exponential mechanism to make the categorical variables of `X` differentially private.

Defines the quality function by checking which column has a 1 for the given category and if none of them have a 1 we know the first value which is not included in the one-hot encoding is the real value. Calculate probabilities using the exponential mechanism, then we do a random choice between the columns given the probabilities.

In [22]:
%%add_to Nicolabk_Kaiie_Banker
def _laplace_mechanism(self, X, sensitivity):
    noise = np.random.laplace(scale=sensitivity/self.epsilon, size=X.size)
    X += noise

Uses the laplace mechanism to add laplace noise to `X` scaled with the `sensitivity` and `self.epsilon`