# Example usage

Here, we demonstrate how to use `fasterrisk` to generate sparse risk scoring systems:

## Download and Read Sample Data

### Imports

In [1]:
from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
from fasterrisk.utils import download_file_from_google_drive
import os.path

import numpy as np
import pandas as pd
import time

### Download Sample Data

In [2]:
train_data_file_path = "../tests/adult_train_data.csv"
test_data_file_path = "../tests/adult_test_data.csv"

if not os.path.isfile(train_data_file_path):
    download_file_from_google_drive('1nuWn0QVG8tk3AN4I4f3abWLcFEP3WPec', train_data_file_path)
if not os.path.isfile(test_data_file_path):
    download_file_from_google_drive('1TyBO02LiGfHbatPWU4nzc8AndtIF-7WH', test_data_file_path)


### Read Sample Data

In [3]:
train_df = pd.read_csv(train_data_file_path)
train_data = np.asarray(train_df)
X_train, y_train = train_data[:, 1:], train_data[:, 0]

test_df = pd.read_csv(test_data_file_path)
test_data = np.asarray(test_df)
X_test, y_test = test_data[:, 1:], test_data[:, 0]

## Train Risk Score Models

### Create RiskScoreOptimizer and Perform Optimization

In [4]:
sparsity = 5
parent_size = 10

RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size)

In [5]:
start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))

Optimization takes 13.68 seconds.


## Get Risk Score Models

In [6]:
multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()
print("We generate {} risk score models from the sparse diverse pool".format(len(multipliers)))

(26049, 50)
We generate 50 risk score models from the sparse diverse pool


### Access the first risk score model

In [7]:
model_index = 0 # first model
multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]

### Use the first risk score model to do prediction

In [8]:
RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients)

In [9]:
y_test_pred = RiskScoreClassifier_m.predict(X_test)
print("y_test are predicted to be {}".format(y_test_pred))

y_test are predicted to be [-1 -1 -1 ... -1 -1 -1]


In [10]:
y_test_pred_prob = RiskScoreClassifier_m.predict_prob(X_test)
print("The risk probabilities of having y_test to be +1 are {}".format(y_test_pred_prob))

The risk probabilities of having y_test to be +1 are [0.13308868 0.34872682 0.34872682 ... 0.04216029 0.34872682 0.04216029]


### Print the first model card

In [11]:
X_featureNames = list(train_df.columns[1:])

RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()

The Risk Score is:
1.            Age_22_to_29     -2 point(s) |   ...
2.               HSDiploma     -2 point(s) | + ...
3.                    NoHS     -4 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.4% |   0.7% |   1.2% |   2.3% |   4.2% |   7.6% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |
RISK  |  13.3% |  22.3% |  34.9% |  50.0% |  65.1% |  77.7% |  92.4% |


### Print Top 10 Model Cards from the Pool and their performance metrics

In [17]:
num_models = min(10, len(multipliers))

for model_index in range(num_models):
    multiplier = multipliers[model_index]
    intercept = sparseDiversePool_beta0_integer[model_index]
    coefficients = sparseDiversePool_betas_integer[model_index]

    RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients)
    RiskScoreClassifier_m.reset_featureNames(X_featureNames)
    RiskScoreClassifier_m.print_model_card()

    train_loss = RiskScoreClassifier_m.compute_logisticLoss(X_train, y_train)
    train_acc, train_auc = RiskScoreClassifier_m.get_acc_and_auc(X_train, y_train)
    test_acc, test_auc = RiskScoreClassifier_m.get_acc_and_auc(X_test, y_test)

    print("The logistic loss on the training set is {}".format(train_loss))
    print("The training accuracy and AUC are {:.3f}% and {:.3f}".format(train_acc*100, train_auc))
    print("The test accuracy and AUC are are {:.3f}% and {:.3f}\n".format(test_acc*100, test_auc))

The Risk Score is:
1.            Age_22_to_29     -2 point(s) |   ...
2.               HSDiploma     -2 point(s) | + ...
3.                    NoHS     -4 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.4% |   0.7% |   1.2% |   2.3% |   4.2% |   7.6% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |
RISK  |  13.3% |  22.3% |  34.9% |  50.0% |  65.1% |  77.7% |  92.4% |
The logistic loss on the training set is 9798.652346518873
The training accuracy and AUC are 82.575% and 0.862
The test accuracy and AUC are are 81.787% and 0.856

The Risk Score is:
1.               HSDiploma     -2 point(s) |   ...
2.                    NoHS     -4 point(s) | + ...
3.                 Married      4 point(s) | + ...
4.    WorkHrsPerWeek_lt_40     -2 point(s) | + ...
5.  