## Privileged Logistic Regression with Partial Availabile Privileged Information

When using the privileged logistic model with partially available privileged information cases, it's essential to prepare the data in such a way that the data rows that have privileged information available always come first and match with the base information. The following example illustrates how to prepare the data for such cases and how to run the model.

### 0. Install Necessary Packages

In [1]:
# !pip install --user -r requirements.txt

### 1. Load Model

In [2]:
# import the models
from privileged_lr import PrivilegedLogisticRegression

### 2. Prepare the Data for Learning Using Partially Availabile Priviledged Information (LUPAPI)

In [3]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

In [23]:
import numpy as np
np.random.seed(0)

In [24]:
n_total, n_informative = 12, 6

# create a simluated dataset
X, y = make_classification(n_samples=2000, n_features=n_total, 
                           n_informative=n_informative, 
                           n_redundant=0, random_state=0)

# split the dataset into train, validation and test set based on the ratio
train_ratio, validation_ratio, test_ratio = 0.4, 0.3, 0.3

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_ratio, random_state=0)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=validation_ratio/(train_ratio+validation_ratio), random_state=1)

# select out all the informtive columns as privileged information
x_train_star = X_train[:, :(n_informative)]

# ---------------------------- LUPAPI ---------------------------- #
# randomly select 80% of the rows in privileged information to be available
idx = np.random.choice(x_train_star.shape[0], int(x_train_star.shape[0]*0.8), replace=False)
# identify the unselected idx
no_pi_idx = np.setdiff1d(np.arange(x_train_star.shape[0]), idx)

# only keep the selected rows in the privileged data and the label
x_train_star = x_train_star[idx]
y_train_star = y_train[idx]
# ---------------------------- LUPAPI ---------------------------- #

# the rest are used as base features
x_train = X_train[:, (n_informative):]

# ---------------------------- LUPAPI ---------------------------- #
# Reaggregate the training set to keep consistency with privileged training data,
# make sure to have idx first and then no_pi_idx
x_train = np.concatenate((x_train[idx], x_train[no_pi_idx]), axis=0) 
y_train = np.concatenate((y_train[idx], y_train[no_pi_idx]), axis=0) 
# ---------------------------- LUPAPI ---------------------------- #

# in the val/test set, we only keep the base features
x_val = X_val[:, n_informative:]
x_test = X_test[:, n_informative:]

### Running with PLR Model (`cvxpy` implementation) - Training, Hyper-parameter Selection and Testing

Learning Using Partially Availabile Priviledged Information

In [25]:
import pandas as pd
import itertools
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, precision_score, recall_score

In [26]:
# initialize the hyperparameters searching grid
param_grid_plr = {
    'lambda_base': [0.01, 0.1, 1, 10],
    'lambda_star': [0.01, 0.1, 1, 10],
    'alpha': [0.01, 0.1, 1, 10],
    'xi_link': [0.01, 0.1, 1, 10],
    'penalty': ['l1']
    }
all_hyperparam_combinations = list(itertools.product(*map(param_grid_plr.get, list(param_grid_plr))))
# initialize the dataframes to store the results
df_train_plr = pd.DataFrame(columns=['lambda_base', 'lambda_star', 'alpha', 'xi_link', 'penalty', 'auroc', 'f1'])

In [27]:
# create a dictionary for each hyperparameter combination and iterate over it
for i, hyper_param_values in enumerate(all_hyperparam_combinations):
    kwarg = dict(zip(list(param_grid_plr.keys()), hyper_param_values))

    # initialize the model with the hyperparameters
    plr_model = PrivilegedLogisticRegression(**kwarg)
    
    # If the data is sparse and the hyperparameter combination is not valid,
    # the model will raise an error. We can catch the error and skip the 
    # hyperparameter combination.
    # ---------------------------- LUPAPI ---------------------------- #
    plr_model.fit(x_train, y_train, 
                  X_star=x_train_star, 
                  y_star=y_train_star) # <= be sure to use the privileged label here in LUPAPI case
    # ---------------------------- LUPAPI ---------------------------- #

    # obtain the prediction
    y_val_pred = plr_model.predict_proba(x_val)

    # calculate the AUROC
    auroc = roc_auc_score(y_val, y_val_pred[:, 1])
    f1 = f1_score(y_val, y_val_pred.argmax(axis=1))
    # store the validation results
    df_train_plr.loc[i] = list(hyper_param_values) + [auroc, f1]


In [22]:
# obtain the best hyperparameters
best_hyperparam = df_train_plr.sort_values(by='f1', ascending=False).iloc[0] 

# only keep the best hyperparameters in param_grid.keys()
best_hyperparam = best_hyperparam[list(param_grid_plr.keys())]

# apply the best hyperparameters to the best_plr_model
best_plr_model = PrivilegedLogisticRegression(**best_hyperparam.to_dict())

# fit the best_plr_model
# ---------------------------- LUPAPI ---------------------------- #
best_plr_model.fit(x_train, y_train, 
                   X_star=x_train_star, 
                   y_star=y_train_star) # <= be sure to use the privileged label here in LUPAPI case
# ---------------------------- LUPAPI ---------------------------- #

# obtain the prediction
y_test_pred = best_plr_model.predict_proba(x_test)

# calculate the AUROC, accuracy, f1, precision and recall
auroc = roc_auc_score(y_test, y_test_pred[:, 1])
acc = accuracy_score(y_test, y_test_pred.argmax(axis=1))
f1 = f1_score(y_test, y_test_pred.argmax(axis=1))
precision = precision_score(y_test, y_test_pred.argmax(axis=1))
recall = recall_score(y_test, y_test_pred.argmax(axis=1))

print('(PLR model) AUROC on test set: {}'.format(auroc))
print('(PLR model) Accuracy on test set: {}'.format(acc))
print('(PLR model) F1 score on test set: {}'.format(f1))
print('(PLR model) Precision on test set: {}'.format(precision))
print('(PLR model) Recall on test set: {}'.format(recall))

(PLR model) AUROC on test set: 0.8721985799842221
(PLR model) Accuracy on test set: 0.795
(PLR model) F1 score on test set: 0.7946577629382304
(PLR model) Precision on test set: 0.7933333333333333
(PLR model) Recall on test set: 0.7959866220735786
