# Reproduce the AUC from example of the Calf paper [1]

While calfpy yields an auc of 0.875 in example 1 from the paper, calfcv produces an auc of 0.9796875.

===========
References:

Jeffries, C.D., Ford, J.R., Tilson, J.L. et al. A greedy regression algorithm with coarse weights offers novel advantages. Sci Rep 12, 5440 (2022). https://doi.org/10.1038/s41598-022-09415-2

In [9]:
import pandas as pd
from sklearn.metrics import roc_auc_score
from calfcv import CalfCV

In [10]:
input_file = "../data/n2.csv"
df = pd.read_csv(input_file, header=0, sep=",")

# The input data is everything except the first column
X = df.loc[:, df.columns != 'ctrl/case']
# The outcome or diagnoses are in the first ctrl/case column
Y = df['ctrl/case']

# The header row is the feature set
features = list(X.columns)

# label the outcomes
Y_names = Y.replace({0: 'non_psychotic', 1: 'pre_psychotic'})

# glmnet requires float64
x = X.to_numpy(dtype='float64')
y = Y.to_numpy(dtype='float64')


### Features

Here we look at the feature names, number of features, shape, category balance, and probability of choosing the positive category by chance.

In [11]:
features[0:5]

['ADIPOQ', 'SERPINA3', 'AMBP', 'A2M', 'ACE']

In [12]:
x.size

9720

In [13]:
x.shape

(72, 135)

## Category Balance

In [14]:
print(list(Y).count(1), list(Y).count(0))

32 40


In [15]:
len(y)

72

CalfCV improves on the calfpy auc of 0.875 from example 1 of the paper. 

In [16]:
y_pred = CalfCV().fit(x, y).predict_proba(x)
roc_auc_score(y, y_pred[:, 1])

0.9796875