# Basic Usage:
In this notebook you will find:
- How to get a survival curve and neighbors prediction using xgbse
- How to validate your xgbse model using sklearn

## Metrabic

We will be using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset from [pycox](https://github.com/havakv/pycox#datasets) as base for this example.

In [1]:
from xgbse.converters import convert_to_structured
from pycox.datasets import metabric
import numpy as np

# getting data
df = metabric.read_df()

df.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,duration,event
0,5.603834,7.811392,10.797988,5.967607,1.0,1.0,0.0,1.0,56.84,99.333336,0
1,5.284882,9.581043,10.20462,5.66497,1.0,0.0,0.0,1.0,85.940002,95.73333,1
2,5.920251,6.776564,12.431715,5.873857,0.0,1.0,0.0,1.0,48.439999,140.233337,0
3,6.654017,5.341846,8.646379,5.655888,0.0,0.0,0.0,0.0,66.910004,239.300003,0
4,5.456747,5.339741,10.555724,6.008429,1.0,0.0,0.0,1.0,67.849998,56.933334,1


## Split and Time Bins

Split the data in train and test, using sklearn API. We also setup the TIME_BINS array, which will be used to fit the survival curve

In [2]:
from xgbse.converters import convert_to_structured
from sklearn.model_selection import train_test_split

# splitting to X, T, E format
X = df.drop(['duration', 'event'], axis=1)
T = df['duration']
E = df['event']
y = convert_to_structured(T, E)

# splitting between train, and validation 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state = 0)
TIME_BINS = np.arange(15, 315, 15)
TIME_BINS

array([ 15,  30,  45,  60,  75,  90, 105, 120, 135, 150, 165, 180, 195,
       210, 225, 240, 255, 270, 285, 300])

## Fit and Predict

We will be using the DebiasedBCE estimator to fit the model and predict a survival curve for each point in our test data

In [3]:
from xgbse import XGBSEDebiasedBCE

# fitting xgbse model
xgbse_model = XGBSEDebiasedBCE()
xgbse_model.fit(X_train, y_train, time_bins=TIME_BINS)

# predicting
y_pred = xgbse_model.predict(X_test)

print(y_pred.shape)
y_pred.head()

(635, 20)


Unnamed: 0,15,30,45,60,75,90,105,120,135,150,165,180,195,210,225,240,255,270,285,300
0,0.984175,0.951248,0.922238,0.900388,0.860692,0.794486,0.711412,0.684977,0.650408,0.612303,0.56863,0.51139,0.491728,0.428023,0.375019,0.30535,0.266715,0.220593,0.181524,0.140979
1,0.970462,0.915608,0.840342,0.707092,0.657995,0.554985,0.491562,0.363362,0.312353,0.300028,0.22773,0.193386,0.172858,0.145737,0.112273,0.089074,0.080625,0.057293,0.048271,0.035753
2,0.986725,0.957639,0.917272,0.888849,0.848961,0.771694,0.719471,0.644347,0.578128,0.527717,0.480682,0.446399,0.42448,0.382811,0.341006,0.277785,0.23846,0.184317,0.155098,0.115905
3,0.986631,0.955226,0.910726,0.856593,0.822179,0.763958,0.665907,0.6255,0.582875,0.536332,0.491294,0.439012,0.409672,0.37188,0.303486,0.232625,0.174503,0.138385,0.115504,0.0874
4,0.975897,0.940299,0.873138,0.799073,0.741586,0.637032,0.563963,0.530511,0.504359,0.469508,0.423137,0.398003,0.371561,0.2994,0.231172,0.196209,0.171285,0.123539,0.098354,0.071912


mean predicted survival curve for test data

In [4]:
y_pred.mean().plot.line();

RuntimeError: dictionary changed size during iteration

## Neighbors

We can also use our model for querying comparables based on survivability.

In [5]:
neighbors = xgbse_model.get_neighbors(
    query_data = X_test,
    index_data = X_train,
    n_neighbors = 5
)

print(neighbors.shape)
neighbors.head(5)

(635, 5)


Unnamed: 0,neighbor_1,neighbor_2,neighbor_3,neighbor_4,neighbor_5
829,1879,418,508,339,166
670,1846,1082,1297,1448,194
1064,416,1230,1392,739,1289
85,1558,1080,8,950,234
1814,105,1743,50,859,941


<b>example</b>: selecting a data point from query data (X_test) and checking its features

In [6]:
desired = neighbors.iloc[10]

X_test.loc[X_test.index == desired.name]

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8
399,5.572504,7.367552,11.023443,5.406307,1.0,0.0,0.0,1.0,67.620003


... and finding its comparables from index data (X_train)

In [7]:
X_train.loc[X_train.index.isin(desired.tolist())]

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8
726,5.635854,6.648942,10.889588,5.496374,1.0,1.0,0.0,1.0,70.860001
968,5.541239,7.058089,10.463409,5.396433,1.0,0.0,0.0,1.0,71.07
870,5.605712,7.309217,10.935708,5.542732,0.0,1.0,0.0,1.0,71.470001
1640,5.812605,7.646811,10.952687,5.516386,1.0,1.0,0.0,1.0,68.559998
234,5.78435,6.797296,11.025448,5.335426,1.0,1.0,0.0,1.0,68.489998


## Score metrics

XGBSE implements concordance index and integrated brier score, both can be used to evaluate model performance

In [8]:
# importing metrics
from xgbse.metrics import concordance_index, approx_brier_score

# running metrics
print(f"C-index: {concordance_index(y_test, y_pred)}")
print(f"Avg. Brier Score: {approx_brier_score(y_test, y_pred)}")

C-index: 0.6700403561591186
Avg. Brier Score: 0.17238947801133372


## Cross Validation

We can also use sklearn's cross_val_score and make_scorer to cross validate our model

In [9]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

results = cross_val_score(xgbse_model, X, y, scoring=make_scorer(approx_brier_score))
results

array([0.1627031 , 0.14851301, 0.12848481, 0.15393882, 0.15409061])