# Fitting a single model with drfsc

## Load, fit, predict, and score using drfsc and WDBC example dataset
In this notebook we use the 
Breast Cancer Wisconsin (Diagnostic) Data Set (WDBC) dataset (available from UCI database at 'https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29') to demonstrate how to use the drfsc package and some of its functionality.

This example dataset is quite small, so only a small number of partitions will be used. This can be expanded arbitrarily, based on the dataset.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from drfsc import drfsc, utils
from sklearn.model_selection import train_test_split

# Loading data
We start by first loading the dataset.

In [2]:
data = utils.load_wdbc("wdbc.data")
print(f"Shape: {data.shape}, Dimensions: {data.ndim}")

Shape: (569, 32), Dimensions: 2


As can be seen from the code below, the first column is an ID column (not used), the second column is the label, and the remaining columns are the features. We split the data accordingly.

In [3]:
data = pd.DataFrame(data)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302.0,1.0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517.0,1.0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903.0,1.0,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301.0,1.0,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402.0,1.0,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
X = data.loc[:, 2:]
X.columns = [f"x_{i}" for i in range(1, X.shape[1] + 1)] # renaming columns
X.head()

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10,...,x_21,x_22,x_23,x_24,x_25,x_26,x_27,x_28,x_29,x_30
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
import seaborn as sns # seaborn not included as drfsc dependency

corr = X.corr()
display(corr.style.background_gradient(cmap='coolwarm'))

In [None]:

'''
Calculation how many times a feature was an in high correlation.
Count = 1 denotes only autocorrelation.
'''

df_CorrCount = utils.get_corr_df(X, level=0.8)

display(df_CorrCount)

# col2drop = ['x_1','x_8','x_23','x_24','x_28']
# X = X.drop(columns=col2drop)

In [None]:
Y = data.loc[:, 1]
Y.head()

The data then needs to be split into training/validation/testing partitions for use by DRFSC. This is done in the standard way using scikit-learn's train_test_split function.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=42, stratify=Y_train)

To initialize our model, we call the DRFSC class. For this tutorial, we will use 10 iterations (n_runs), and 4 vertical and 2 horizontal partitions (n_vbins and n_hbins respectively). 

In this notebook we show the results for when output = 'single'. This means that we return a single model. The other option is output = 'ensemble', where an ensemble is created based on the number of horizontal partitions.

In [None]:
model = drfsc.DRFSC(n_vbins=4, n_hbins=2, n_runs = 10, output='single', verbose = True)

To load the data into the DRFSC model, we call the load_data function. This function preprocesses the data. Here, we can specify the degree of polynomial expansion desired (here we use degree=2 (polynomial)).

In [None]:
X_train, X_val, Y_train, Y_val, X_test, Y_test = model.load_data(X_train, X_val, Y_train, Y_val, X_test, Y_test, polynomial=2)

If we want to specify some initial mu values for the RFSC optimization, this can be done via the `DRFSC.set_initial_mu()` method. Here, we set the initial mu values to be 0.1 for all features. This is not necessary, and if not specified, the initial mu values will be set to 1/n_features for all features by default.

In [None]:
model.set_initial_mu(0.1)

We could instead set specific mu values for specific features by passing `DRFSC.set_initial_mu() a dictionary, as shown below.

In [None]:
model.set_initial_mu({'x_2': 0.2, 'x_3': 0.3, 'x_9': 0.05})

## Model fitting

To fit our DRFSC model, we simply call the fit() method and pass our training and validation data and labels. This will fit the model to the data and return the best model found.

In [None]:
model.fit(X_train, X_val, Y_train, Y_val)

Once a single model has been fit, we can view the features and coefficients of the final model by calling the features_ and coef_ attributes. if the input data is a numpy array and has no feature names, we can instead use the feature_num attribute to access the indicies of the model features

In [None]:
model.features_

In [None]:
model.features_num

In [None]:
model.coef_

We can also access the attributes model object itself by calling the `model` attribute.

In [None]:
model.model

After a model has been fit, we can use it to predict labels using the predict_proba() or predict() methods. This function takes our test set data (X_test) as a sole argument, and returns the predicted probabilities and labels respectively.

In [None]:
y_prob = model.predict_proba(X_test)
y_prob

In [None]:
y_pred = model.predict(X_test)
y_pred

# Model Evaluation

To find out how well our model performed, we can use the score() method. This method takes our test set data (X_test) and labels (Y_test) as arguments, and a metric as an optional argument (here we use the default metric, accuracy). It then calculates the metric on the test set, and returns the performance of the model on the test set.

In [None]:
model.score(X_test, Y_test)

# Model Visualization

To visualize the results, we can use the feature_importance(), pos_neg_prediction(), and single_prediction() methods. 

The feature_importance() method takes no arguments and displays the a histogram of the final model features and their coefficeints.

The pos_neg_prediction method takes as an argument the index of the sample we want to visualize. If we want to visualise a prediction on the test set, the test set should also be passed as an argument. This method displays the positive and negative predictions for the sample, computed by multiplying the data by the model coefficients and separating the positive and negative contributions.
- E.g. usage: pos_neg_prediction(0, X_test) for visualising the prediction on the first sample in the test set.

The single_prediction() method takes the same arguments as pos_neg_prediction(), and diplays the model coefficients weighted by the data for the sample. This is useful for visualising the model's prediction on a single sample.
- E.g. usage: single_prediction(0, X_test) for visualising the prediction on the first sample in the test set.

In [None]:
model.feature_importance()

In [None]:
model.pos_neg_prediction(0, X_test)

In [None]:
model.single_prediction(0, X_test)