# Evaluating Models on FS-Mol

`fs_mol.utils.test_utils.eval_model()` is a utility function that allows to evaluate a model against the full set of FS-Mol testing tasks, following a fixed evaluation scheme.

It requires the following arguments:
* `test_model_fn: Callable[[FSMolTaskSample, str, int], BinaryEvalMetrics]`:
  This is the core evaluation function, taking a `FSMolTaskSample` (that is, a sample of datapoints from a single task -- see notebooks/dataset.ipynb), a temporary output folder in which to store scratch data, and the seed used for the sampling.
  It should return `BinaryEvalMetric` object containing all metrics calculated from the model output and labels of the task sample.
* `dataset: FSMolDataset`:
  The actual dataset, which can be the full FS-Mol dataset as released, or a different set created directly through its constructor.
  This is particularly useful if you are planning to evaluate models on a set of tasks differing from FS-Mol's test set.

The returned results contain a list of evaluation metrics for each task, in a dictionary indexed by task name.

Additionally, it can be optionally further configured using the following optional arguments:
* `out_dir: Optional[str]`:
   If provided, a summary of the evaluation results will be written, as one CSV file per task.
* `seed: int`:
   A seed value used to make sampling and splitting of tasks deterministic.
   If set to anything but `0`, results of `eval_model` will be incomparable to the canonical evaluation runs.
* `num_samples: int`:
   The number of random splits to draw for the task undergoing evaluation.
* `train_set_sample_sizes: List[int]`:
   The sizes of training / support set data to consider. For each of these, `num_samples` samples will be drawn for each test task, and then be used to call `test_model_fn` with, where the passed `FSMolTaskSample` will have the requested number of `train_samples`.
* `valid_size_or_ratio: Union[int, float]`:
   Number or ratio of `train_samples` in the drawn samples that will be split out into `validation_samples`. This is useful for models that require an explicit validation set during fine-tuning.
* `test_size_or_ratio: Optional[Union[int, float, Tuple[int, int]]]`:
   Number or ratio of `test_samples` that will be drawn from the original task. By default, all available samples will be used, but it may be useful to limit this when not using `eval_model` for full model evaluation.
* `fold: DataFold`:
   The fold of FS-Mol on which to perform evaluation, typically will be the test fold.
* `task_reader_fn: Optional[Callable[[List[RichPath], int], Iterable[FSMolTask]]]`:
   Callable allowing additional transformations on the data prior to its batching and passing through a model.


## Example: Evaluating a Random Forest Model

In [5]:
# Setting up local details:
import os
import sys

# This should be the location of the checkout of the FS-Mol repository:
FS_MOL_CHECKOUT_PATH = os.path.join(os.environ['HOME'], "Projects", "FS-Mol")
FS_MOL_DATASET_PATH = os.path.join(os.environ['HOME'], "Datasets", "FS-Mol")

os.chdir(FS_MOL_CHECKOUT_PATH)
sys.path.insert(0, FS_MOL_CHECKOUT_PATH)

We first define a simple implementation of a `test_model_fn` that creates a random forest model, using the scikit-learn default implementation and parameters.

In [3]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier

from fs_mol.data.fsmol_task import FSMolTaskSample
from fs_mol.utils.metrics import BinaryEvalMetrics, compute_binary_task_metrics

def test_model_fn(
    task_sample: FSMolTaskSample, temp_out_folder: str, seed: int
) -> BinaryEvalMetrics:
    train_data = task_sample.train_samples
    test_data = task_sample.test_samples

    # get data in to form for sklearn
    X_train = np.array([x.get_fingerprint() for x in train_data])
    X_test = np.array([x.get_fingerprint() for x in test_data])
    y_train = [float(x.bool_label) for x in train_data]
    y_test = [float(x.bool_label) for x in test_data]

    model = RandomForestClassifier()
    model.fit(X_train, y_train)

    # Compute test results:
    y_predicted_true_probs = model.predict_proba(X_test)[:, 1]
    test_metrics = compute_binary_task_metrics(y_predicted_true_probs, y_test)

    return test_metrics

Given these functions, we can now evaluate these models on the test tasks in FS-Mol as follows, reducing the space of considered samples from each task for speed purposes:

In [6]:
from fs_mol.data.fsmol_dataset import FSMolDataset, DataFold
from fs_mol.utils.test_utils import eval_model

fsmol_dataset = FSMolDataset.from_directory(FS_MOL_DATASET_PATH)

results = eval_model(
    test_model_fn=test_model_fn,
    dataset=fsmol_dataset,
    # Restrict number of samples to one per task:
    train_set_sample_sizes=[16],
    num_samples=1,
)

results

{'CHEMBL1006005': [FSMolTaskSampleEvalResults(size=171, acc=0.5497076023391813, balanced_acc=0.5537766830870279, f1=0.6315789473684211, prec=0.528, recall=0.7857142857142857, roc_auc=0.5892172961138478, avg_precision=0.5489965911494983, kappa=0.10665581111337274, task_name='CHEMBL1006005', seed=0, num_train=16, num_test=171, fraction_pos_train=0.5, fraction_pos_test=0.49122807017543857)],
 'CHEMBL1066254': [FSMolTaskSampleEvalResults(size=125, acc=0.704, balanced_acc=0.7074795081967213, f1=0.7375886524822695, prec=0.65, recall=0.8524590163934426, roc_auc=0.8209528688524591, avg_precision=0.8158091691137052, kappa=0.4119516846789574, task_name='CHEMBL1066254', seed=0, num_train=16, num_test=125, fraction_pos_train=0.5, fraction_pos_test=0.488)],
 'CHEMBL1243967': [FSMolTaskSampleEvalResults(size=192, acc=0.6197916666666666, balanced_acc=0.6197916666666667, f1=0.6866952789699571, prec=0.583941605839416, recall=0.8333333333333334, roc_auc=0.6884223090277778, avg_precision=0.65188644087058