# Multiclass F1 investigation

Goal: understand what f1 score function the code is using, since there is the multi class version.

In [None]:
%cd /content/drive/MyDrive/DLHProject/Danielgitrepo

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/Danielgitrepo


In [None]:
%%capture
! pip install -r requirements.txt

In [None]:
JOB_DIR = '/content/drive/MyDrive/DLHProject/jobs'

In [None]:
from transplant.evaluation import f1
from transplant.utils import read_predictions
test = read_predictions(f'{JOB_DIR}/finetune_random_cnn_original_data/test_predictions.csv')
y_true = test['y_true']
y_prob = test['y_prob']
print(f1(y_true, y_prob))

0.7180218152145887


Note that f1 as is is multiclass=False!

Code for f1

```python
def f1(y_true, y_prob, multiclass=False, threshold=None):
    # threshold may also be a 1d array of thresholds for each class
    if y_prob.ndim != 2:
        raise ValueError('y_prob must be a 2d matrix with class probabilities for each sample')
    if y_true.ndim == 1:  # we assume that y_true is sparse (consequently, multiclass=False)
        if multiclass:
            raise ValueError('if y_true cannot be sparse and multiclass at the same time')
        depth = y_prob.shape[1]
        y_true = _one_hot(y_true, depth)
    if multiclass:
        if threshold is None:
            threshold = 0.5
        y_pred = y_prob >= threshold
    else:
        y_pred = y_prob >= np.max(y_prob, axis=1)[:, None]
    return f1_score(y_true, y_pred, average='macro')
```

And for multi class f1
```python
def multi_f1(y_true, y_prob):
    return f1(y_true, y_prob, multiclass=True, threshold=0.5)
```

Now we try above f1 , but with multi_f1. This works as well?



In [None]:
from transplant.evaluation import multi_f1
print(multi_f1(y_true, y_prob))

0.707702478743826


Also, we now check out the `is_multiclass` function which is used in the fine tuning trainer when the argument `--val-metric=f1`

`is_multiclass` operates on the labels produced by the fine tuning preprocess. So now we load the preprocessed train data we generated earlier.
- Note that this train set is the one generated by the suggested script by the authors in finetuning readme.md

In [18]:
from transplant.utils import load_pkl

DATA_DIR = '/content/drive/MyDrive/DLHProject/data'
train = load_pkl(f"{DATA_DIR}/physionet_finetune/physionet_train.pkl")


In [22]:
from transplant.utils import is_multiclass

In [23]:
is_multiclass(train['y'])

False

multiclass is false is kind of unexpected because there are 4 labels in this dataset.

In [25]:
print(type(train['y']))
print(train['y'].shape)

<class 'numpy.ndarray'>
(6822, 4)


In [26]:
train['y'][:5, ]

array([[0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0]], dtype=uint8)

Weird, and here's the code for is_multiclass

```python
def is_multiclass(labels):
    """ Return true if this is a multiclass task otherwise false. """
    return labels.squeeze().ndim == 2 and any(labels.sum(axis=1) != 1)
```

In [27]:
train['y'].squeeze().ndim

2

In [29]:
# sum across rows, but the data is 1 hot encoded
any(train['y'].sum(axis=1) != 1)

False

Conclusion: `is_multiclass` is a **MISNOMER**, this should really be renamed to:

> **Multi label classification task**

Note that what we call multi-class is a special case of multi label classification.

Evidence for conclusion:

This piece of fine tuning trainer.py confused me a lot;

```python
        if is_multiclass(train['y']):
            activation = 'sigmoid'
            loss = tf.keras.losses.BinaryCrossentropy()
            accuracy = tf.keras.metrics.BinaryAccuracy(name='acc')
        else:
            activation = 'softmax'
            loss = tf.keras.losses.CategoricalCrossentropy()
            accuracy = tf.keras.metrics.CategoricalAccuracy(name='acc')
```

Typically, we use sigmoid activation for BINARY (0, 1) classification problem, so how is it that sigmoid can be used for multiclass problem?

Then I search paper for 'sigmoid'. Turns out that it is mentioned only once, on page 9 for a different downstream dataset (NOT physionet that we are working with). That is a different finetuning task.

So this leads to the conclusion that multiclass is misnomer, and that we should just use `f1()` function for evaluation.

# F1 computation investigation

Goal: understand how `f1()` function works, now that we understand that only it is used and not `multi_f1()`.

Why? Because, in Table 1 of the paper, the authors report average F1 score for each class (4 of them). But the `f1()` function only returns a scalar, representing macro average. The codebase does not appear to show how per class f1 is computed.

In [43]:
print(y_true.ndim)

2


The above means that this block of code in `f1()` is called

```python
y_pred = y_prob >= np.max(y_prob, axis=1)[:, None]
```

In [48]:
import numpy as np
y_pred = y_prob >= np.max(y_prob, axis=1)[:, None]
y_pred.astype(int)

array([[0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       ...,
       [0, 1, 0, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 0]])

Obviously this is multiclass classification (here we use the correct meaning of multiclass, unlike the authors) problem for sklearn f1_score. [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

To get per class average, we need to call `f1_score` with `average=None`

However, `f1()` does not expose this to us, so we need to write our own `f1` code. We do it below, and make simplifications as we understand better what we want to do.

In [59]:
from sklearn.metrics import f1_score
def my_f1(y_true, y_prob, average='macro'):
    y_pred = y_prob >= np.max(y_prob, axis=1)[:, None]
    return f1_score(y_true, y_pred, average=average)

Default usage of `my_f1` should match authors' `f1`, which it does.

In [65]:
our_f1_result = my_f1(y_true, y_prob)
author_f1_result = f1(y_true, y_prob)
np.testing.assert_equal(our_f1_result, author_f1_result)

Onward to testing `my_f1` to output per class F1 scores.

In [62]:
f1_by_class = my_f1(y_true, y_prob, average=None)
f1_by_class

array([0.64367816, 0.89545241, 0.68888889, 0.6440678 ])

In [55]:
train['classes']

array(['A', 'N', 'O', '~'], dtype=object)

In [63]:
dict(zip(train['classes'], f1_by_class))

{'A': 0.6436781609195402,
 'N': 0.8954524144397561,
 'O': 0.6888888888888889,
 '~': 0.6440677966101694}

## Conclusion

We are able to write a F1 function that outputs per class F1 score. Also the relative numbers appear to align mostly with the baseline F1 scores in Table 1.

Note that "~" appears to refer to the Noisy class, as A (AF), N (Normal), O (Other) are already taken.

> We can probably paste this `my_f1` into the project report! Don't have to use author `f1` code!

# Model investigation

Goal: load Resnet18 model, see what it looks like in summary form

In [35]:
import tensorflow as tf

from finetuning.utils import ecg_feature_extractor

Here's the relevant snippet of the function definition

```python
def ecg_feature_extractor(arch=None, stages=None):
    if arch is None or arch == 'resnet18':
        resnet = ResNet(num_outputs=None,
                        blocks=(2, 2, 2, 2)[:stages],
                        kernel_size=(7, 5, 5, 3),
                        include_top=False)
    # ...                        
    feature_extractor = tf.keras.Sequential([
        resnet,
        tf.keras.layers.GlobalAveragePooling1D()
    ])
    return feature_extractor   
```

Now we manually copy paste the finetuning model construction from finetune trainer.py here

In [36]:
model = ecg_feature_extractor('resnet18')
num_classes = 4
activation = 'softmax'
model.add(tf.keras.layers.Dense(num_classes, activation=activation))
inputs = tf.keras.layers.Input(train['x'].shape[1:], dtype=train['x'].dtype)
model(inputs)

<KerasTensor: shape=(None, 4) dtype=float32 (created by layer 'sequential_4')>

In [40]:
model.count_params()

4494532

In [41]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 res_net_4 (ResNet)          (None, 512, 512)          4492480   
                                                                 
 global_average_pooling1d_4  (None, 512)               0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dense (Dense)               (None, 4)                 2052      
                                                                 
Total params: 4494532 (17.15 MB)
Trainable params: 4484932 (17.11 MB)
Non-trainable params: 9600 (37.50 KB)
_________________________________________________________________
