# Tutorial

## Basic imports and install library if needed

Install our library if needed by running the following cell.

```bash
pip install peerannot
```

In [1]:
import numpy as np
from pathlib import Path

In [2]:
DIR = Path().cwd()
DIRc10h = (DIR / ".." / "datasets" / "cifar10H").resolve()
DIR_module = DIRc10h / "cifar10h.py"
print(DIRc10h)

/home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H


## Install dataset

In [3]:
! peerannot install -h

Usage: peerannot install [OPTIONS] PATH

  Install dataset from `.py` file

Options:
  -h, --help  Show this message and exit.

  Each dataset is a folder with:

      - name.py: python file containing how to download and format data
      - answers.json: json file containing each task voted labels
      - metadata.json: all metadata for dataset, at least the name, n_task and n_classes


In [4]:
! peerannot install $DIR_module

Loading data folders at /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H
Files already downloaded and verified
Files already downloaded and verified
100%|███████████████████████████████████| 10000/10000 [00:09<00:00, 1098.74it/s]
100%|███████████████████████████████████| 50000/50000 [00:45<00:00, 1105.28it/s]
Created:
- train: /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H/train
- val: /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H/val
- test: /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H/test
Handling crowdsourced labels
Task: 100%|███████████████████████████████| 10000/10000 [04:22<00:00, 38.13it/s]
Train crowd labels are in /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H/answers.json
Train crowd labels (validation set) are in /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H/answers_valid.json


# Aggregate labels with majority voting

Let us consider the majority vote system:
$$\hat y_i = \arg\max_{k} \sum_{j:j \text{ answered }i} 1\!\!1\{y_{i}^{(j)}=k\}$$

In [5]:
! peerannot aggregate -h

Usage: peerannot aggregate [OPTIONS] [DATASET]

  Aggregate crowdsourced labels stored in the provided directory

Options:
  -s, --strategy TEXT   Aggregation strategy to compute estimated labels from
  --hard                Only consider hard labels even if the strategy
                        produces soft labels  [default: False]
  --metadata_path PATH  Path to the metadata of the dataset if different than
                        default
  --answers-file TEXT   Name (with json extension) of the path to the
                        crowdsourced labels
  --path-remove PATH    Path to file of index to prune from the training set
  -h, --help            Show this message and exit.

  All aggregated labels are stored in the associated dataset directory with
  the strategy name


In [6]:
! peerannot aggregate $DIRc10h -s MV

Running aggregation mv with options {}
Aggregated labels stored at /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H/labels/labels_cifar-10h_mv.npy with shape (9500,)


## Dataset API loader

In [8]:
from peerannot.runners.train import load_all_data
labels_path = DIRc10h / "labels" / "labels_cifar-10h_mv.npy"
trainset, valset, testset = load_all_data(
    DIRc10h, labels_path, path_remove=None, labels=labels_path, img_size=32, data_augmentation=False)

Loading datasets
Accuracy on aggregation: 99.232%


## Dataset CLI load and train

In [None]:
! peerannot train -h

In [9]:
labels_path = DIRc10h / "labels" / "labels_cifar-10h_mv.npy"
num_epochs = 10

In [12]:
! peerannot train {DIRc10h} -o cifar10H_MV -K 10\
    --labels {labels_path} --model resnet18 --img-size=32\
    --n-epochs=150 --lr=0.1 --scheduler -m 50 -m 100 \
    --num-workers 8

Running the following configuration:
----------
- Data at /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H will be saved with prefix cifar10H_MV
- number of classes: 10
- labels: /home/tlefort/Documents/peerannot/peerannot/datasets/cifar10H/labels/labels_cifar-10h_mv.npy
- model: resnet18
- img_size: 32
- n_epochs: 150
- lr: 0.1
- scheduler: True
- milestones: (50, 100)
- num_workers: 8
- optimizer: SGD
- metadata_path: None
- data_augmentation: False
- path_remove: None
- pretrained: False
- momentum: 0.9
- decay: 0.0005
- n_params: 3072
- lr_decay: 0.1
- batch_size: 64
- freeze: False
----------
Loading datasets
Accuracy on aggregation: 99.232%
Train set: 9500 tasks
Test set: 50000 tasks
Validation set: 500 tasks
Using cache found in /home/tlefort/.cache/torch/hub/pytorch_vision_main
Using cache found in /home/tlefort/.cache/torch/hub/pytorch_vision_main
Using cache found in /home/tlefort/.cache/torch/hub/pytorch_vision_main
Successfully loaded resnet18 with n_classes=10

# Aggregate into soft labels with Dawid and Skene

In [None]:
! peerannot aggregate $DIRc10h -s DS

In [None]:
labels_path = DIRc10h / "labels" / "labels_cifar-10h_ds.npy"
num_epochs = 10

In [None]:
! peerannot train $DIRc10h -o cifar10H_DS -K 10\
    --labels $labels_path --model resnet18 --img-size=32\
    --n-epochs=$num_epochs --lr=0.1 --scheduler -m 50 -m 100 \
    --num-workers 8

# Task ambiguity identification

In [None]:
! peerannot identificationinfo

In [None]:
! peerannot identify -h

In [3]:
path_votes = DIRc10h / "answers.json"

In [None]:
! peerannot identify $DIRc10h -K 10 --method WAUM --labels $path_votes\
    --model resnet18 --n-epochs 0 --lr=0.1 --img-size=32 \
    --maxiter-DS=50

In [4]:
from peerannot.models.WAUM import WAUM
import json
with open(path_votes, "r") as f:
    answers = json.load(f)
waum = WAUM(answers, 10, n_workers=2571, n_epoch=1)

In [5]:
waum.run()

Dawid and Skene:   0%|          | 0/60 [00:00<?, ?it/s]

Dawid and Skene:   0%|          | 0/60 [00:00<?, ?it/s]

In [21]:
def get_probas(waum):
    """Get soft labels distribution for each task

    :return: Weighted label frequency for each task in D_pruned
    :rtype: numpy.ndarray(n_task, n_classes)
    """
    baseline = np.zeros((len(waum.answers), waum.n_classes))
    waum.answers = dict(sorted(waum.answers.items()))
    for task_id, tt in enumerate(list(waum.answers.keys())):
        if tt not in waum.too_hard[:, 1]:
            task = waum.answers[tt]
            for worker, vote in task.items():
                baseline[task_id, int(vote)] += waum.pi[
                    waum.ds.converter.table_worker[int(worker)]
                ][int(vote), int(vote)]
    waum.baseline = baseline
    return np.where(
        baseline.sum(axis=1).reshape(-1, 1),
        baseline / baseline.sum(axis=1).reshape(-1, 1),
        0,
    )

In [22]:
labs = get_probas(waum)
np.save(DIRc10h / "labels" / "labels_waum_0.01.npy", labs)

  baseline / baseline.sum(axis=1).reshape(-1, 1),


In [23]:
from peerannot.runners.train import load_all_data
labels_path = DIRc10h / "labels" / "labels_waum_0.01.npy"
trainset, valset, testset = load_all_data(
    DIRc10h, labels_path, path_remove=None, labels=labels_path, img_size=32, data_augmentation=False)

Loading datasets
Accuracy on aggregation: 98.389%
