# Blood cells classification
This project aims to solve an image classification task with Toloka.

### Challenge

We chose the [BCCD Dataset](https://github.com/Shenggan/BCCD_Dataset) for this task. This dataset is under [MIT license](https://github.com/Shenggan/BCCD_Dataset/blob/master/LICENSE).

BCCD Dataset is a small-scale dataset for blood cell detection. The cell types are Eosinophil, Lymphocyte, Monocyte, and Neutrophil. There are microscope images for each of four different cell types grouped into four different folders (according to cell type).

Check out [this example](https://www.kaggle.com/brsdincer/classify-blood-cell-subtypes-all-process) of a machine learning model for cell classification based on this dataset. 

We will also implement cell classification on images using Toloka crowdsourcing platform. We are going to use a slightly reworked BCCD [dataset](https://www.kaggle.com/paultimothymooney/blood-cells).


### Context
The diagnosis of blood-based diseases often involves identifying and characterizing patient's blood samples. Automated methods to detect and classify blood cell types have important medical applications.


### Description
To solve this task, we will create a project using [Toloka-Kit](https://toloka.github.io/toloka-kit). 

In this project, we will show an image of a blood cell and a brief instruction for Toloka performers. Then, we will ask performers to choose which type of cell they see on this image:

1 - Eosinophil

2 - Lymphocyte

3 - Monocyte

4 - Neutrophil
<table  align="center">
  <tr><td>
    <img src="./manual/img/eosinophil_ex2.jpg"
         alt="Sample blood cell image"  width="500">
  </td></tr>
  <tr><td align="center">
    <b>Picture 1.</b> Eosinophil
  </td></tr>
</table>


<table  align="center">
  <tr><td>
    <img src="./manual/manual.png"
         alt="Manual"  width="500">
  </td></tr>
  <tr><td align="center">
    <b>Picture 2.</b> Instruction
  </td></tr>
</table>

This task usually requires special domain knowledge about blood cells, but we can solve it with crowdsourcing.

To prepare our performers, we wrote <a href="./manual/manual.html" target="_blank">detailed instructions</a> and implemented the project pipeline: Training → Exam → Main tasks.

### Set up the environment

First, you'll need to register in Toloka as a requester. Learn more about this step in the [documentation](https://yandex.com/support/toloka-requester/concepts/access.html).

In our example, we are using the [production](https://toloka.yandex.com) environment of Toloka, but you can also use the [sandbox](https://yandex.com/support/toloka-requester/concepts/sandbox.html).

The second step is to obtain your [OAuth token](https://yandex.com/dev/toloka/doc/concepts/access.html#access__token).
A detailed description of these actions can be found in the example [learn the basics](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb).

The next step is to configure s3 to store images. You can use [Yandex.Cloud](https://cloud.yandex.com) or [Amazon.AWS](https://aws.amazon.com/s3/) for example.
You will also need to create an access key:
- In [Yandex](https://cloud.yandex.com/docs/iam/operations/sa/create-access-key)
- In [Amazon](https://docs.aws.amazon.com/IAM/latest/APIReference/API_CreateAccessKey.html)

> **Tip:** In [Yandex.Cloud](https://cloud.yandex.com) make sure your s3 service account is configured with an [admin role](https://cloud.yandex.com/en-ru/docs/iam/operations/sa/assign-role-for-sa), and you have the full rights to access it.

Then, register on the **Kaggle** platform and download the dataset:
1. Log in or sign up on [Kaggle](https://www.kaggle.com). 
2. Go to **Account**. 
3. Click **Create New API Token**. This action will generate a JSON file with your username and the key.

In [None]:
!pip install toloka-kit==0.1.5
!pip install crowd-kit==0.0.3
!pip install pandas
!pip install boto3
!pip install ipyplot
!pip install kaggle

In [None]:
import datetime
import os
import random
import time
import uuid
from zipfile import ZipFile

import boto3
import ipyplot
import pandas

import toloka.client as toloka
import toloka.client.project.template_builder as tb
from crowdkit.aggregation import MajorityVote, DawidSkene

Let's connect to s3 and create a bucket (a logical entity that stores objects). If you configured the bucket manually, just enter the bucket name into the following code:

In [None]:
key_id='' # Enter your key id
access_key=''  # Enter your access key
bucket_name = ''  # Enter your bucket name if you have one, or leave it empty
s3_url = 'https://storage.yandexcloud.net'

session = boto3.session.Session()
s3 = session.client(
    service_name='s3',
    endpoint_url=s3_url,
    aws_access_key_id=key_id,
    aws_secret_access_key=access_key
)

if bucket_name == '':
    bucket_name = f'blood-crowd-test-{uuid.uuid4().hex}'
    response = s3.create_bucket(ACL='public-read', Bucket=bucket_name)
    print(response['Location'])

Let's set up a connection to Toloka [production]((https://toloka.yandex.com) or [sandbox](https://sandbox.toloka.yandex.com) environment.

In [None]:
token = ''  # Enter your production or sandbox OAuth-token

toloka_client = toloka.TolokaClient(token, 'PRODUCTION') # Or switch to 'SANDBOX' instance
requester = toloka_client.get_requester()
print(requester)

Let's download the dataset using the **Kaggle API** and unpack it.

In [None]:
os.environ['KAGGLE_USERNAME'] = ''  # "username" from kaggle.json
os.environ['KAGGLE_KEY'] = ''  # "key" from kaggle.json
!kaggle datasets download -d paultimothymooney/blood-cells
with ZipFile('blood-cells.zip', 'r') as archive:
    archive.extractall('archive')

Let's prepare the basic settings: 
- Where data is located.
- A list of possible cell types.
- How many pictures we want to annotate.

In [None]:
data_dir = './archive/dataset-master/dataset-master/JPEGImages/'
test_dir = './archive/dataset2-master/dataset2-master/images/TEST_SIMPLE/'

typecells = ['EOSINOPHIL', 'LYMPHOCYTE', 'MONOCYTE', 'NEUTROPHIL']
tests_count = 4 # The number of control tasks: 4*typecells = 16
tasks_count = None # The number of tasks we want to annotate. If None is specified, we annotate everything

Let's prepare the UI text descriptions for all fields that we need.

In [None]:
# Toloka text setting
project_name = '🔬Blood cell type detection'
project_description = 'Look at the picture and determine which type of blood cell is presented.'
project_label = 'Which blood cell is presented on the photo?'
project_namecells = ['1 - EOSINOPHIL', '2 - LYMPHOCYTE', '3 - MONOCYTE', '4 - NEUTROPHIL']

exam_skill_name = f'{project_name} (Exam)'
exam_skill_description = f'How performer passed the exam on the project {project_name}'
quality_skill_name = f'{project_name} (Quality)'
quality_skill_description = f'How performer completed the tasks on the project {project_name}'

train_pool_name = 'Training'
exam_pool_name = 'Exam'
exam_public_description = 'Take the exam first to gain access to the paid tasks.'

Let's prepare some tips. We will use them to teach the performer how to identify cell types.

In [None]:
hint_dict = {
    'EOSINOPHIL': '1 - Eosinophil. The cell nucleus is divided into two parts by a pinch in the middle. Red and orange dots are visible in the pink layer.', 
    'LYMPHOCYTE': '2 – Lymphocyte. Large cell nucleus and no grain. Small in size.',
    'MONOCYTE': '3 – Monocyte. The cell nucleus is not divided into fragments. It is also dark, slightly elongated, and looks like a bean. Quite large in size.', 
    'NEUTROPHIL': '4 – Neutrophil. The cell nucleus is divided into several (2-4) unequal parts connected by pinches. Granularity is well expressed within the pink layer.',
}

### Prepare the instructions

>**Note:** We strongly recommend checking the task interface and instructions every time you create a project. With an intuitive interface and clear instructions, performers will complete the task correctly, and you will get useful results.

In [None]:
public_instruction = open('manual/manual.html').read().strip()

for image in os.listdir('manual/img/'):
    s3.upload_file(f'manual/img/{image}', bucket_name, f'manual/img/{image}')

public_instruction = public_instruction.replace('./img/', f'{s3_url}/{bucket_name}/manual/img/')

s3.upload_file('manual/manual.png', bucket_name, 'manual/manual.png')
manualpng = f'{s3_url}/{bucket_name}/manual/manual.png'

In [None]:
public_instruction = open('manual/manual.html').read().strip()
public_instruction = public_instruction.replace('./img/', f'{s3_url}/{bucket_name}/manual/img/')

manualpng = f'{s3_url}/{bucket_name}/manual/manual.png'

### Create a project

At this step we need to: 
- Configure how performers will see the task.
- Upload instructions.
- Define the format of input and output data.

Learn more about [projects](https://yandex.com/support/toloka-requester/concepts/project.html) in Toloka.

In [None]:
# How performers will see the task
radio_group_field = tb.fields.RadioGroupFieldV1(
    data=tb.data.OutputData(path='result'),
    label=project_label,
    validation=tb.conditions.RequiredConditionV1(),
    options=[
        tb.fields.GroupFieldOption(label=cell_name, value=cell_type)
        for cell_name, cell_type in zip(project_namecells, typecells)        
    ]
)

project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(
    config=tb.TemplateBuilder(
        view=tb.view.ListViewV1(
            items=[
                tb.view.ImageViewV1(url=tb.data.InputData(path='image'), max_width=500),
                tb.view.ImageViewV1(url=manualpng, max_width=500),
                radio_group_field,
            ]
        ),
        plugins=[
            tb.plugins.HotkeysPluginV1(
                **{
                    f'key_{i+1}': tb.actions.SetActionV1(data=tb.data.OutputData(path='result'),payload=cell_type)
                    for i, cell_type in enumerate(typecells)
                }
            ),
            tb.plugins.TolokaPluginV1(
                layout = tb.plugins.TolokaPluginV1.TolokaPluginLayout(
                    kind='scroll', 
                    task_width=500,
                )
            ),
        ]
    )
)

# Set up the project
markup_project = toloka.project.Project(
    assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
    public_name=project_name,
    public_description=project_description,
    public_instructions=public_instruction,
    # Set up the task: view, input, and output parameters
    task_spec=toloka.project.task_spec.TaskSpec(
        input_spec={
            'image': toloka.project.field_spec.StringSpec()
        },
        output_spec={
            'result': toloka.project.field_spec.StringSpec(allowed_values=typecells)
        },
        view_spec=project_interface,
    ),
)

# Call the API to create a new project
markup_project = toloka_client.create_project(markup_project)
print(f'Created markup project with id {markup_project.id}')
print(f'To view the project, go to https://toloka.yandex.com/requester/project/{markup_project.id}')
# For Sandbox environment use:
# print(f'To view the project, go to https://sandbox.toloka.yandex.com/requester/project/{markup_project.id}')

### Set up skills for performers
[Skill](https://yandex.com/support/toloka-requester/concepts/nav.html) can be any characteristic of the performer. It is described by a number from 0 to 100. For example, you can record the percentage of correct responses in a skill.

We will use two skills in our project:
- **Quality skill** shows how well the performer completed training tasks.

- **Exam skill** shows how well the performer passed the exam tasks.


In [None]:
exam_skill = next(toloka_client.get_skills(name=exam_skill_name), None)
if exam_skill:
    print('Exam skill already exists')
else:
    print('Create new exam skill')
    exam_skill = toloka_client.create_skill(
        name=exam_skill_name,
        hidden=True,
        private_comment=exam_skill_description,
    )

quality_skill = next(toloka_client.get_skills(name=quality_skill_name), None)
if quality_skill:
    print('Quality skill already exists')
else:
    print('Create new quality skill')
    quality_skill = toloka_client.create_skill(
        name=quality_skill_name,
        hidden=True,
        private_comment=quality_skill_description,
    )

### Add a training
With [training pools](https://yandex.com/support/toloka-requester/concepts/train.html), performers can practice before getting started. While completing training, performers will see a hint if they answered incorrectly. 

You can also provide access to the main task pool only to those performers who passed the training.

Let's create a training pool:

In [None]:
train_pool = toloka.training.Training(
    project_id=markup_project.id,
    private_name=train_pool_name,
    may_contain_adult_content=False,
    assignment_max_duration_seconds=60*20,
    mix_tasks_in_creation_order=False,
    shuffle_tasks_in_task_suite=True,
    training_tasks_in_task_suite_count=15,
    task_suites_required_to_pass=7,
    retry_training_after_days=1,
)

train_pool = toloka_client.create_training(train_pool)
print(f'Created "{train_pool.private_name}" training with id {train_pool.id}')

Let's now add tasks to the training. Don't forget to define the hints.

In [None]:
training_tasks = []

for cell_type in typecells:
    dir_path = f'{test_dir}/{cell_type}/'
    test_images_list = os.listdir(dir_path)
    random.shuffle(test_images_list)
    count = tests_count if len(test_images_list) > tests_count else len(test_images_list)
    for image in test_images_list[:count]:
        s3.upload_file(f'{dir_path}{image}', bucket_name, f'train/{image}')
        training_tasks.append(
            toloka.task.Task(
                input_values={'image': f'{s3_url}/{bucket_name}/train/{image}'},
                known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'result': cell_type})],
                message_on_unknown_solution = hint_dict[cell_type],
                pool_id=train_pool.id,
                infinite_overlap=True,
            )
        )
created_training_tasks = toloka_client.create_tasks(training_tasks, toloka.task.CreateTasksParameters(allow_defaults=True))
print(f'{len(created_training_tasks.items)} tasks added to the pool {train_pool.id}')

### Create an exam
Performers, who scored at least 50% correct answers in training, are allowed to do the exam. In the exam, performers complete control tasks. And as a result, they receive access to main (paid) tasks.

In [None]:
from toloka.client.collectors import AssignmentSubmitTime, GoldenSet
from toloka.client.actions import RestrictionV2, SetSkillFromOutputField
from toloka.client.conditions import (
    FastSubmittedCount,
    GoldenSetCorrectAnswersRate,
    RuleConditionKey,
    TotalAnswersCount,
)

exam_pool = toloka.pool.Pool(
    project_id=markup_project.id,
    private_name=exam_pool_name,
    public_description=exam_public_description,
    may_contain_adult_content=False,
    type='EXAM',
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    reward_per_assignment=0.00,
    auto_accept_solutions=True,
    assignment_max_duration_seconds=60*10,
    defaults=toloka.pool.Pool.Defaults(
         default_overlap_for_new_task_suites=99,
         default_overlap_for_new_tasks=None,
    ),
)

# Set the number of tasks per page
exam_pool.set_mixer_config(
    real_tasks_count=0, 
    golden_tasks_count=15,
    training_tasks_count=0,
    min_golden_tasks_count=15,
    mix_tasks_in_creation_order=False,
    shuffle_tasks_in_task_suite=True,
)

exam_pool.filter = (
    toloka.filter.FilterOr([toloka.filter.Languages.in_('EN')]) &
    toloka.filter.FilterOr([
        toloka.filter.ClientType == 'BROWSER',
        toloka.filter.ClientType == 'TOLOKA_APP'
    ])
)

exam_pool.set_training_requirement(
    training_pool_id=train_pool.id,
    training_passing_skill_value=50
)

exam_pool.quality_control.add_action(
    collector=AssignmentSubmitTime(fast_submit_threshold_seconds=7),
    conditions=[FastSubmittedCount > 0],
    action=RestrictionV2(
        scope='PROJECT',
        duration_unit='PERMANENT',
        private_comment='Fast responses'
    )
)

exam_pool.quality_control.add_action(
    collector=GoldenSet(history_size=15),
    conditions=[TotalAnswersCount >= 14],
    action=SetSkillFromOutputField(
        skill_id=exam_skill.id,
        from_field=RuleConditionKey('correct_answers_rate')
    )
)

exam_pool = toloka_client.create_pool(exam_pool)
print(f'Created "{exam_pool.private_name}" pool with id {exam_pool.id}')

Let's now add tasks to the exam.

In [None]:
exam_tasks = []

for cell_type in typecells:
    dir_path = f'{test_dir}/{cell_type}/'
    test_images_list = os.listdir(dir_path)
    random.shuffle(test_images_list)
    count = tests_count if len(test_images_list) > tests_count else len(test_images_list)
    for image in test_images_list[:count]:
        s3.upload_file(f'{dir_path}{image}', bucket_name, f'exam/{image}')
        exam_tasks.append(
            toloka.task.Task(
                input_values={'image': f'{s3_url}/{bucket_name}/exam/{image}'},
                known_solutions = [
                    toloka.task.BaseTask.KnownSolution(output_values={'result': cell_type})
                ],
                pool_id=exam_pool.id,
                infinite_overlap=True,
            )
        )

created_exam_tasks = toloka_client.create_tasks(exam_tasks)
print(f'{len(created_exam_tasks.items)} tasks added to the pool {exam_pool.id}')

### Create a pool with main tasks
Trusted performers (who successfully completed training and exam) will annotate the real data in the main [pool](https://yandex.com/support/toloka-requester/concepts/pool-main.html).

In [None]:
blood_pool = toloka.pool.Pool(
    project_id=markup_project.id,
    private_name=project_name+' - test for adding tasks',
    may_contain_adult_content=False,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    reward_per_assignment=0.01,
    auto_accept_solutions=True,
    assignment_max_duration_seconds=60*20,
    defaults=toloka.pool.Pool.Defaults(
         default_overlap_for_new_task_suites=1,
         default_overlap_for_new_tasks=10,
    ),
)

# Set the number of tasks per page
blood_pool.set_mixer_config(
    real_tasks_count=8, 
    golden_tasks_count=2,
    training_tasks_count=0,
    min_real_tasks_count=1,
    min_golden_tasks_count=2
)

blood_pool.filter = (
    toloka.filter.FilterOr([toloka.filter.Languages.in_('EN')]) &
    toloka.filter.FilterOr([toloka.filter.Skill(exam_skill.id) >= 80]) &
    toloka.filter.FilterOr([
        toloka.filter.ClientType == 'BROWSER',
        toloka.filter.ClientType == 'TOLOKA_APP'
    ]) &
    toloka.filter.FilterOr([
        toloka.filter.Skill(quality_skill.id) >= 80,
        toloka.filter.Skill(quality_skill.id) == None
    ])
)

blood_pool.quality_control.add_action(
    collector=AssignmentSubmitTime(fast_submit_threshold_seconds=10),
    conditions=[FastSubmittedCount > 0],
    action=RestrictionV2(
        scope='PROJECT',
        duration_unit='PERMANENT',
        private_comment='Fast responses'
    )
)

blood_pool.quality_control.add_action(
    collector=GoldenSet(history_size=100),
    conditions=[GoldenSetCorrectAnswersRate < 80],
    action=RestrictionV2(
        scope='PROJECT',
        duration_unit='PERMANENT',
        private_comment='Wrong honeypot'
    )
)

blood_pool.quality_control.add_action(
    collector=GoldenSet(history_size=100),
    conditions=[TotalAnswersCount >= 3],
    action=SetSkillFromOutputField(
        skill_id=quality_skill.id,
        from_field=RuleConditionKey('correct_answers_rate')
    )
)

blood_pool = toloka_client.create_pool(blood_pool)
print(f'Created "{blood_pool.private_name}" pool with id {blood_pool.id}')

Let's add tasks to the main pool.

> **Note:** There are some corner cases in this dataset: when the image shows several cells or no cells at all. When doing your annotations, such cases must be kept in mind and handled separately. We'll drop them from the input for the sake of simplicity. 

Let's  prepare a ground truth `pandas.DataFrame`. We will compare the final annotations with ground truths later on.

In [None]:
# Upload ground truth annotations
ground_truth_df = pandas.read_csv('archive/dataset-master/dataset-master/labels.csv', sep=',')
ground_truth_df = ground_truth_df[['Image', 'Category']]
ground_truth_df = ground_truth_df.rename(columns = {'Image':'task','Category':'ground_truth'})

prefix = f'{s3_url}/{bucket_name}/task/BloodImage_'
ground_truth_df['task'] = ground_truth_df['task'].apply(lambda x: f'{prefix}{str(x).zfill(5)}.jpg')

ground_truth_df.set_index('task', inplace=True)
print(ground_truth_df)

In [None]:
real_tasks = []

# Add golden tasks with known solutions
for cell_type in typecells:
    dir_path = f'{test_dir}/{cell_type}/'
    test_images_list = os.listdir(dir_path)
    for image in test_images_list:
        s3.upload_file(f'{dir_path}{image}', bucket_name, f'golden_task/{image}')
        real_tasks.append(
            toloka.task.Task(
                input_values={'image': f'{s3_url}/{bucket_name}/golden_task/{image}'},
                known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'result': cell_type})],
                pool_id=blood_pool.id,
                infinite_overlap=True,
            )
        )

# Add main tasks
images_list = os.listdir(f'{data_dir}')
count = len(images_list) if tasks_count is None or tasks_count > len(images_list) else tasks_count
for image in images_list[:count]:
    image_url = f'{s3_url}/{bucket_name}/task/{image}'
    if image_url not in ground_truth_df.index or ground_truth_df.loc[image_url, :]['ground_truth'] not in typecells:
        break

    s3.upload_file(f'{data_dir}{image}', bucket_name, f'task/{image}')
    real_tasks.append(
        toloka.task.Task(
            input_values={'image': image_url},
            pool_id=blood_pool.id,
        )
    )
created_tasks = toloka_client.create_tasks(real_tasks, toloka.task.CreateTasksParameters(allow_defaults=True))
print(f'{len(created_tasks.items)} tasks added to the pool {blood_pool.id}')

### Run the pools
We will do the following:
1. Run training, exam, and the main pool.
2. Wait for the main pool to complete.
3. Stop all the other pools in a project.
4. Get the results.

In [None]:
def wait_pool_for_close(pool, timeout_minutes=5):
    sleep_time = 60*timeout_minutes
    pool = toloka_client.get_pool(pool.id)
    while not pool.is_closed():
        print(
            f'{datetime.datetime.now().strftime("%H:%M:%S")} '
            f'Pool {pool.id} has status {pool.status}.'
        )
        time.sleep(sleep_time)
        pool = toloka_client.get_pool(pool.id)

toloka_client.open_pool(train_pool.id)
toloka_client.open_pool(exam_pool.id)
toloka_client.open_pool(blood_pool.id)

# Wait for the pools to close
print('\nWaiting for the main pool to close')
wait_pool_for_close(blood_pool)
print(f'Pool "{blood_pool.private_name}" is finally closed!')

toloka_client.close_pool(train_pool.id)
print(f'Pool "{train_pool.private_name}" is closed!')

toloka_client.close_pool(exam_pool.id)
print(f'Pool "{exam_pool.private_name}" is closed!')

### Get the results

We will collect the results in `pandas.DataFrame`, and then simply print it.

In [None]:
answers_df = toloka_client.get_assignments_df(blood_pool.id)

answers_df = answers_df[answers_df['GOLDEN:result'].isnull()].copy()
answers_df = answers_df[['INPUT:image','OUTPUT:result','ASSIGNMENT:worker_id']]
answers_df = answers_df.rename(columns = {'INPUT:image':'task','OUTPUT:result':'label','ASSIGNMENT:worker_id':'performer'})

# Dawid Skene aggregation type
ds_labels = DawidSkene(n_iter=20).fit_predict(answers_df)
result = pandas.concat([result, ds_labels], axis=1).rename(columns = {0:'ds_label'})

result = result.drop(result[result.ds_label.isnull()].index)
print(result)

### Summary

This project shows how to perform image classification using the crowd force of Toloka performers. For more examples, visit our [Toloka-Kit usage examples](https://github.com/Toloka/toloka-kit/tree/main/examples) page.