# Kili Tutorial: Create a project with the Imagenette dataset

In this tutorial, we will create a Kili Image Classification project and import the Imagenette dataset (https://github.com/fastai/imagenette).

Additionally:

For an overview of Kili, visit https://kili-technology.com. You can also check out the Kili documentation https://cloud.kili-technology.com/docs.

The tutorial is divided into three parts:

1. Downloading the dataset
2. Creating the project and importing the assets
3. Importing the labels

This next cell connects the notebook to the Kili API. You need to update the credentials `api_key` before.

In [None]:
import os
#!pip install kili
from kili.client import Kili

api_key = os.getenv('KILI_USER_API_KEY')

kili = Kili(api_key=api_key)

#Path where to extract the dataset
data_path = "./data"

## Downloading the dataset

Depending on the size of the images wanted, you can uncomment the correspoding url. By default, images are downloaded at their full size.

The '320 px' and '160 px' versions have their shortest side resized to that size, with their aspect ratio maintained.

In [None]:
import requests
import tarfile

#Full size images
url = "https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz"
#320px images
#url = "https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-320.tgz"
#160px images
#url = "https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz"

response = requests.get(url, stream=True)
file = tarfile.open(fileobj=response.raw, mode="r|gz")
os.makedirs(data_path, exist_ok=True)
file.extractall(path=data_path)

## Creating the Kili project and importing the assets

First, we create a project with a Classification task and introduce the 10 classes present in the Imagenette dataset.

In [None]:
json_interface ={
    "jobs": {
        "CLASSIFICATION_JOB": {
            "mlTask": "CLASSIFICATION",
            "content": {
                "categories": {
                    "TENCH": {"name": "Tench"},
                    "ENGLISH_SPRINGER": {"name": "English springer"},
                    "CASETTE_PLAYER": {"name": "Cassette player"},
                    "CHAIN_SAW": {"name": "Chain saw"},
                    "CHURCH": {"name": "Church"},
                    "FRENCH_HORN": {"name": "French horn"},
                    "GARBAGE_TRUCK": {"name": "Garbage truck"},
                    "GAS_PUMP": {"name": "Gas pump"},
                    "GOLF_BALL": {"name": "Golf ball"},
                    "PARACHUTE": {"name": "Parachute"}},
                "input": "radio"
            },
            "required": 1,
            "isChild": False,
            "instruction": "Category"
        }
    }
}

project_id = kili.create_project(
        title='Imagenette',
        description='Classification on 10 classes on the Imagenette dataset',
        input_type='IMAGE',
        json_interface=json_interface
)['id']

In the next cell we create the list containing information about the assets that we will use to import the assets with the API.

In [None]:
assets = []
path = os.path.join(data_path, 'imagenette2/train')

for dir in os.listdir(path):
    if dir.startswith('n'):
        complete_path = os.path.join(path, dir)
        for file in os.listdir(complete_path):
            if os.path.isfile(os.path.join(complete_path, file)):
                assets.append({
                    'externalId': file,
                    'content': os.path.join(complete_path, file),
                    'metadata': {},
                })
print(f"Total assets count: {len(assets)}")


And now we can import all the assets to the Kili project. We do it in chunks of 10 images, to not overcharge the server.

In [None]:
from tqdm import tqdm
import time

CHUNK_SIZE = 10

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

for asset_chunk in tqdm(list(chunks(assets, CHUNK_SIZE))):
    external_id_array = [a.get('externalId') for a in asset_chunk]
    content_array = [a.get('content') for a in asset_chunk]
    json_metadata_array = [a.get('metadata') for a in asset_chunk]
    kili.append_many_to_dataset(project_id=project_id, 
                                      content_array=content_array,
                                      external_id_array=external_id_array, 
                                      json_metadata_array=json_metadata_array)
    time.sleep(0.2)

## Importing the labels

Now we will add the corresponding label to each asset. First we read the csv file containing all the labels. The column *noisy_label_0* stands for the original label in ImageNet, and the other ones are label that have *x* percent of noise. For example, for *noisy_label_50*, 50% of labels have their labels changed to a wrong one randomly.

In [None]:
import pandas as pd

df = pd.read_csv(os.path.join(data_path, 'imagenette2/noisy_imagenette.csv'))
df_train = df[df['is_valid'] == False]
df_train['externalId'] = df.apply(lambda row: row['path'].split('/')[-1], axis=1)
df_train

In the next cell, we upload the labels to Kili. The parameter *upload_noisy* is set to True by default, meaning we will upload two labels for each asset, the original one as a **Review** label and the one with 50% change of being wrong as a **Default** one. If this parameter is set to False, only the original label will be uploaded as a **Default** label.

In [None]:
name_to_cat = {
  'n01440764': 'TENCH',
  'n02102040': 'ENGLISH_SPRINGER',
  'n02979186': 'CASETTE_PLAYER',
  'n03000684': 'CHAIN_SAW',
  'n03028079': 'CHURCH',
  'n03394916': 'FRENCH_HORN',
  'n03417042': 'GARBAGE_TRUCK',
  'n03425413': 'GAS_PUMP',
  'n03445777': 'GOLF_BALL',
  'n03888257': 'PARACHUTE',
}

upload_noisy = True

for index, row in df_train.iterrows():
  time.sleep(0.2)

  asset = kili.assets(project_id=project_id, fields=['id', 'externalId'], external_id_contains=[row['externalId']])[0]
  label_name = row['noisy_labels_0']
  label_name_noisy = row['noisy_labels_50']
  label = {
      "CLASSIFICATION_JOB": {
          "categories": [{"name": name_to_cat[label_name]}]
      }
  }
  if not upload_noisy:
    kili.append_to_labels(
        json_response=label,
        label_asset_id=asset['id'],
        label_type='DEFAULT'
    )
  else:
    label_noisy = {
        "CLASSIFICATION_JOB": {
            "categories": [{"name": name_to_cat[label_name_noisy]}]
        }
    }
    kili.append_to_labels(
        json_response=label,
        label_asset_id=asset['id'],
        label_type='REVIEW'
    )
    kili.append_to_labels(
        json_response=label_noisy,
        label_asset_id=asset['id'],
        label_type='DEFAULT'
    )

You can now delete the *data* folder if you want