# Installing Requirements + Setup
Goal of this Hackathon: **Classify digits (0-9) from a diverse dataset of handwritten, printed, and billboard text**

In [1]:
!pip install datasets transformers torch
!sudo apt install zstd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 7.5 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 85.7 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 85.3 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 85.7 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 85.9 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!cp drive/MyDrive/dataset.tar.zst ./
!tar --use-compress-program=unzstd -xvf dataset.tar.zst

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/6483.png
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/4294.png
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/5834.png
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/2183.png
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/11902.png
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/7574.png
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/8647.png
Users/parin/PycharmProjects/ImageProcessingHackathon/hackathon-online-2022-image-processing/IM_Test/IM_Test/12984.p

# Preprocessing Dataset + Augmentation


In [None]:
import glob
import random
from PIL import Image, ImageOps

files = glob.glob("hackathon-online-2022-image-processing/**/*.png", recursive=True)

size = 70

for _ in range(20):
    # Test Transformations
    choice = random.choice(files)
    print(choice)
    img = Image.open(choice)
    img = ImageOps.contain(img, (70, 70), Image.Resampling.LANCZOS)
    img = ImageOps.autocontrast(img)

    img = ImageOps.pad(img, (70, 70), centering=(0,0))

    img = ImageOps.grayscale(img)

    img = ImageOps.autocontrast(img)
    display(img)


## Rationale
The images given are of different:
- sizes
- contrast
- colors
- brightness

To increase the performance of our model, we should do our best to make these images the same.
- Make sizes the same by using "contain" resize, so we keep the aspect ratio and add padding to make them the same size.
- Adjust the contrast so that the numbers stand out from their background.
- Convert the image to grayscale = **Image Version 1**
- Invert the grayscale image (black to white & white to black) = **Image Version 2**

These last two steps help handle images in the test set where some end up with the text being white, while others end up with their text being black.

Final result = preprocessed + augmented dataset too since we save both Image Version 1 and Image Version 2.

In [None]:
IMG_SIZE = 224
failed = []


def transform_images(files):
    files_num = len(files)
    for id_file, file in enumerate(files):
        try:
            image = Image.open(file)
            image = ImageOps.contain(image, (IMG_SIZE, IMG_SIZE), Image.Resampling.LANCZOS)
            image = ImageOps.autocontrast(image)
            image = ImageOps.grayscale(image)
            image = ImageOps.autocontrast(image)
            inverted = ImageOps.invert(image)

            image = ImageOps.pad(image, (IMG_SIZE, IMG_SIZE), centering=(0, 0))
            inverted = ImageOps.pad(inverted, (IMG_SIZE, IMG_SIZE), centering=(0, 0))

            inverted = inverted.convert("RGB")
            image = image.convert("RGB")

            inverted.save(f"{file}_inverted.png")
            image.save(file)

        except ValueError:
            failed.append(file)
        print(f"{id_file + 1} out of {files_num} File: {file.strip()}", end="\r")


transform_images(files)

# Loading Processed Dataset

In [2]:
from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="./hackathon-online-2022-image-processing/train")

Resolving data files:   0%|          | 0/146514 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

In [3]:
data = dataset["train"].train_test_split(test_size=0.1)

In [4]:
data

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 131862
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 14652
    })
})

# Initialize Training
I chose ViT for this as it is one of the state-of-the art models at this time.
The rationale behind choosing the variant finetuned on the imagenet as well was to perhaps harness the benefits of transfer learning. Unfortunately, I was not able to validate this hypothesis as there was not enough time.

In [21]:
from transformers import ViTFeatureExtractor, ViTForImageClassification
model_name_or_path = 'google/vit-large-patch16-224-in21k'
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name_or_path)

labels = data['train'].features['label'].names

model = ViTForImageClassification.from_pretrained(
    model_name_or_path,
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)}
)

loading configuration file drive/MyDrive/vit/checkpoint-20610/preprocessor_config.json
Feature extractor ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "size": 224
}

loading configuration file drive/MyDrive/vit/checkpoint-20610/config.json
Model config ViTConfig {
  "_name_or_path": "google/vit-large-patch16-224-in21k",
  "architectures": [
    "ViTForImageClassification"
  ],
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 1024,
  "id2label": {
    "0": "0",
    "1": "1",
    "2": "2",
    "3": "3",
    "4": "4",
    "5": "5",
    "6": "6",
    "7": "7",
    "8": "8",
    "9": "9"
  },
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "0": "0",
    "1": "1",
    "2": 

In [10]:
labels

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [11]:
feature_extractor

ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "size": 224
}

In [12]:
def process_example(example):
    inputs = feature_extractor(example['image'], return_tensors='pt')
    inputs['label'] = example['label']
    return inputs

In [13]:
process_example(data["train"][0])

{'pixel_values': tensor([[[[ 0.4588,  0.4745,  0.4745,  ..., -1.0000, -1.0000, -1.0000],
          [ 0.4588,  0.4745,  0.4745,  ..., -1.0000, -1.0000, -1.0000],
          [ 0.4588,  0.4745,  0.4745,  ..., -1.0000, -1.0000, -1.0000],
          ...,
          [ 0.9686,  0.9686,  0.9686,  ..., -1.0000, -1.0000, -1.0000],
          [ 1.0000,  1.0000,  1.0000,  ..., -1.0000, -1.0000, -1.0000],
          [ 1.0000,  1.0000,  1.0000,  ..., -1.0000, -1.0000, -1.0000]],

         [[ 0.4588,  0.4745,  0.4745,  ..., -1.0000, -1.0000, -1.0000],
          [ 0.4588,  0.4745,  0.4745,  ..., -1.0000, -1.0000, -1.0000],
          [ 0.4588,  0.4745,  0.4745,  ..., -1.0000, -1.0000, -1.0000],
          ...,
          [ 0.9686,  0.9686,  0.9686,  ..., -1.0000, -1.0000, -1.0000],
          [ 1.0000,  1.0000,  1.0000,  ..., -1.0000, -1.0000, -1.0000],
          [ 1.0000,  1.0000,  1.0000,  ..., -1.0000, -1.0000, -1.0000]],

         [[ 0.4588,  0.4745,  0.4745,  ..., -1.0000, -1.0000, -1.0000],
          [ 0

In [12]:
def transform(example_batch):
    # Take a list of PIL images and turn them to pixel values
    inputs = feature_extractor([x for x in example_batch['image']], return_tensors='pt')

    # Don't forget to include the labels!
    inputs['labels'] = example_batch['label']
    return inputs

In [13]:
prepared_data = data.with_transform(transform)

In [14]:
import torch
import numpy as np
from datasets import load_metric

def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['pixel_values'] for x in batch]),
        'labels': torch.tensor([x['labels'] for x in batch])
    }


metric = load_metric("accuracy")
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)


  if sys.path[0] == '':


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [16]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="./vit-digit-recognition",
  per_device_train_batch_size=64,
  evaluation_strategy="epoch",
  save_strategy="epoch",
  num_train_epochs=15,
  fp16=True,
  logging_steps=10,
  learning_rate=2e-4,
  save_total_limit=2,
  warmup_ratio=0.1,
  remove_unused_columns=False,
  push_to_hub=False,
  metric_for_best_model="accuracy",
  report_to='tensorboard',
  load_best_model_at_end=True,
)

In [22]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_data["train"],
    eval_dataset=prepared_data["test"],
    tokenizer=feature_extractor,
)

Using cuda_amp half precision backend


In [19]:
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

***** Running training *****
  Num examples = 131862
  Num Epochs = 15
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 30915


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2429,0.2457,0.928952
2,0.2314,0.231313,0.931272
3,0.2186,0.208114,0.936391
4,0.156,0.170679,0.95045
5,0.0761,0.162963,0.955228
6,0.0648,0.131258,0.963896
7,0.0948,0.139897,0.962531
8,0.0319,0.11136,0.971335
9,0.0284,0.111867,0.971471
10,0.0181,0.111137,0.973655


***** Running Evaluation *****
  Num examples = 14652
  Batch size = 8
Saving model checkpoint to ./vit-digit-recognition/checkpoint-2061
Configuration saved in ./vit-digit-recognition/checkpoint-2061/config.json
Model weights saved in ./vit-digit-recognition/checkpoint-2061/pytorch_model.bin
Feature extractor saved in ./vit-digit-recognition/checkpoint-2061/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 14652
  Batch size = 8
Saving model checkpoint to ./vit-digit-recognition/checkpoint-4122
Configuration saved in ./vit-digit-recognition/checkpoint-4122/config.json
Model weights saved in ./vit-digit-recognition/checkpoint-4122/pytorch_model.bin
Feature extractor saved in ./vit-digit-recognition/checkpoint-4122/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 14652
  Batch size = 8
Saving model checkpoint to ./vit-digit-recognition/checkpoint-6183
Configuration saved in ./vit-digit-recognition/checkpoint-6183/config.json
Model weights s

KeyboardInterrupt: ignored

In [20]:
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

NameError: ignored

In [23]:
metrics = trainer.evaluate(prepared_data["test"])

***** Running Evaluation *****
  Num examples = 14652
  Batch size = 8


In [24]:
trainer.log_metrics("eval", metrics)

***** eval metrics *****
  eval_accuracy           =     0.9737
  eval_loss               =     0.1111
  eval_runtime            = 0:01:52.45
  eval_samples_per_second =    130.288
  eval_steps_per_second   =      16.29


In [12]:
# Confusion Matrix
import torch
import torch.nn.functional as F
from sklearn import metrics
import copy
 
y_preds = []
y_trues = []
for index,val_item in enumerate(data["test"]):
     encoding = feature_extractor(val_item["image"], return_tensors="pt").to("cuda")
     outputs = model(**encoding)
     y_pred = outputs.logits.argmax(-1)
     y_true = val_item["label"]
     y_preds.append(y_pred)
     y_trues.append(y_true)
     print(f"{index} out of {len(data['test'])}")
     

0 out of 14652
1 out of 14652
2 out of 14652
3 out of 14652
4 out of 14652
5 out of 14652
6 out of 14652
7 out of 14652
8 out of 14652
9 out of 14652
10 out of 14652
11 out of 14652
12 out of 14652
13 out of 14652
14 out of 14652
15 out of 14652
16 out of 14652
17 out of 14652
18 out of 14652
19 out of 14652
20 out of 14652
21 out of 14652
22 out of 14652
23 out of 14652
24 out of 14652
25 out of 14652
26 out of 14652
27 out of 14652
28 out of 14652
29 out of 14652
30 out of 14652
31 out of 14652
32 out of 14652
33 out of 14652
34 out of 14652
35 out of 14652
36 out of 14652
37 out of 14652
38 out of 14652
39 out of 14652
40 out of 14652
41 out of 14652
42 out of 14652
43 out of 14652
44 out of 14652
45 out of 14652
46 out of 14652
47 out of 14652
48 out of 14652
49 out of 14652
50 out of 14652
51 out of 14652
52 out of 14652
53 out of 14652
54 out of 14652
55 out of 14652
56 out of 14652
57 out of 14652
58 out of 14652
59 out of 14652
60 out of 14652
61 out of 14652
62 out of 14652
63

KeyboardInterrupt: ignored

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
cm = metrics.confusion_matrix([int(x) for x in y_trues], [x.item() for x in y_preds], labels=[x for x in range(10)])
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

In [30]:
!cp -r "vit-digit-recognition/checkpoint-16488" drive/MyDrive/vit/

In [31]:
!zip -r "checkpoint-16488.zip" "vit-digit-recognition/checkpoint-16488"

  adding: vit-digit-recognition/checkpoint-16488/ (stored 0%)
  adding: vit-digit-recognition/checkpoint-16488/trainer_state.json (deflated 87%)
  adding: vit-digit-recognition/checkpoint-16488/optimizer.pt (deflated 8%)
  adding: vit-digit-recognition/checkpoint-16488/training_args.bin (deflated 48%)
  adding: vit-digit-recognition/checkpoint-16488/scaler.pt (deflated 55%)
  adding: vit-digit-recognition/checkpoint-16488/scheduler.pt (deflated 49%)
  adding: vit-digit-recognition/checkpoint-16488/rng_state.pth (deflated 27%)
  adding: vit-digit-recognition/checkpoint-16488/preprocessor_config.json (deflated 46%)
  adding: vit-digit-recognition/checkpoint-16488/config.json (deflated 57%)
  adding: vit-digit-recognition/checkpoint-16488/pytorch_model.bin (deflated 7%)


# Get Results
The test files can have many digits in one photo, so we first crop these according to the bounding boxes given.

In [5]:
import glob, json
test_images = glob.glob("hackathon-online-2022-image-processing/IM_Test/**/*.png", recursive=True)
raw_bboxes = json.load(open("hackathon-online-2022-image-processing/IM_Test/IM_Test.json"))


In [6]:
file_to_bbox = dict()
for raw_bbox in raw_bboxes:
    file_to_bbox[raw_bbox["filename"]] = raw_bbox["bboxes"]

In [7]:
file_to_bbox

{'1.png': [{'bbox_id': 0, 'x1': 43, 'x2': 62, 'y1': 7, 'y2': 37}],
 '2.png': [{'bbox_id': 0, 'x1': 99, 'x2': 113, 'y1': 5, 'y2': 28},
  {'bbox_id': 1, 'x1': 114, 'x2': 122, 'y1': 8, 'y2': 31},
  {'bbox_id': 2, 'x1': 121, 'x2': 133, 'y1': 6, 'y2': 29}],
 '3.png': [{'bbox_id': 0, 'x1': 61, 'x2': 72, 'y1': 6, 'y2': 22}],
 '4.png': [{'bbox_id': 0, 'x1': 32, 'x2': 46, 'y1': 6, 'y2': 23}],
 '5.png': [{'bbox_id': 0, 'x1': 97, 'x2': 116, 'y1': 28, 'y2': 56}],
 '6.png': [{'bbox_id': 0, 'x1': 40, 'x2': 47, 'y1': 11, 'y2': 34}],
 '7.png': [{'bbox_id': 0, 'x1': 44, 'x2': 53, 'y1': 7, 'y2': 28},
  {'bbox_id': 1, 'x1': 51, 'x2': 62, 'y1': 6, 'y2': 27},
  {'bbox_id': 2, 'x1': 62, 'x2': 72, 'y1': 6, 'y2': 27}],
 '8.png': [{'bbox_id': 0, 'x1': 62, 'x2': 76, 'y1': 16, 'y2': 39},
  {'bbox_id': 1, 'x1': 80, 'x2': 94, 'y1': 17, 'y2': 40}],
 '9.png': [{'bbox_id': 0, 'x1': 27, 'x2': 39, 'y1': 8, 'y2': 26},
  {'bbox_id': 1, 'x1': 40, 'x2': 53, 'y1': 5, 'y2': 23},
  {'bbox_id': 2, 'x1': 52, 'x2': 67, 'y1': 7, 

In [8]:
from transformers import ViTFeatureExtractor, ViTForImageClassification
model_name_or_path = "drive/MyDrive/vit/checkpoint-20610"
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name_or_path)

labels = data['train'].features['label'].names

model = ViTForImageClassification.from_pretrained(
    model_name_or_path,
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)}
).to("cuda")

In [28]:
!pip install pillow==9.2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pillow==9.2.0
  Downloading Pillow-9.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 7.3 MB/s 
[?25hInstalling collected packages: pillow
  Attempting uninstall: pillow
    Found existing installation: Pillow 7.1.2
    Uninstalling Pillow-7.1.2:
      Successfully uninstalled Pillow-7.1.2
Successfully installed pillow-9.2.0


In [9]:
from PIL import Image, ImageOps
import torch

answers_final = dict()

with torch.no_grad():
    for img_id, image_path in enumerate(test_images):
        key = image_path.split("/")[-1]
        bbox = file_to_bbox[key]


        for box in bbox:
            img = Image.open(image_path)
            boxed = img.crop((box["x1"], box["y1"], box["x2"], box["y2"]))
            
            # Transform those images
            image = ImageOps.contain(boxed, (224, 224), Image.Resampling.LANCZOS)
            image = ImageOps.autocontrast(image)
            image = ImageOps.grayscale(image)
            image = ImageOps.autocontrast(image)
            image = ImageOps.pad(image, (224, 224), centering=(0, 0))
            image = image.convert("RGB")

            encoding = feature_extractor(image, return_tensors="pt").to("cuda")
            outputs = model(**encoding)
            pred = outputs.logits.argmax(-1).item()
            answers_final[f"{key.split('.')[0]}_{box['bbox_id']}"] = model.config.id2label[str(pred)]
        print(f"Image {img_id + 1} out of {len(test_images)}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Image 8069 out of 13068
Image 8070 out of 13068
Image 8071 out of 13068
Image 8072 out of 13068
Image 8073 out of 13068
Image 8074 out of 13068
Image 8075 out of 13068
Image 8076 out of 13068
Image 8077 out of 13068
Image 8078 out of 13068
Image 8079 out of 13068
Image 8080 out of 13068
Image 8081 out of 13068
Image 8082 out of 13068
Image 8083 out of 13068
Image 8084 out of 13068
Image 8085 out of 13068
Image 8086 out of 13068
Image 8087 out of 13068
Image 8088 out of 13068
Image 8089 out of 13068
Image 8090 out of 13068
Image 8091 out of 13068
Image 8092 out of 13068
Image 8093 out of 13068
Image 8094 out of 13068
Image 8095 out of 13068
Image 8096 out of 13068
Image 8097 out of 13068
Image 8098 out of 13068
Image 8099 out of 13068
Image 8100 out of 13068
Image 8101 out of 13068
Image 8102 out of 13068
Image 8103 out of 13068
Image 8104 out of 13068
Image 8105 out of 13068
Image 8106 out of 13068
Image 8107 out of 13068

In [10]:
answers_final

{'7923_0': '2',
 '7923_1': '1',
 '7923_2': '7',
 '7923_3': '6',
 '12513_0': '1',
 '12513_1': '1',
 '12513_2': '4',
 '5396_0': '2',
 '5396_1': '0',
 '5396_2': '9',
 '2734_0': '3',
 '2734_1': '1',
 '6510_0': '6',
 '6510_1': '2',
 '1106_0': '8',
 '1106_1': '6',
 '11086_0': '9',
 '4402_0': '7',
 '5061_0': '0',
 '7402_0': '7',
 '10098_0': '6',
 '10098_1': '7',
 '826_0': '1',
 '826_1': '3',
 '826_2': '3',
 '11841_0': '2',
 '11841_1': '6',
 '11841_2': '3',
 '8129_0': '2',
 '8129_1': '0',
 '8129_2': '4',
 '1667_0': '5',
 '1667_1': '2',
 '4531_0': '3',
 '4531_1': '2',
 '10096_0': '2',
 '1342_0': '1',
 '1342_1': '1',
 '1342_2': '3',
 '6236_0': '9',
 '6236_1': '0',
 '4240_0': '8',
 '4240_1': '7',
 '4240_2': '4',
 '10218_0': '3',
 '10218_1': '0',
 '5604_0': '1',
 '5604_1': '7',
 '5895_0': '1',
 '5895_1': '5',
 '104_0': '1',
 '11798_0': '2',
 '11798_1': '9',
 '11254_0': '2',
 '9364_0': '1',
 '9364_1': '3',
 '11717_0': '5',
 '11206_0': '5',
 '11206_1': '2',
 '430_0': '2',
 '430_1': '3',
 '4871_0': '

In [11]:
with open("solution_best.csv", "w") as f:
    f.write("imageid_boxid,class\n")
    for name in answers_final:
        f.write(f"{name},{answers_final[name]}\n")