# Preparación de un conjunto de datos de clasificación de documentos para Donut

En este notebook se prepara el dataset que será utilizado para finetunear el modelo Donut para la clasificación de documentos.

Básicamente, se *realizan* 2 pasos:

1. Cargar un conjunto de datos de clasificación de imágenes.
2. Prepararlo en formato Donut: Añadir una columna `ground_truth` al conjunto de datos. Cada ground_truth es una secuencia objetivo `gt_parse`, que sigue el formato de `{"class" : {class_name}}`, por ejemplo, `{"class" : "scientific_report"}` o `{"class" : "presentation"}`.

## Configuración del entorno

In [None]:
!pip install -q datasets

[K     |████████████████████████████████| 365 kB 5.4 MB/s 
[K     |████████████████████████████████| 212 kB 50.8 MB/s 
[K     |████████████████████████████████| 115 kB 55.9 MB/s 
[K     |████████████████████████████████| 120 kB 52.7 MB/s 
[K     |████████████████████████████████| 127 kB 40.4 MB/s 
[?25h

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



## Preparación en formato Donut




Utilizaremos 10 clases en total: 8 clases de Tobacco-3482 (ADVE, email, form, memo, news, note, report y resume), una clase de passport del conjunto de datos Pardo y una clase de receipt del dataset SROIE.

La estructura del conjunto de datos de imágenes tiene que ser la siguiente:

folder/train/dog/golden_retriever.png

folder/train/dog/german_shepherd.png

folder/train/dog/chihuahua.png

folder/train/cat/maine_coon.png

folder/train/cat/bengal.png

folder/train/cat/birman.png

Creamos la estructura de carpetas:

In [None]:
# create directory structure
import os

os.makedirs('/content/dataset')
os.makedirs('/content/dataset/train')
os.makedirs('/content/dataset/test')

classes=["passport", "ADVE", "email", "form", "receipt", "memo", "news", "note", "report", "resume"]
for i in classes:
  os.makedirs('/content/dataset/train/'+i)
  os.makedirs('/content/dataset/test/'+i)

Ahora se pueden subir las imágenes a las carpetas creadas.

Se carga el conjunto de datos especificando imagefolder y el directorio de su conjunto de datos en data_dir:



In [None]:
from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="/content/dataset")

Resolving data files:   0%|          | 0/2800 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/700 [00:00<?, ?it/s]



Downloading and preparing dataset imagefolder/default to /root/.cache/huggingface/datasets/imagefolder/default-a953ccf4e53ad573/0.0.0/0fc50c79b681877cc46b23245a6ef5333d036f48db40d53765a68034bc48faff...
                

Downloading data files #2:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #3:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #0:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #4:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #6:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #13:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #1:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #5:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #11:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #10:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #7:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #9:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #12:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #14:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #8:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files #15:   0%|          | 0/175 [00:00<?, ?obj/s]

Downloading data files: 0it [00:00, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

   

Downloading data files #0:   0%|          | 0/44 [00:00<?, ?obj/s]

  

Downloading data files #3:   0%|          | 0/44 [00:00<?, ?obj/s]

 

Downloading data files #1:   0%|          | 0/44 [00:00<?, ?obj/s]

Downloading data files #2:   0%|          | 0/44 [00:00<?, ?obj/s]

  

Downloading data files #4:   0%|          | 0/44 [00:00<?, ?obj/s]

 

Downloading data files #7:   0%|          | 0/44 [00:00<?, ?obj/s]

Downloading data files #6:   0%|          | 0/44 [00:00<?, ?obj/s]

 

Downloading data files #5:   0%|          | 0/44 [00:00<?, ?obj/s]

 

Downloading data files #10:   0%|          | 0/44 [00:00<?, ?obj/s]

 

Downloading data files #9:   0%|          | 0/44 [00:00<?, ?obj/s]

   

Downloading data files #8:   0%|          | 0/44 [00:00<?, ?obj/s]

Downloading data files #11:   0%|          | 0/44 [00:00<?, ?obj/s]

 

Downloading data files #13:   0%|          | 0/43 [00:00<?, ?obj/s]

Downloading data files #14:   0%|          | 0/43 [00:00<?, ?obj/s]

Downloading data files #12:   0%|          | 0/43 [00:00<?, ?obj/s]

Downloading data files #15:   0%|          | 0/43 [00:00<?, ?obj/s]

Downloading data files: 0it [00:00, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset imagefolder downloaded and prepared to /root/.cache/huggingface/datasets/imagefolder/default-a953ccf4e53ad573/0.0.0/0fc50c79b681877cc46b23245a6ef5333d036f48db40d53765a68034bc48faff. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset["train"][0]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=L size=817x1089 at 0x7F74BE3E68D0>,
 'label': 0}

In [None]:
dataset["train"][-1]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=L size=1728x2292 at 0x7F74BE3E0D90>,
 'label': 9}

In [None]:
dataset["test"][-1]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=L size=1728x2292 at 0x7F74BE3E0BD0>,
 'label': 9}

In [None]:
template = '{"gt_parse": {"class" : '

In [None]:
id2label = {
  0: "ADVE",
  1: "email",
  2: "form",
  3: "memo",
  4: "news",
  5: "note",
  6: "passport",
  7: "receipt",
  8: "report",
  9: "resume",
}

def update_examples(examples):
  ground_truths = []
  for label in examples['label']:
    ground_truths.append(template + '"' + id2label[label] + '"' + "}}")

  examples['ground_truth'] = ground_truths

  return examples

dataset = dataset.map(update_examples, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
test = dataset['train'][0]['ground_truth']
test

'{"gt_parse": {"class" : "ADVE"}}'

Verifiquemos que podemos leerlo como un Dict:

In [None]:
from ast import literal_eval

test2 = literal_eval(test)
test2['gt_parse']

{'class': 'ADVE'}

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['image', 'label', 'ground_truth'],
        num_rows: 2800
    })
    test: Dataset({
        features: ['image', 'label', 'ground_truth'],
        num_rows: 700
    })
})

## Push del dataset Donut

Por último, subimos este conjunto de datos a Hugging Face para poder reutilizarlo fácilmente, compartirlo, etc.

In [None]:
!pip install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git-lfs
  Downloading git_lfs-1.6-py2.py3-none-any.whl (5.6 kB)
Installing collected packages: git-lfs
Successfully installed git-lfs-1.6


In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [None]:
# you can simply add `private=True` in case you're using the private hub
dataset.push_to_hub("Mijavier/10_classes_custom_dataset_donut")



  0%|          | 0/1 [00:00<?, ?ba/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]



  0%|          | 0/1 [00:00<?, ?ba/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]