<img src="https://fsdl.me/logo-720-dark-horizontal">

# Lab 06: Data Annotation & Synthesis

### What You Will Learn

- How the `IAMParagraphs` dataset is structured
- How to use Label Studio to set up a data annotation workflow
- Just how messy data really is

# Setup

If you're running this notebook on Google Colab,
the cell below will run full environment setup.

It should take about three minutes to run.

In [None]:
%env FSDL_REPO=fsdl-text-recognizer-2022

In [None]:
lab_idx = None  # CHANGE ME WHEN YOU COPY THE TEMPLATE OVER


if "bootstrap" not in locals() or bootstrap.run:
    # path management for Python
    pythonpath, = !echo $PYTHONPATH
    if "." not in pythonpath.split(":"):
        pythonpath = ".:" + pythonpath
        %env PYTHONPATH={pythonpath}
        !echo $PYTHONPATH

    # get both Colab and local notebooks into the same state
    !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py
    import bootstrap

    # change into the lab directory
    bootstrap.change_to_lab_dir(lab_idx=lab_idx)

    # needed for inline plots in some contexts
    %matplotlib inline

    bootstrap.run = False  # change to True re-run setup
    
!pwd
%ls

# `IAMParagraphs`: From annotated data to a PyTorch `Dataset`

We've used the `text_recognizer.data` submodule
to serve up PyTorch `Dataset`s that our
`DataLoader`s and `LightningDataModule`s can
turn into PyTorch `Tensor`s ready to train our DNNs.

These `Dataset`s operate on a much rawer format of data,
which looks much like other kinds of data.

Let's walk through their processing in detail.

This class downloads the data --
we'll talk more about it later,
but we want to have the data present for the first part of the discussion.

In [None]:
from text_recognizer.data.iam import IAM

iam = IAM()
iam.prepare_data()

## Dataset structure on disk


The `IAM` dataset is downloaded as zip file:

In [None]:
from text_recognizer.metadata.iam import DL_DATA_DIRNAME


iam_dir = DL_DATA_DIRNAME
!ls {iam_dir}

Inside that zip file are the following folders:

In [None]:
iamdb = iam_dir / "iamdb"

!du -h {iamdb}

There are >3000 files, almost all of which are `.xml` or `.jpg`:

In [None]:
!find {iamdb} | grep "\.jpg$\|\.xml$" | wc -l

And they are equal in number:

In [None]:
!find {iamdb}/xml | grep "\.xml$" | wc -l

In [None]:
!find {iamdb}/forms | grep "\.jpg$" | wc -l

Where there are many small files in equal number, there are inputs and targets.

And indeed, an individual "datapoint" in `IAM` is a "form", because the humans whose hands wrote the data were writing on "forms", as below:

In [None]:
from IPython.display import Image


form_fn, = !find {iamdb}/forms | grep ".jpg$" | sort | head -n 1

print(form_fn)
Image(filename=form_fn, width="360")

And the `xml` files indeed contain the targets:

In [None]:
xml_fn, = !find {iamdb}/xml | grep "\.xml$" | sort | head -n 1

!cat {xml_fn} | grep -A 100 "handwritten-part" | grep "<word"

But they also contain the metadata required to convert images of entire forms into more useful images, e.g. of lines or paragraphs of handwritten text:

In [None]:
xml_fn, = !find {iamdb}/xml | grep "\.xml$" | head -n 1

!cat {xml_fn} | grep -A 25 "handwritten-part" | grep -A 5 "<word"

The `ascii` folder has metadata in `.txt` files in the ASCII format.

There's a handful of other files full of metadata -- e.g. the training, validation, and test splits:

In [None]:
!find {iamdb} | grep "\\.txt$"

The `ascii` folder has metadata in `.txt` files in the ASCII format.

In [None]:
!ls -lh {iamdb}/ascii

## Extracting paragraphs from raw data

So from images of entire forms
and XML positiona and label metadata,
we need to extract cropped images
of paragraphs and string labels.

In [None]:
import text_recognizer.util as util

form_id = "g01-031"
fn = iam.form_filenames_by_id[form_id]

print(fn)
Image(filename=fn, width=360)

This is handled by a utility function, `get_paragraph_crops_and_labels`:

In [None]:
from text_recognizer.data.iam_paragraphs import get_paragraph_crops_and_labels

p_crops, p_labels = get_paragraph_crops_and_labels(iam, split="val")

print(p_labels[form_id])
p_crops[form_id]

Loosely: we calculate paragraph regions
by joining over the line regions.

We pull line regions from the XML:

In [None]:
from text_recognizer.data.iam import _get_line_regions_from_xml_file

_get_line_regions_from_xml_file??

We resize them so they take up less disk space.

We invert them because many NNs work better
with positive features.

## Structuring into a PyTorch dataset

Lastly, we convert to something we can use with PyTorch and `torchvision`: a PyTorch `Dataset`.

A basic `Dataset` just allows us to index into multiple sources of data
(e.g. the inputs and the targets) at the same time.

We want our targets to be `Tensor`s,
so we convert the strings:

In [None]:
from text_recognizer.data.util import convert_strings_to_labels
from text_recognizer.data import IAMParagraphs

iam_paragraphs = IAMParagraphs()

tensor_labels = convert_strings_to_labels(
    strings=p_labels,
    mapping=iam_paragraphs.inverse_mapping,
    length=iam_paragraphs.output_dims[0])

We do eventually want `Tensor`s out of our images,
but we want our `DataLoader` to do stuff during forward pass,
make use of CPUs,
so we leave our inputs as a list of `Image`s.

In [None]:
list_crops = list(p_crops.values())

We combine them together with our `BaseDataset` class.

In [None]:
import wandb

from text_recognizer.data.util import BaseDataset


dataset = BaseDataset(list_crops, tensor_labels)

im, label = dataset[0]
wandb.Image(im).image

## Synthesizing handwritten paragraphs from handwritten lines

In [None]:
from text_recognizer.data.iam_synthetic_paragraphs import IAMSyntheticParagraphs

# FSDL Handwriting Dataset: From images to an annotated dataset

Above, we relied on an existing dataset,
already nicely formatted with images and their annotations.

But data does not come to us like this.

Inputs collected from the world somehow,
and annotations are often collected from humans.

Let's walk through how that's done.

We'll use a dataset of text prompts
and handwritten responses collected during the 2019 edition of FSDL.

## Handling Data with AWS S3

We begin a few steps after the beginning:
data has been collected from humans who were tasked with
writing out text prompts by hand on paper forms,
and those forms were scanned and digitized.

The digitized forms were placed in storage on Amazon Web Services'
Simple Storage Service, aka S3,
which is a form of object storage.

They are publicly accessible, so we can view them directly by inputting a URL:

In [None]:
from IPython.display import Image

idx = 117
img_url = f"https://fsdl-public-assets.s3.us-west-2.amazonaws.com/fsdl_handwriting_20190302/page-{str(idx).zfill(3)}.jpg"
print(img_url)
Image(url=img_url, width=360)

For programmatic access,
we use
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html),
the Python SDK for AWS.

It is named after the Portuguese term for
[river dolphins native to the Amazon river](https://en.wikipedia.org/wiki/Boto).

In [None]:
import boto3  # boto3: high-level API
from botocore import UNSIGNED  # botocore: lower-level API and components
from botocore.config import Config


s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

In [None]:
from text_recognizer.metadata.shared import DATA_DIRNAME


FSDL_RAW_DATA_DIRNAME = DATA_DIRNAME / "raw" / "fsdl_handwriting"
FSDL_DL_DATA_DIRNAME = DATA_DIRNAME / "downloaded" / "fsdl_handwriting"/ "pages"

In [None]:
!mkdir -p {FSDL_DL_DATA_DIRNAME}

s3.download_file("fsdl-public-assets", "fsdl_handwriting_20190302/page-001.jpg", f"{FSDL_DL_DATA_DIRNAME}/page-001.jpg")

In [None]:
from IPython.display import Image

Image(filename=f"{FSDL_DL_DATA_DIRNAME}/page-001.jpg", height=720)

In [None]:
import os

s3_resource = boto3.resource('s3', config=Config(signature_version=UNSIGNED))

def download_s3_folder(bucket_name, s3_folder, local_dir=None):
    """Download the contents of a folder on S3, recursively.

    Args:
        bucket_name: the name of the s3 bucket
        s3_folder: the folder path in the s3 bucket
        local_dir: a relative or absolute directory path in the local file system
    """
    # from https://stackoverflow.com/questions/49772151/download-a-folder-from-s3-using-boto3
    bucket = s3_resource.Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=s3_folder):
        target = obj.key if local_dir is None \
            else os.path.join(local_dir, os.path.relpath(obj.key, s3_folder))
        if not os.path.exists(os.path.dirname(target)):
            os.makedirs(os.path.dirname(target))
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, target)

In [None]:
download_s3_folder("fsdl-public-assets", "fsdl_handwriting_20190302", FSDL_DL_DATA_DIRNAME)

In [None]:
!find {FSDL_DL_DATA_DIRNAME} | head -n 20

In [None]:
%%writefile {FSDL_RAW_DATA_DIRNAME}/manifest.csv
page

In [None]:
s3_bucket_name = "fsdl-public-assets"
s3_directory_path = "fsdl_handwriting_20190302/"
s3_url = f"https://{s3_bucket_name}.s3.us-west-2.amazonaws.com/{s3_directory_path}"

In [None]:
!find {FSDL_DL_DATA_DIRNAME} | grep "page-.*.jpg$" | sed "s\\{FSDL_DL_DATA_DIRNAME}/\\{s3_url}\\"| sort >> {FSDL_RAW_DATA_DIRNAME}/manifest.csv

In [None]:
!cat {FSDL_RAW_DATA_DIRNAME}/manifest.csv | head -n 10

## Annotation with Label Studio

### Configuring and connecting to the web server

In [None]:
username = "fsdl@localhost"
password = "pancakes"

In [None]:
%env DJANGO_SETTINGS_MODULE=data.raw.fsdl_handwriting.labelstudio_settings
%env LABEL_STUDIO_USERNAME={username}
%env LABEL_STUDIO_PASSWORD={password}

In [None]:
import os
import getpass

from pyngrok import ngrok


if not os.path.exists(ngrok.conf.DEFAULT_NGROK_CONFIG_PATH):
    print("Enter your ngrok auth token, which can be copied from https://dashboard.ngrok.com/auth")
    ngrok.conf.get_default().auth_token = getpass.getpass()

In [None]:
LABEL_STUDIO_PORT = 8081

https_tunnel = ngrok.connect(LABEL_STUDIO_PORT, bind_tls=True)
print(https_tunnel)

We'll briefly install `label-studio` here.

Not compatible with the rest of our environment,
so we'll clean it up at the end
(if running locally).

Intended to be run inside a Docker container
or a special-purpose server.

In [None]:
!pip install -qqq label-studio

In [None]:
%%script bash --bg --proc label_studio_proc

label-studio start

Give it about 30 seconds to start.

In [None]:
print(https_tunnel)

## Label Studio Cleanup

In [None]:
import sys

in_colab = "google.colab" in sys.modules
done_with_label_studio = True

if done_with_label_studio:
    !pkill -P {label_studio_proc.pid}
    if not in_colab:
        !make pip-tools

# Exercises

### 🌟 Do some data labelling yourself.

Label a handful of pages.
Notice the edge cases.
Incorporate them into labeling instructions.

Interesting ones: #24, #35, #97.

### 🌟🌟 Hook up S3 directly to Label Studio.

Create an AWS account. Get your Access Key ID.

Guide [here](https://labelstud.io/guide/storage.html), but because public access, can start
[here](https://labelstud.io/guide/storage.html#Set-up-connection-in-the-Label-Studio-UI).

Do not need pre-signed URLs or a Session Token. Our region is us-west-2.

Bucket name and bucket prefix are above. Files are all `.jpg`.