## Synthetic Dataset Generation

Here we demonstrate how to use our `genalog` package to generate synthetic documents with custom image degradation and upload the documents to an Azure Blob Storage.

<p float="left">
  <img src="static/labeled_synthetic_pipeline.png" width="900" />
</p>


## Dataset file structure 

Our dataset follows this file structure:
```
<ROOT FOLDER>/                             #eg. synthetic-image-root
     <SRC_DATASET_NAME>                    #eg. CNN-Dailymail-Stories
        │
        │───shared/                        #common files shared across different dataset versions
        │     │───train/
        │     │     │───clean_text/  
        │     │     │     │─0.txt
        │     │     │     │─1.txt
        │     │     │     └─...
        │     │     └───clean_labels/
        │     │           │─0.txt
        │     │           │─1.txt
        │     │           └─...
        │     └───test/
        │           │───clean_text/*.txt
        │           └───clean_labels/*.txt
        │   
        └───<VERSION_NAME>/                #e.g. hyphens_blur_heavy
               │───train/
               │     │─img/*.png           #Degraded Images
               │     │─ocr/*.json          #json output files that are output of GROK
               │     │─ocr_text/*.txt      #text output retrieved from OCR Json Files
               │     └─ocr_labels/*.txt    #Aligned labels files in IOB format
               │───test/
               │     │─img/*.png           #Degraded Images
               │     │─ocr/*.json          #json output files that are output of GROK
               │     │─ocr_text/*.txt      #text output retrieved from OCR Json Files
               │     └─ocr_labels/*.txt    #Aligned labels files in IOB format
               │
               │───layout.json             #records page layout info (font-family,template name, etc)
               │───degradation.json        #records degradation parameters
               │───ocr_metric.csv          #records metrics on OCR noise across the dataset
               └───substitution.json       #records character substitution errors in the OCR'ed text.
```

## Source NER Dataset

This pipeline is designed to work with standard NER datasets like CoNLL 2003 and CoNLL 2012. You can downaload the source dataset from DeepAI: [CoNLL-2003](https://deepai.org/dataset/conll-2003-english)

**NOTE:** the source dataset has three separate columns of NER labels, we are only interested in the last column:

```
       Source                                Desired (space-separted)

DOCSTART- -X- -X- O                        DOCSTART O
SOCCER NN B-NP O                           SOCCER O
- : O O                                    - O
JAPAN NNP B-NP B-LOC                       JAPAN B-LOC
GET VB B-VP O                              GET O
LUCKY NNP B-NP O                           LUCKY O
WIN NNP I-NP O                             WIN O
, , O O                                    , O
CHINA NNP B-NP B-PER                       CHINA B-PER
IN IN B-PP O                               IN O
SURPRISE DT B-NP O                         SURPRISE O
DEFEAT NN I-NP O                           DEFEAT O
...                                        ...
```

Unfortunately, this preprocess step is out of the scope of this pipeline.

**TODO:** Add support for this or share the preprocessed dataset.




## Source Dataset Split

Before we can generate analog documents, we need text to populate the analog documents. To do so, we will split the source text into smaller text fragments. Here we have provided a script `genalog.text.splitter` to easily split NER datasets CoNLL-2003 and CoNLL-2012 in the following ways:

1. **Split dataset into smaller fragments**: each fragment is named as `<INDEX>.txt`
1. **Separate NER labels from document text**: NER labels will be stored in `clean_lables` folder and text in `clean_text` folder

In [None]:
INPUT_FILE_TEMPLATE = "/data/enki/datasets/CoNLL_2003_2012/CoNLL-<DATASET_YEAR>/CoNLL-<DATASET_YEAR>_<SUBSET>.txt"
OUTPUT_FOLDER_TEMPLATE = "/data/enki/datasets/synthetic_dataset/CoNLL_<DATASET_YEAR>_v3/shared/<SUBSET>/"
for year in ["2003", "2012"]:
    for subset in ["test", "train"]:
        # INPUT_FILE = "/data/enki/datasets/CoNLL_2003_2012/CoNLL-2012/CoNLL-2012_test.txt"
        INPUT_FILE = INPUT_FILE_TEMPLATE.replace("<DATASET_YEAR>", year).replace("<SUBSET>", subset)
        # OUTPUT_FOLDER = "/data/enki/datasets/synthetic_dataset/CoNLL_2012_v2/shared/test/"
        OUTPUT_FOLDER = OUTPUT_FOLDER_TEMPLATE.replace("<DATASET_YEAR>", year).replace("<SUBSET>", subset)
        
        print(f"Loading {INPUT_FILE} \nOutput to {OUTPUT_FOLDER}")
        if year == "2003":
            !python -m genalog.text.splitter $INPUT_FILE $OUTPUT_FOLDER --doc_sep="-DOCSTART-\tO"
        else:
            !python -m genalog.text.splitter $INPUT_FILE $OUTPUT_FOLDER

## Configurations
We will generate the synthetic dataset on your local disk first. You will need to specify the following CONSTANTS to locate where to store the dataset:

1. `ROOT_FOLDER`: root directory of the dataset, path can be relative to the location of this notebook.
1. `SRC_DATASET_NAME`: name of the source dataset from which the text used in the generation originates from
1. `SRC_TRAIN_SPLIT_PATH`: path of the train-split of the source dataset
1. `SRC_TEST_SPLIT_PATH`: path of the test-split of the source dataset
1. `VERSION_NAME`: version name of the generated dataset

You will also have to define the styles and degradation effects you will like to apply onto each generated document:
 
1. `STYLE_COMBINATIONS`: a dictionary defining the combination of styles to generate per text document (i.e. a copy of the same text document is generate per style combination). Example is shown below:

        STYLE_COMBINATION = {
        "language": ["en_US"],
        "font_family": ["Segoe UI"],
        "font_size": ["12px"],
        "text_align": ["left"],
        "hyphenate": [False],
        }
    
    You can expand the list of each style for more combinations
    
    
2. `DEGRADATIONS`: a list defining the sequence of degradation effects applied onto the synthetic images. Each element is a two-element tuple of which the first element is one of the method names from  `genalog.degradation.effect` and the second element is the corresponding function keyword arguments.

        DEGRADATIONS = [
            ("blur", {"radius": 3}),
            ("bleed_through", {"alpha": 0.8}),
            ("morphology", {"operation": "open", "kernel_shape": (3,3), "kernel_type": "ones"}), 
        ]
    The example above will apply degradation effects to synthetic images in the sequence of: 
    
            blur -> bleed_through -> morphological operation (open)
    
   
3. `HTML_TEMPLATE`: name of html template used to generate the synthetic images. The `genalog` package has the following default templates: 

    1. `columns.html.jinja` 
    2. `letter.html.jinja`
    3. `text_block.html.jinja`
    
            HTML_TEMPLATE = 'text_block.html.jinja'

In [None]:
from genalog.degradation.degrader import ImageState

ROOT_FOLDER = "/data/enki/datasets/synthetic_dataset/"
SRC_DATASET_NAME = "CoNLL_2003_v3"
VERSION_NAME = "hyphens_close_heavy"
SRC_TRAIN_SPLIT_PATH = ROOT_FOLDER + SRC_DATASET_NAME + "/shared/train/clean_text/"
SRC_TEST_SPLIT_PATH = ROOT_FOLDER + SRC_DATASET_NAME + "/shared/test/clean_text/"
DST_TRAIN_PATH = ROOT_FOLDER + SRC_DATASET_NAME + "/" + VERSION_NAME + "/train/"
DST_TEST_PATH = ROOT_FOLDER + SRC_DATASET_NAME + "/" + VERSION_NAME + "/test/"

STYLE_COMBINATIONS = {
    "language": ["en_US"],
     "font_family": ["Segeo UI"],
     "font_size": ["12px"],
     "text_align": ["justify"],
     "hyphenate": [True],
}

DEGRADATIONS = [
## Stacking Degradations
    ("morphology", {"operation": "open", "kernel_shape":(9,9), "kernel_type":"plus"}),
    ("morphology", {"operation": "close", "kernel_shape":(9,1), "kernel_type":"ones"}),
    ("salt", {"amount": 0.9}),
    ("overlay", {
        "src": ImageState.ORIGINAL_STATE,
        "background": ImageState.CURRENT_STATE,
    }),
    ("bleed_through", {
        "src": ImageState.CURRENT_STATE,
        "background": ImageState.ORIGINAL_STATE,
        "alpha": 0.95,
        "offset_x": -6,
        "offset_y": -12,
    }),
    ("pepper", {"amount": 0.001}),
    ("blur", {"radius": 5}),
    ("salt", {"amount": 0.1}),
]

HTML_TEMPLATE = "text_block.html.jinja"

IMG_RESOLUTION = 300 #dpi

print(f"Training set will be saved to: '{DST_TRAIN_PATH}'")
print(f"Testing set will be saved to: '{DST_TEST_PATH}'")

## Load in Text Documents

In [None]:
import glob
import os

train_text = sorted(glob.glob(SRC_TRAIN_SPLIT_PATH + "*.txt"))
test_text = sorted(glob.glob(SRC_TEST_SPLIT_PATH + "*.txt"))

print(f"Number of training text documents: {len(train_text)}")
print(f"Number of testing text documents: {len(test_text)}")

## Document Sample 

In [None]:
from genalog.pipeline import AnalogDocumentGeneration
from IPython.core.display import Image, display
import timeit
import cv2

sample_file = test_text[0]
print(f"Sample Filename: {sample_file}")
doc_generation = AnalogDocumentGeneration(styles=STYLE_COMBINATIONS, degradations=DEGRADATIONS, resolution=IMG_RESOLUTION)
print(f"Avaliable Templates: {doc_generation.list_templates()}")

start_time = timeit.default_timer()
img_array = doc_generation.generate_img(sample_file, HTML_TEMPLATE, target_folder=None)
elapsed = timeit.default_timer() - start_time
print(f"Time to generate 1 documents: {elapsed:.3f} sec")

_, encoded_image = cv2.imencode('.png', img_array)
display(Image(data=encoded_image, width=600))

## Execute Generation

In [None]:
from genalog.pipeline import generate_dataset_multiprocess

# Generating test set
generate_dataset_multiprocess(
    test_text, DST_TEST_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, 
    resolution=IMG_RESOLUTION, batch_size=5
)

In [None]:
from genalog.pipeline import generate_dataset_multiprocess

# Generating training set
generate_dataset_multiprocess(
    train_text, DST_TRAIN_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, 
    resolution=IMG_RESOLUTION, batch_size=5
)

### Saving Dataset Configurations as .json

In [None]:
from genalog.pipeline import ImageStateEncoder
import json

layout_json_path = ROOT_FOLDER + SRC_DATASET_NAME + "/" + VERSION_NAME + "/layout.json"
degradation_json_path = ROOT_FOLDER + SRC_DATASET_NAME + "/" + VERSION_NAME + "/degradation.json"

layout = {
    "style_combinations": STYLE_COMBINATIONS,
    "img_resolution": IMG_RESOLUTION,
    "html_templates": [HTML_TEMPLATE],
}

layout_js_str = json.dumps(layout, indent=2)
degrade_js_str = json.dumps(DEGRADATIONS, indent=2, cls=ImageStateEncoder)

with open(layout_json_path, "w") as f:
    f.write(layout_js_str)
    
with open(degradation_json_path, "w") as f:
    f.write(degrade_js_str)
    
print(f"Writing configs to {layout_json_path}")
print(f"Writing configs to {degradation_json_path}")

## Setup Azure Blob Client

We will use Azure Cognitive Service to run OCR on these synthetic images, and we will first upload the dataset to blob storage.

1. If you haven't already, setup new Azure resources 
    1. [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) (for storage)
    1. [Azure Cognitive Search](https://azure.microsoft.com/en-us/services/search/) (for OCR results)
1. Create an `.secret` file with the environment variables that includes the names of you index, indexer, skillset, and datasource to create on the search service. Include keys to the blob that contains the documents you want to index, keys to the congnitive service and keys to you computer vision subscription and search service. In order to index more than 20 documents, you must have a computer services subscription. An example of one such `.secret` file is below:

    ```bash

    SEARCH_SERVICE_NAME = "ocr-ner-pipeline"
    SKILLSET_NAME = "ocrskillset"
    INDEX_NAME = "ocrindex"
    INDEXER_NAME = "ocrindexer"
    DATASOURCE_NAME = <BLOB STORAGE ACCOUNT NAME>
    DATASOURCE_CONTAINER_NAME = <BLOB CONTAINER NAME>
    
    COMPUTER_VISION_ENDPOINT = "https://<YOUR ENDPOINT NAME>.cognitiveservices.azure.com/"
    COMPUTER_VISION_SUBSCRIPTION_KEY = "<YOUR SUBSCRIPTION KEY>"
    
    BLOB_NAME = "<YOUR BLOB STORAGE NAME>"
    BLOB_KEY = "<YOUR BLOB KEY>"
    SEARCH_SERVICE_KEY = "<YOUR SEARCH SERVICE KEY>"
    COGNITIVE_SERVICE_KEY = "<YOUR COGNITIVE SERVICE KEY>"
    ```

In [None]:
from dotenv import load_dotenv
from genalog.ocr.blob_client import GrokBlobClient

# Setup variables and authenticate blob client
ROOT_FOLDER = "/data/enki/datasets/synthetic_dataset/"
SRC_DATASET_NAME = "CoNLL_2012_v3"

local_path = ROOT_FOLDER + SRC_DATASET_NAME 
remote_path = SRC_DATASET_NAME

print(f"Uploadig from local_path: {local_path}")
print(f"Upload to remote_path:    {remote_path}")

load_dotenv("../.secrets")

blob_client = GrokBlobClient.create_from_env_var()

## Upload Dataset to Azure Blob Storage

In [None]:
import time
# Python uploads can be slow.
# for very large datasets use azcopy: https://github.com/Azure/azure-storage-azcopy
start = time.time()
dest, res = blob_client.upload_images_to_blob(local_path, remote_path, use_async=True)
await res
print("time (mins): ", (time.time()-start)/60)

In [None]:
# Delete a remote folder on Blob
# blob_client.delete_blobs_folder("CoNLL_2003_v2_test")

## Run Indexer and Retrieve OCR results
Please note that this process can take a **long time**, but you can upload multiple dataset to Blob and run this once for all of them.

In [None]:
from genalog.ocr.rest_client import GrokRestClient
from dotenv import load_dotenv

load_dotenv("../.secrets")
grok_rest_client = GrokRestClient.create_from_env_var()
grok_rest_client.create_indexing_pipeline()
grok_rest_client.run_indexer()

# wait for indexer to finish
grok_rest_client.poll_indexer_till_complete()

## Download OCR Results

In [None]:
import os
# Downloading multiple dataset to local
remote_path = SRC_DATASET_NAME
local_path = ROOT_FOLDER + SRC_DATASET_NAME
versions = ["hyphens_all_heavy"]
version_prefix = ""
version_suffixes = [""]
print(f"Remote Path: {remote_path} \nLocal Path: {local_path} \nVersions: {versions}")

blob_img_paths_test = []
blob_img_paths_train = []
local_ocr_json_paths_test = []
local_ocr_json_paths_train = []
version_name = ""
for version in versions:
    for weight in version_suffixes:
        version_name = version_prefix + version + weight
        blob_img_paths_test.append(os.path.join(remote_path, version_name, "test", "img"))
        blob_img_paths_train.append(os.path.join(remote_path, version_name, "train", "img"))
        local_ocr_json_paths_test.append(os.path.join(local_path, version_name, "test", "ocr"))
        local_ocr_json_paths_train.append(os.path.join(local_path, version_name, "train", "ocr"))
print(f"Example Version Name: {version_name}")

In [None]:
# download OCR
for blob_path_test, blob_path_train, local_path_test, local_path_train in \
    zip(blob_img_paths_test, blob_img_paths_train, \
        local_ocr_json_paths_test, local_ocr_json_paths_train):
        
    print(f"Downloading \nfrom remote path:'{blob_path_test} \n   to local path:'{local_path_test}'")
    await blob_client.get_ocr_json(blob_path_test, output_folder=local_path_test, use_async=True)
    print(f"Downloading \nfrom remote path:'{blob_path_train} \n   to local path:'{local_path_train}'")
    await blob_client.get_ocr_json(blob_path_train, output_folder=local_path_train, use_async=True)

# Generate OCR metrics

In [None]:
import os 

local_path = ROOT_FOLDER + SRC_DATASET_NAME
versions = ["hyphens_all_heavy"]
version_prefix = ""
version_suffixes = [""]
print(f"Local Path: {local_path} \nVersions: {versions}\n")

input_json_path_templates = []
output_metric_path = []
for version in versions:
    for suffix in version_suffixes:
        version_name = version_prefix + version + suffix
        # Location depends on the input dataset
        input_json_path_templates.append(os.path.join(local_path, version_name, "<test/train>/ocr"))
        output_metric_path.append(os.path.join(local_path, version_name))
        
clean_text_path_template = os.path.join(local_path, "shared/<test/train>/clean_text")
csv_metric_name_template = "<test/train>_ocr_metrics.csv"
subs_json_name_template = "<test/train>_subtitutions.json"
avg_metric_name = "ocr_metrics.csv"

print(f"Loading \n'{clean_text_path_template}' \nand \n'{input_json_path_templates[0]}'...")
print(f"Saving to {output_metric_path}")

In [None]:
import sys
import json
import pandas as pd
from genalog.ocr.metrics import get_metrics, substitution_dict_to_json

for input_json_path_template, output_metric_path in zip(input_json_path_templates, output_metric_path):
    subsets = ["train", "test"]
    avg_stat = {subset: None for subset in subsets}
    for subset in subsets:
        clean_text_path = clean_text_path_template.replace("<test/train>", subset)
        ocr_json_path = input_json_path_template.replace("<test/train>", subset)
        csv_metric_name = csv_metric_name_template.replace("<test/train>", subset)
        subs_json_name = subs_json_name_template.replace("<test/train>", subset)

        output_csv_name = output_metric_path + "/" + csv_metric_name
        output_json_name = output_metric_path + "/" + subs_json_name

        print(f"Saving to '{output_csv_name}' \nand '{output_json_name}'")
        df, subs, actions = get_metrics(clean_text_path, ocr_json_path, use_multiprocessing=True)
        # Writing metrics on individual file
        df.to_csv(output_csv_name)
        json.dump(substitution_dict_to_json(subs), open(output_json_name, "w"))
        # Getting average metrics
        avg_stat[subset] = df.mean()

    # Saving average metrics
    avg_stat = pd.DataFrame(avg_stat)
    output_avg_csv = os.path.join(output_metric_path, avg_metric_name)
    avg_stat.to_csv(output_avg_csv)
    print(f"Saving average metrics to {output_avg_csv}")
    print(avg_stat[16:])

## Organize OCR'ed Text into IOB Format For Model Training Purpose

The last step in preparing the dataset is to format all the OCR'ed text and the NER label into a usable format for training. Our model consume data in IOB format, which is the same format used in the CoNLL datasets.

In [None]:
base_path = "/data/enki/datasets/synthetic_dataset/CoNLL_2012_v3"
versions = ["hyphens_all_heavy"]
version_prefix = ""
version_suffixes = [""]
version_names = []
for version in versions:
    for suffix in version_suffixes:
        version_names.append(version_prefix + version + suffix)
print(f"base_path: {base_path}\nversion_names: {version_names}")

In [None]:
for version in version_names:
    !python -m genalog.text.conll_format $base_path $version --train_subset

## [Optional] Re-upload Local Dataset to Blob 

We can re-upload the local copy of the dataset to Blob Storage to sync up the two copies

In [None]:
import os

local_dataset_to_sync = os.path.join(local_path)
blob_path = os.path.join(remote_path)
print(f"local_dataset_to_sync: {local_dataset_to_sync}\nblob_path: {blob_path}")

In [None]:
import time
# Python uploads can be slow.
# for very large datasets use azcopy: https://github.com/Azure/azure-storage-azcopy
start = time.time()
dest, res = blob_client.upload_images_to_blob(local_dataset_to_sync, blob_path, use_async=True)
await res
print("time (mins): ", (time.time()-start)/60)