## Unsupervised Synthetic Dataset Generation

Here we demonstrate how use our `genalog` package to generate synthetic analog documents with custom image degradation on unlabbled data, which is basically Natural Language rich documents.
![genalog_class_diagram](static/unlabeled_synthetic_pipeline.png)

### Package Requirements
1. Have the `genalog` package installed in your virtual environment:
    - Install from source:
        1. `git clone https://msazure.visualstudio.com/DefaultCollection/Cognitive%20Services/_git/Tools-Synthetic-Data-Generator`
        1. `cd Tools-Synthetic-Data-Generator`
        1. `python -m venv .env`
        1. `source .env/bin/activate` or on Windows `.env/Scripts/activate.bat`
        1. `pip install -r requirements.txt`
        1. `pip install -e .`
    - Install from Azure Artifacts:
        1. Visit this [Azure Artifacts repo](https://msazure.visualstudio.com/DefaultCollection/Cognitive%20Services/_packaging?_a=package&feed=CognitiveServices&package=genalog&protocolType=PyPI&version=0.0.0) and downalod the latest version
        1. Relocate the `.whl` package if necessary 
        1. Create your virtual environment `python -m venv .env` 
        1. Activate the virtual environemnt `source .env/bin/activate` or on Windows `.env/Script/activate.bat`
        1. Run `pip install <GENALOG_WHEEL_NAME>` 
    
1. Download TATK NER model (for generating NER labels for unlabeled data)
    1. Please visit see [Cognitive Services Wiki](https://msazure.visualstudio.com/Cognitive%20Services/_wiki/wikis/Cognitive%20Services.wiki/35359/Local-Setup) for details.
    1. Specify the path and version of the TA model in the environment variable "MODEL_ROOT_PATH" and "MODEL_VERSION"
1. A collection of **preprocessed text files** from a source dataset. Each text file will be used to generated one synthetic image.
    
(**Skip** the following steps if you don't want to store documents in the cloud)
1. If you haven't already, setup new Azure resources 
    1. [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) (for storage)
    1. [Azure Cognitive Search](https://azure.microsoft.com/en-us/services/search/) (for OCR results)
1. Create an `.secret` file with the environment variables that includes the names of you index, indexer, skillset, and datasource to create on the search service. Include keys to the blob that contains the documents you want to index, keys to the congnitive service and keys to you computer vision subscription and search service. In order to index more than 20 documents, you must have a computer services subscription. An example of one such `.secret` file is below:

    ```bash
    MODEL_ROOT_PATH = <>
    MODEL_VERSION = <>
    
    SEARCH_SERVICE_NAME = "ocr-ner-pipeline"
    SKILLSET_NAME = "ocrskillsetcnn"
    INDEX_NAME = "ocrindexcnn"
    INDEXER_NAME = "ocrindexercnn"
    DATASOURCE_NAME = "enkidata"
    DATASOURCE_CONTAINER_NAME = <>
    PROJECTIONS_CONTAINER_NAME = <>
    COMPUTER_VISION_ENDPOINT = "https://enki-vision.cognitiveservices.azure.com/"
    COMPUTER_VISION_SUBSCRIPTION_KEY = <>
    
    BLOB_NAME = "enkidata"
    BLOB_KEY = <>
    SEARCH_QUERY_KEY = <>
    SEARCH_SERVICE_KEY = <>
    COGNITIVE_SERVICE_KEY = <>
    ```


## Dataset file structure 

Our dataset follows this file structure:
```
<ROOT FOLDER>/                             #eg. synthetic-image-root
     <SRC_DATASET_NAME>                    #eg. CNN-Dailymail-Stories
        │
        │───shared/                        #common files shared across different dataset versions
        │     │───train/
        │     │     │───clean_text/  
        │     │     │     │─0.txt
        │     │     │     │─1.txt
        │     │     │     └─...
        │     │     └───clean_labels/
        │     │           │─0.txt
        │     │           │─1.txt
        │     │           └─...
        │     └───test/
        │           │───clean_text/*.txt
        │           └───clean_labels/*.txt
        │   
        └───<VERSION_NAME>/                #e.g. hyphens_blur_heavy
               │───train/
               │     │─img/*.png           #Degraded Images
               │     │─ocr/*.json          #json output files that are output of GROK
               │     │─ocr_text/*.txt      #text output retrieved from OCR Json Files
               │     └─ocr_labels/*.txt    #Aligned labels files in IOB format
               │───test/
               │     │─img/*.png           #Degraded Images
               │     │─ocr/*.json          #json output files that are output of GROK
               │     │─ocr_text/*.txt      #text output retrieved from OCR Json Files
               │     └─ocr_labels/*.txt    #Aligned labels files in IOB format
               │
               │───layout.json             #records page layout info (font-family,template name, etc)
               │───degradation.json        #records degradation parameters
               │───ocr_metric.csv          #records metrics on OCR noise across the dataset
               └───substitution.json       #records character substitution errors in the OCR'ed text.
```

## Dataset Setup
We will generate the synthetic dataset on your local disk first. You will need to specify the following CONSTANTS to locate where to store the dataset:
You will need to put all NL Rich Documents in a folder.
If your file is a single large file, you can split it into more sizeable chunks by using the split command. e.g `split -l 5 -a 7 -d big_file.txt`"
You wont be able to Generate labels on large files due to memory issues.

1. `INPUT_DATA`: Directory of Unlabbeled Input Data you want to process.
1. `ROOT_FOLDER`: root directory of the dataset, path can be relative to the location of this notebook.
1. `SRC_DATASET_NAME`: Name the Dataset you are Creating
1. **Separate NER labels from document text**: NER labels will be stored in `clean_lables` folder and text in `clean_text` folder

In [None]:
import os

INPUT_DATA = "/data/enki/datasets/CNN_Articles_Cased/demo"
ROOT_FOLDER = "/data/enki/datasets/synthetic_dataset"
SRC_DATASET_NAME = "cnn_stories_cased_demo"
CLEAN_TEXT_DIR = os.path.join(ROOT_FOLDER,SRC_DATASET_NAME,"shared","train","clean_text")
CLEAN_LABEL_DIR = os.path.join(ROOT_FOLDER,SRC_DATASET_NAME,"shared","train","clean_labels")

In [None]:
import os
import shutil
import tqdm

if not os.path.exists(CLEAN_TEXT_DIR):
        os.makedirs(CLEAN_TEXT_DIR)
if not os.path.exists(CLEAN_LABEL_DIR):
        os.makedirs(CLEAN_LABEL_DIR)

# Move clean text files to new location
print(f"Copying clean text files \nfrom {INPUT_DATA} \nto {CLEAN_TEXT_DIR}\n")
for filename in tqdm.tqdm(os.listdir(INPUT_DATA)):
    base, extension = os.path.splitext(filename)
    shutil.copyfile(os.path.join(INPUT_DATA, filename), os.path.join(CLEAN_TEXT_DIR, base+".txt"))

In [None]:
# Render example input
import os
example = sorted(os.listdir(CLEAN_TEXT_DIR))[2]
with open(os.path.join(CLEAN_TEXT_DIR, example)) as f:
    print(f.read()[:500])

## Generate Labels

If your dataset has no labels, you will need to generate them first. To do so you will first have to download the tatk NER model and add the path to your model and model version to your environment variable file e.g:

```
MODEL_ROOT_PATH = "/mnt/c/Users/dabanda/Downloads/entitygeneral/tatk"
MODEL_VERSION = "1.0.0.1"
```
We have a utility script to download the mode from blob. see `.scripts/download_model.py`

After downloading the model, run the label generator. This utility tool will call the model to get the NER labels. The TATK model supports files sizes up to a maximum of 50mb. If your file is above this size, you can split it into more sizeable chunks by using the split command. e.g `split -l 5 -a 7 -d big_file.txt`

usage:
```
python -m genalog.text.label_generator -h

usage: label_generator.py [-h] [--use_multiprocesssing USE_MULTIPROCESSSING]
                          [--batch_size BATCH_SIZE]
                          input_dir output_dir

positional arguments:
  input_dir             input folder containing text files
  output_dir            folder to place label tsv files

optional arguments:
  -h, --help              show this help message and exit
  --use_multiprocesssing  use multiprocessing
  --batch_size BATCH_SIZE batch size
```

Run the generator for train and test sets:

```
python -m genalog.text.label_generator <ROOT_FOLDER>/<SRC_DATASET_NAME>/shared/train/clean_text/ <ROOT_FOLDER>/<SRC_DATASET_NAME>/shared/train/clean_labels/ --batch_size 5
python -m genalog.text.label_generator <ROOT_FOLDER>/<SRC_DATASET_NAME>/shared/train/clean_text/ <ROOT_FOLDER>/<SRC_DATASET_NAME>/shared/train/clean_labels/ --batch_size 5
```

### Generate Labels from MT-LSTM Model file
We can rely on the TA NER model to provide NER labels for the input text. For the `label_generator` to work, we need the following additional dependencies installed in your virtual environment:
```
msft-tatk==1.0.122032a1 
torch==1.5.0
pytorch-pretrained-bert==0.6.2
```

Some of these are TA's depedencies, you can find instructions to install them from Azure Artifacts [here](https://msazure.visualstudio.com/Cognitive%20Services/_wiki/wikis/Cognitive%20Services.wiki/35359/Local-Setup?anchor=on-windows).

PLEASE also remember to setup the following environment variables from above in a `.secrets` file
```
MODEL_ROOT_PATH = <>
MODEL_VERSION = <>
```

In [None]:
import glob
from dotenv import load_dotenv
load_dotenv("../.secrets")

if len(os.listdir(CLEAN_LABEL_DIR)) != 0:
    print(f"CLEAN_LABEL_DIR: {CLEAN_LABEL_DIR} exists. Emptying the existing content")
    existing_files = glob.glob(CLEAN_LABEL_DIR + "/*.txt")
    for f in existing_files:
        os.remove(f)

!python -m genalog.text.label_generator $CLEAN_TEXT_DIR $CLEAN_LABEL_DIR

In [None]:
# Show after labeling
example = sorted(os.listdir(CLEAN_LABEL_DIR))[2]
with open(os.path.join(CLEAN_LABEL_DIR, example)) as f:
    print(f.read()[:203])

## Configurations
We will generate the synthetic dataset on your local disk first. You will need to specify the following CONSTANTS to locate where to store the dataset:

1. `SRC_TRAIN_SPLIT_PATH`: path of the train-split of the source dataset
1. `SRC_TEST_SPLIT_PATH`: path of the test-split of the source dataset
1. `VERSION_NAME`: version name of the generated dataset

You will also have to define the styles and degradation effects you will like to apply onto each generated document:
 
1. `STYLE_COMBINATIONS`: a dictionary defining the combination of styles to generate per text document (i.e. a copy of the same text document is generate per style combination). Example is shown below:

        STYLE_COMBINATION = {
        "language": ["en_US"],
        "font_family": ["Segoe UI"],
        "font_size": ["12px"],
        "text_align": ["left"],
        "hyphenate": [False],
        }
    
    You can expand the list of each style for more combinations
    
    
2. `DEGRADATIONS`: a list defining the sequence of degradation effects applied onto the synthetic images. Each element is a two-element tuple of which the first element is one of the method names from  `genalog.degradation.effect` and the second element is the corresponding function keyword arguments.

        DEGRADATIONS = [
            ("blur", {"radius": 3}),
            ("bleed_through", {"alpha": 0.8}),
            ("morphology", {"operation": "open", "kernel_shape": (3,3), "kernel_type": "ones"}), 
        ]
    The example above will apply degradation effects to synthetic images in the sequence of: 
    
            blur -> bleed_through -> morphological operation (open)
    
   
3. `HTML_TEMPLATE`: name of html template used to generate the synthetic images. The `genalog` package has the following default templates: 

    1. `columns.html.jinja` 
    2. `letter.html.jinja`
    3. `text_block.html.jinja`
    
            HTML_TEMPLATE = 'text_block.html.jinja'

In [None]:
from genalog.degradation.degrader import ImageState

VERSION_NAME = "hyphens_all_heavy"
SRC_TRAIN_SPLIT_PATH = ROOT_FOLDER + "/" + SRC_DATASET_NAME + "/shared/train/clean_text/"
#SRC_TEST_SPLIT_PATH = ROOT_FOLDER +"/"+ SRC_DATASET_NAME + "/shared/test/clean_text/"
DST_TRAIN_PATH = ROOT_FOLDER + "/"+ SRC_DATASET_NAME + "/" + VERSION_NAME + "/train/"
#DST_TEST_PATH = ROOT_FOLDER + "/"+ SRC_DATASET_NAME + "/" + VERSION_NAME + "/test/"

STYLE_COMBINATIONS = {
    "language": ["en_US"],
     "font_family": ["Segeo UI"],
     "font_size": ["12px"],
     "text_align": ["justify"],
     "hyphenate": [True],
}

DEGRADATIONS = [
## Elementary Operations
#     ("blur", {"radius": 15}),
#     ("salt", {"amount": 0.7}),
#     ("pepper", {"amount": 0.005}),
#     ("bleed_through", {"alpha": 0.8, "offset_x": -5, "offset_y": -5,}),
#     ("morphology", {"operation": "open", "kernel_shape":(9,9), "kernel_type":"plus"}),
    
## Stacking Degradations
    ("morphology", {"operation": "open", "kernel_shape":(9,9), "kernel_type":"plus"}),
    ("morphology", {"operation": "close", "kernel_shape":(9,1), "kernel_type":"ones"}),
    ("salt", {"amount": 0.5}),
    ("overlay", {
        "src": ImageState.ORIGINAL_STATE,
        "background": ImageState.CURRENT_STATE,
    }),
    ("bleed_through", {
        "src": ImageState.CURRENT_STATE,
        "background": ImageState.ORIGINAL_STATE,
        "alpha": 0.8,
        "offset_x": -12,
        "offset_y": -8,
    }),
    ("pepper", {"amount": 0.015}),
    ("blur", {"radius": 11}),
    ("salt", {"amount": 0.15}),
]

HTML_TEMPLATE = "text_block.html.jinja"

IMG_RESOLUTION = 300 #dpi
print(f"Training set will be created from: '{SRC_TRAIN_SPLIT_PATH}'")
print(f"Training set will be saved to: '{DST_TRAIN_PATH}'")

## Load in Text Documents

In [None]:
import glob
import os

train_text = sorted(glob.glob(SRC_TRAIN_SPLIT_PATH + "*.txt"))
#test_text = sorted(glob.glob(SRC_TEST_SPLIT_PATH + "*.txt"))

print(f"Number of training text documents: {len(train_text)}")
#print(f"Number of testing text documents: {len(test_text)}")

## Document Sample 

In [None]:
from genalog.pipeline import AnalogDocumentGeneration
from IPython.core.display import Image, display
import timeit
import cv2

sample_file = train_text[0]
print(f"Sample Filename: {sample_file}")
doc_generation = AnalogDocumentGeneration(styles=STYLE_COMBINATIONS, degradations=DEGRADATIONS, resolution=IMG_RESOLUTION)
print(f"Avaliable Templates: {doc_generation.list_templates()}")

start_time = timeit.default_timer()
img_array = doc_generation.generate_img(sample_file, HTML_TEMPLATE, target_folder=None)
elapsed = timeit.default_timer() - start_time
print(f"Time to generate 1 documents: {elapsed:.3f} sec")

_, encoded_image = cv2.imencode('.png', img_array)
display(Image(data=encoded_image, width=600))

## Execute Generation

In [None]:
from genalog.pipeline import generate_dataset_multiprocess

# Generating training set
generate_dataset_multiprocess(
    train_text, DST_TRAIN_PATH, STYLE_COMBINATIONS, DEGRADATIONS, HTML_TEMPLATE, 
    resolution=IMG_RESOLUTION, batch_size=1
)

### Saving Dataset Configurations as .json

In [None]:
from genalog.pipeline import ImageStateEncoder
import json

layout_json_path = os.path.join(ROOT_FOLDER,SRC_DATASET_NAME,VERSION_NAME)+"/layout.json"
degradation_json_path = os.path.join(ROOT_FOLDER,SRC_DATASET_NAME,VERSION_NAME)+"/degradation.json"

layout = {
    "style_combinations": STYLE_COMBINATIONS,
    "img_resolution": IMG_RESOLUTION,
    "html_templates": [HTML_TEMPLATE],
}

layout_js_str = json.dumps(layout, indent=2)
degrade_js_str = json.dumps(DEGRADATIONS, indent=2, cls=ImageStateEncoder)

with open(layout_json_path, "w") as f:
    f.write(layout_js_str)
    
with open(degradation_json_path, "w") as f:
    f.write(degrade_js_str)
    
print(f"Writing configs to {layout_json_path}")
print(f"Writing configs to {degradation_json_path}")

## Setup Azure Blob Client

In [None]:
from dotenv import load_dotenv
load_dotenv("../.secrets")
from genalog.ocr.blob_client import GrokBlobClient

local_path = ROOT_FOLDER + "/" +SRC_DATASET_NAME
remote_path = SRC_DATASET_NAME
# dataset_version = VERSION_NAME
dataset_version = "hyphens_all_heavy"

print(f"Uploadig from local_path: {local_path}")
print(f"Upload to remote_path:    {remote_path}")
print(f"dataset_version:          {dataset_version}")

load_dotenv("../.secrets")

blob_client = GrokBlobClient.create_from_env_var()

## Upload Dataset to Azure Blob Storage

In [None]:
import time
# Python uploads can be slow.
# for very large datasets use azcopy: https://github.com/Azure/azure-storage-azcopy
start = time.time()
dest, res = blob_client.upload_images_to_blob(local_path, remote_path, use_async=True)
await res
print("time (mins): ", (time.time()-start)/60)

## Run Indexer and Retrieve OCR results
Please note that this process can take a **longer time**, but you can upload multiple dataset to Blob and run this once for all of them.

In [None]:
from genalog.ocr.rest_client import GrokRestClient
from dotenv import load_dotenv

load_dotenv("../.secrets")
grok_rest_client = GrokRestClient.create_from_env_var()
grok_rest_client.create_indexing_pipeline()
grok_rest_client.run_indexer()

# wait for indexer to finish
grok_rest_client.poll_indexer_till_complete()

## Download OCR Results

In [None]:
import os

#blob_img_path_test = os.path.join(remote_path, dataset_version, "test", "img")
blob_img_path_train = os.path.join(remote_path, dataset_version, "train", "img")
#local_ocr_json_path_test = os.path.join(local_path, dataset_version, "test", "ocr")
local_ocr_json_path_train = os.path.join(local_path, dataset_version, "train", "ocr")
#print(f"Downloading \nfrom remote path:'{blob_img_path_test} \n   to local path:'{local_ocr_json_path_test}'")
print(f"Downloading \nfrom remote path:'{blob_img_path_train} \n   to local path:'{local_ocr_json_path_train}'")

In [None]:
# download OCR
import os

#await blob_client.get_ocr_json(blob_img_path_test, output_folder=local_ocr_json_path_test, use_async=True)
await blob_client.get_ocr_json(blob_img_path_train, output_folder=local_ocr_json_path_train, use_async=True)

In [None]:
# print OCR'ed document
import os
import json

example = sorted(os.listdir(CLEAN_TEXT_DIR))[2]

with open(os.path.join(CLEAN_TEXT_DIR, example)) as f:
    print("****Source text: ****\n")
    print(f.read()[:500])
    
with open(os.path.join(local_ocr_json_path_train, example.replace("txt", "json"))) as f:
    print("\n\n****OCR'ed text: ****\n")
    json_data = json.load(f)
    print(json_data[0]['text'][:500])

# Generate OCR metrics

In [None]:
import os 

clean_text_path_template = os.path.join(local_path, "shared/<test/train>/clean_text")
ocr_json_path_template = os.path.join(local_path, dataset_version, "<test/train>/ocr")
output_metric_path = os.path.join(local_path, dataset_version)
csv_metric_name_template = "<test/train>_ocr_metrics.csv"
subs_json_name_template = "<test/train>_subtitutions.json"
avg_metric_name = "ocr_metrics.csv"

print(f"Loading \n'{clean_text_path_template}' \nand \n'{ocr_json_path_template}'")

In [None]:
import sys
import json
import pandas as pd
from genalog.ocr.metrics import get_metrics, substitution_dict_to_json

subsets =  ["train"]
avg_stat = {subset: None for subset in subsets}

for subset in subsets:
    clean_text_path = clean_text_path_template.replace("<test/train>", subset)
    ocr_json_path = ocr_json_path_template.replace("<test/train>", subset)
    csv_metric_name = csv_metric_name_template.replace("<test/train>", subset)
    subs_json_name = subs_json_name_template.replace("<test/train>", subset)
    
    output_csv_name = output_metric_path + "/" + csv_metric_name
    output_json_name = output_metric_path + "/" + subs_json_name
    
    print(f"Saving to '{output_csv_name}' \nand '{output_json_name}'")
    
    df, subs, actions = get_metrics(clean_text_path, ocr_json_path, use_multiprocessing=True)
    df.to_csv(output_csv_name)
    json.dump(substitution_dict_to_json(subs), open(output_json_name, "w"))
    avg_stat[subset] = df.mean()

# Saving average metrics
avg_stat = pd.DataFrame(avg_stat)
output_avg_csv = os.path.join(output_metric_path, avg_metric_name)
avg_stat.to_csv(output_avg_csv)
print(f"Saving average metrics to {output_avg_csv}")
print(avg_stat[16:].append(avg_stat[:3]))

## Organize OCR'ed Text into IOB Format For Model Training Purpose

The last step in preparing the dataset is to format all the OCR'ed text and the NER label into a usable format for training. Our model consume data in IOB format, which is the same format used in the CoNLL datasets.

In [None]:
base_path = local_path
degraded_folder = dataset_version
print(f"base_path: {base_path}\ndegraded_folder: {degraded_folder}")

In [None]:
!python -m genalog.text.conll_format $base_path $degraded_folder --train_subset

In [None]:
# Printing the final IOB-formatted OCR tokens
import os 

OCR_LABEL_PATH = os.path.join(base_path, degraded_folder, "train/ocr_labels")
example = sorted(os.listdir(OCR_LABEL_PATH))[2]
with open(os.path.join(OCR_LABEL_PATH, example)) as f:
    print("****Labeled OCR Tokens: ****\n")
    print(f.read()[:200])

In [None]:
import os 
example = sorted(os.listdir(OCR_LABEL_PATH))[2]
with open(os.path.join(CLEAN_LABEL_DIR, example)) as f:
    print("****Grouth Truth Tokens: ****\n")
    print(f.read()[:200])

## [Optional] Re-upload Local Dataset to Blob 

We can re-upload the local copy of the dataset to Blob Storage to sync up the two copies

In [None]:
import os

local_dataset_to_sync = os.path.join(local_path)
blob_path = os.path.join(remote_path)
print(f"local_dataset_to_sync: {local_dataset_to_sync}\nblob_path: {blob_path}")

In [None]:
import time
# Python uploads can be slow.
# for very large datasets use azcopy: https://github.com/Azure/azure-storage-azcopy
start = time.time()
dest, res = blob_client.upload_images_to_blob(local_dataset_to_sync, blob_path, use_async=True)
await res
print("time (mins): ", (time.time()-start)/60)