# Model Factory Processor Development Example

### This notebook provides detailed instructions with code for using DataCompute to label documents for building custom document processors. 

### Step 1: Install dependencies.

In [2]:
!pip install joblib google-cloud-documentai ratelimiter tabulate immutabledict

Collecting google-cloud-documentai
  Downloading google_cloud_documentai-1.2.1-py2.py3-none-any.whl (138 kB)
     |████████████████████████████████| 138 kB 5.0 MB/s            
[?25hCollecting ratelimiter
  Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)
Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Collecting immutabledict
  Downloading immutabledict-2.2.1-py3-none-any.whl (4.0 kB)
Installing collected packages: tabulate, ratelimiter, immutabledict, google-cloud-documentai
Successfully installed google-cloud-documentai-1.2.1 immutabledict-2.2.1 ratelimiter-1.2.0.post0 tabulate-0.8.9


### Step 2: Create a labeler pool.

The pool can be reused in the development of multiple processors. __Please replace values before invoking the code below.__

In [None]:
from model_factory import http_client

LABELER_POOL_DISPLAY_NAME = 'Labeler Pool Name',
LABELER_POOL_MANAGER_EMAILS = "Labeler Pool Manager email"

dai_client = http_client.DocumentAIClient()

lro_name = dai_client.create_labeler_pool(LABELER_POOL_DISPLAY_NAME , LABELER_POOL_MANAGER_EMAILS)
print('Creating labeler pool...\\nThis could take a few seconds. Please wait.')

lro = dai_client.wait_for_lro(lro_name)
if 'response' in lro:
    labeler_pool = lro['response']['name']
    print(f'Labeler pool created: {labeler_pool}')
    else:
        print(f'Failed to create labeler pool: {lro}')

After the labeler pool is created, labeler pool managers should receive a email including a link to the manager dashboard for managing labeling tasks and labelers. 

## Processor Development

If you lose the connection to the notebook or interrupt the kernel session when you work on the following steps, please start from Step 1 again and optionally skip completed steps. All status including processor config, imported documents, labeled annotations are persisted under the specified workspace in your GCS bucket. Please use a different workspace path for each processor. 

### Step 1: Create a processor.

Code below will create an Extraction processor. __If you want to create a Classification or a Splitting processor, use its respective code (commented below) instead.__

In [None]:
from model_factory import http_client, processor

# Replace values below
WORKSPACE = 'gs://<your_bucket_name>/<path_to_the_workspace>'
new_processor = processor.ExtractionProcessor(WORKSPACE)
# new_processor = processor.ClassificationProcessor(WORKSPACE)
# new_processor = processor.SplittingProcessor(WORKSPACE)

### Step 2: Provide schema and labeling instructions.

__Please follow the playbook__ for detailed information about how to prepare the schema and labeling instructions.

In [None]:
from model_factory import http_client
from IPython.display import HTML, display
import tabulate

dai_client = http_client.DocumentAIClient()
response = dai_client.list_labeler_pools()

if 'labelerPools' not in response or not response['labelerPools']:
    print('Labeler pool not found.\\nPlease follow the Prerequisites section to create a labeler pool.')
else:
    print('Please select one labeler pool from below before running the next code block.')      
    table = [['Display Name', 'Labeler Pool','Managers']]
    
    for pool in response['labelerPools']:
        table.append([pool['displayName'],pool['name'],', '.join(pool['managerEmails'])])
    display(HTML(tabulate.tabulate(table, tablefmt='html',headers='firstrow')))

In [None]:
# Replace values below

LABELER_POOL = 'projects/*/locations/*/labelerPools/*' # Use a labeler pool from the above table

SCHEMA = {
        'displayName': 'Schema name',
        'description': 'Schema description',
        'entityTypes': [
            {
                'type': 'type1',
                'baseType': 'money',
                'occurrenceType': 'OPTIONAL_ONCE',
            },
            {
                'type': 'type2',
                'baseType': 'datetime',
                'occurrenceType': 'OPTIONAL_ONCE',
            },
        ]
    }

INSTRUCTION_URI = 'gs://<your_bucket_name>/<path_to_the_instruction_pdf>' # PDF instructions to be shared with labeler manager.
new_processor.update_data_labeling_config(SCHEMA, INSTRUCTION_URI, LABELER_POOL)

### Step 3: Import training and test documents.

Please upload training documents and test documents to GCS under two separate folders. Expect at least 5 minutes for importing documents.

In [None]:
# Replace values below\

TRAINING_SET_PATH = 'gs://<your_bucket_name>/<path_to_training_set>'
TEST_SET_PATH = 'gs://<your_bucket_name>/<path_to_test_set>'

new_processor.import_documents(TRAINING_SET_PATH, 'training')
new_processor.import_documents(TEST_SET_PATH, 'test')

### Step 4: Label documents.

After you run the below code block, please go to the labeler manager console to assign the task to corresponding labelers so that they can see the tasks in the UI. 

In [None]:
new_processor.label_dataset('training')
new_processor.label_dataset('test')

### Step 5: Train the processor.

This step could take a few hours depending on the size of training and test datasets.

In [None]:
# Replace the below value.

PROCESSOR_VERSION_DISPLAY_NAME = 'Version1' # Please use English letters, digits, underscore, hyphen only.

# If you are training an extraction processor and interested in specifying algorithms. Set active algorithms first:
# new_processor.active_algorithms = ['eesf', 'clara', 'gbow-flee1', 'harvester']

# If you'd like to lower min dataset size thresholds from their default values to for example 5, use these options:
# new_processor.processing_options['min-ground-truth-documents'] = '5'
# new_processor.processing_options['min-ground-truth-documents-per-entity-type'] = '5'
# new_processor.processing_options['min-ground-truth-entities-per-entity-type'] = '5'

new_processor_version = new_processor.train('training','test', display_name = PROCESSOR_VERSION_DISPLAY_NAME)

print(f'Trained processor version: {new_processor_version}')
processor_name = '/'.join(new_processor.processor_name().split('/')[2:])
evaluation_uri = f'https://console.cloud.google.com/ai/document-ai/{processor_name}/evaluations'

from IPython.core.display import display, HTML
display(HTML(f'<a href=\"{evaluation_uri}\">Evaluation page</a>'))