# Annotating Raw Test Set 

Please read all comments in each code block before executing it.

## Prerequisites
### Step 1: Install dependencies. 

In [None]:
!pip install joblib google-cloud-documentai ratelimiter tabulate immutabledict

### Step 2: Create a new labeler pool.
__Please replace values before invoking the code below.__

In [2]:
from model_factory import http_client

# Replace values below
LABELER_POOL_DISPLAY_NAME = 'HITL Test Label'
LABELER_POOL_MANAGER_EMAILS = ['marinadeletic@google.com']


dai_client = http_client.DocumentAIClient()

lro_name = dai_client.create_labeler_pool(LABELER_POOL_DISPLAY_NAME , LABELER_POOL_MANAGER_EMAILS)

print('Creating labeler pool...\nThis could take a few seconds. Please wait.')

lro = dai_client.wait_for_lro(lro_name)
if 'response' in lro:
    labeler_pool = lro['response']['name']
    print(f'Labeler pool created: {labeler_pool}')
else:
    print(f'Failed to create labeler pool: {lro}')

Creating labeler pool...
This could take a few seconds. Please wait.
Labeler pool created: projects/639644730573/locations/us/labelerPools/8064214206855062731


After the labeler pool is created, labeler pool managers should receive a email including a link to the manager dashboard for managing labeling tasks and labelers. Please follow [the instructions](https://docs.google.com/document/d/11okb4o5-QRG1Dr-a2xtCa3eb4YAnoqcPzkQu7FkxMO0/edit?resourcekey=0-dIJXlmaKj8-T76zyhZX0kA#bookmark=id.1xgo3q3k1g6w) to add labelers to the pool.

## Processor Development

If you lose the connection to the notebook or interrupt the kernel session when you work on the following steps, please start from Step 1 again and optionally skip completed steps. All status including processor config, imported documents, labeled annotations are persisted under the specified workspace in your GCS bucket. 

### Step 1: Create a processor.
In your document bucket create a new folder called 'labeling' and specify the location below


In [10]:
from model_factory import http_client, processor

# Replace values below
WORKSPACE = 'gs://<BUCKET_NAME>/labeling'

new_processor = processor.ExtractionProcessor(WORKSPACE) 

Creating new processor...
Processor name: projects/639644730573/locations/us/processors/6b342b7070428fec.
Display name: Processor (labeling_2).
Done.


### Step 2: Provide schema and labeling instructions.

Please follow the playbook for detail information about how to prepare the schema and labeling instructions.

In [4]:
from model_factory import http_client
from IPython.display import HTML, display
import tabulate

dai_client = http_client.DocumentAIClient()
response = dai_client.list_labeler_pools()
if 'labelerPools' not in response or not response['labelerPools']:
    print('Labeler pool not found.\nPlease follow the Prerequisites section to create a labeler pool.')
else:
    print('Please select one labeler pool from below before running the next code block.')
    table = [['Display Name', 'Labeler Pool','Managers']]
    for pool in response['labelerPools']:
        table.append([pool['displayName'],pool['name'],', '.join(pool['managerEmails'])])
    display(HTML(tabulate.tabulate(table, tablefmt='html',headers='firstrow')))

Please select one labeler pool from below before running the next code block.


Display Name,Labeler Pool,Managers
Debs Labeler Pool,projects/639644730573/locations/us/labelerPools/10635035648724695338,deboraelkin@google.com
Marinas Labeler Pool 2,projects/639644730573/locations/us/labelerPools/4100455005425702955,marinadeletic@google.com
Debs Labeler Pool 2,projects/639644730573/locations/us/labelerPools/6098442292745778478,deboraelkin@google.com
HITL Test Label,projects/639644730573/locations/us/labelerPools/8064214206855062731,marinadeletic@google.com


In [11]:
# Replace values below
LABELER_POOL = 'projects/639644730573/locations/us/labelerPools/8064214206855062731' # Use a labeler pool from the above table

# Use Schema provided by google for the document type
SCHEMA = {
    'displayName': 'Schema Labeling',
    'description': 'Schema description',
    'entityTypes': [
        {
            'type': 'name',
            'baseType': 'text',
            'occurrenceType': 'REQUIRED_ONCE',
        },
        {
            'type': 'address',
            'baseType': 'text',
            'occurrenceType': 'REQUIRED_ONCE',
        },
        {
            'type': 'date_of_birth',
            'baseType': 'text',
            'occurrenceType': 'REQUIRED_ONCE',
        },
        {
            'type': 'licence_no',
            'baseType': 'integer',
            'occurrenceType': 'REQUIRED_ONCE',
        },
    ]
}
INSTRUCTION_URI = 'gs://md-instructions/instructions.pdf' # PDF instructions to be shared with labeler manager.

new_processor.update_data_labeling_config(SCHEMA, INSTRUCTION_URI, LABELER_POOL)

Updating data labeling config...
Done.


### Step 3: Import test documents.

Specify the path of your raw test set created when de-identifying the documents. Expect at least 5 minutes for importing documents. 

In [12]:
# Replace value below
TEST_SET_PATH = 'gs://deb-ai-test-vic_licence/input-raw/test' 

new_processor.import_documents(TEST_SET_PATH, 'test')

Read LRO states: 0it [00:00, ?it/s]
Create LROs:   0%|          | 0/1 [00:00<?, ?it/s]

Found 1 new documents to import.


Create LROs: 100%|██████████| 1/1 [00:02<00:00,  2.16s/it]
Wait for LROs: 100%|██████████| 1/1 [01:18<00:00, 78.79s/it]
Process LRO outputs: 100%|██████████| 1/1 [00:00<00:00, 4369.07it/s]


### Step 4: Label documents.

After you run the below code block, please go to the labeler manager console to assign the task to corresponding labelers so that they can see the tasks in the UI. Check [this document](https://docs.google.com/document/d/11okb4o5-QRG1Dr-a2xtCa3eb4YAnoqcPzkQu7FkxMO0/edit?resourcekey=0-dIJXlmaKj8-T76zyhZX0kA#bookmark=id.uay738ifr6s) for detail instructions for assigning task.

In [None]:
new_processor.label_dataset('test')

Read LRO names: 0it [00:00, ?it/s]
Create LROs: 100%|██████████| 1/1 [00:00<00:00, 6213.78it/s]


Labeling task has been created.
Please make sure the task has been assigned to raters / labelers.


Wait for LROs:   0%|          | 0/1 [00:00<?, ?it/s]