# Custom entity recognition with Comprehend
---
*Step 2: launch a custom entity recognition training job*

This series of notebook is a walkthrough on how to leverage Amazon Comprehend to recognize customized entities from documents. More details about the training process can be found here: https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html

## Initialization
---

In [None]:
%%sh
pip -q install --upgrade pip
pip -q install sagemaker awscli boto3 --upgrade

In [None]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import os
import pandas as pd
import numpy as np
import boto3
import botocore
import sagemaker
import sys
import datetime
import time

RAW_DATA = '../data/raw'
DATA = '../data/interim'
PROCESSED_DATA = '../data/processed'

# Specify S3 bucket and prefix that you want to use for the model data
# Feel free to specify a different bucket here if you wish.
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'comprehend_workshop'
execution_role = sagemaker.get_execution_role()
comprehend_client = boto3.client('comprehend')

In [None]:
# Check if the bucket exists
try:
    boto3.Session().client('s3').head_bucket(Bucket=bucket)
except botocore.exceptions.ParamValidationError as e:
    print('Hey! You either forgot to specify your S3 bucket'
          ' or you gave your bucket an invalid name!')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == '403':
        print("Hey! You don't have permission to access the bucket, {}.".format(bucket))
    elif e.response['Error']['Code'] == '404':
        print("Hey! Your bucket, {}, doesn't exist!".format(bucket))
    else:
        raise
else:
    print('Training input/output will be stored in: s3://{}/{}'.format(bucket, prefix))

## Training
---

Let's prepare the custom entity training job request file:

In [None]:
s3_train_documents = 's3://{}/{}/documents/'.format(bucket, prefix)
s3_train_annotations = 's3://{}/{}/annotations/'.format(bucket, prefix)

custom_entity_request = {
    "Documents": { "S3Uri": s3_train_documents },
    "Annotations": { "S3Uri": s3_train_annotations },
    "EntityTypes": [
        { "Type": "ANATOMY" },
        { "Type": "CHEMICALS" },
        { "Type": "DISORDERS" },
        { "Type": "LIVING_BEING" },
        { "Type": "PROCEDURE" },
        { "Type": "PHYSIOLOGY" },
        { "Type": "DEVICES" },
        { "Type": "OBJECT" }
    ]
}

Launch the training: note that the language code is mandatory but only accepts `en` as a value. However, for custom entity training, the actual language used will, of course, be the one of the dataset:

In [None]:
unique_id = str(datetime.datetime.now().strftime('%s'))
create_custom_entity_response = comprehend_client.create_entity_recognizer(
    RecognizerName = 'french-healthcare-entities-' + unique_id, 
    DataAccessRoleArn = execution_role,
    InputDataConfig = custom_entity_request,
    LanguageCode = 'en'
)

Let's monitor this job while it's training:

In [None]:
job_arn = create_custom_entity_response['EntityRecognizerArn']

max_time = time.time() + 3*60*60
while time.time() < max_time:
    custom_recognizer_description = comprehend_client.describe_entity_recognizer(
        EntityRecognizerArn=job_arn
    )
    status = custom_recognizer_description['EntityRecognizerProperties']['Status']
    print('Custom entity recognizer: {}'.format(status))
    
    if status == 'TRAINED' or status == 'IN_ERROR':
        break
        
    time.sleep(60)

In [None]:
custom_recognizer_description