# Custom entity recognition with Comprehend

*Step 1: preparing the training dataset for custom entity recognition*


This series of notebook is a walkthrough on how to leverage Amazon Comprehend to recognize customized entities from documents. More details about the training process can be found here: https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html

## Initialization
---
### Update SageMaker execution role credentials
Before you start make sure that your Sagemaker Execution Role has the following credentials.

* `ComprehendFullAccess`
* `ComprehendMedicalFullAccess`
* `TranslateFullAccess`
* `SagemakerFullAccess`
* Your Sagemaker Execution Role should have access to S3 already. If not you can add the `S3FullAccess` policy.
* You will also need to add `iam:passRole` as an inline policy.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "iam:PassRole"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}
```

![Policies](../assets/iam-policies.png)

* You will also need the following trust policies:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "sagemaker.amazonaws.com",
          "s3.amazonaws.com",
          "comprehend.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
```

![Trust Relationships](../assets/iam-trust-relationships.png)

### Library imports

In [None]:
import numpy as np
import os
import pandas as pd

from tqdm import tqdm

RAW_DATA = '../data/raw'
DATA = '../data/interim'
PROCESSED_DATA = '../data/processed'

os.makedirs(DATA, exist_ok=True)
os.makedirs(RAW_DATA, exist_ok=True)
os.makedirs(PROCESSED_DATA, exist_ok=True)

In [None]:
import boto3
import botocore
import sagemaker
import sys

# Specify S3 bucket and prefix that you want to use for the model data
# Feel free to specify a different bucket here if you wish.
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'comprehend_workshop'
execution_role = sagemaker.get_execution_role()

# Check if the bucket exists
try:
    boto3.Session().client('s3').head_bucket(Bucket=bucket)
except botocore.exceptions.ParamValidationError as e:
    print('Hey! You either forgot to specify your S3 bucket'
          ' or you gave your bucket an invalid name!')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == '403':
        print("Hey! You don't have permission to access the bucket, {}.".format(bucket))
    elif e.response['Error']['Code'] == '404':
        print("Hey! Your bucket, {}, doesn't exist!".format(bucket))
    else:
        raise
else:
    print('Training input/output will be stored in: s3://{}/{}'.format(bucket, prefix))

### Downloading datasets
For this training, we will use the Quaero French Medical corpus dataset that can be downloaded from here: https://quaerofrenchmed.limsi.fr/QUAERO_FrenchMed_brat.zip. For this dataset, a selection of MEDLINE titles and EMEA documents were manually annotated. The annotation process was guided by concepts in the Unified Medical Language System (UMLS). As such, ten types of clinical entities, as defined by the following UMLS Semantic Groups (Bodenreider and McCray 2003) were annotated: **Anatomy**, **Chemical and Drugs**, **Devices**, **Disorders**, **Geographic Areas**, **Living Beings**, **Objects**, **Phenomena**, **Physiology** and **Procedures**. Amazon Comprehend Medical follows a similar semantic and was trained on content in the English language.

In [None]:
%%time

if os.path.exists(os.path.join(RAW_DATA, 'corpus')):
    print('Dataset was apparently already downloaded, nothing to do here.')
    
else:
    print('Dataset not found, downloading and unzipping content.')
    !wget https://quaerofrenchmed.limsi.fr/QUAERO_FrenchMed_brat.zip -O $RAW_DATA/QUAERO_FrenchMed_brat.zip
    !unzip -qq $RAW_DATA/QUAERO_FrenchMed_brat.zip -d $RAW_DATA
    !mv $RAW_DATA/QUAERO_FrenchMed/corpus $RAW_DATA
    !rm -Rf $RAW_DATA/QUAERO_FrenchMed
    !rm $RAW_DATA/QUAERO_FrenchMed_brat.zip

## Helper functions
---

In [None]:
def is_int(s):
    """
    Checks if the string argument contains an int
    
    :param: s - string that we want to test the content for
    """
    try: 
        int(s)
        return True
    except ValueError:
        return False

Dictionary of entities we will capture: the entity types **must** be in uppercase for the custom entity training on Amazon Comprehend:

In [None]:
types = {
    'ANAT': 'ANATOMY',
    'CHEM': 'CHEMICALS',
    'DEVI': 'DEVICES',
    'DISO': 'DISORDERS',
    'GEOG': 'GEOGRAPHIC_AREA',
    'LIVB': 'LIVING_BEING',
    'OBJC': 'OBJECT',
    'PHEN': 'PHENOMENON',
    'PHYS': 'PHYSIOLOGY',
    'PROC': 'PROCEDURE'
}

## Data preparation
---
The Quaero corpus uses the BRAT format:
Document example in `filename.txt`:

```
La contraception par les dispositifs intra utérins
```

Associated annotations in `filename.ann`:
```
T1 PROC 3 16 contraception
#1 AnnotatorNotes T1 C0700589
T2 DEVI 25 50 dispositifs intra utérins
#2 AnnotatorNotes T2 C0021900
T3 ANAT 43 50 utérins
#3 AnnotatorNotes T3 C0042149
```

Note that:
* Several annotations can be associated to each line.
* The begin and end offset are **relative to the whole file**, and not to the current line.
* Documents samples can contain empty lines (with \n characters to take into account when looking for the correct offsets for the tagged entities).
* Some files can contain several rows: we will consider each row as an individual document.

In [None]:
%%time

# We will process the whole corpus data, as the Comprehend training already includes a train/test split procedure internally.
root_dir = os.path.join(RAW_DATA, 'corpus')
training_doc = pd.DataFrame(columns=['File', 'Line', 'Begin Offset', 'End Offset', 'Type'])
s3 = boto3.resource('s3')

# We walk throughout the directory and process all the .txt (document) and .ann (annotations) files
index = 0
for root, dirs, files in os.walk(top=root_dir):
    # Loops through the files of the current directory:
    for name in tqdm(files, desc=root):
        # We only process the document samples:
        ext = name[-4:]
        if ext == '.txt':
            # Read each line of the current documents:
            document_fname = os.path.join(root, name)
            with open(document_fname, 'r') as document:
                content = document.readlines()
                
            # The BRAT format registers position in the whole 
            # file: we want to use positions in each current 
            # line: let's loop through each row of the document:
            last_position = 0
            df = pd.DataFrame(columns=['document', 'start_pos'])
            for index, row in enumerate(content):
                # Compute the starting position of the current row relative to the document:
                start = last_position
                length = len(row)
                last_position += length

                # This dataframe will record each row (individual documents) and 
                # its starting position relative to the file currently opened:
                df = df.append({
                    'document': row,
                    'start_pos': start,
                }, ignore_index=True)

                # Ignore empty lines to filter the right documents to be sent to S3:
                if row != '\n':
                    # Each row contained in the file is sent as an individual document to S3.
                    new_document_fname = os.path.join(PROCESSED_DATA, name[:-4] + '_row{}.txt'.format(index))
                    with open(new_document_fname, 'w') as new_document:
                        new_document.write(row)    
                    s3.meta.client.upload_file(new_document_fname, bucket, prefix + '/documents/' + name[:-4] + '_row{}.txt'.format(index))
                    
            # We now have a dataframe registering all the tags 
            # found in the current file. Let's process the 
            # associated annotations file:
            annotation_fname = os.path.join(root, name[:-4] + '.ann')
            with open(annotation_fname, 'r') as f:
                annotations = f.readlines()
                
            # We only loop through annotations lines that do not start by #:
            annotations = [a for a in annotations if a[0] != '#']
            for a in annotations:
                # Split each annotation line following this scheme: 
                # TYPE BEGIN_OFFSET END_OFFSET EXPRESSION
                a = a.split('\t')[1].split(' ')

                # Some annotations have more complex scheme, 
                # but they are rare. We will discard them at 
                # this stage and only keep the ones with begin
                # and end offset stored as mere integers:
                if is_int(a[1]) and is_int(a[2]):
                    # Extract information from the current annotation line:
                    start = int(a[1])
                    end = int(a[2])
                    
                    # Where was this annotation positioned in the file?
                    doc_info = df[df['start_pos'] <= start].iloc[-1]
                    
                    # This is the row number:
                    doc_id = doc_info.name
                    
                    # Compute start and end of current entity relative to its line, not to the whole file:
                    entity_start = start - doc_info['start_pos']
                    entity_end = end - doc_info['start_pos']

                    # Build the current training sample
                    training_doc = training_doc.append({
                        'File': name[:-4] + '_row{}.txt'.format(doc_id),
                        'Line': 0,
                        'Begin Offset': entity_start,
                        'End Offset': entity_end,
                        'Type': types[a[0]]
                    }, ignore_index=True)

print(training_doc.shape)
training_doc.head(10)

The Quaero corpus **contains overlap** which Amazon Comprehend NER training feature **does not accept**, let's filter them out:

In [None]:
training_doc['Begin Offset'] = training_doc['Begin Offset'].astype(np.int16)
training_doc['End Offset'] = training_doc['End Offset'].astype(np.int16)
nb_overlaps = 1
pass_index = 0

# Loops through this process until no more duplicate are found:
while nb_overlaps > 0:
    pass_index += 1
    training_doc['overlap'] = (training_doc.groupby(by='File').apply(lambda x: (x['End Offset'].shift() - x['Begin Offset']) > 0).reset_index(level=0, drop=True))
    nb_overlaps = training_doc[training_doc['overlap'] == True].shape[0]
    print('Pass {}: {} overlaps removed'.format(pass_index, nb_overlaps))
    training_doc = training_doc[training_doc['overlap'] == False]

Amazon Comprehend needs at least 200 samples for each entity type, let's filter out the documents pertaining to samples with an insufficient cardinality before pushing our final dataset to S3. The dataset must have the following format (pay attention to save the CSV file **without any dataframe index** and double check the column names):

<img src="../assets/comprehend_training_format.png" width="370" />

In [None]:
# Only keep documents for entity types with more than 200 samples:
counts = training_doc['Type'].value_counts()
types_list = counts[counts >= 200].index.tolist()
training_doc = training_doc[training_doc['Type'].isin(types_list)]
print(f'{training_doc.shape[0]} documents left in the training dataset')

# Push the final training dataset to S3.
training_doc = training_doc[['File', 'Line', 'Begin Offset', 'End Offset', 'Type']]
training_doc.to_csv(os.path.join(PROCESSED_DATA, 'annotations.csv'), index=False)
s3.meta.client.upload_file(os.path.join(PROCESSED_DATA, 'annotations.csv'), bucket, prefix + '/annotations/' + 'annotations.csv')

# Print the document number per entity type:
counts[counts > 200]