##  Ground Truth Text Labeling

In [19]:
import json
import os
import time
import boto3
import random
import sagemaker
import pandas as pd

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
s3 = boto3.client('s3')
sm_client = boto3.client('sagemaker')
bucket = sagemaker_session.default_bucket()
region = boto3.session.Session().region_name


# https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html
# https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis

### Set up a private work team

If you want to preview the worker task UI, create a private work team and add yourself as a worker. 

If you have already created a private workforce, follow the instructions in [Add or Remove Workers](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management-private-console.html#add-remove-workers-sm) to add yourself to the work team you use to create a lableing job. 

#### Create a private workforce and add yourself as a worker

To create and manage your private workforce, you can use the **Labeling workforces** page in the Amazon SageMaker console. When following the instructions below, you will have the option to create a private workforce by entering worker emails or importing a pre-existing workforce from an Amazon Cognito user pool. To import a workforce, see [Create a Private Workforce (Amazon Cognito Console)](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-cognito.html).

To create a private workforce using worker emails:

* Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

* In the navigation pane, choose **Labeling workforces**.

* Choose Private, then choose **Create private team**.

* Choose **Invite new workers by email**.

* Paste or type a list of up to 50 email addresses, separated by commas, into the email addresses box.

* Enter an organization name and contact email.

* Optionally choose an SNS topic to subscribe the team to so workers are notified by email when new Ground Truth labeling jobs become available. 

* Click the **Create private team** button.

After you import your private workforce, refresh the page. On the Private workforce summary page, you'll see your work team ARN. Enter this ARN in the following cell. 

In [None]:
WORKTEAM_ARN = '<PUT YOUR WORKTEAM ARN HERE>'


## Let's download a text dataset

In [5]:
!bash squad_download.sh


Downloading dataset for squad...
--2020-08-31 20:54:06--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.110.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30288272 (29M) [application/json]
Saving to: ‘v1.1/train-v1.1.json’


2020-08-31 20:54:06 (224 MB/s) - ‘v1.1/train-v1.1.json’ saved [30288272/30288272]

--2020-08-31 20:54:06--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.111.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4854279 (4.6M) [application/json]
Saving to: ‘v1.1/dev-v1.1.json’


2020-08-31 20:54:06 (118 MB/s) - ‘v1.1/dev-v1.1.json

In [13]:
with open('/home/ec2-user/SageMaker/v2.0/train-v2.0.json', 'r') as f:
    squad_data = json.load(f)

## Let's look at some of the entries

The dataset we downloaded is a common natural language understanding dataset called the Stanford Question Answering Dataset. We are going to use it for a couple different text tasks.

To learn more about SQuAD check out this page:

https://rajpurkar.github.io/SQuAD-explorer/

In [8]:
ind = random.randint(0,34)
sq = squad_data['data'][ind]
print('Paragraph title: ',sq['title'], '\n')
print(sq['paragraphs'][0]['context'],'\n')
print('Question:', sq['paragraphs'][0]['qas'][0]['question'])
print('Answer:', sq['paragraphs'][0]['qas'][0]['answers'][0]['text'])


Paragraph title:  Spectre_(2015_film) 

Spectre (2015) is the twenty-fourth James Bond film produced by Eon Productions. It features Daniel Craig in his fourth performance as James Bond, and Christoph Waltz as Ernst Stavro Blofeld, with the film marking the character's re-introduction into the series. It was directed by Sam Mendes as his second James Bond film following Skyfall, and was written by John Logan, Neal Purvis, Robert Wade and Jez Butterworth. It is distributed by Metro-Goldwyn-Mayer and Columbia Pictures. With a budget around $245 million, it is the most expensive Bond film and one of the most expensive films ever made. 

Question: Which company made Spectre?
Answer: Eon Productions


## Transform data 

In order to work with ground truth we can do one of two things to put our data in the proper format, we can provide Ground Truth with a CSV or TXT file and let it perform the conversion into a manifest going through the console, or we can manually convert it into a manifest and launch the job from the notebook. The manifest is in the form of a json line file where each line has the text we want to classify 

In [14]:
# put data in a list 
text_list = []
for data in squad_data['data']:
    text_list.append(data['paragraphs'][0]['context'])
    
os.makedirs('text_for_labeling', exist_ok=True)

# create txt file
with open('text_for_labeling/squad.txt','w') as f:
    for text in text_list:
        f.write(text)
        f.write('\n')

# create CSV file
text_frame = pd.DataFrame()
text_frame['txt'] = text_list
text_frame.to_csv('text_for_labeling/squad.csv',index=False)

# create manifests 
with open('text_for_labeling/squad.manifest','w') as f:
    for text in text_list:
        f.write(json.dumps({'source':text}))
        f.write('\n')

In [15]:
# send to s3
s3.upload_file(Filename='text_for_labeling/squad.csv', Bucket=bucket, Key='text_files/csv/squad.csv')
s3.upload_file(Filename='text_for_labeling/squad.txt', Bucket=bucket, Key='text_files/txt/squad.txt')
s3.upload_file(Filename='text_for_labeling/squad.manifest', Bucket=bucket, Key='text_files/text_manifests/squad.manifest')
INPUT_MANIFEST_S3_URI = f's3://{bucket}/text_files/text_manifests/squad.manifest'

## Label categories

We can create label categories through the ground truth console, or we can create them manually if using boto3, we are going to create a new entry structured:

{"label": "YOUR LABEL"}

In [21]:
# create label categories

labelcats = {
    "document-version": "2020-08-15",
    "auditLabelAttributeName": "Text",
    "labels": [
        {
            "label": "Entity",
        },
        {
            "label": "Location",
        },
        {
            "label": "Animal",
        },
    ],
    "instructions": {
        "shortInstruction": "Classify the text using one of the following labels",
        "fullInstruction": "Some useful instruction"
    }
}

filename = '/home/ec2-user/SageMaker/text_for_labeling/text_categories.json'
with open(filename,'w') as f:
    json.dump(labelcats,f)

s3.upload_file(Filename=filename, Bucket=bucket, Key='text_files/text_manifests/text_categories.json')

LABEL_CATEGORIES_S3_URI = f's3://{bucket}/text_files/text_manifests/text_categories.json'

## Get our labeling template

There are different templates available for different labeling tasks, to see some examples check out:

https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis

To use one of these UIs, you'll want to clone the repo and then send the template to s3 so ground truth can grab it.
We have provided two common templates, text classification, and named entity recogntion for you to use.

In [17]:
# how to use one of the ground truth UIs

# !git clone https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis.git
    
# !aws s3 cp amazon-sagemaker-ground-truth-task-uis/text/text-classification-multiselect.liquid.html s3://{bucket}/text_files/text_manifests/text-classification-multiselect.liquid.html

# try switching to named-entity-recognition.liquid for NER
template = 'text-classification-multiselect.liquid'

!aws s3 cp {template} s3://{bucket}/text_files/text_manifests/{template}

fatal: destination path 'amazon-sagemaker-ground-truth-task-uis' already exists and is not an empty directory.
upload: amazon-sagemaker-ground-truth-task-uis/text/text-classification-multiselect.liquid.html to s3://sagemaker-us-east-1-209419068016/text_files/text_manifests/text-classification-multiselect.liquid.html


## Launch labeling job from notebook

In [22]:

LABELING_JOB_NAME = 'text-multi-class-demo-large'
UI_TEMPLATE_S3_URI = f's3://{bucket}/text_files/text_manifests/text-classification-multiselect.liquid

createLabelingJob_request = {
  "LabelingJobName": LABELING_JOB_NAME,
  "HumanTaskConfig": {
    "AnnotationConsolidationConfig": {
      "AnnotationConsolidationLambdaArn": f"arn:aws:lambda:{region}:432418664414:function:ACS-TextMultiClassMultiLabel"
    },
    "MaxConcurrentTaskCount": 200,
    "NumberOfHumanWorkersPerDataObject": 1,
    "PreHumanTaskLambdaArn": f"arn:aws:lambda:{region}:432418664414:function:PRE-TextMultiClassMultiLabel",
    "TaskAvailabilityLifetimeInSeconds": 864000,
    "TaskDescription": "Classify text",
    "TaskKeywords": [
      "Text Classification",
      "Labeling"
    ],
    "TaskTimeLimitInSeconds": 800,
    "TaskTitle": LABELING_JOB_NAME,
    "UiConfig": {
      "UiTemplateS3Uri": UI_TEMPLATE_S3_URI
    },
    "WorkteamArn": WORKTEAM_ARN
  },
  "InputConfig": {
    "DataAttributes": {
      "ContentClassifiers": [
        "FreeOfPersonallyIdentifiableInformation",
        "FreeOfAdultContent"
      ]
    },
    "DataSource": {
      "S3DataSource": {
        "ManifestS3Uri": INPUT_MANIFEST_S3_URI
      }
    }
  },
  "LabelAttributeName": "Text",
  "LabelCategoryConfigS3Uri": LABEL_CATEGORIES_S3_URI,
  "OutputConfig": {
    "S3OutputPath": f"s3://{bucket}/text_files/text_job_output/"
  },
  "RoleArn": role,
  "StoppingConditions": {
    "MaxPercentageOfInputDatasetLabeled": 100
  }
}
print(createLabelingJob_request)
out = sm_client.create_labeling_job(**createLabelingJob_request)
print(out)

{'LabelingJobName': 'text-multi-class-demo-check', 'HumanTaskConfig': {'AnnotationConsolidationConfig': {'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:ACS-TextMultiClassMultiLabel'}, 'MaxConcurrentTaskCount': 200, 'NumberOfHumanWorkersPerDataObject': 1, 'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-TextMultiClassMultiLabel', 'TaskAvailabilityLifetimeInSeconds': 864000, 'TaskDescription': 'Classify text', 'TaskKeywords': ['Text Classification', 'Labeling'], 'TaskTimeLimitInSeconds': 800, 'TaskTitle': 'text-multi-class-demo-check', 'UiConfig': {'UiTemplateS3Uri': 's3://sagemaker-us-east-1-209419068016/text_files/text_manifests/text-classification-multiselect.liquid.html'}, 'WorkteamArn': 'arn:aws:sagemaker:us-east-1:209419068016:workteam/private-crowd/ijp-private-workteam'}, 'InputConfig': {'DataAttributes': {'ContentClassifiers': ['FreeOfPersonallyIdentifiableInformation', 'FreeOfAdultContent']}, 'DataSource': {'S3Data

## Describe labeling job

In [None]:
describe = sm_client.describe_labeling_job(LabelingJobName=LABELING_JOB_NAME)
try:
    output_man = describe['LabelingJobOutput']['OutputDatasetS3Uri']
except:
    print('Job not finished yet!')

In [79]:
!aws s3 cp {output_man} /home/ec2-user/SageMaker/text_for_labeling/output/output.manifest

download: s3://privisaa-bucket-2/text_files/text_manifests/text-example-multi-class-job/manifests/output/output.manifest to text_for_labeling/output/output.manifest


## View our results

In [80]:
output_list = []
with open('text_for_labeling/output/output.manifest','r') as file:
    for line in file:
        output_list.append(json.loads(line))

In [82]:
output_list[1]

{'source': '"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."',
 'text-example-multi-class-job': [1],
 'text-example-multi-class-job-metadata': {'job-name': 'labeling-job/text-example-multi-class-job',
  'confidence-map': {'1': 0},
  'class-map': {'1': 'topic'},
  'type': 'groundtruth/text-classifica