# Creating and Designing Crowdsourcing Tasks using MTurk

Rosni Vasu | 16.11.2022

### Table of Contents
* **Crowdsourcing using Amazon Mechanical Turk (MTurk)**
    * **GUI**
        * Platform, how to log in, requirements
        * How to create and design tasks (Refer the document *ATAI Tutorial MTurk Platform.pdf*)
    * **Boto3 API**
        * How to set up
        * Creating and designing tasks (Complete Demo using API)
            * Components (HIT template, properties, Informed consent)
            * HIT designs examples (only template)
            * Qualification task (Question Answer XML) - Gold tasks
            * Publish Task, Show progress
            * Approve/reject assignments, Delete HITs
            * Retrieve the result

-----------------------------------------------------------------------------------------

### The humans inside the machine!

* Platform on which Workers (or "Turkers") can complete Human Intelligence Tasks (HITs) for monetary compensation.
* HITs are put up by Requestors (that’s us!)
* Set-up to do large-scale experiments that require human intelligence


![image info](img/mturk1.png)

#### Requester account for creating MTurk tasks


![image info](img/mturk2.png)

### Requestor Vs Developer

* Sandbox Environment (https://requestersandbox.mturk.com/ )
* Production Site (https://requester.mturk.com/ )

### Worker
* Sandbox Environment (https://workersandbox.mturk.com/ )
* Production Site (https://worker.mturk.com/ )

![image info](img/mturk3.png)

### MTurk Fee Structure


![image info](img/mturk4.png)

### Ethics and Informed Participants

* Amazon worker IDs are actually tied to their (public) Amazon account, and thus constitute identifiable information.  Anonymize data by redacting worker IDs (and other identifiable information).
* Ethical obligation to inform participants about a task to help them decide if they'd like to participate


### MTurk Platform - a mini tour!

### Steps:
1. If you do not have an AWS acocunt already, visit https://aws.amazon.com and create an account you can use for your project.
2. Register as a requestor (shown before). Visit https://requester.mturk.com and create a new account.
3. When logged into both the root of your AWS account and your MTurk account, visit https://requester.mturk.com/developer to link them together.

**Go to:** https://requestersandbox.mturk.com/

**Demo:** Create and Publish using MTurk GUI

------------------------------------------------------------------------------------------------------------------

### Boto3 API

Language: Python

Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mturk.html


![image info](img/mturk5.png)

----------------------------------------------------------------------------------------------------------------

## **Let's get started**

**Table of Contents**

- Required libraries/ Configuring Your Profile/ Load the libraries/ Client configuration
- HIT design
- Informed Consent
- Create an HTML template
- Calculating Cost and Budgeting
- Qualification Task (Gold Tasks)
- More Qualitfication Requirement
- Create your task
- HIT Progress
- Get the results
- Approve or Reject Assignments
- Delete the HITs
- Block and Unblock Workers

### Required libraries

Based on [MTurk account set up](https://blog.mturk.com/tutorial-mturk-using-python-in-jupyter-notebook-17ba0745a97f) 

In [21]:
!pip install awscli
!pip install boto3
!pip install xmltodict



### Configuring Your Profile

#### Steps:
To call MTurk you will need to configure your computer with a profile that has the right credentials.
1. Create a new AWS IAM User. 
2. Then select the Security Credentials tab and create a new Access Key, copy the Access Key and Secret Access Key for future use.
3. Run the below command to configure an mturk profile on your computer that you will use when calling the API. You need to run it in the terminal, it won't work in a Jupyter Notebook.

```python
aws configure --profile mturk
```

4. When prompted, provide the `Access Key Id` and `Secret Access Key Id` you captured above. For a region, you can enter “us-east-1” and leave the output format as None.

### Load the libraries

* The boto3 package is an easy to use Python library for accessing AWS. 
* The xmltodict library makes it much easier to work with the XML data returned my MTurk.

In [22]:
import boto3
import xmltodict
import json

### Client configuration

Use the [worker sandbox](https://workersandbox.mturk.com/) and [requester sandbox](https://requestersandbox.mturk.com/) to test your code before running it on the production site ([worker](https://worker.mturk.com/) and [requester](https://requester.mturk.com/)).

In [23]:
create_hits_in_production = False

environments = {
  "production": {
    "endpoint": "https://mturk-requester.us-east-1.amazonaws.com",
    "preview": "https://www.mturk.com/mturk/preview"
  },
  "sandbox": {
    "endpoint": 
          "https://mturk-requester-sandbox.us-east-1.amazonaws.com",
    "preview": "https://workersandbox.mturk.com/mturk/preview"
  },
}

mturk_environment = environments["production"] if create_hits_in_production else environments["sandbox"]
session = boto3.Session(profile_name='mturk')
client = session.client(
    service_name='mturk',
    region_name='us-east-1',
    endpoint_url=mturk_environment['endpoint'],
)

print(client.get_account_balance()['AvailableBalance'])


10000.00


All the client methods are defined in the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mturk.html)

### HIT design

* Sentiment detection, 
* Knowledge graph completion

#### Task:
* Read data from a file
* Data given below has a few examples of entites which can be defined by some relations (this is further used as input triples for the knowledge completion models)

In [24]:
import pandas as pd

data = pd.read_csv("data/data.csv")
data.head()

for idx, row in data.iterrows():
    print(list(row))

['Barack Obama', 'the state of hawai', 'was born in', 'won the ', 'is or was married at', 'died in', 'was influenced by']
['Leonardo da Vinci', 'The Mona Lisa', 'was born in', 'won the ', 'is or was married at', 'painted', 'destroyed']
['Leonard Nimoy', 'Star Trek', 'won the', 'is or was married to', 'painted', 'starred in', 'character in']


### Informed Consent

* Ethical obligation to inform participants about a task to help them decide if they'd like to participate

> The first page of the online survey should be the consent document. The online consent will have all of the elements of a regular consent, but it will not require a signature. 


### Create an HTML template that is shown to workers for each input item


In line 211 to 224 of template: 

` <p class="well">${content}</p> `


* you can read the HTML from a file and wrap that into a `QUESTION_XML` (question layout xml) required for the MTurk

In [25]:
html_layout = open('data/HIT-KGCompletion.html', 'r').read()
QUESTION_XML = """<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
        <HTMLContent><![CDATA[{}]]></HTMLContent>
        <FrameHeight>650</FrameHeight>
        </HTMLQuestion>"""
question_xml = QUESTION_XML.format(html_layout)

### Calculating Cost and Budgeting



* We pay all participants fair wages at `$15/hour`. That means that if a study takes an hour, we pay `$15`. If it takes a half hour, we pay `$7.50`
* Fairwork (Refer to lecture): This is a single line JavaScript link you can add to your task that will ensure your workers are paid at leats minimum wage.


#### An appropriate title, description, keywords are also provided to let Workers know what is involved in this task.
* Each response has a reward of `$0.05` so the total Worker reward for this task would be `$0.25` plus `$0.05` in MTurk fees. 

In [1]:
TaskAttributes = {
    'MaxAssignments': 5,           
    # How long the task will be available on MTurk (1 hour)     
    'LifetimeInSeconds': 60*60,
    # How long Workers have to complete each item (10 minutes)
    'AssignmentDurationInSeconds': 60*10,
    # The reward you will offer Workers for each response
    'Reward': '0.05',                     
    'Title': 'Find a suitable relation for the given entities',
    'Keywords': 'Knowledge graph, link, fact',
    'Description': 'Identify suitable relation for the given entities'
}

### Qualification Task (Gold Tasks)

In [39]:
questions = open(file="data/Test.xml", mode="r").read()
answers = open(file="data/AnswerKey.xml", mode="r").read()

Refer: https://requestersandbox.mturk.com/ 

In [41]:
client.update_qualification_type(
                      QualificationTypeId='3DCMS2BWIWGQB243W43QQNA4WGZCTH',
                      Description='This is a brief knowledge test for workers',
                      QualificationTypeStatus='Active',
                      Test=questions,
                      AnswerKey=answers,
                      TestDurationInSeconds=2400)

{'QualificationType': {'QualificationTypeId': '36EODA36TPPO3DK9F6W0O78Z0IE8BH',
  'CreationTime': datetime.datetime(2022, 11, 14, 9, 1, 23, tzinfo=tzlocal()),
  'Name': 'PreScreening',
  'Description': 'This is a brief knowledge test for workers',
  'QualificationTypeStatus': 'Active',
  'Test': '<QuestionForm xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2005-10-01/QuestionForm.xsd">\n  <Overview>\n    <Title>Qualification Test</Title>\n      <Text> \n    </Text>\n  </Overview>\n\n\n  <Question>\n    <QuestionIdentifier>question1</QuestionIdentifier>\n    <IsRequired>true</IsRequired>\n    <QuestionContent>\n      <Text>1. Find the subject, object and predicate of the following fact </Text>\n        <FormattedContent><![CDATA[\n            The theory of relativity proposed by Albert Einstein\n         ]]>\n    </FormattedContent>\n    </QuestionContent>\n    <AnswerSpecification>\n      <SelectionAnswer>\n        <StyleSuggestion>radiobutton</StyleSuggestion>

### More Qualitfication Requirement

Create a task according to the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mturk.html#MTurk.Client.create_hit)

In [42]:
TaskAttributes = { 
    'MaxAssignments': 5,           
    # How long the task will be available on MTurk (1 hour)     
    'LifetimeInSeconds': 60*60,
    # How long Workers have to complete each item (10 minutes)
    'AssignmentDurationInSeconds': 60*10,
    # The reward you will offer Workers for each response
    'Reward': '0.05',                     
    'Title': 'Find a suitable relation for the given entities',
    'Keywords': 'Knowledge graph, link, fact',
    'Description': 'Identify suitable relation for the given entities',
    ## More qualification
    'QualificationRequirements': [
                {
            'QualificationTypeId': '3DCMS2BWIWGQB243W43QQNA4WGZCTH',
            'Comparator': 'GreaterThan',
            'IntegerValues': [
                80,
            ]
        },
    ]
}

### Create your task/ HIT

Use `create_hit` to create a HIT/Task and pass the `TaskAttributes` and `Question` as parameters.
All the parameters are defined in the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mturk.html#MTurk.Client.create_hit)

In [43]:
import re

results = []
hit_type_id = ''

for idx, row in data.iterrows():
    temp_data = list(row)
    rep = {"${subject}":temp_data[0] , "${Object}": temp_data[1], "${Option1}": temp_data[2], "${Option2}": temp_data[3],\
           "${Option3}": temp_data[4], "${Option4}": temp_data[5], "${Option5}": temp_data[6] }
    
    rep = dict((re.escape(k), v) for k, v in rep.items()) 
    pattern = re.compile("|".join(rep.keys()))
    question = pattern.sub(lambda m: rep[re.escape(m.group(0))], question_xml)
    
    response = client.create_hit(
        **TaskAttributes,
        Question = question
    )
    hit_type_id = response['HIT']['HITTypeId']
    results.append({
        'triple': temp_data,
        'hit_id': response['HIT']['HITId']
    })
    
print("You can view the HITs here:")
print(mturk_environment['preview']+"?groupId={}".format(hit_type_id))
    

You can view the HITs here:
https://workersandbox.mturk.com/mturk/preview?groupId=31SXY3N7KD3XPS1HMH7J4X99BBC0BA


Go to the [worker sandbox](https://workersandbox.mturk.com/) and check if the HIT is created.

### HIT Progress

Fetch the HIT using `get_hit` and check the status of the HIT. 


In [44]:
import argparse
from collections import Counter

for idx, item in enumerate(results):
    counter = Counter()
    hit_id = item['hit_id']
    print('Checking HIT %d / %d' % (idx + 1, len(results)))
    try:
        hit = client.get_hit(HITId=hit_id)['HIT']
    except:
        print('Can\'t find hit id: %s' % (hit_id))
        continue
    total = int(hit['MaxAssignments'])
    completed = 0
    paginator = client.get_paginator('list_assignments_for_hit')
    for a_page in paginator.paginate(HITId=hit_id, PaginationConfig={'PageSize': 100}):
        for a in a_page['Assignments']:
            if a['AssignmentStatus'] in ['Submitted', 'Approved', 'Rejected']:
                completed += 1
    counter.update([(completed, total)])

    for (completed, total), count in counter.most_common():
        print('%d / %d: %d' % (completed, total, count))

Checking HIT 1 / 3
0 / 5: 1
Checking HIT 2 / 3
0 / 5: 1
Checking HIT 3 / 3
0 / 5: 1


### Congratulations, you have published your first MTurk task!

### Get the results




For each item in the results array we perform the following steps:

1. Get the current status of the HIT and store it in the results array.
2. Get a list of the Assignments that have been completed for each item and store the count of Assignments completed into the results array.
3. Loop through each Assignment and capture the details of the Assignment and the results to an array of answers.
4. Approve each Assignment so that the $0.05 reward will be distributed to Workers.


In [45]:
for item in results:
    
    # Get the status of the HIT
    hit = client.get_hit(HITId=item['hit_id'])
    item['status'] = hit['HIT']['HITStatus']
    # Get a list of the Assignments that have been submitted
    assignmentsList = client.list_assignments_for_hit(
        HITId=item['hit_id'],
        AssignmentStatuses=['Submitted', 'Approved'],
        MaxResults=10
    )
    assignments = assignmentsList['Assignments']
    item['assignments_submitted_count'] = len(assignments)
    answers = []
    for assignment in assignments:
    
        # Retreive the attributes for each Assignment
        worker_id = assignment['WorkerId']
        assignment_id = assignment['AssignmentId']
        
        # Retrieve the value submitted by the Worker from the XML
        answer_dict = xmltodict.parse(assignment['Answer'])
        for answer in answer_dict['QuestionFormAnswers']['Answer']:
            if answer['FreeText']=='true':
                print(answer['QuestionIdentifier'])
                answers.append(answer['QuestionIdentifier'])
            if answer['QuestionIdentifier'].startswith("wiki"):
                print(answer['FreeText'])
                answers.append(answer['FreeText'])
                
        #answer = answer_dict['QuestionFormAnswers']['Answer']['FreeText']
        #answers.append(int(answer))
        
        # Approve the Assignment (if it hasn't been already)
        if assignment['AssignmentStatus'] == 'Submitted':
            client.approve_assignment(
                AssignmentId=assignment_id,
                OverrideRejection=False
            )
    
    # Add the answers that have been retrieved for this item
    item['answers'] = answers
    #if len(answers) > 0:
    #    item['avg_answer'] = sum(answers)/len(answers)
print(json.dumps(results,indent=2))


option4.Option 4
wikipedia url
sentence form wikipedia
[
  {
    "triple": [
      "Barack Obama",
      "the state of hawai",
      "was born in",
      "won the ",
      "is or was married at",
      "died in",
      "was influenced by"
    ],
    "hit_id": "3JTPR5MTZSIDNBJTV8K8ZFW08VM5K1",
    "status": "Assignable",
    "assignments_submitted_count": 0,
    "answers": []
  },
  {
    "triple": [
      "Leonardo da Vinci",
      "The Mona Lisa",
      "was born in",
      "won the ",
      "is or was married at",
      "painted",
      "destroyed"
    ],
    "hit_id": "38VTL6WC4AJ87G8AZNLZ8NBB6AUY5O",
    "status": "Assignable",
    "assignments_submitted_count": 1,
    "answers": [
      "option4.Option 4",
      "wikipedia url",
      "sentence form wikipedia"
    ]
  },
  {
    "triple": [
      "Leonard Nimoy",
      "Star Trek",
      "won the",
      "is or was married to",
      "painted",
      "starred in",
      "character in"
    ],
    "hit_id": "3KQC8JMJGCYJ76VHB4TI7QIO

### Approve or Reject Assignments

#### Approve

Use the `approve_assignment()` method to approve an assignment. You can also provide a feedback message to the Worker.

In [46]:
import re

approve_ids = []
reject_ids = []
store_true = False

for idx, item in enumerate(results):
    hit_id = item['hit_id']
    paginator = client.get_paginator('list_assignments_for_hit')
    try:
        for a_page in paginator.paginate(HITId=hit_id, PaginationConfig={'PageSize': 100}):
            for a in a_page['Assignments']:
                if a['AssignmentStatus'] == 'Submitted':
                    try:
                      # Try to parse the output from the assignment. If it isn't
                      # valid JSON then we reject the assignment.
                        json.loads(re.search(r'<FreeText>(?P<answer>.*?)</FreeText>', a['Answer'])['answer'])
                        approve_ids.append(a['AssignmentId'])
                    except ValueError as e:
                        reject_ids.append(['AssignmentId'])
                else:
                    print("hit %s has already been %s" % (str(hit_id), a['AssignmentStatus']))
    except client.exceptions.RequestError:
        continue

print('This will approve %d assignments and reject %d assignments with '
     'sandbox=%s' % (len(approve_ids), len(reject_ids), str(mturk_environment['endpoint'])))
print('Continue?')

if not store_true:
    s = input('(Y/N): ')
else:
    s = 'Y'
if s == 'Y' or s == 'y':
    print('Approving assignments')
    for idx, assignment_id in enumerate(approve_ids):
        print('Approving assignment %d / %d' % (idx + 1, len(approve_ids)))
        client.approve_assignment(AssignmentId=assignment_id)
    for idx, assignment_id in enumerate(reject_ids):
        print('Rejecting assignment %d / %d' % (idx + 1, len(reject_ids)))
        client.reject_assignment(AssignmentId=assignment_id, RequesterFeedback='Invalid results')
else:
    print('Aborting')


hit 38VTL6WC4AJ87G8AZNLZ8NBB6AUY5O has already been Approved
This will approve 0 assignments and reject 0 assignments with sandbox=https://mturk-requester-sandbox.us-east-1.amazonaws.com
Continue?
(Y/N): Y
Approving assignments


#### Rejecting Assignments

Use the `reject_assignment()` method to reject an assignment. You can also provide a feedback message to the Worker.

In [47]:
assignment_ids = []

for item in results:
    assignmentsList = client.list_assignments_for_hit(
        HITId=item['hit_id'],
        AssignmentStatuses=['Submitted', 'Approved'],
        MaxResults=10
    )          

    assignments = assignmentsList['Assignments']
    item['assignments_submitted_count'] = len(assignments)
    for assignment in assignments:
        assignment_ids.append(assignment['AssignmentId'])
        
print(assignment_ids)
print('This will reject %d assignments with '
     'sandbox=%s' % (len(assignment_ids), str(mturk_environment['endpoint'])))
print('Continue?')

s = input('(Y/N): ')
if s == 'Y' or s == 'y':
    print('Rejecting assignments')
    for idx, assignment_id in enumerate(assignment_ids): 
        try:
            print(idx,assignment_id)
            print('Rejecting assignment %d / %d' % (idx + 1, len(assignment_ids)))
            client.reject_assignment(AssignmentId=assignment_id, RequesterFeedback='Invalid results')
        except:
            print("Could not reject: %s" % (assignment_id))
else:
    print('Aborting')


['3S4AW7T80CO9XIU4S5I1Y7YGNVXL4W']
This will reject 1 assignments with sandbox=https://mturk-requester-sandbox.us-east-1.amazonaws.com
Continue?
(Y/N): Y
Rejecting assignments
0 3S4AW7T80CO9XIU4S5I1Y7YGNVXL4W
Rejecting assignment 1 / 1
Could not reject: 3S4AW7T80CO9XIU4S5I1Y7YGNVXL4W


### Delete the HITs

In [48]:
from datetime import datetime

delete_all = False
hit_ids = []

if delete_all:
    for hit in client.get_all_hits():
        hit_ids.append(hit.HITId)
else:
    for item in results:
        hit_ids.append(item['hit_id'])

print('This will delete %d HITs with sandbox=%s'
         % (len(hit_ids), str(mturk_environment['endpoint'])))
print('Continue?')
s = input('(Y/N): ')
if s == 'Y' or s == 'y':
    for index, hit_id in enumerate(hit_ids):
        try:
            client.delete_hit(HITId=hit_id)
            print('disabling: %d / %d' % (index+1, len(hit_ids)))
        except:
            #print('Failed to delete: %s' % (hit_id))
            # Get HIT status
            status = client.get_hit(HITId=hit_id)['HIT']['HITStatus']
            print('HITStatus:', status)
            if status == 'Assignable':
                response = client.update_expiration_for_hit(HITId=hit_id,ExpireAt=datetime(2015, 1, 1))        
            # Delete the HIT
            try:
                client.delete_hit(HITId=hit_id)
            except:
                print('Not deleted')
            else:
                print('Deleted')
else:
    print('Aborting')

This will delete 3 HITs with sandbox=https://mturk-requester-sandbox.us-east-1.amazonaws.com
Continue?
(Y/N): Y
HITStatus: Assignable
Not deleted
HITStatus: Assignable
Deleted
HITStatus: Assignable
Not deleted


### Block and Unblock Workers


Use `create_worker_block` to block a worker from working on your HITs. You can also use `list_worker_blocks` to get a list of blocked workers and `delete_worker_block` to unblock a worker.

#### How to deal with malicious or bad workers?
* Quality issues discussed in lecture

#### Block Workers



In [7]:
worker_ids = ['AXMNSDDF']

# with open(args.worker_ids_file, 'r') as f:
#    worker_ids = [line.strip() for line in f]

print('This will block %d workers with IDs with sandbox=%s'
     % (len(worker_ids), str(mturk_environment['endpoint'])))
print('Continue?')

s = input('(Y/N): ')

if s == 'Y' or s == 'y':
    for worker_id in worker_ids:
        try:
            client.create_worker_block(WorkerId=worker_id, Reason='provided bad data')
        except:
            print('Failed to block: %s' % (worker_id))
else:
    print('Aborting')

This will block 1 workers with IDs with sandbox=https://mturk-requester-sandbox.us-east-1.amazonaws.com
Continue?
(Y/N):  y
Failed to block: AXMNSDDF


#### Unblock Workers

In [9]:
worker_ids = ['AXMNSDDF']

# with open(args.worker_ids_file, 'r') as f:
#    worker_ids = [line.strip() for line in f]

print('This will block %d workers with IDs with sandbox=%s'
     % (len(worker_ids), str(mturk_environment['endpoint'])))
print('Continue?')

s = input('(Y/N): ')

if s == 'Y' or s == 'y':
    for worker_id in worker_ids:
        try:
            client.delete_worker_block(WorkerId=worker_id, Reason='already provided data')
        except:
            print('Failed to unblock: %s' % (worker_id))
else:
    print('Aborting')

This will block 1 workers with IDs with sandbox=https://mturk-requester-sandbox.us-east-1.amazonaws.com
Continue?
(Y/N):  y
Failed to unblock: AXMNSDDF


#### Disassociate the qualification form worker

In [38]:
response = client.disassociate_qualification_from_worker(
    WorkerId='A22BXIPVQWKLM2',
    QualificationTypeId='36GZRO2Y1D7WY5RMP45HEIT2W6AF0B',
    Reason='Allow the worker to retake the qualification test'
)