Skip to content

humanprotocol/inception-human-protocol

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

INCEpTION - HUMAN Edition

Settings

This section briefly explains the recommended settings for INCEpTION when deploying to interact with HUMAN Protocol.

General settings for unsupervised deployment

We assume that the INCEpTION instance should be deployed automatically.

Default admin password

By default, INCEpTION sets up an admin account with a standard password at the first start. It is likely that you do not want that account to be generated with the default password but rather with a more secure password of your choice. In order to do that, you can add the following line containing a BCrypt-encrypted password to the settings.properties file.

security.default-admin-password=BCRYPT_ENCRYPTED_PASSWORD
Enabling the INCEpTION remote API (optional)

Optionally, you can enable the INCEpTION remote API for the default admin user by adding these following lines to the settings.properties file. The communication with the HUMAN Protocol uses its own API endpoints, so enabling the standard remote API is optional. However, it can be useful if you want to programmatically perform actions not available via the HUMAN Protocol API endpoints.

security.default-admin-remote-access=true
remote-api.enabled=true
Limit concurrent users

Depending on the capacity of your server and the characteristics of the annotation projects you are running (e.g. document size, number of documents, number of projects), you may want to limit the number of users which can log in to the system at any given time.

login.max-concurrent-sessions=50
Enable crowd-sourcing annotation workflow

In the context of the HUMAN Protocol, a crowd-sourcing workflow is used where invites are send out to potential annotators and then assign them on-demand to annotate particular documents. Thus, add the following lines to your settings.properties file to enable dynamic workload management and the ability to generate invite links.

workload.dynamic.enabled=true
sharing.invites.enabled=true
sharing.invites.guests-enabled=true
sharing.invites.invite-base-url=URL_WHERE_USERS_CAN_REACH_INCEPTION

Settings for the HUMAN Protocol

This section highlights the settings required to communicate with services using the HUMAN Protocol.

Wallet-based login

In order to log in as an annotator via an invite link, the user needs to connect to a wallet account. Currently supported are browser-embedded solutions that are supported by web3modal such as Metamask. Additionally, it is possible to configure an Infura ID in the settings.properties file which then also enables a login via WalletConnect.

human-protocol.infura-id=dead1234beefDEAD5678BEEFdead1234

Receiving jobs from the HUMAN Protocol

To enable the submission of jobs from HUMAN to INCEpTION, it is necessary to configure INCEpTION as an exchange and to bind it to an instance of the HUMAN Protocol Job Flow. You may obtain the values for the following settings from a support person on the side of the HUMAN Protocol and add them to the settings.properties file.

human-protocol.exchange-key=UUID
human-protocol.exchange-id=NUMBER
human-protocol.job-flow-url=URL
human-protocol.job-flow-key=UUID

Publishing results back to the HUMAN Protocol

Warning
Results are currently published as a full INCEpTION project export which contains all information about the project setup and also the annotations themselves in UIMA CAS XMI format. This is likely going to change.

To enable the publishing of annotation task results back, you need to set up a bucket on an S3-compatible service and then configure INCEpTION to access that service using the following settings in the settings.properties file. The settings are targetted at the Amazon S3 service. When using another cloud storage service such as Google Cloud Storage, you may have to follow specific procedures to obain an AWS-compatible access key and secret access key. You will also have to override the endpoint URL. Setting an AWS region is mandatory, even if the selected cloud storage service does not provide the particular region - the URL override should take care of that.

human-protocol.s3-bucket=BUCKET_NAME
human-protocol.s3-endpoint=ENDPOINT_URL (e.g. https://storage.googleapis.com)
human-protocol.s3-region=REGION (e.g. us-west-2 - must be a valid AWS region name!)
human-protocol.s3-access-key-id=ACCESS-KEY
human-protocol.s3-secret-access-key=SECRET-ACCESS-KEY

Job descriptions

Expiration date

The expiration_date field can be used to determine that using an invite link is only possible up to the given date. If this field is set to 0, the invite link is accessible until all documents have been annotated.

Project title and description

As there is not default key for a project key in the HUMAN Protocol Manifest Specification, it can be set via the key projectTitle in the request_config section. If this key is not specified, the job address is used as the project title:

"request_config": {
  "projectTitle": "My annotation project"
}

The project description is visible to annotators on the project dashboard after they join a project. This description is put together from the requester_description and the requester_question keys of the manifest. Both fields are interpreted as markdown, so you can use that to format your instructions to the annotators. Line breaks can be added using \n.

Job types

The HUMAN Protocol Manifest Specification defines the general framework for describing labelling jobs that are then submitted to exchanges such as INCEpTION. INCEpTION is a text-oriented labelling tool and supports several specific settings relevant to text labelling tasks. We will briefly go through the different types of labelling jobs that are presently supported and how they can be configured in the manifest.

Word-level annotation tasks

Word-level annotation tasks require the annotator to mark one or more words that constitue some semantic entity and to label them. Optionally, it is possible that the relevant spans are already pre-marked so that the annotator only has to assign a label.

For example, the annotator might be instructed to identify words or sequences of words representing persons, organizations, and locations (the classical named entities). Note though that the choice of what to identify is completely up to you, so you could also chose to let your annotators label e.g. chemical compounds, medical conditions, machine parts, job skills - you name it.

"requester_question" : {
  "en" : "Please mark any *persons*, *organizations*, *locations* and *other types of entities* in the text.\n\nTo do so, left-click with the mouse on the first word that is part of the mention, then drag it to the last word and then release the mouse button. You do not have to aim exactly at the characters, clicking and releasing anywhere within a word will automatically include the entire word. If you want to quickly mark a single word, double-left-click on it."
},
"requester_restricted_answer_set": {
  "PER":  { "en": "Person" },
  "ORG":  { "en": "Organization" },
  "LOC":  { "en": "Location" },
  "MISC": { "en": "Other types of entities" }
},
"request_type" : "span_select",
"request_config": {
  "anchoring": "tokens",
  "overlap": "none",
  "crossSenence": false,
  "dataFormat" : "text"
}

Sentence-level annotation tasks

Sentence-level annotation tasks require the annotator to mark relevant sentences and to assign properties to them.

For example, the annotator might be instructed to mark sentences containing sentiment expressions and to assign a polarity to them (e.g. positive, neutral, negative)

"requester_question" : {
  "en" : "Please mark sentences that contain a sentiment statement and assign a polarity.\n\nTo do so, double-click anywhere within a sentence to mark it and then select the polarity on the right side of the screen."
},
"requester_restricted_answer_set": {
  "pos":  { "en": "Positive sentiment expression" },
  "neg":  { "en": "Negative sentiment expression" },
  "neut": { "en": "Neutral sentiment expression" }
},
"request_type" : "span_select",
"request_config": {
  "anchoring": "sentences",
  "overlap": "none",
  "crossSenence": false,
  "dataFormat" : "textlines"
}

Document-level annotation tasks

Document-level annotation tasks require the annotator to classify a document by assigning a label.

Many sentence-level annotation tasks can also be treated as document-level annotation tasks if the documents are structured such that the consist only of a single sentence (or statement). Thus, the difference to sentence-level annotation tasks is often simply that the annotator does not have to mark relevant sentences before assigning a label to them, thus saving valuable time. On the other hand, considering sentences as documents and treating them in isolation from each other also hides the context of the sentence from the annotator, making it potentially more complicated or even impossible to assign the correct labels.

For example, the annotator might be instructed to flag documents (e.g. tweets or forum comments) that contain inappropriate content.

"requester_question" : {
  "en" : "Please read the tweet/forum post and if it contains inappropriate content choose the type of inappropriate content."
},
"requester_restricted_answer_set": {
  "abusive-harmful":  { "en": "Abusive or harmful" },
  "sensitive-personal": { "en": "Exposes sensitive personal information" },
  "spam":  { "en": "Unsolicited advertisement or promotion of commercial activity" }
},
"request_type" : "document_classification",
"request_config": {
  "dataFormat" : "text"
}

Manifest settings

Table 1. Request types
Request type  Description

span_select

Labelling of spans of text. Use this e.g. for named entity annotation tasks or other tasks for which you could train a sequence labelling model.

document_classification

Labelling of the entire document. Use this e.g. for marking emails as spam or other tasks for which you could train a document classification model.

Table 2. Workload parameters (top-level manifest parameters)
Parameter  Request type Description

requester_min_repeats

any

Number of annotators that need to finish a labelling a document before the document is considered to be fully labelled (default: 1). The value must be 1 or greater.

requester_max_repeats

any

This parameter is currently ignored.

requester_accuracy_target

span_select

If set to a decimal value between 0.0 and 1.0, the annotators labels are merged automatically before the result submission. If unset (default), no automatic merging is performed as part of the results submisson. If there is more than one label assigned to a span, then this parameter controls how many annotators must have chosen the majority label over the second-best label in order for the majority label to be considered for auto-merging. If this parameter is 0, then the majority label is always merged except if there is a tie with the second-best label. If the parameter is 1 then the majority label is used only if all annotators assigned it unanimously (i.e. there is no second-best label).

document_classification

For document classification tasks, this parameter is currently ignored.

Table 3. Request configuration parameters
 Request configuration parameter Request type Description

projectTitle

any

The name/title of the project (if not specified, the job address is used as the default title).

dataFormat

any

The format use for the task data (default: text). See below for a list of supported formats.

anchoring

span_select

The granularity of the annotations. Possible values are characters, tokens and sentences (default: characters). If the selected data format does not provide token and sentence boundary information, a simple token and sentence boundary detection will be performed by INCEpTION. Note that this simple detection may have issues such as generating a sentence boundary when a dot is used to mark an abbreviation. It is typically the best idea to provide data to INCEpTION that is already tokenized and which has sentence boundary information.

 crossSenence

span_select

Whether annotations may cross sentence boundaries or not. Possible values are true and false (default: false). For more info on automatic sentence detection, see anchoring above.

overlap

span_select

Whether annotations may overlap. Possible values are none and any (default: none). Allowing overlapping annotations also allows placing multiple annotations at the same location (stacking) which is useful if the labelling task allows for ambiguities or multiple answers.

Task data

Task data can be included either directly in the job manifest using the taskdata key, or it an external task data specification file can be reference using the taskdata_uri field.

As mentioned above, the tasks operate e.g. on a word or sentence level. INCEpTION includes a basic mechanism for detecting words (tokens) and sentences, but you may have a better algorithm at hand for your specific data, or you might care to define a sentence differently (e.g. one tweet being once sentence, irrespective of any punctuation used in the tweet). Also you might care to reduce the effort for your annotators e.g. by already pre-marking spans which the annotators should then only assign labels to (marking the spans might have been a separate previous annotation job).

To support these different kinds of scenarions, INCEpTION supports various data formats. It is necessary to define in request_config section of the job description which data format the task data is using.

"request_config": {
  ...
  "dataFormat" : "textlines"
  ...
}

A few supported formats are given here. Additional formats may be found in the INCEpTION documentation.

Plain text (text)

Plain UTF-8 text files. Word and sentences boundaries are automatically determined by INCEpTION.

This is a simple example text. INCEpTION will detect this sentence as the second sentence. Abbreviatins like Molholand Dr. can easily throw the internal sentence splitter off track.
Plain text with one sentence per line (textlines)

Plain UTF-8 text files that have been pre-formatted to contain one sentence-like unit per line. INCEpTION will treat each line in the documents as a sentence and within these sentences automatically identify word boundaries.

In this format, every line is treated as a sentence.
We no longer have problems with abbreviating drive into dr in Molholand Dr. as we did with plain text.
However, we cannot have line
breaks within a single sentence anymore.
Plain text with one sentence per line and whitespace-separated tokens (pretokenized-textlines)

Plain UTF-8 text files that have been pre-formatted to contain one sentence-like unit per line. Additionally, it is expected that words are separated by spaces. INCEpTION will not try to automatically identify word boundaries but treats every space as a boundary.

In this format , every line is treated as a sentence .
Additionally , words ( tokens ) must be separated by a space character .
This provides e.g. the ability to ensure that abbreviation markers are not confused with sentence end markers .
UIMA CAS XMI (xmi)

The UIMA CAS XMI format is a flexible XML-based format able to represent complex annotations. This is the format of choice for scenarios that operate on pre-annotated data.