INCEpTION - HUMAN Edition
Settings
This section briefly explains the recommended settings for INCEpTION when deploying to interact with HUMAN Protocol.
General settings for unsupervised deployment
We assume that the INCEpTION instance should be deployed automatically.
By default, INCEpTION sets up an admin account with a standard password at the first start. It is likely that you do not want that account to be generated with the default password but rather with a more secure password of your choice. In order to do that, you can add the following line containing a BCrypt-encrypted password to the settings.properties
file.
security.default-admin-password=BCRYPT_ENCRYPTED_PASSWORD
Optionally, you can enable the INCEpTION remote API for the default admin user by adding these following lines to the settings.properties
file. The communication with the HUMAN Protocol uses its own API endpoints, so enabling the standard remote API is optional. However, it can be useful if you want to programmatically perform actions not available via the HUMAN Protocol API endpoints.
security.default-admin-remote-access=true
remote-api.enabled=true
Depending on the capacity of your server and the characteristics of the annotation projects you are running (e.g. document size, number of documents, number of projects), you may want to limit the number of users which can log in to the system at any given time.
login.max-concurrent-sessions=50
In the context of the HUMAN Protocol, a crowd-sourcing workflow is used where invites are send out to potential annotators and then assign them on-demand to annotate particular documents. Thus, add the following lines to your settings.properties
file to enable dynamic workload management and the ability to generate invite links.
workload.dynamic.enabled=true
sharing.invites.enabled=true
sharing.invites.guests-enabled=true
sharing.invites.invite-base-url=URL_WHERE_USERS_CAN_REACH_INCEPTION
Settings for the HUMAN Protocol
This section highlights the settings required to communicate with services using the HUMAN Protocol.
Wallet-based login
In order to log in as an annotator via an invite link, the user needs to connect to a wallet account. Currently supported are browser-embedded solutions that are supported by web3modal such as Metamask. Additionally, it is possible to configure an Infura ID in the
settings.properties
file which then also enables a login via WalletConnect.
human-protocol.infura-id=dead1234beefDEAD5678BEEFdead1234
Receiving jobs from the HUMAN Protocol
To enable the submission of jobs from HUMAN to INCEpTION, it is necessary to configure INCEpTION as an exchange and to bind it to an instance of the HUMAN Protocol Job Flow. You may obtain the values for the following settings from a support person on the side of the HUMAN Protocol and add them to the settings.properties
file.
human-protocol.exchange-key=UUID
human-protocol.exchange-id=NUMBER
human-protocol.job-flow-url=URL
human-protocol.job-flow-key=UUID
Publishing results back to the HUMAN Protocol
Warning
|
Results are currently published as a full INCEpTION project export which contains all information about the project setup and also the annotations themselves in UIMA CAS XMI format. This is likely going to change. |
To enable the publishing of annotation task results back, you need to set up a bucket on an S3-compatible service and then configure INCEpTION to access that service using the following settings in the settings.properties
file. The settings are targetted at the Amazon S3 service. When using another cloud storage service such as Google Cloud Storage, you may have to follow specific procedures to obain an AWS-compatible access key and secret access key. You will also have to override the endpoint URL. Setting an AWS region is mandatory, even if the selected cloud storage service does not provide the particular region - the URL override should take care of that.
human-protocol.s3-bucket=BUCKET_NAME
human-protocol.s3-endpoint=ENDPOINT_URL (e.g. https://storage.googleapis.com)
human-protocol.s3-region=REGION (e.g. us-west-2 - must be a valid AWS region name!)
human-protocol.s3-access-key-id=ACCESS-KEY
human-protocol.s3-secret-access-key=SECRET-ACCESS-KEY
Job descriptions
The expiration_date
field can be used to determine that using an invite link is only possible up to the given date. If this field is set to 0, the invite link is accessible until all documents have been annotated.
Project title and description
As there is not default key for a project key in the HUMAN Protocol Manifest Specification, it can
be set via the key projectTitle
in the request_config
section. If this key is not specified, the
job address is used as the project title:
"request_config": {
"projectTitle": "My annotation project"
}
The project description is visible to annotators on the project dashboard after they join a project.
This description is put together from the requester_description
and the requester_question
keys of the
manifest. Both fields are interpreted as markdown, so you can use that to format your instructions
to the annotators. Line breaks can be added using \n
.
Job types
The HUMAN Protocol Manifest Specification defines the general framework for describing labelling jobs that are then submitted to exchanges such as INCEpTION. INCEpTION is a text-oriented labelling tool and supports several specific settings relevant to text labelling tasks. We will briefly go through the different types of labelling jobs that are presently supported and how they can be configured in the manifest.
Word-level annotation tasks
Word-level annotation tasks require the annotator to mark one or more words that constitue some semantic entity and to label them. Optionally, it is possible that the relevant spans are already pre-marked so that the annotator only has to assign a label.
For example, the annotator might be instructed to identify words or sequences of words representing persons, organizations, and locations (the classical named entities). Note though that the choice of what to identify is completely up to you, so you could also chose to let your annotators label e.g. chemical compounds, medical conditions, machine parts, job skills - you name it.
"requester_question" : {
"en" : "Please mark any *persons*, *organizations*, *locations* and *other types of entities* in the text.\n\nTo do so, left-click with the mouse on the first word that is part of the mention, then drag it to the last word and then release the mouse button. You do not have to aim exactly at the characters, clicking and releasing anywhere within a word will automatically include the entire word. If you want to quickly mark a single word, double-left-click on it."
},
"requester_restricted_answer_set": {
"PER": { "en": "Person" },
"ORG": { "en": "Organization" },
"LOC": { "en": "Location" },
"MISC": { "en": "Other types of entities" }
},
"request_type" : "span_select",
"request_config": {
"anchoring": "tokens",
"overlap": "none",
"crossSenence": false,
"dataFormat" : "text"
}
Sentence-level annotation tasks
Sentence-level annotation tasks require the annotator to mark relevant sentences and to assign properties to them.
For example, the annotator might be instructed to mark sentences containing sentiment expressions and to assign a polarity to them (e.g. positive
, neutral
, negative
)
"requester_question" : {
"en" : "Please mark sentences that contain a sentiment statement and assign a polarity.\n\nTo do so, double-click anywhere within a sentence to mark it and then select the polarity on the right side of the screen."
},
"requester_restricted_answer_set": {
"pos": { "en": "Positive sentiment expression" },
"neg": { "en": "Negative sentiment expression" },
"neut": { "en": "Neutral sentiment expression" }
},
"request_type" : "span_select",
"request_config": {
"anchoring": "sentences",
"overlap": "none",
"crossSenence": false,
"dataFormat" : "textlines"
}
Document-level annotation tasks
Document-level annotation tasks require the annotator to classify a document by assigning a label.
Many sentence-level annotation tasks can also be treated as document-level annotation tasks if the documents are structured such that the consist only of a single sentence (or statement). Thus, the difference to sentence-level annotation tasks is often simply that the annotator does not have to mark relevant sentences before assigning a label to them, thus saving valuable time. On the other hand, considering sentences as documents and treating them in isolation from each other also hides the context of the sentence from the annotator, making it potentially more complicated or even impossible to assign the correct labels.
For example, the annotator might be instructed to flag documents (e.g. tweets or forum comments) that contain inappropriate content.
"requester_question" : {
"en" : "Please read the tweet/forum post and if it contains inappropriate content choose the type of inappropriate content."
},
"requester_restricted_answer_set": {
"abusive-harmful": { "en": "Abusive or harmful" },
"sensitive-personal": { "en": "Exposes sensitive personal information" },
"spam": { "en": "Unsolicited advertisement or promotion of commercial activity" }
},
"request_type" : "document_classification",
"request_config": {
"dataFormat" : "text"
}
Manifest settings
Table 1. Request typesRequest type | Description |
---|---|
|
Labelling of spans of text. Use this e.g. for named entity annotation tasks or other tasks for which you could train a sequence labelling model. |
|
Labelling of the entire document. Use this e.g. for marking emails as spam or other tasks for which you could train a document classification model. |
Parameter | Request type | Description |
---|---|---|
|
any |
Number of annotators that need to finish a labelling a document before the document is considered to be fully labelled (default: |
|
any |
This parameter is currently ignored. |
|
|
If set to a decimal value between |
|
For document classification tasks, this parameter is currently ignored. |
Request configuration parameter | Request type | Description |
---|---|---|
|
any |
The name/title of the project (if not specified, the job address is used as the default title). |
|
any |
The format use for the task data (default: |
|
|
The granularity of the annotations. Possible values are |
|
|
Whether annotations may cross sentence boundaries or not. Possible values are |
|
|
Whether annotations may overlap. Possible values are |
Task data
Task data can be included either directly in the job manifest using the taskdata
key, or it an external task data specification file can be reference using the taskdata_uri
field.
As mentioned above, the tasks operate e.g. on a word or sentence level. INCEpTION includes a basic mechanism for detecting words (tokens) and sentences, but you may have a better algorithm at hand for your specific data, or you might care to define a sentence differently (e.g. one tweet being once sentence, irrespective of any punctuation used in the tweet). Also you might care to reduce the effort for your annotators e.g. by already pre-marking spans which the annotators should then only assign labels to (marking the spans might have been a separate previous annotation job).
To support these different kinds of scenarions, INCEpTION supports various data formats. It is necessary to define in request_config
section of the job description which data format the task data is using.
"request_config": {
...
"dataFormat" : "textlines"
...
}
A few supported formats are given here. Additional formats may be found in the INCEpTION documentation.
- Plain text (
text
) -
Plain UTF-8 text files. Word and sentences boundaries are automatically determined by INCEpTION.
This is a simple example text. INCEpTION will detect this sentence as the second sentence. Abbreviatins like Molholand Dr. can easily throw the internal sentence splitter off track.
- Plain text with one sentence per line (
textlines
) -
Plain UTF-8 text files that have been pre-formatted to contain one sentence-like unit per line. INCEpTION will treat each line in the documents as a sentence and within these sentences automatically identify word boundaries.
In this format, every line is treated as a sentence. We no longer have problems with abbreviating drive into dr in Molholand Dr. as we did with plain text. However, we cannot have line breaks within a single sentence anymore.
- Plain text with one sentence per line and whitespace-separated tokens (
pretokenized-textlines
) -
Plain UTF-8 text files that have been pre-formatted to contain one sentence-like unit per line. Additionally, it is expected that words are separated by spaces. INCEpTION will not try to automatically identify word boundaries but treats every space as a boundary.
In this format , every line is treated as a sentence . Additionally , words ( tokens ) must be separated by a space character . This provides e.g. the ability to ensure that abbreviation markers are not confused with sentence end markers .
- UIMA CAS XMI (
xmi
) -
The UIMA CAS XMI format is a flexible XML-based format able to represent complex annotations. This is the format of choice for scenarios that operate on pre-annotated data.