TEES 2.1

jbjorne edited this page Apr 24, 2013 · 28 revisions

Updates

  • Apr 24th 2013: When citing TEES in the DDIExtraction 2013 task, please use the reference shown below. Please also note that if you ran TEES yourself, you may need to cite the programs it utilizes. For further information please see the page on licenses.
@inproceedings{bjorne2013ddi,
  title={UTurku: Drug Named Entity Detection and Drug-drug Interaction Extraction Using SVM Classification and Domain Knowledge},
  author={Bj\"{o}rne, J. and Kaewphan, S. and Salakoski, T.},
  booktitle={Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013)},
  year={2013}
}
  • Apr 15th 2013: Please do not use TEES for predicting the GRN13 task as the GRN13 task does not follow the common BioNLP'13 Shared Task file format where event triggers and events have the same type, resulting in loss of type information in TEES predictions. Apologies for not noticing this earlier, and therefore not being able to provide a working model for this task in time. Time permitting, updated results may be provided tomorrow.
  • Apr 13th 2013: TEES has been updated to version 2.1.1, including support for processing the BioNLP'13 test sets. Precalculated analyses for the test sets can also be downloaded.
  • Apr 7th 2013: If you used TEES or TEES precalculated analyses for the DDIExtraction 2013 task please check this page for updated information on citing before submitting your camera ready paper.
  • Apr 7th 2013: BioNLP'13 test set analyses will be provided shortly after all test sets become available.
  • Mar 7th 2013: Test set analyses are available for registered DDIExtraction 2013 participants. For more information, see the announcement.

TEES 2.1 is a compatibility update that adds support for the new features and data formats of the BioNLP'13 and DDIExtraction 2013 Shared Tasks. It also introduces automated annotation scheme learning, enabling TEES to be adapted to new corpora with no additional programming.

The 2.1 update can be downloaded from the downloads page or the repository, and the new datasets and models installed with the normal configure.py installation program. This page contains instructions and caveats on using TEES when participating in the BioNLP'13 and DDIExtraction 2013 tasks. It is also possible to simply use TEES analyses without running the program: pre-calculated analyses are available for most of the BioNLP'13 tasks as well as the DDIExtraction 2013 task from https://sourceforge.net/projects/tees/files/analyses/. Corresponding analyses for the test sets will be provided once they become available.

Hopefully this TEES update for its part will also encourage new participants to try these interesting Shared Tasks. TEES can for example be used as a basis for a new event extraction system, to automate intermediate tasks allowing you to focus on your core research interests or as an additional predicted dataset usable for system combination. System combination has been shown to improve performance by several percentage points (Kano et. al. 2011) and in the BioNLP'11 Shared Task the winning entry in the GE category used the combined output from two systems (McClosky et. al. 2012).

TEES 2.1 changes

In version 2.1, TEES has been updated for compatibility with the 2013 versions of the BioNLP and DDIExtraction Shared Task file formats. In terms of performance, the system remains largely unchanged from the 2.0 release. While TEES can be used for most challenges in these Shared Tasks, certain features fall outside its scope, as explained in the relevant task sections.

A major new feature is automated annotation structure analysis, used to determine training settings for new corpora. With this update, no programming is required anymore to re-train TEES for a new corpus. The system can automatically optimize itself for new annotation schemes when these are represented in the flexible interaction XML file format. For further information on training TEES for a new corpus, please see the Training wiki page.

In earlier versions of TEES, while the edge example builder could process any corpus, it was inefficient on new corpora, unless example generation filtering rules were defined by inheriting a new EdgeExampleBuilder-class. In TEES 2.1, optimal edge example filtering rules are determined automatically based on the annotation scheme analysis, so no programming is needed to efficiently use the system on a novel corpus.

In TEES 2.0, the event unmerging system was specific for the BioNLP'11 corpora. In TEES 2.1, the annotation scheme analysis is used to define task-specific unmerging settings, so the system can be directly used on new event-corpora.

Task 2 (the Site-arguments) were processed by TEES in the 2011 Shared Task in different ways depending on the task. In the current update, Site-argument representation has been unified, so that Site-arguments are treated identically with other event arguments, as outgoing interactions of the event trigger node. Additional "SiteParent" relations are generated to link "Entity" and "Protein" type entities, and when converting to the BioNLP Shared Task file format, these relations are used to link Sites to core arguments in ambiguous cases.

Confidence scores

TEES predictions come from SVM-classifications, so they also have an estimate of certainty, or a confidence score. TEES produces these confidence scores for predicted entities (named entities and trigger words), interactions (relations and event arguments) and unmerging (an overall event confidence score). These are stored in both the interaction XML as well as the BioNLP Shared Task format output, with "conf" denoting entity or interaction confidence and "umConf" the unmerging confidence.

For each prediction, confidence scores are listed for all potential classes ("neg" is always the negative class). The element's confidence is the highest value in the list. It's important to note that this is not necessarily the value of the element's class, as TEES uses merged classes to predict overlapping elements. So, for a predicted "Phosphorylation" trigger, the confidence score can for example be the value given to the merged "Phosphorylation---Regulation" class.

The example below shows an event predicted for the BioNLP'11 GENIA development set with confidence scores:

<entity charOffset="85-97" conf="neg:-88.4885287,Positive_regulation:121.967042,Phosphorylation:21.858001,Gene_expression:12.79538,Regulation:-6.038156,Binding:4.346843,Localization:7.959493,Entity:-19.764132,Negative_regulation:-10.895809,Transcription:-20.964166,Protein_catabolism:21.707699,Gene_expression---Positive_regulation:-7.137787,Phosphorylation---Positive_regulation:-9.138881,Negative_regulation---Positive_regulation:1.708512,Gene_expression---Transcription:-19.613922,Regulation---Transcription:0.0,Negative_regulation---Phosphorylation:-3.344573,Localization---Regulation:-3.247587,Binding---Positive_regulation:-5.006682,Positive_regulation---Regulation:-4.413469,Positive_regulation---Protein_catabolism:2.04659,Entity---Positive_regulation:-5.878282,Negative_regulation---Regulation:-0.321749,Localization---Positive_regulation:-11.280628,Binding---Negative_regulation:-2.581345,Positive_regulation---Transcription:-4.340616,Gene_expression---Negative_regulation:7.646308" goldIds="GE11.d0.s0.e29" headOffset="85-97" id="GE11.d0.s0.e2" negation="False" speculation="False" text="upregulation" type="Positive_regulation" umConf="neg:10.481994,Positive_regulation:91.959339,Phosphorylation:1.23414,Gene_expression:-17.726275,Regulation:-23.131412,Binding:-24.167135,Entity:-15.250921,Negative_regulation:-1.097466,Transcription:-28.712631,Protein_catabolism:7.255721,Localization:-0.845356" />
<interaction conf="neg:27.421903,Theme:87.191324,Cause:34.394885,ToLoc:-38.10363,Cause---Theme:-4.655489,SiteParent:-24.587815,Site:-49.002859,AtLoc:-32.658392" directed="True" e1="GE11.d0.s0.e2" e1DuplicateIds="" e2="GE11.d0.s0.e1" e2DuplicateIds="" event="True" id="GE11.d0.s0.i0" type="Theme" />

BioNLP'13 Shared Task

TEES 2.1 is compatible with all BioNLP'13 event and relation extraction tasks. The BB13 subtask 1 (ontology term assignment) falls outside the current scope of TEES and is thus not directly supported. Even in the supported BB13 subtasks 2 and 3 the majority of relations crosses sentence boundaries, and as TEES does not detect them, performance will be very limited.

Performance is hard to estimate due to the complexity of the tasks and lack of official evaluator programs. In general, TEES performance in terms of F-score should be in the 40-50% range for most tasks, and around 10% for BB. For the GRN task, for which an official evaluator is provided, TEES 2.1 devel set performance is at 47.71% using the Relaxed F-score metric.

The BioNLP'13 file format introduces several new aspects. TEES 2.1 has been updated to survive disjoint entities (entities with discontinuous character offsets) but as TEES represents entities with a single syntactic head token these do not have much impact on the system. In GE13, relations can occur between entities, but TEES processes only relations linking named entities and triggers.

Confidence scores in the BioNLP Shared Task files

The trigger and event confidence scores are saved in the a1/a2-format output files. The file format produced by TEES is identical to the official BioNLP'13 format, with the addition of a new X-annotation type for arbitrary extended data. As with other a1/a2 annotations, the X lines can appear in an arbitrary order and as with the modifier lines, the number after the X must be unique but is not meaningful. The X lines contain a list of tab-separated triplets, where a single space separates the element identifier, variable name and variable value. The element identifier can be either a T, an E or an R annotation id, or in the case of event arguments, an E-id catenated with a period to the event argument definition. An arbitrary number of data elements can be stored in one X-line, in the TEES output the confidence scores are grouped by trigger and event.

The example below shows an event predicted for the BioNLP'11 GENIA development set with confidence scores, corresponding to the XML representation shown earlier:

T30	Positive_regulation 85 97	upregulation
X1	T30 conf neg:-88.4885287,Positive_regulation:121.967042,Phosphorylation:21.858001,Gene_expression:12.79538,Regulation:-6.038156,Binding:4.346843,Localization:7.959493,Entity:-19.764132,Negative_regulation:-10.895809,Transcription:-20.964166,Protein_catabolism:21.707699,Gene_expression---Positive_regulation:-7.137787,Phosphorylation---Positive_regulation:-9.138881,Negative_regulation---Positive_regulation:1.708512,Gene_expression---Transcription:-19.613922,Regulation---Transcription:0.0,Negative_regulation---Phosphorylation:-3.344573,Localization---Regulation:-3.247587,Binding---Positive_regulation:-5.006682,Positive_regulation---Regulation:-4.413469,Positive_regulation---Protein_catabolism:2.04659,Entity---Positive_regulation:-5.878282,Negative_regulation---Regulation:-0.321749,Localization---Positive_regulation:-11.280628,Binding---Negative_regulation:-2.581345,Positive_regulation---Transcription:-4.340616,Gene_expression---Negative_regulation:7.646308
E1	Positive_regulation:T30 Theme:T2
X13	E1 umConf neg:10.481994,Positive_regulation:91.959339,Phosphorylation:1.23414,Gene_expression:-17.726275,Regulation:-23.131412,Binding:-24.167135,Entity:-15.250921,Negative_regulation:-1.097466,Transcription:-28.712631,Protein_catabolism:7.255721,Localization:-0.845356	E1:Theme:T2 conf neg:27.421903,Theme:87.191324,Cause:34.394885,ToLoc:-38.10363,Cause---Theme:-4.655489,SiteParent:-24.587815,Site:-49.002859,AtLoc:-32.658392

DDIExtraction 2013 Shared Task (SemEval task 9)

TEES 2.1 provides a system compatible with the data format of the DDIExtraction 2013 Shared Task (SemEval task 9), Task 9.2: Extraction of drug-drug interactions. Analyses provided for this task are TEES drug-drug interaction predictions, BLLIP Penn tree-bank style parses (using the McClosky biomodel), Stanford dependency parses (in the collapsed CC-processed format) and syntactic head offsets for drug entities. Parsing failed for 71 out of the 6976 training sentences, this is normal and is marked by such sentences' parse-elements having an empty "pennstring" attribute.

All of these analyses are available in the interaction XML file DDI13-train-TEES-analyses-130224.xml.gz. The DDI13 documents have been catenated into a single corpus. Document, sentence and entity identifiers remain unchanged, and original file names are also marked in the source-attributes of the document elements. Predicted interaction identifiers are not comparable to original gold interaction identifiers, so interactions should be compared by "e1" and "e2" attributes. The output from TEES 2.1 is in an updated version of the interaction XML format, but the differences to the official DDI13 format are minimal. Namely, ddi-elements are interaction-elements and disjoint character offsets (offsets with gaps in the span) use a comma as a span separator instead of a semicolon.

The system used to produce these analyses is in function almost identical to TEES 2.0 (the system used to produce the University of Turku entry in the 2011 DDI Shared Task). It should be noted that MetaMap features are not used this time, that thresholding is not used due to multiclass classification and that no further optimization towards the specifics of the 2013 Shared Task have been done.

A word of caution

While TEES has been shown to work on several earlier shared tasks, a new dataset introduces new complexities, and there are no guarantees that the system is producing correct or good quality output. All of these resources are used at your own responsibility, and if possible, you should always use other methods to verify and evaluate the TEES output. TEES has not been optimized for these new tasks, so while the results produced by it can be helpful, they should only be considered an additional resource or a starting point.

License terms

TEES is GPL licensed and can be used freely. Use resulting in publications should be cited, and this page will be later updated with information on citing TEES 2.1 if used in the current tasks. For more information on the license terms and conditions of TEES and the other programs it utilizes, please see the wiki page Licenses.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.