-
Notifications
You must be signed in to change notification settings - Fork 1
KAF structure overview
KAF is structured by layers. Each layer is usually the result of a particular analysis module of the chain. Some layers has pointers to other layers' elements through the elements' ids, which is done using "span" elements. Also some layers (the more basic ones) are a prerrequisite for the processes that produce some other layers (the more advanced ones).
The remarkable layers and elements are the following:
- KAF root element
- KAF Header
- WordForms
- Terms
- Dependencies
- Chunks
- Constituents
- Named entities
- Coreference
- [Features of sentiments] (#features)
- Relations
- [Opinions] (#opinions)
##KAF DTD
All KAF documents have a root element <KAF> which has the following attributes:
-
xml:lang: language identifier . -
version: the version of KAF. For opener, we will use version "v1.opener"
Example:
<KAF xml:lang="en" version="v1.opener">
<!--- ... --->
</KAF>KAF documents may have a header for describing information about the document,
such as its original name, URI or a list of the linguistic processors which
generated the KAF document. The KAF header is represented within the
<kafHeader> element, which is optional but highly recommended. The header
element has three sub-elements:
<fileDesc> is an empty element containing information about the computer
document itself. It has the following attributes:
title: the title of the document (optional).
- author (optional): the author of the document.
- title (optional): the title of the document
- creationtime (optional): when the document was created, in ISO 8601.
- filename (optional): the original file name.
- filetype (optional): the original format (PDF, HTML, DOC, etc).
- pages (optional): number of pages of the original document.
Example:
<fileDesc title="3.2012" author="casa400" filename="residence_hostal" filetype="PDF" pages="19" /><public> is an empty element which stores public information about the
document, such as its URI. It has the following attributes:
- publicId: a public identifier (for instance, the number inserted by the capture server) (optional).
- uri (optional): a public URI of the document.
Example:
<public publicId="3.3012" uri="http://casa400.com/docs/residence.pdf" />The header also stores the information about which linguistic processors
produced the KAF document, described under <linguisticProcessors> elements.
There can be several <linguisticProcessors> elements, one per KAF layer. KAF
layers correspond to the top-level elements of the documents, such as "text",
"terms", "deps" etc. Each <linguisticProcessors> element contains one or
several <lp> elements, each one describing one specific linguistic processor.
<lp> elements have the following attributes:
- name: the name of the processor
- version: processor’s version
- timestamp: a timestamp, denoting the date/time at which the processor was launched. The timestamp follows the XML Schema xs:dateTime type .
Example:
<linguisticProcessors layer="text">
<lp name="Freeling" version="2.1" timestamp="2012-06-25T10:05:00Z"/>
</linguisticProcessors>
<linguisticProcessors layer="terms">
<lp name="Freeling" version="2.1" timestamp="2012-06-25T10:10:19Z"/>
<lp name="ukb" version="0.1.2" timestamp="2012-06-25T16:10:19Z"/>
</linguisticProcessors>
<linguisticProcessors layer="namedEntities">
<lp name="Standfort_NE" version="0.1" timestamp="200120626_00:10:19Z"/>
<lp name="kybot_NE" version="0.1" timestamp="200120626_00:10:19Z"/>
</linguisticProcessors>Full example of a KAF header:
<kafHeader>
<fileDesc title="3.3012" author="casa400" filename="residence_hostal" filetype="PDF" pages="19"/>
<public publicId="3.3012"
uri="http://casa400.com/docs/residence.pdf" />
<linguisticProcessors layer="text">
<lp name="Freeling" version="2.1" timestamp="2012-06-25T10:05:00Z"/>
</linguisticProcessors>
<linguisticProcessors layer="terms">
<lp name="Freeling" version="2.1" timestamp="2009-06-25T10:10:19Z"/>
<lp name="ukb" version="0.1.2" timestamp="2012-06-25T16:10:19Z"/>
</linguisticProcessors>
<linguisticProcessors layer="namedEntities">
<lp name="kybot_NE" version="0.1" timestamp="2012-06-26T00:10:19Z"/>
</linguisticProcessors>
</kafHeader>After the tokenization step, all word forms are annotated within the element, and each form is enclosed by a element. These elements have the following attributes:
- wid: the id for the word form (unique in the file).
- sent (optional): sentence id of the token
- para (optional): paragraph id
- page (optional): page id
- offset (optional): The offset (in characters) of the original word form
- length (optional): The length (in characters) of the original word form
- xpath (optional): in case of source xml files, the xpath expression identifying the original word form
Examples of word level annotations:
<text>
<wf wid="w1" offset="0" length="4" sent="1" para="1">John</wf>
<wf wid="w2" offset="5" length="6" sent="1" para="1">taught</wf>
<wf wid="w3" offset="12" length="11" sent="1" para="1">mathematics</wf>
<wf wid="w4" offset="24" length="2" sent="1" para="1">20</wf>
<wf wid="w5" offset="27" length="7" sent="1" para="1">minutes</wf>
<wf wid="w6" offset="35" length="5" sent="1" para="1">every</wf>
<wf wid="w7" offset="41" length="6" sent="1" para="1">Monday</wf>
<wf wid="w8" offset="48" length="2" sent="1" para="1">in</wf>
<wf wid="w9" offset="51" length="3" sent="1" para="1">New</wf>
<wf wid="w10" offset="55" length="3" sent="1" para="1">York</wf>
<wf wid="w11" offset="59" length="1" sent="1" para="1">.</wf>
</text>Terms refer to previous word forms (and groups multi-words) and attach lemma,
part of speech, synset and name entity information. <term> elements have the
following attributes:
- tid: unique identifier, starting with
t - type: type of the term . Currently, 2 values are possible:
- open: open class word
- close: closed class word
- lemma: lemma of the term
- pos: part of speech (Section 3.4.1).
- common noun (N)
- proper noun (R)
- adjective (G)
- verb (V)
- preposition (P)
- adverb (A)
- conjunction (C)
- determiner (D)
- other (O)
- morphofeat (optional): morphosyntactic feature encoded as a single attribute.
- head: if the term is a compound, the id of the head component (Section 3.4.4).
- case (optional): declension case of the term.
<term> elements have the following sub-element:
- span: this element spans the target word. Target elements are used to refer to the target word, using word ids (wid). If the term is a multiword, multiple target elements are used.
The pos attribute must consist of one single letter from the following set:
| N | common noun | V | verb |
| R | proper noun | P | preposition |
| Q | Pronoun | A | adverb |
| D | Determiner | C | conjunction |
| G | Adjective | O | other |
The term layer represents sentiment information which is context-independent and that can be found in a sentiment lexicon. It is related to concepts expressed by words/ terms (e.g. beautiful) or multi-word expressions (e. g. out of order). We provide possibilities to store sentiment information at word level and at sense/synset level. In the latter case, the sentiment information is included in the "external_reference" section and a WSD process may identify the correct sense with its sentiment information. The extension contains the following information categories.
<sentiment> elements have the following sub-element:
- Resource: identifier and reference to an external sentiment resource
- Polarity: Refers to the property of a word to express positive, negative or
no sentiment. These values are possible:
- Positive
- Negative
- Neutral
- Or numerical value on a numerical scale
- Strength: refers to the strength of the polarity
- Weak
- Average
- Strong
- Or Numerical value
- Subjectivity: refers to the property of a words to express an opionion (or
not)
- Subjective/Objective
- Factual/opinionated
- Sentiment_semantic_type: refers to a sentiment-related semantic type
- Aesthetics_evaluation
- Moral_judgment
- Emotion
- etc
- Sentiment modifier: refers to words which modify the polarity of another word
- Intensifier/weakener polarity shifter
- Sentiment_marker: refers to words which themselves do not carry polarity, but
are kind of vehicles of it
- Find, think, in my opinion, according to....
- Sentiment_product_feature: refers to a domain; mainly used in feature-based sentiment analysis
- Values are related to specific domain. For the tourist domain, for example, staff, cleanliness, beds, bathroom, transportation, location, etc..
Beautiful polarity = positive; strength=average; subjectivity=subjective; sentiment_semantic_type=aesthetics_evaluation; sentiment_modifier=""; sentiment_product_feature=general;
Valley polarity=negative; strength=average; subjectivity=factual; sentiment_semantic_type=""; sentiment_modifier=""; sentiment_product_feature=beds;
Very polarity=""; strength=""; subjectivity=""; sentiment_semantic_type=""; sentiment_modifier="intensifier"; sentiment_product_feature=""
<term tid="t2" lemma="nice" pos="G" type="open">
<sentiment resource="VUA_polarityLexicon_word" polarity="positive" strength="average"
subjectivity="subjective"
sentiment_semanrtic_type="behaviour/trait"
sentiment_product_feature="" />
<span>
<target id="w2"/>
</span>
<externalReferences/>
</term>
<term tid="t5" lemma="warm" pos="G" type="open">
<sentiment/>
<span>
<target id="w5"/>
</span>
<externalReferences>
<externalRef resource="WN-ENG" ref="c_1009" conf="0.38">
<sentiment resource="VUA_polarityLexicon_synset" polarity="positive" strength="average"
subjectivity="subjective"
sentiment_semanrtic_type="behaviour/traitEvaluation"
sentiment_feature="" />
</externalRef>
<externalRef resource="WN-ENG" ref="c_1008" conf="0.31">
<sentiment resource="VUA_polarityLexicon_synset" polarity="positive" strength="average"
subjectivity="objective"
sentiment_semanrtic_type="temperature" sentiment_product_feature=""/>
</externalRef>
</externalReferences>
</term>The optional <externalReferences> element is used to associate terms to external lexical or semantic resources, such as elements of a Knowledge base: semantic lexicon (like WordNet) or an ontology. It consists of several elements, one per association.
The <externalRef> elements may be nested, meaning that each externalRef refines the relation expressed by the parent element.
elements have the following attributes:
- resource (required): indicates the identifier of the resource referred to with the following format: "resource_language_version" i.e. wn_en_15, wn_it_30, wn_de_18
- reference (required): code of the referred element. For example, If the element is a synset of some version of WordNet, it follows the pattern: [0-9]+-[nvars] i.e. 2345-n, 662345-v, … which is a string composed by two fields separated by a dash. The two fields are the following:
- Synset identifier composed by digits.
- POS character:
- n noun
- v verb
- a adjective
- r adverb Examples of valid patterns are: "12345678-n", "017403-v", etc.
- reftype (optional): indicates the kind of relation the externalRef is expressing. reftype attribute is dependent on the kind on the external reference. For Wordnet it may have values like ’sc_DomainOf’, ’sc_SubclassOf’, etc. An empty reftype would indicate a direct relationship.
- status (optional): indicates the status of the relationship.
- source (optional): the name of the process which created the external reference.
- confidence (optional): a floating number between 0 and 1. Indicates the confidence weight of the association.
Compound terms can be represented in KAF by including <component> elements within <term> elements. For example, the Dutch term landbouwbeleid (English: agriculture policy) would look like this:
<term tid="t7" head="t7.1" lemma="landbouwbeleid" pos="N" type="open">
<span><target id="w7"/></span>
<component id="t7.1" lemma="landbouw" pos="N">
<externalReferences>...</externalReferences>
</component>
<component id="t7.2" lemma="beleid" pos="N">
<externalReferences>...</externalReferences>
</component>
<externalReferences>...</externalReferences>
</term>The <component> elements have the following attributes:
- id: unique identifier
- lemma: lemma of the term
- pos: part of speech
- case: declension case (apparently optional)
Components of compound terms should all have the term identifier of the whole word before the full stop, and an integer after it, increasing in linear order.
Multiword terms can be represented in KAF by including more than one <target>
element within <term> elements. As the compound terms, multiwords have
<component> elements.
For example, two terms "Prime" and "Minister", before the analysis, are showed below:
<term tid="t5" lemma="Prime" pos="N" type="open">
<span>
<target id="w5"/>
</span>
<externalReferences>...</externalReferences>
</term>
<term tid="t6" lemma="Minister" pos="N" type="open">
<span>
<target id="w6"/>
</span>
<externalReferences>...</externalReferences>
</term>After the multiword analysis, they become:
<term tid="t5" lemma="Prime_Minister" pos="N" head="t5.2" type="open">
<span>
<target id="w5"/>
<target id="w6"/>
</span>
<component id="t5.1" lemma ="Prime">
<externalReferences>...</externalReferences>
</component>
<component id="t5.2" lemma ="Minister"/>
<externalReferences>...</externalReferences>
</component>
<externalReferences>...</externalReferences>
</term>All the layers referring to t4 and t5 will be updated.
The <component> elements have the following attributes:
- id: unique identifier
- lemma: lemma of the term
- pos: part of speech
- case: declension case (apparently optional)
Components of multiword terms should all have the term identifier of the whole word before the full stop, and an integer after it, increasing in linear order.
<terms>
<term tid="t1" lemma="John" pos="R">
<span><target id="w1"/></span>
</term>
<term tid="t2" type="open" lemma="teach" pos="V">
<span><target id="w2"/></span>
<externalReferences>
<externalRef resource="WN-1.7" reference="eng-17-00861095-v" confidence="0.80">
<externalRef resource="ontology" reference="Teach" reftype="SubClassOf">
<externalRef resource="ontology" reference="Human" reftype="agent"/>
<externalRef resource="ontology" reference="Human" reftype="patient"/>
</externalRef>
</externalRef>
<externalRef resource="WN-1.7" reference="eng-17-00859568-v" confidence="0.20"/>
<externalRef resource="WN-1.7" reference="eng-17-00859568-v" confidence="0.20"/>
</externalReferences>
</term>
<term tid="t3" type="open" lemma="mathematics" pos="N">
<span><target id="w3"/></span>
<externalReferences>
<externalRef resource="WN-1.7" reference="eng-17-04597590-n" confidence="0.99"/>
</externalReferences>
</term>
<term tid="t4" lemma="20" pos="N">
<span><target id="w4"/></span>
</term>
<term tid="t5" type="open" lemma="minute" pos="N">
<span><target id="w5"/></span>
</term>
<term tid="t5" type="close" lemma="every" pos="D">
<span><target id="w6"/></span>
</term>
<term tid="t6" lemma="Monday" pos="N">
<span><target id="w7"/></span>
<externalReferences>
<externalRef resource="WN-1.7" reference="eng-17-12557842-n" confidence="0.99"/>
</externalReferences>
</term>
<term tid="t7" type="close" lemma="in" pos="P">
<span><target id="w8"/></span>
</term>
<term tid="t8" lemma="New_York" pos="R">
<span>
<target id="w9"/>
<target id="w10"/>
</span>
</term>
</terms>Dependencies represent dependency relations among terms. Each dependency is represented by an empty element and span previous terms. The element has the following attributes:
- from: term id of the source element
- to: term id of the target element
- rfunc: relational function. One of:
subj´ (grammatical subject),obj´ (objects and/or complements), or `mod´ (modifier of noun or verb) - mod: indicates the word introducing the dependent in a head- modifier relation. For instance: mod(by,gift,Peter) the gift of a book by Peter mod(of,examination,patient) the examination of the patient
- subj: indicates the subject in the grammatical relation Subject-Predicate. For instance: subj(arrive,John,) John arrived in Paris subj(employ,Microsoft,) Microsoft employed 10 C programmers subj(employ,Paul,obj) Paul was employed by Microsoft
- csubj, xsubj, ncsubj: The Grammatical Relations (RL) s csubj and xsubj may be used for clausal subjects, controlled from within, or without, respectively. ncsubj is a non-clausal subject. For instance: xsubj(win,require,_) to win the America’s Cup requires heaps of cash
- dobj: Indicates the object in the grammatical relation between a predicate and its direct object. For instance: dobj(read,book,_) read books
- iobj: The relation between a predicate and a non-clausal complement introduced by a preposition; type indicates the preposition introducing the dependent. For instance: iobj(in,arrive,Spain) arrive in Spain iobj(into,put,box) put the tools into the box iobj(to,give,poor) give to the poor
- obj2: The relation between a predicate and the second non-clausal complement in ditransitive constructions. For instance: obj2(head,dependent) obj2(give,present) give Mary a present obj2(mail,contract) mail Paul the contract
- case (optional): declension case
There is no requirement that the dependency graph be complete but it should not have any directed circles. Example of dependency relation annotations:
<deps>
<!-- subj(teach, John) -->
<dep from="t1" to="t2" rfunc="subj" />
<!-- dobj(teach, Mathematics) -->
<dep from="t3" to="t2" rfunc="dobj" />
<!-- iobj(teach, New_York) -->
<dep from="t8" to="t2" rfunc="iobj" />
</deps>Chunks are noun, verb or prepositional phrases, spanning terms.
<chunk> elements have the following attributes:
- cid: unique identifier
- head: the chunk head’s term id
- phrase: type of the phrase
- case (optional): declension case
Example of chunk annotations:
<chunks>
<!-- John -->
<chunk cid="c1" head="t1" phrase="NP">
<span><target id="t1"/></span>
</chunk>
<!-- taught -->
<chunk cid="c2" head="t2" phrase="V">
<span><target id="t2"/></span>
</chunk>
<!-- Mathematics -->
<chunk cid="c3" head="t3" phrase="NP">
<span><target id="t3"/></span>
</chunk>
<!-- 20 minutes -->
<chunk cid="c5" head="t5" phrase="NP">
<span><target id="t4"/><target id="t5"/></span>
</chunk>
<!-- every -->
<chunk cid="c6" head="t6" phrase="R">
<span><target id="t6"/></span>
</chunk>
<!-- every Monday -->
<chunk cid="c7" head="t7" phrase="NP">
<span><target id="t6"/><target id="t7"/></span>
</chunk>
<!-- in New York -->
<chunk cid="c9" head="t9" phrase="PP">
<span><target id="t8"/><target id="t9"/></span>
</chunk>
</chunks>This layer represent the output of constituency analysis parsers, i.e., full
syntactic tree of sentences. The top element of the layer is
<constituency>, and each sentence (parse tree) is represented by a
<tree> element. Inside each <tree>, there are three types
of elements:
-
<nt>elements representing non-terminal nodes. -
<t>elements representing terminal nodes. -
<edge>elements representing in-tree edges.
The <nt> element represents the internal nodes of the parse
tree. It has the following attribute:
- id (required): unique identifier starting with the prefix ``nter'';
- label (required): the category label of the node (for instance, 'S', 'NP', 'VP', etc).
The <t> element represents the leaf nodes of the parse tree. It has the following attributes:
- id (required): unique identifier starting with the prefix ``ter'';
The <t> element contains a <span> element pointing to the term layer. Each <span> contains one or more <target> elements, with the following attributes:
- id (required): id of the target term.
The <edge> element links child nodes with its parent. It has the following attribute:
- id (optional) unique identifier starting with the prefix ``tre'' (tree edge);
- from (required): id of the child node. The child node can be a terminal or a non-terminal.
- to (required): id of the parent node. The parent node is a non-terminal.
- head (optional): a "yes" value indicates that the child node is the head constituent of the sub-tree pointed by the parent node.
Note that the order in which the <edge> elements appear in the XML
document is important. Given the sentence:
The dog ate the cat.
In a Penn Treebank format, its parse tree is the following (heads are marked with an asterisk):
(S (NP (DET The) *(NN dog)) *(VP *(V ate) (NP ((DET the) *(NN cat)))) (. .))
<constituency>
<tree>
<!-- Non-terminals -->
<nt id="nter0" label="ROOT"/>
<nt id="nter1" label="S"/>
<nt id="nter2" label="NP"/>
<nt id="nter3" label="VP"/>
<nt id="nter4" label="V"/>
<nt id="nter5" label="NP"/>
<nt id="nter6" label="DET"/>
<nt id="nter7" label="NN"/>
<nt id="nter8" label="DET"/>
<nt id="nter9" label="NN"/>
<nt id="nter10" label="."/>
<!-- Terminals -->
<!-- The -->
<t id="ter1"><span><target id="t1"/></span></t>
<!-- dog -->
<t id="ter2"><span><target id="t2"/></span></t>
<!-- ate -->
<t id="ter3"><span><target id="t3"/></span></t>
<!-- the -->
<t id="ter4"><span><target id="t4"/></span></t>
<!-- cat -->
<t id="ter5"><span><target id="t5"/></span></t>
<!-- . -->
<t id="ter6"><span><target id="t6"/></span></t>
<!-- tree edges. Note: order is important! -->
<edge id="tre1" from="nter1" to="nter0"/> <!-- ROOT <- S -->
<edge id="tre2" from="nter2" to="nter1"/> <!-- S <- NP -->
<edge id="tre3" from="nter6" to="nter2"/> <!-- NP <- DET -->
<edge id="tre4" from="ter1" to="nter6"/> <!-- DET <- The -->
<edge id="tre5" from="nter7" to="nter2" head="yes"/> <!-- NP <- NN (head) -->
<edge id="tre6" from="ter2" to="nter7"/> <!-- NN <- dog -->
<edge id="tre7" from="nter3" to="nter1" head="yes"/> <!-- S <- VP (head) -->
<edge id="tre8" from="nter4" to="nter3" head="yes"/> <!-- VP <- V (head) -->
<edge id="tre9" from="ter3" to="nter4"/> <!-- V <- ate -->
<edge id="tre10" from="nter5" to="nter3"/> <!-- VP <- NP -->
<edge id="tre11" from="nter8" to="nter5"/> <!-- NP <- DET -->
<edge id="tre12" from="ter4" to="nter8"/> <!-- DET <- the -->
<edge id="tre13" from="nter9" to="nter5" head="yes"/> <!-- NP <- NN (head) -->
<edge id="tre14" from="ter5" to="nter5"/> <!-- NN <- cat -->
<edge id="tre15" from="ter6" to="nter10"/> <!-- . <- . -->
<edge id="tre16" from="nter10" to="nter1"/> <!-- S <- . -->
</tree>
</constituency>A named entity is a term (or a multiword) that clearly identifies one item. The optional Named Entity layer is used to reference terms that are named entities. The Named Entity layer may be used either for referencing single mentioned entities or for referencing multi-mentioned entities: in the latter case, the layer creates clusters of term spans, which we call mentions, each mention referencing the same entity.
We currently use 8 entity types in total, distributed as the 4 class models for CoNLL and the 7 class model for MUC. The table below shows the two models. In OpeNER, we at least offer one 4 class CoNLL type of model for every language and, whenever possible and depending on training set availability, a MUC 7 class type of model too. It could also be possible a combination of these and other datasets.
CoNLL basic list of entities:
- Basic
- Location
- Person
- Organization
- Misc
MUC advanced list of entities
- Advanced
- Location
- Person
- Organization
- Date
- Time
- Percent
- Money
A named entity element has the following attributes:
- eid: the id for the named entity
- type : type of the named entity. Currently, 8 values are possible:
- Person
- Organization
- Location
- Date
- Time
- Money
- Percent
- Misc
Every named entity could have other optional attributes. The following table display some cases.
| Type | Optional attributes |
|---|---|
| ORGANIZATION | subtype = "company" |
| LOCATION | subtype = type of location (i.e. "street", "city", "country") |
| DATE | dateISO = "2012/12/31" |
| TIME | timeISO = "15:38:00" |
| MONEY | moneyISO = "100 EUR" |
| MISC | subtype = "car" country : car registration code |
| MISC | subtype = "phone" phonetype: type of the phone number (i.e. "imei", "mobile", "landline") country: the country code |
| MISC | subtype = "personal" cardtype: type of the personal card (i.e. "passport", "idcard", "driver_license") country: the country code |
| MISC | subtype = "banking" banktype: type of bank entity (i.e. "iban", "ccard", "account") country: the country code |
| MISC | subtype = "internet" nettype: type of the network entity (i.e. "ipaddress", "macaddress", "email", "url") |
A named entity element have the following sub-elements:
- references: this element contains one or more span elements
- externalReferences (optional): this element contains one or more externalRef elements
A <span> sub-element can be used to reference the different occurrences of
the same named entity in the document or mentions of it, using the target
elements to refer to the target terms. If the entity is composed by multiple
words, multiple target elements are used. The <target> element has the
following attributes:
- id: tid that refers to the target term.
- head (optional): a "yes" value indicates that the term is the head of the mention.
The optional sub-element (see 3.4.3) is used to associate terms to external lexical or semantic resources, such as elements of a Knowledge base. element have the following attributes:
- resource (required): indicates the identifier of the resource referred to.
- reference (required): code of the referred element.
- confidence (optional): a floating number between 0 and 1. Indicates the confidence weight of the association.
Example:
<entities>
<entity type="person" eid="e1" >
....
</entity>
<entity type="organization" eid="e2">
....
</entity>
<entity type="date" eid="e3">
....
</entity>
</entities><entities>
<entity type="person" eid="e1">
<references>
<!-- John Smith -->
<span>
<target id="t1"/>
<target id="t2"/>
</span>
<!-- Him -->
<span>
<target id="t15"/>
</span>
</references>
<externalReferences>
<externalReference confidence="0.7" reference="13982343" resource="JRCNames"/>
<externalReference confidence="0.3" reference="834354" resource="JRCNames"/>
</externalReferences>
</entity>
<entity type="location" subtype="street" eid="e2">
<references>
<!-- London -->
<span>
<target id="t12" head="yes"/>
</span>
<!-- The capital city of England -->
<span>
<target id="t1" head="yes"/>
<target id="t2" head="yes"/>
<target id="t3" head="yes"/>
<target id="t4" head="yes"/>
<target id="t5" head="yes"/>
</span>
</references>
<externalReferences>
<externalReference reference="ref01011020" resource="GeoNames"/>
</externalReferences>
<externalReferences>
<externalReference reference="http://wwww.wikipedia.com/London" resource="Wikipedia"/>
</externalReferences>
</entity>
</entities>After breakfast at the Elia Beach Hotel, I and my wife had a walk to Mykonos. There we were picked up and driven to Piraeus Port, where we had lunch with Mr. Vernicos at the Marine Club.
<entities>
<entiy eid="e1" type="organization">
<!-- Elia Beach Hotel -->
<span>
<target id="t5"/>
<target id="t6"/>
<target id="t7"/>
</span>
</entity>
<entiy eid="e2" type="location">
<!-- Mykonos -->
<span>
<target id="t16"/>
</span>
<!-- There -->
<span>
<target id="t17"/>
</span>
</entity>
<entiy eid="e3" type="location">
<!-- Piraeus Port -->
<span>
<target id="t25"/>
<target id="t26"/>
</span>
</entity>
<entiy eid="e4" type="person">
<!—Mr. Vernicos -->
<span>
<target id="t32"/>
<target id="t33"/>
</span>
</entity>
<entiy eid="e5" type="location">
<!-- Piraeus Port -->
<span>
<target id="t36"/>
<target id="t37"/>
</span>
</entity>
</entities>The optional coreference layer creates clusters of term spans (which we call mentions) which share the same referent. Coreferences are intended to be used for clustering mentions that are not Named Entities.
A <coref> element represents a mention cluster, and within <coref> each
mention is represented by a <span> element (which groups term mentions
using <target> elements). Additionally, one <target> element within the
<span> may have an attribute "head" with value "yes" to represent the
fact that this particular term is the head of the mention.
<coref> element has the following attribute:
- coid: unique id, starting with the prefix "co"
the
<span>element has as many<target>elements. The<target>element has the following attribute: - id (optional): tid that refers to the target term.
- head (optional): a "yes" value indicates that the term is the head of the mention.
<coreferences>
<coref coid="co1">
<!-- I -->
<span >
<target id="t8" head="yes"/>
</span>
<!-- we -->
<span>
<target id="t18"/>
</span>
</coref>
<coref coid="co2">
<!—my wife -->
<span >
<target id="t10"/>
<target id="t11" head="yes"/>
</span>
<!-- we -->
<span>
<target id="t18"/>
</span>
</coref>
</coreferences>After breakfast at the Elia Beach Hotel, I and my wife had a walk to Mykonos. There we were picked up and driven to Piraeus Port, where we had lunch with Mr. Vernicos at the Marine Club.
<coreferences>
<coref coid="co1">
<!-- I -->
<span >
<target id="t8" head="yes"/>
</span>
<!-- we -->
<span>
<target id="t18"/>
</span>
<!-- we -->
<span>
<target id="t28"/>
</span>
</coref>
<coref coid="co2">
<!—my wife -->
<span >
<target id="t10"/>
<target id="t11" head="yes"/>
</span>
<!-- we -->
<span>
<target id="t18"/>
</span>
<!-- we -->
<span>
<target id="t28"/>
</span>
</coref>
</coreferences>Features of hotels may be properties (like cleanliness. location, view) or categories (like rooms, bathrooms, staff). A property is an abstract feature (i.e. view and location), a category instead is a more concrete and real feature (i.e. staff and rooms). A category may have a relation with a properties (i.e. the view in the rooms is good). In that case, the relation will be included in the layer. Properties don’t have relation with categories (i.e. "the room in the view" does not make sense)
<features> element may contain a <properties> element and a <categories> element.
<properties> element contains one or more <property> elements. A <property> element has the following attributes:
- pid: the unique identifier of the property
- lemma: lemma of the property
<categories> element contains one or more <category> elements.
A <category> element has the following attributes:
- cid: the unique identifier of the category
- lemma: lemma of the category
<property>and<category>elements have the following sub-elements: - references: this element contains one or more reference elements
- externalReferences (optional): this element contains one or more externalRef elements
A
<span>sub-element can be used to reference the different occurrences of the same property or category in the document, using target elements to refer to the target terms. If the feature is composed by multiple words, multiple target elements are used.
The optional <externalRef> sub-element (see 3.4.3) is used to associate terms
to external lexical or semantic resources, such as elements of a Knowledge
base. <externalRef> element have the following attributes:
- resource (required): indicates the identifier of the resource referred to.
- reference (required): code of the referred element.
- confidence (optional): a floating number between 0 and 1. Indicates the confidence weight of the association.
Example
Nothing special really. Comfortable and clean but very boring decor in comparison to other NH hotels. I stayed in NH in Brussels and Zurich and I really liked them because of their modern and stylish design and big rooms. This one was just like any other hotel. Basic rooms with basic and dull decor - bit disappointng. The customer service was average. The rate was very expensive and I stll had to pay for Internet and 20 euros for breakfast!!! It was good but way overpriced! The best thing about the hotel was the location - city centre, 2min from a metro stop.
<features>
<properties>
...
<property pid="p1" lemma="customer services">
<references>
<span>
<target id="t58" />
<target id="t59" />
</span>
</references>
</property>
<property pid="p2" lemma="rate">
<references>
<span>
<target id="t63" />
</span>
</references>
</property>
<property pid="p3" lemma="location">
<references>
<span>
<target id="t94" />
</span>
</references>
</property>
...
</properties>
<categories>
...
<category cid="c1" lemma="room">
<references>
<span>
<target id="t39" />
</span>
<span>
<target id="t49" />
</span>
</references>
</category>
<category cid="c2" lemma="internet">
<references>
<span>
<target id="t74" />
</span>
</references>
</category>
<category cid="c3" lemma="breakfast" />
<references>
<span>
<target id="t79" />
</span>
</references>
</category>
...
</categories>
</features>Relation between entities and/or features.
Two entities may have a relation with a degree of confidence. For example: "Bill Clinton lives in New York"
- Bill Clinton is a Person
- New York is a Location There can be relations between entities and features with a degree of confidence. For example: "The rooms in the NH Hotel are expensive."
- NH Hotel is an Organization (with subtype company)
- Room is a Category (sub element of the features layer)
Furthermore, there can be relations between features, from a category and a property. For example: "Services in the rooms are good."
- Service is a Property
- Room is a Category (sub element of the features layer)
The degree of confidence may be provided using different approaches and processors. For example, if two entities occur in the same sentence, a processor may assign a low confidence score. If they appear in the same chunk, a processor may assign a higher confidence score. If a linguistic processor detects a dependency relation, it may assign a higher score, as it happens, for example, in the previous sample for the named entities "Bill Clinton" and "New York". An element contains these attributes:
- rid: the unique identifier of the relation between two entities
- from: entity/category/property id of the source element
- to: entity/category/property id of the target element
- confidence: (optional): a floating number between 0 and 1. Indicates the confidence weight of the relation Example
"Bill Clinton lives in New York"
<entities>
<entity type=’person’ eid="e1">
// Bill Clinton
...
</entity>
<entity type=’location’ eid="e2">
// New York
</entity>
...
</entities>
<relations>
<relation from=’e1’ to=’e2/> // confidence = 100%
</relations>"The rooms in the NH Hotel are expensive."
<entities>
<entity type=’organization’ subtype=’company’ eid="e1">
// NH HOTEL
</entity>
</entities>
<features>
<categories>
<category cid="c1">
// ROOMS
</category>
</categories>
</features>
<relations>
<relation from=’c1’ to=’e1/> // confidence = 100%
</relations>"Services in the rooms are good.."
<features>
<properties>
<property pid="p1">
// SERVICES
</property>
</properties>
<categories>
<category cid="c1">
// ROOMS
</category>
</categories>
</features>
<relations>
<relation from=’p1’ to=’c1/> // confidence = 100%
</relations>The sentiment related information in KAF is represented at two levels: the (already existing) term layer and a (new) opinion layer. The term layer represents information copied from an external source like the sentiment lexicon. Information can be stored at both word and sense-synset level. Sentiment information in the term layer is the building block for the sentiment analysis tool which then tries to generate information needed to fill the slots (opinion holder, opinion target, opinion expression) of an opinion triplet. These triplets are represented in the opinion layer. Starting from these triplets, final sentiment analysis results can be calculated. Basic forms of sentiment analysis where polarity is aggregated at sentence or document level may ignore the opinion layer triplets and make use of the term level sentiment only. In these cases (for example, polarity classification of product reviews) , opinion target (i. e. the product) and opinion holder (i. e. the reviewer) are known on beforehand. However, when more complex sentiment analysis is performed, the opinion triplets are needed. An example of such an analysis, is product feature-based SA where the target of the opinion is not just the product but some specific feature of the product. Another example concerns the analysis of , for example, news or blogs where different people (i. e. opinion holders) may have different opinions about different targets. In those cases, aggregation of the polarity values found in the text, is not enough.
The <opinion> layer has one attribute:
- oid: the unique identifier of the opinion
The <opinion> layer consists of the following subelement:
- opinion_holder: whose opinion: speaker or some actor in the text
- opinion _target : about what
- opinion_expression: the expression
<opinion_holder> and <opinion_target> elements have the following sub-element:
- span: this element spans the target term. Target elements are used to refer to the target term,, using term ids (tid). If the term is a multiword, multiple target elements are used.
<opinion_expression> has the following attributes:
- polarity: refers to the positive or negative orientation of the expression
- strength: refers to the strength of the expression
- subjectivity: refers to whether an expression is subjective or not
- sentiment_semantic_type: refers to sentiment related semantic types like emotion, judgment, belief, speculation
- sentiment_product_feature: refers to specific features of entities, to be used in feature/aspect-based sentiment analysis
The fillers of the three elements (holder, target, type) can be found in one sentence or in different ones (cf. ex. 1). Moreover, one sentence can contain multiple opinion triplets (cf. ex. 2).
Example
(1)
Vicky Masing,"http://twitter.com/VickyMasing/statuses/245094342799286273",2012-09-10T10:13:24Z,"english","Fountain of Bellagio Las Vegas. One has got to witness that magical show. Definitely beautiful, definitely mesmerizing."
opinion expression: Definitely beautiful, definitely mesmerizing.
polarity (optional): positive
strength (optional): strong
sentiment_semantic_type (optional): evaluation
sentiment_product_feature (optional): ""
opinion holder (optional): Speaker/Writer
opinion target (optional): Fountain of Bellagio Las Vegas
(2) We stayed at The Toren as a transit hotel for our trip from saudi arabia to USA, the hotel was just amazing in all aspects, the service was great, the people working their were really helpful and nice, the food at the breakfast was delicious and fresh and the location is great, very near to tram stops, the central station, Shopping...
opinion expression: really helpful and nice
polarity: positive
strength: strong
sentiment_semantic_type: evaluation
sentiment_product_feature: staff
opinion holder: Speaker/Writer
opinion target: people working there
opinion expression: delicious
polarity: positive
strength: strong
sentiment_semantic_type: evaluation
sentiment_product_feature: breakfast
opinion holder: Speaker/Writer
opinion target: the food at the breakfast
(N.B. there are more opinion triplets in this fragment than the ones given here).
They had a nightmare with Hilton Hotel Paris
<opinion oid="o1">
<opinion_holder type="" >
<span>
<target id="t1"/>
</span>
</opinion_holder>
<opinion_target>
<span>
<target id="t6"/>
<target id="t7"/>
<target id="t8"/>
</span>
</opinion_target>
<opinion_expression polarity="negative" strength="strong"
subjectivity="subjectivity" sentiment_semantic_type="evaluation"
sentiment_product_feature="">
<span>
<target id="t3"/>
<target id="t4"/>
</span>
</opinion_expression>
</opinion>The most shallow sentiment analysis process uses the sentiment information of the term layer that comes from a polarity lexicon. The polarity information is processed with or without WSD and (with or without WSD) , applying polarity shifting and intensification rules. The results are not stored in any KAF layer but aggregated - depending on the end-user requirements – up to the level of e.g. chunck, sentence , paragraph, document, set of documents, etc. For example, the hotel is not(shifter) expensive(negative) but not(shifter) clean(positive) and not(shifter) welllocated( positive) either results in +1(positive) and -2(negative) for this sentence.
More complex sentiment analysis makes use of the term and opinion layer. Polarity information of the terms is processed and shifting and intensification rules are applied. The sentiment analysis process then exploits , like , for example, syntactic rules to find the opinion holder and target. The results of this process are stored in the opinion layer. For example, the hotel is not(shifter) expensive(negative) but not(shifter) clean(positive) and not(shifter) well-located(positive) either results in 3 opinion triplets:
opinion1: [[ holder: default] [expression(not expensive): [polarity: positive] [strength: average]] [target : hotel ]].opinion2: [[ holder: default] [expression(not clean): [polarity: negative] [strength: average]] [target : hotel ]].opinion3: [[ holder: default] [expression(not well-located): [polarity: negative] [strength: average]] [target : hotel ]].
The relation of the word "hotel" in this sentence and a specific hotel in the entity layer is established by overlapping spans in the term layer. Again, how the information is further processed, e.g. aggregation of different opinions about one target or aggregations of different opinions from one holder, depends on the end-user requirements and the results are not stored KAF.
