KAF structure overview

Rodrigo Agerri edited this page Dec 12, 2013 · 20 revisions

Index

KAF is structured by layers. Each layer is usually the result of a particular analysis module of the chain. Some layers has pointers to other layers' elements through the elements' ids, which is done using "span" elements. Also some layers (the more basic ones) are a prerrequisite for the processes that produce some other layers (the more advanced ones).

The remarkable layers and elements are the following:

  1. KAF root element
  2. KAF Header
    1. fileDesc element
  3. WordForms
  4. Terms
    1. Part-of-Speech codes
  5. Dependencies
  6. Chunks
  7. Constituents
  8. Named entities
  9. Coreference
  10. [Features of sentiments] (#features)
  11. Relations
  12. [Opinions] (#opinions)

##KAF full example

##KAF DTD

##KAF diagram, full size

KAF diagram


Kaf root element

Back to index

All KAF documents have a root element <KAF> which has the following attributes:

  • xml:lang: language identifier .
  • version: the version of KAF. For opener, we will use version "v1.opener"

Example:

<KAF xml:lang="en" version="v1.opener">
  <!--- ... --->
</KAF>

KAF Header

Back to index

KAF documents may have a header for describing information about the document, such as its original name, URI or a list of the linguistic processors which generated the KAF document. The KAF header is represented within the <kafHeader> element, which is optional but highly recommended. The header element has three sub-elements:

fileDesc element

Back to index

<fileDesc> is an empty element containing information about the computer document itself. It has the following attributes:

title: the title of the document (optional).

  • author (optional): the author of the document.
  • title (optional): the title of the document
  • creationtime (optional): when the document was created, in ISO 8601.
  • filename (optional): the original file name.
  • filetype (optional): the original format (PDF, HTML, DOC, etc).
  • pages (optional): number of pages of the original document.

Example:

<fileDesc title="3.2012" author="casa400" filename="residence_hostal" filetype="PDF" pages="19" />

public element

Back to index

<public> is an empty element which stores public information about the document, such as its URI. It has the following attributes:

  • publicId: a public identifier (for instance, the number inserted by the capture server) (optional).
  • uri (optional): a public URI of the document.

Example:

<public publicId="3.3012" uri="http://casa400.com/docs/residence.pdf" />

Linguistic Processors

Back to index

The header also stores the information about which linguistic processors produced the KAF document, described under <linguisticProcessors> elements. There can be several <linguisticProcessors> elements, one per KAF layer. KAF layers correspond to the top-level elements of the documents, such as "text", "terms", "deps" etc. Each <linguisticProcessors> element contains one or several <lp> elements, each one describing one specific linguistic processor. <lp> elements have the following attributes:

  • name: the name of the processor
  • version: processor’s version
  • timestamp: a timestamp, denoting the date/time at which the processor was launched. The timestamp follows the XML Schema xs:dateTime type .

Example:

<linguisticProcessors layer="text">
  <lp name="Freeling" version="2.1" timestamp="2012-06-25T10:05:00Z"/>
</linguisticProcessors>
<linguisticProcessors layer="terms">
  <lp name="Freeling" version="2.1" timestamp="2012-06-25T10:10:19Z"/>
  <lp name="ukb" version="0.1.2" timestamp="2012-06-25T16:10:19Z"/>
</linguisticProcessors>
<linguisticProcessors layer="namedEntities">
  <lp name="Standfort_NE" version="0.1" timestamp="200120626_00:10:19Z"/>
  <lp name="kybot_NE" version="0.1" timestamp="200120626_00:10:19Z"/>
</linguisticProcessors>

Header example

Back to index

Full example of a KAF header:

<kafHeader>
  <fileDesc title="3.3012" author="casa400" filename="residence_hostal" filetype="PDF" pages="19"/>
  <public publicId="3.3012"
    uri="http://casa400.com/docs/residence.pdf" />
  <linguisticProcessors layer="text">
    <lp name="Freeling" version="2.1" timestamp="2012-06-25T10:05:00Z"/>
  </linguisticProcessors>
  <linguisticProcessors layer="terms">
    <lp name="Freeling" version="2.1" timestamp="2009-06-25T10:10:19Z"/>
    <lp name="ukb" version="0.1.2" timestamp="2012-06-25T16:10:19Z"/>
  </linguisticProcessors>
  <linguisticProcessors layer="namedEntities">
    <lp name="kybot_NE" version="0.1" timestamp="2012-06-26T00:10:19Z"/>
  </linguisticProcessors>
</kafHeader>

Wordforms

Back to index

After the tokenization step, all word forms are annotated within the element, and each form is enclosed by a element. These elements have the following attributes:

  • wid: the id for the word form (unique in the file).
  • sent (optional): sentence id of the token
  • para (optional): paragraph id
  • page (optional): page id
  • offset (optional): The offset (in characters) of the original word form
  • length (optional): The length (in characters) of the original word form
  • xpath (optional): in case of source xml files, the xpath expression identifying the original word form

Examples of word level annotations:

<text>
  <wf wid="w1" offset="0" length="4" sent="1" para="1">John</wf>
  <wf wid="w2" offset="5" length="6" sent="1" para="1">taught</wf>
  <wf wid="w3" offset="12" length="11" sent="1" para="1">mathematics</wf>
  <wf wid="w4" offset="24" length="2" sent="1" para="1">20</wf>
  <wf wid="w5" offset="27" length="7" sent="1" para="1">minutes</wf>
  <wf wid="w6" offset="35" length="5" sent="1" para="1">every</wf>
  <wf wid="w7" offset="41" length="6" sent="1" para="1">Monday</wf>
  <wf wid="w8" offset="48" length="2" sent="1" para="1">in</wf>
  <wf wid="w9" offset="51" length="3" sent="1" para="1">New</wf>
  <wf wid="w10" offset="55" length="3" sent="1" para="1">York</wf>
  <wf wid="w11" offset="59" length="1" sent="1" para="1">.</wf>
</text>

Terms

Back to index

Terms refer to previous word forms (and groups multi-words) and attach lemma, part of speech, synset and name entity information. <term> elements have the following attributes:

  • tid: unique identifier, starting with t
  • type: type of the term . Currently, 2 values are possible:
  • open: open class word
  • close: closed class word
  • lemma: lemma of the term
  • pos: part of speech (Section 3.4.1).
    • common noun (N)
    • proper noun (R)
    • adjective (G)
    • verb (V)
    • preposition (P)
    • adverb (A)
    • conjunction (C)
    • determiner (D)
    • other (O)
  • morphofeat (optional): morphosyntactic feature encoded as a single attribute.
  • head: if the term is a compound, the id of the head component (Section 3.4.4).
  • case (optional): declension case of the term.

<term> elements have the following sub-element:

  • span: this element spans the target word. Target elements are used to refer to the target word, using word ids (wid). If the term is a multiword, multiple target elements are used.

Part-of-speech codes

Back to index

The pos attribute must consist of one single letter from the following set:

N common noun V verb
R proper noun P preposition
Q Pronoun A adverb
D Determiner C conjunction
G Adjective O other

Sentiment features

Back to index

The term layer represents sentiment information which is context-independent and that can be found in a sentiment lexicon. It is related to concepts expressed by words/ terms (e.g. beautiful) or multi-word expressions (e. g. out of order). We provide possibilities to store sentiment information at word level and at sense/synset level. In the latter case, the sentiment information is included in the "external_reference" section and a WSD process may identify the correct sense with its sentiment information. The extension contains the following information categories.

<sentiment> elements have the following sub-element:

  • Resource: identifier and reference to an external sentiment resource
  • Polarity: Refers to the property of a word to express positive, negative or no sentiment. These values are possible:
    • Positive
    • Negative
    • Neutral
    • Or numerical value on a numerical scale
  • Strength: refers to the strength of the polarity
    • Weak
    • Average
    • Strong
    • Or Numerical value
  • Subjectivity: refers to the property of a words to express an opionion (or not)
    • Subjective/Objective
    • Factual/opinionated
  • Sentiment_semantic_type: refers to a sentiment-related semantic type
    • Aesthetics_evaluation
    • Moral_judgment
    • Emotion
    • etc
  • Sentiment modifier: refers to words which modify the polarity of another word
    • Intensifier/weakener polarity shifter
  • Sentiment_marker: refers to words which themselves do not carry polarity, but are kind of vehicles of it
    • Find, think, in my opinion, according to....
  • Sentiment_product_feature: refers to a domain; mainly used in feature-based sentiment analysis
    • Values are related to specific domain. For the tourist domain, for example, staff, cleanliness, beds, bathroom, transportation, location, etc..

Example

Beautiful polarity = positive; strength=average; subjectivity=subjective; sentiment_semantic_type=aesthetics_evaluation; sentiment_modifier=""; sentiment_product_feature=general;

Valley polarity=negative; strength=average; subjectivity=factual; sentiment_semantic_type=""; sentiment_modifier=""; sentiment_product_feature=beds;

Very polarity=""; strength=""; subjectivity=""; sentiment_semantic_type=""; sentiment_modifier="intensifier"; sentiment_product_feature=""

KAF Example

<term tid="t2" lemma="nice" pos="G" type="open">
  <sentiment resource="VUA_polarityLexicon_word" polarity="positive" strength="average"
    subjectivity="subjective"
    sentiment_semanrtic_type="behaviour/trait"
    sentiment_product_feature="" />
  <span>
    <target id="w2"/>
  </span>
  <externalReferences/>
</term>

<term tid="t5" lemma="warm" pos="G" type="open">
  <sentiment/>
  <span>
    <target id="w5"/> 
  </span>
  <externalReferences>
    <externalRef resource="WN-ENG" ref="c_1009" conf="0.38">
      <sentiment resource="VUA_polarityLexicon_synset" polarity="positive" strength="average"
        subjectivity="subjective"
        sentiment_semanrtic_type="behaviour/traitEvaluation"
        sentiment_feature="" />
    </externalRef>
    <externalRef resource="WN-ENG" ref="c_1008" conf="0.31">
      <sentiment resource="VUA_polarityLexicon_synset" polarity="positive" strength="average"
        subjectivity="objective"
        sentiment_semanrtic_type="temperature" sentiment_product_feature=""/>
    </externalRef>
  </externalReferences>
</term>

External References

Back to index

The optional <externalReferences> element is used to associate terms to external lexical or semantic resources, such as elements of a Knowledge base: semantic lexicon (like WordNet) or an ontology. It consists of several elements, one per association. The <externalRef> elements may be nested, meaning that each externalRef refines the relation expressed by the parent element.

elements have the following attributes:

  • resource (required): indicates the identifier of the resource referred to with the following format: "resource_language_version" i.e. wn_en_15, wn_it_30, wn_de_18
  • reference (required): code of the referred element. For example, If the element is a synset of some version of WordNet, it follows the pattern: [0-9]+-[nvars] i.e. 2345-n, 662345-v, … which is a string composed by two fields separated by a dash. The two fields are the following:
  • Synset identifier composed by digits.
  • POS character:
    • n noun
    • v verb
    • a adjective
    • r adverb Examples of valid patterns are: "12345678-n", "017403-v", etc.
  • reftype (optional): indicates the kind of relation the externalRef is expressing. reftype attribute is dependent on the kind on the external reference. For Wordnet it may have values like ’sc_DomainOf’, ’sc_SubclassOf’, etc. An empty reftype would indicate a direct relationship.
  • status (optional): indicates the status of the relationship.
  • source (optional): the name of the process which created the external reference.
  • confidence (optional): a floating number between 0 and 1. Indicates the confidence weight of the association.

Compound Terms

Back to index

Compound terms can be represented in KAF by including <component> elements within <term> elements. For example, the Dutch term landbouwbeleid (English: agriculture policy) would look like this:

<term tid="t7" head="t7.1" lemma="landbouwbeleid" pos="N" type="open">
  <span><target id="w7"/></span>
  <component id="t7.1" lemma="landbouw" pos="N">
    <externalReferences>...</externalReferences>
  </component>
  <component id="t7.2" lemma="beleid" pos="N">
    <externalReferences>...</externalReferences>
  </component>
  <externalReferences>...</externalReferences>
</term>

The <component> elements have the following attributes:

  • id: unique identifier
  • lemma: lemma of the term
  • pos: part of speech
  • case: declension case (apparently optional)

Components of compound terms should all have the term identifier of the whole word before the full stop, and an integer after it, increasing in linear order.

Multiword Terms

Back to index

Multiword terms can be represented in KAF by including more than one <target> element within <term> elements. As the compound terms, multiwords have <component> elements.

For example, two terms "Prime" and "Minister", before the analysis, are showed below:

<term tid="t5" lemma="Prime" pos="N" type="open">
  <span>
    <target id="w5"/>
  </span>
  <externalReferences>...</externalReferences>
</term>
<term tid="t6" lemma="Minister" pos="N" type="open">
  <span>
    <target id="w6"/>
  </span>
  <externalReferences>...</externalReferences>
</term>

After the multiword analysis, they become:

<term tid="t5" lemma="Prime_Minister" pos="N" head="t5.2" type="open">
  <span>
    <target id="w5"/>
    <target id="w6"/>
  </span>
  <component id="t5.1" lemma ="Prime">
    <externalReferences>...</externalReferences>
  </component>
  <component id="t5.2" lemma ="Minister"/>
    <externalReferences>...</externalReferences>
  </component>
  <externalReferences>...</externalReferences>
</term>

All the layers referring to t4 and t5 will be updated.

The <component> elements have the following attributes:

  • id: unique identifier
  • lemma: lemma of the term
  • pos: part of speech
  • case: declension case (apparently optional)

Components of multiword terms should all have the term identifier of the whole word before the full stop, and an integer after it, increasing in linear order.

Examples of term annotation

Back to index

<terms>
  <term tid="t1" lemma="John" pos="R">
    <span><target id="w1"/></span>
  </term>

  <term tid="t2" type="open"  lemma="teach" pos="V">
    <span><target id="w2"/></span>
    <externalReferences>
      <externalRef resource="WN-1.7" reference="eng-17-00861095-v" confidence="0.80">
        <externalRef resource="ontology" reference="Teach" reftype="SubClassOf">
          <externalRef resource="ontology" reference="Human" reftype="agent"/>
          <externalRef resource="ontology" reference="Human" reftype="patient"/>
        </externalRef>
      </externalRef>
      <externalRef resource="WN-1.7" reference="eng-17-00859568-v" confidence="0.20"/>
      <externalRef resource="WN-1.7" reference="eng-17-00859568-v" confidence="0.20"/>
    </externalReferences>
  </term>

  <term tid="t3" type="open"  lemma="mathematics" pos="N">
    <span><target id="w3"/></span>
    <externalReferences>
      <externalRef resource="WN-1.7" reference="eng-17-04597590-n" confidence="0.99"/>
    </externalReferences>
  </term>

  <term tid="t4" lemma="20" pos="N">
    <span><target id="w4"/></span>
  </term>
  <term tid="t5" type="open" lemma="minute" pos="N">
    <span><target id="w5"/></span>
  </term>

  <term tid="t5" type="close" lemma="every" pos="D">
    <span><target id="w6"/></span>
  </term>

  <term tid="t6" lemma="Monday" pos="N">
    <span><target id="w7"/></span>
    <externalReferences>
      <externalRef resource="WN-1.7" reference="eng-17-12557842-n" confidence="0.99"/>
    </externalReferences>
  </term>

  <term tid="t7" type="close" lemma="in" pos="P">
    <span><target id="w8"/></span>
  </term>

  <term tid="t8" lemma="New_York" pos="R">
    <span>
      <target id="w9"/>
      <target id="w10"/>
    </span>
  </term>
</terms>

Dependencies

Back to index

Dependencies represent dependency relations among terms. Each dependency is represented by an empty element and span previous terms. The element has the following attributes:

  • from: term id of the source element
  • to: term id of the target element
  • rfunc: relational function. One of: subj´ (grammatical subject), obj´ (objects and/or complements), or `mod´ (modifier of noun or verb)
  • mod: indicates the word introducing the dependent in a head- modifier relation. For instance: mod(by,gift,Peter) the gift of a book by Peter mod(of,examination,patient) the examination of the patient
  • subj: indicates the subject in the grammatical relation Subject-Predicate. For instance: subj(arrive,John,) John arrived in Paris subj(employ,Microsoft,) Microsoft employed 10 C programmers subj(employ,Paul,obj) Paul was employed by Microsoft
  • csubj, xsubj, ncsubj: The Grammatical Relations (RL) s csubj and xsubj may be used for clausal subjects, controlled from within, or without, respectively. ncsubj is a non-clausal subject. For instance: xsubj(win,require,_) to win the America’s Cup requires heaps of cash
  • dobj: Indicates the object in the grammatical relation between a predicate and its direct object. For instance: dobj(read,book,_) read books
  • iobj: The relation between a predicate and a non-clausal complement introduced by a preposition; type indicates the preposition introducing the dependent. For instance: iobj(in,arrive,Spain) arrive in Spain iobj(into,put,box) put the tools into the box iobj(to,give,poor) give to the poor
  • obj2: The relation between a predicate and the second non-clausal complement in ditransitive constructions. For instance: obj2(head,dependent) obj2(give,present) give Mary a present obj2(mail,contract) mail Paul the contract
  • case (optional): declension case

There is no requirement that the dependency graph be complete but it should not have any directed circles. Example of dependency relation annotations:

<deps>
  <!-- subj(teach, John) -->
  <dep from="t1" to="t2" rfunc="subj" />
  <!-- dobj(teach, Mathematics) -->
  <dep from="t3" to="t2" rfunc="dobj" />
  <!-- iobj(teach, New_York) -->
  <dep from="t8" to="t2" rfunc="iobj" />
</deps>

Chunks

Back to index

Chunks are noun, verb or prepositional phrases, spanning terms. <chunk> elements have the following attributes:

  • cid: unique identifier
  • head: the chunk head’s term id
  • phrase: type of the phrase
  • case (optional): declension case

Example of chunk annotations:

<chunks>
  <!-- John -->
  <chunk cid="c1" head="t1" phrase="NP">
    <span><target id="t1"/></span>
  </chunk>
  <!-- taught -->
  <chunk cid="c2" head="t2" phrase="V">
    <span><target id="t2"/></span>
  </chunk>
  <!-- Mathematics -->
  <chunk cid="c3" head="t3" phrase="NP">
    <span><target id="t3"/></span>
  </chunk>
  <!-- 20 minutes -->
  <chunk cid="c5" head="t5" phrase="NP">
    <span><target id="t4"/><target id="t5"/></span>
  </chunk>
  <!-- every -->
  <chunk cid="c6" head="t6" phrase="R">
    <span><target id="t6"/></span>
  </chunk>
  <!-- every Monday -->
  <chunk cid="c7" head="t7" phrase="NP">
    <span><target id="t6"/><target id="t7"/></span>
  </chunk>
  <!-- in New York -->
  <chunk cid="c9" head="t9" phrase="PP">
    <span><target id="t8"/><target id="t9"/></span>
  </chunk>
</chunks>

Constituent Parsing

Back to index

This layer represent the output of constituency analysis parsers, i.e., full syntactic tree of sentences. The top element of the layer is <constituency>, and each sentence (parse tree) is represented by a <tree> element. Inside each <tree>, there are three types of elements:

  • <nt> elements representing non-terminal nodes.
  • <t> elements representing terminal nodes.
  • <edge> elements representing in-tree edges.

The <nt> element represents the internal nodes of the parse tree. It has the following attribute:

  • id (required): unique identifier starting with the prefix ``nter'';
  • label (required): the category label of the node (for instance, 'S', 'NP', 'VP', etc).

The <t> element represents the leaf nodes of the parse tree. It has the following attributes:

  • id (required): unique identifier starting with the prefix ``ter'';

The <t> element contains a <span> element pointing to the term layer. Each <span> contains one or more <target> elements, with the following attributes:

  • id (required): id of the target term.

The <edge> element links child nodes with its parent. It has the following attribute:

  • id (optional) unique identifier starting with the prefix ``tre'' (tree edge);
  • from (required): id of the child node. The child node can be a terminal or a non-terminal.
  • to (required): id of the parent node. The parent node is a non-terminal.
  • head (optional): a "yes" value indicates that the child node is the head constituent of the sub-tree pointed by the parent node.

Note that the order in which the <edge> elements appear in the XML document is important. Given the sentence:

The dog ate the cat.

In a Penn Treebank format, its parse tree is the following (heads are marked with an asterisk):

(S (NP (DET The) *(NN dog)) *(VP *(V ate) (NP ((DET the) *(NN cat)))) (. .))

<constituency>
  <tree>
    <!-- Non-terminals -->
    <nt id="nter0"  label="ROOT"/>
    <nt id="nter1"  label="S"/>
    <nt id="nter2"  label="NP"/>
    <nt id="nter3"  label="VP"/>
    <nt id="nter4"  label="V"/>
    <nt id="nter5"  label="NP"/>
    <nt id="nter6"  label="DET"/>
    <nt id="nter7"  label="NN"/>
    <nt id="nter8"  label="DET"/>
    <nt id="nter9"  label="NN"/>
    <nt id="nter10" label="."/>
    <!-- Terminals -->
    <!-- The -->
    <t id="ter1"><span><target id="t1"/></span></t>
    <!-- dog -->
    <t id="ter2"><span><target id="t2"/></span></t>
    <!-- ate -->
    <t id="ter3"><span><target id="t3"/></span></t>
    <!-- the -->
    <t id="ter4"><span><target id="t4"/></span></t>
    <!-- cat -->
    <t id="ter5"><span><target id="t5"/></span></t>
    <!-- . -->
    <t id="ter6"><span><target id="t6"/></span></t>

    <!-- tree edges. Note: order is important! -->
    <edge id="tre1" from="nter1" to="nter0"/>             <!-- ROOT <- S -->
    <edge id="tre2" from="nter2" to="nter1"/>             <!-- S <- NP -->
    <edge id="tre3" from="nter6" to="nter2"/>             <!-- NP <- DET -->
    <edge id="tre4" from="ter1" to="nter6"/>              <!-- DET <- The -->
    <edge id="tre5" from="nter7" to="nter2" head="yes"/>  <!-- NP <- NN (head) -->
    <edge id="tre6" from="ter2" to="nter7"/>              <!-- NN <- dog -->
    <edge id="tre7" from="nter3" to="nter1" head="yes"/>  <!-- S  <- VP (head) -->
    <edge id="tre8" from="nter4" to="nter3" head="yes"/>  <!-- VP <- V (head) -->
    <edge id="tre9" from="ter3" to="nter4"/>              <!-- V  <- ate -->
    <edge id="tre10" from="nter5" to="nter3"/>            <!-- VP <- NP -->
    <edge id="tre11" from="nter8" to="nter5"/>            <!-- NP <- DET -->
    <edge id="tre12" from="ter4" to="nter8"/>             <!-- DET <- the -->
    <edge id="tre13" from="nter9" to="nter5" head="yes"/> <!-- NP <- NN (head) -->
    <edge id="tre14" from="ter5" to="nter5"/>             <!-- NN <- cat -->
    <edge id="tre15" from="ter6" to="nter10"/>            <!-- . <- . -->
    <edge id="tre16" from="nter10" to="nter1"/>           <!-- S <- . -->
  </tree>
</constituency>

Named Entities

Back to index

A named entity is a term (or a multiword) that clearly identifies one item. The optional Named Entity layer is used to reference terms that are named entities. The Named Entity layer may be used either for referencing single mentioned entities or for referencing multi-mentioned entities: in the latter case, the layer creates clusters of term spans, which we call mentions, each mention referencing the same entity.

We currently use 8 entity types in total, distributed as the 4 class models for CoNLL and the 7 class model for MUC. The table below shows the two models. In OpeNER, we at least offer one 4 class CoNLL type of model for every language and, whenever possible and depending on training set availability, a MUC 7 class type of model too. It could also be possible a combination of these and other datasets.

CoNLL basic list of entities:

  • Basic
  • Location
  • Person
  • Organization
  • Misc

MUC advanced list of entities

  • Advanced
  • Location
  • Person
  • Organization
  • Date
  • Time
  • Percent
  • Money

A named entity element has the following attributes:

  • eid: the id for the named entity
  • type : type of the named entity. Currently, 8 values are possible:
  • Person
  • Organization
  • Location
  • Date
  • Time
  • Money
  • Percent
  • Misc

Every named entity could have other optional attributes. The following table display some cases.

Type Optional attributes
ORGANIZATION subtype = "company"
LOCATION subtype = type of location (i.e. "street", "city", "country")
DATE dateISO = "2012/12/31"
TIME timeISO = "15:38:00"
MONEY moneyISO = "100 EUR"
MISC subtype = "car" country : car registration code
MISC subtype = "phone" phonetype: type of the phone number (i.e. "imei", "mobile", "landline") country: the country code
MISC subtype = "personal" cardtype: type of the personal card (i.e. "passport", "idcard", "driver_license") country: the country code
MISC subtype = "banking" banktype: type of bank entity (i.e. "iban", "ccard", "account") country: the country code
MISC subtype = "internet" nettype: type of the network entity (i.e. "ipaddress", "macaddress", "email", "url")

A named entity element have the following sub-elements:

  • references: this element contains one or more span elements
  • externalReferences (optional): this element contains one or more externalRef elements

A <span> sub-element can be used to reference the different occurrences of the same named entity in the document or mentions of it, using the target elements to refer to the target terms. If the entity is composed by multiple words, multiple target elements are used. The <target> element has the following attributes:

  • id: tid that refers to the target term.
  • head (optional): a "yes" value indicates that the term is the head of the mention.

The optional sub-element (see 3.4.3) is used to associate terms to external lexical or semantic resources, such as elements of a Knowledge base. element have the following attributes:

  • resource (required): indicates the identifier of the resource referred to.
  • reference (required): code of the referred element.
  • confidence (optional): a floating number between 0 and 1. Indicates the confidence weight of the association.

Example:

<entities>
  <entity type="person" eid="e1" >
    ....
  </entity>
  <entity type="organization" eid="e2">
    ....
  </entity>
  <entity type="date" eid="e3">
    ....
  </entity>
</entities>

Examples of entity annotation

Example 1

<entities>
  <entity type="person" eid="e1">
    <references>
      <!-- John Smith -->
      <span>
        <target id="t1"/>
        <target id="t2"/>
      </span>
      <!-- Him -->
      <span>
        <target id="t15"/>
      </span>
    </references>
    <externalReferences>
      <externalReference confidence="0.7" reference="13982343" resource="JRCNames"/>
      <externalReference confidence="0.3" reference="834354" resource="JRCNames"/>
    </externalReferences>
  </entity>

  <entity type="location" subtype="street" eid="e2">
    <references>
      <!-- London -->
      <span>
        <target id="t12" head="yes"/>
      </span>
      <!-- The capital city of England -->
      <span>
        <target id="t1" head="yes"/>
        <target id="t2" head="yes"/>
        <target id="t3" head="yes"/>
        <target id="t4" head="yes"/>
        <target id="t5" head="yes"/>
      </span>
    </references>
    <externalReferences>
      <externalReference reference="ref01011020" resource="GeoNames"/>
    </externalReferences>
    <externalReferences>
      <externalReference reference="http://wwww.wikipedia.com/London" resource="Wikipedia"/>
    </externalReferences>
  </entity>
</entities>

Example 2

After breakfast at the Elia Beach Hotel, I and my wife had a walk to Mykonos. There we were picked up and driven to Piraeus Port, where we had lunch with Mr. Vernicos at the Marine Club.

<entities>
  <entiy eid="e1" type="organization">
    <!-- Elia Beach Hotel -->
    <span>
      <target id="t5"/>
      <target id="t6"/>
      <target id="t7"/>
    </span>
  </entity>
  <entiy eid="e2" type="location">
    <!-- Mykonos -->
    <span>
      <target id="t16"/>
    </span>
    <!-- There -->
    <span>
      <target id="t17"/>
    </span>
  </entity>
  <entiy eid="e3" type="location">
    <!-- Piraeus Port -->
    <span>
      <target id="t25"/>
      <target id="t26"/>
    </span>
  </entity>
  <entiy eid="e4" type="person">
    <!—Mr. Vernicos -->
    <span>
      <target id="t32"/>
      <target id="t33"/>
    </span>
  </entity>
  <entiy eid="e5" type="location">
    <!-- Piraeus Port -->
    <span>
      <target id="t36"/>
      <target id="t37"/>
    </span>
  </entity>
</entities>

Coreferences

Back to index

The optional coreference layer creates clusters of term spans (which we call mentions) which share the same referent. Coreferences are intended to be used for clustering mentions that are not Named Entities.

A <coref> element represents a mention cluster, and within <coref> each mention is represented by a <span> element (which groups term mentions using <target> elements). Additionally, one <target> element within the <span> may have an attribute "head" with value "yes" to represent the fact that this particular term is the head of the mention. <coref> element has the following attribute:

  • coid: unique id, starting with the prefix "co" the <span> element has as many <target> elements. The <target> element has the following attribute:
  • id (optional): tid that refers to the target term.
  • head (optional): a "yes" value indicates that the term is the head of the mention.

Example of coreference annotation

Example 1

<coreferences>
  <coref coid="co1">
    <!-- I -->
    <span >
      <target id="t8" head="yes"/>
    </span>
    <!-- we -->
    <span>
      <target id="t18"/>
    </span>
  </coref>
  <coref coid="co2">
    <!—my wife -->
    <span >
      <target id="t10"/>
      <target id="t11" head="yes"/>
    </span>
    <!-- we -->
    <span>
      <target id="t18"/>
    </span>
  </coref>
</coreferences>

Example 2

After breakfast at the Elia Beach Hotel, I and my wife had a walk to Mykonos. There we were picked up and driven to Piraeus Port, where we had lunch with Mr. Vernicos at the Marine Club.

<coreferences>
  <coref coid="co1">
    <!-- I -->
    <span >
      <target id="t8" head="yes"/>
    </span>
    <!-- we -->
    <span>
      <target id="t18"/>
    </span>
    <!-- we -->
    <span>
      <target id="t28"/>
    </span>
  </coref>
  <coref coid="co2">
    <!—my wife -->
    <span >
      <target id="t10"/>
      <target id="t11" head="yes"/>
    </span>
    <!-- we -->
    <span>
      <target id="t18"/>
    </span>
    <!-- we -->
    <span>
      <target id="t28"/>
    </span>
  </coref>
</coreferences>

Features of sentiment analysis

Back to index

Features of hotels may be properties (like cleanliness. location, view) or categories (like rooms, bathrooms, staff). A property is an abstract feature (i.e. view and location), a category instead is a more concrete and real feature (i.e. staff and rooms). A category may have a relation with a properties (i.e. the view in the rooms is good). In that case, the relation will be included in the layer. Properties don’t have relation with categories (i.e. "the room in the view" does not make sense)

<features> element may contain a <properties> element and a <categories> element.

<properties> element contains one or more <property> elements. A <property> element has the following attributes:

  • pid: the unique identifier of the property
  • lemma: lemma of the property

<categories> element contains one or more <category> elements.

A <category> element has the following attributes:

  • cid: the unique identifier of the category
  • lemma: lemma of the category <property> and <category> elements have the following sub-elements:
  • references: this element contains one or more reference elements
  • externalReferences (optional): this element contains one or more externalRef elements A <span> sub-element can be used to reference the different occurrences of the same property or category in the document, using target elements to refer to the target terms. If the feature is composed by multiple words, multiple target elements are used.

The optional <externalRef> sub-element (see 3.4.3) is used to associate terms to external lexical or semantic resources, such as elements of a Knowledge base. <externalRef> element have the following attributes:

  • resource (required): indicates the identifier of the resource referred to.
  • reference (required): code of the referred element.
  • confidence (optional): a floating number between 0 and 1. Indicates the confidence weight of the association.

Example

Nothing special really. Comfortable and clean but very boring decor in comparison to other NH hotels. I stayed in NH in Brussels and Zurich and I really liked them because of their modern and stylish design and big rooms. This one was just like any other hotel. Basic rooms with basic and dull decor - bit disappointng. The customer service was average. The rate was very expensive and I stll had to pay for Internet and 20 euros for breakfast!!! It was good but way overpriced! The best thing about the hotel was the location - city centre, 2min from a metro stop.

<features>
  <properties>
    ...
    <property pid="p1" lemma="customer services">
      <references>
        <span>
          <target id="t58" />
          <target id="t59" />
        </span>
      </references>
    </property>
    <property pid="p2" lemma="rate">
      <references>
        <span>
          <target id="t63" />
        </span>
      </references>
    </property>
    <property pid="p3" lemma="location">
      <references>
        <span>
          <target id="t94" />
        </span>
      </references>
    </property>
    ...
  </properties>

  <categories>
    ...
    <category cid="c1" lemma="room">
      <references>
        <span>
          <target id="t39" />
        </span>
        <span>
          <target id="t49" />
        </span>
      </references>
    </category>
    <category cid="c2" lemma="internet">
      <references>
        <span>
          <target id="t74" />
        </span>
      </references>
    </category>
    <category cid="c3" lemma="breakfast" />
    <references>
      <span>
        <target id="t79" />
      </span>
    </references>
  </category>
  ...
</categories>
    </features>

Relations

Back to index

Relation between entities and/or features.

Two entities may have a relation with a degree of confidence. For example: "Bill Clinton lives in New York"

  • Bill Clinton is a Person
  • New York is a Location There can be relations between entities and features with a degree of confidence. For example: "The rooms in the NH Hotel are expensive."
  • NH Hotel is an Organization (with subtype company)
  • Room is a Category (sub element of the features layer)

Furthermore, there can be relations between features, from a category and a property. For example: "Services in the rooms are good."

  • Service is a Property
  • Room is a Category (sub element of the features layer)

The degree of confidence may be provided using different approaches and processors. For example, if two entities occur in the same sentence, a processor may assign a low confidence score. If they appear in the same chunk, a processor may assign a higher confidence score. If a linguistic processor detects a dependency relation, it may assign a higher score, as it happens, for example, in the previous sample for the named entities "Bill Clinton" and "New York". An element contains these attributes:

  • rid: the unique identifier of the relation between two entities
  • from: entity/category/property id of the source element
  • to: entity/category/property id of the target element
  • confidence: (optional): a floating number between 0 and 1. Indicates the confidence weight of the relation Example

"Bill Clinton lives in New York"

<entities>
  <entity type=’person’ eid="e1">
    // Bill Clinton
    ...
  </entity>
  <entity type=’location’ eid="e2">
    // New York
  </entity>

  ...
</entities>
<relations>
  <relation from=’e1’ to=’e2/> // confidence = 100%
</relations>

"The rooms in the NH Hotel are expensive."

<entities>
  <entity type=’organization’ subtype=’company’ eid="e1">
    // NH HOTEL
  </entity>
</entities>
<features>
  <categories>
    <category cid="c1">
      // ROOMS
    </category>
  </categories>
</features>
<relations>
  <relation from=’c1’ to=’e1/> // confidence = 100%
</relations>

"Services in the rooms are good.."

<features>
  <properties>
    <property pid="p1">
      // SERVICES
    </property>
  </properties>
  <categories>
    <category cid="c1">
      // ROOMS
    </category>
  </categories>
</features>
<relations>
  <relation from=’p1’ to=’c1/> // confidence = 100%
</relations>

Opinions

Back to index

The sentiment related information in KAF is represented at two levels: the (already existing) term layer and a (new) opinion layer. The term layer represents information copied from an external source like the sentiment lexicon. Information can be stored at both word and sense-synset level. Sentiment information in the term layer is the building block for the sentiment analysis tool which then tries to generate information needed to fill the slots (opinion holder, opinion target, opinion expression) of an opinion triplet. These triplets are represented in the opinion layer. Starting from these triplets, final sentiment analysis results can be calculated. Basic forms of sentiment analysis where polarity is aggregated at sentence or document level may ignore the opinion layer triplets and make use of the term level sentiment only. In these cases (for example, polarity classification of product reviews) , opinion target (i. e. the product) and opinion holder (i. e. the reviewer) are known on beforehand. However, when more complex sentiment analysis is performed, the opinion triplets are needed. An example of such an analysis, is product feature-based SA where the target of the opinion is not just the product but some specific feature of the product. Another example concerns the analysis of , for example, news or blogs where different people (i. e. opinion holders) may have different opinions about different targets. In those cases, aggregation of the polarity values found in the text, is not enough.

The <opinion> layer has one attribute:

  • oid: the unique identifier of the opinion

The <opinion> layer consists of the following subelement:

  • opinion_holder: whose opinion: speaker or some actor in the text
  • opinion _target : about what
  • opinion_expression: the expression

<opinion_holder> and <opinion_target> elements have the following sub-element:

  • span: this element spans the target term. Target elements are used to refer to the target term,, using term ids (tid). If the term is a multiword, multiple target elements are used.

<opinion_expression> has the following attributes:

  • polarity: refers to the positive or negative orientation of the expression
  • strength: refers to the strength of the expression
  • subjectivity: refers to whether an expression is subjective or not
  • sentiment_semantic_type: refers to sentiment related semantic types like emotion, judgment, belief, speculation
  • sentiment_product_feature: refers to specific features of entities, to be used in feature/aspect-based sentiment analysis

The fillers of the three elements (holder, target, type) can be found in one sentence or in different ones (cf. ex. 1). Moreover, one sentence can contain multiple opinion triplets (cf. ex. 2).

Example

(1) Vicky Masing,"http://twitter.com/VickyMasing/statuses/245094342799286273",2012-09-10T10:13:24Z,"english","Fountain of Bellagio Las Vegas. One has got to witness that magical show. Definitely beautiful, definitely mesmerizing."

opinion expression: Definitely beautiful, definitely mesmerizing.
polarity (optional): positive
strength (optional): strong
sentiment_semantic_type (optional): evaluation
sentiment_product_feature (optional): ""
opinion holder (optional): Speaker/Writer
opinion target (optional): Fountain of Bellagio Las Vegas

(2) We stayed at The Toren as a transit hotel for our trip from saudi arabia to USA, the hotel was just amazing in all aspects, the service was great, the people working their were really helpful and nice, the food at the breakfast was delicious and fresh and the location is great, very near to tram stops, the central station, Shopping...

opinion expression: really helpful and nice
polarity: positive
strength: strong
sentiment_semantic_type: evaluation
sentiment_product_feature: staff
opinion holder: Speaker/Writer
opinion target: people working there

opinion expression: delicious
polarity: positive
strength: strong
sentiment_semantic_type: evaluation
sentiment_product_feature: breakfast
opinion holder: Speaker/Writer
opinion target: the food at the breakfast

(N.B. there are more opinion triplets in this fragment than the ones given here).

KAF Example

They had a nightmare with Hilton Hotel Paris

<opinion oid="o1">
  <opinion_holder type="" >
    <span>
      <target id="t1"/>
    </span>
  </opinion_holder>
  <opinion_target>
    <span>
      <target id="t6"/>
      <target id="t7"/>
      <target id="t8"/>
    </span>
  </opinion_target>
  <opinion_expression polarity="negative" strength="strong"
    subjectivity="subjectivity" sentiment_semantic_type="evaluation"
    sentiment_product_feature="">
    <span>
      <target id="t3"/>
      <target id="t4"/>
    </span>
  </opinion_expression>
</opinion>

Ways of working with the sentiment-related KAF layers

Shallow polarity aggregation

The most shallow sentiment analysis process uses the sentiment information of the term layer that comes from a polarity lexicon. The polarity information is processed with or without WSD and (with or without WSD) , applying polarity shifting and intensification rules. The results are not stored in any KAF layer but aggregated - depending on the end-user requirements – up to the level of e.g. chunck, sentence , paragraph, document, set of documents, etc. For example, the hotel is not(shifter) expensive(negative) but not(shifter) clean(positive) and not(shifter) welllocated( positive) either results in +1(positive) and -2(negative) for this sentence.

Analysis with opinion expressions, targets and holders

More complex sentiment analysis makes use of the term and opinion layer. Polarity information of the terms is processed and shifting and intensification rules are applied. The sentiment analysis process then exploits , like , for example, syntactic rules to find the opinion holder and target. The results of this process are stored in the opinion layer. For example, the hotel is not(shifter) expensive(negative) but not(shifter) clean(positive) and not(shifter) well-located(positive) either results in 3 opinion triplets:

  • opinion1: [[ holder: default] [expression(not expensive): [polarity: positive] [strength: average]] [target : hotel ]].
  • opinion2: [[ holder: default] [expression(not clean): [polarity: negative] [strength: average]] [target : hotel ]].
  • opinion3: [[ holder: default] [expression(not well-located): [polarity: negative] [strength: average]] [target : hotel ]].

The relation of the word "hotel" in this sentence and a specific hotel in the entity layer is established by overlapping spans in the term layer. Again, how the information is further processed, e.g. aggregation of different opinions about one target or aggregations of different opinions from one holder, depends on the end-user requirements and the results are not stored KAF.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.