Skip to content

Classification

Benjamin Meyers edited this page May 22, 2017 · 2 revisions

► Classification

★ Algorithms

Vincze et al.[1] trained a Linear Conditional Random Fields (CRF) model for detecting semantic uncertainty in English. The backbone of a Linear CRF is the principle of Maximum Entropy, which is also the basis for our equivalent Logistic Regression algorithm (from scikit-learn).

★ Feature Set

The features used for training are explained in the aforementioned paper, and we have provided in-depth explanations and examples on our Feature Set page. Features are word-based with sentence-level context taken into consideration. Each word will have between 13 (sentence length of 1) and 28 (sentence length of 5 or more) features. Features are then aggregated to ensure equal numbers for every word. A feature can either be present in a word (denoted by 1.0) or absent (0.0).

★ Binary Classification

Our model supports the original binary classification from Vincze et al.[1], where a label of C indicates certainty and a label of U indicates uncertainty. In the same manner as the original, a sentence is considered uncertain if any of the words within that sentence are uncertain.

★ Multilabel Classification

Our work extends the previous by breaking the U class into five distinct classes of
semantic uncertainty:

  • E: Epistemic - the proposition is possible (based on our knowledge of the universe), but its truth-value cannot be determined.
    • Ex: It may be raining.
  • D: Doxastic - the proposition is assumed to be true or false, but its truth-value cannot be determined.
    • Ex: He believes the Earth is flat.
  • I: Investigation - the proposition is in the process of having its truth-value determined.
    • Ex: We examined the role of fire in the development of civilization.
  • N: Condition - the proposition is truth or false based on the truth-value of another proposition.
    • Ex: If it rains, we will stay inside.

The fifth class, U, is a little different than the other four. Given a sentence, if there are two or more labels of E, D, I, or N present, the label with the maximum occurrences is assigned to the sentence. If multiple uncertainty classes have the same maximum occurrences, the sentence is labeled as generally uncertain, U.

☕ Footnotes

📃 [1] Vincze, V. (2015). Uncertainty detection in natural language texts (Doctoral dissertation, szte).