Skip to content
louismullie edited this page Oct 21, 2012 · 35 revisions

##Configuration

###Encoding

Ruby 1.9 does not parse files with non-ASCII characters unless you specify the encoding. You can do so by adding a Ruby comment at the very top of the file, with the appropriate encoding, e.g.: # encoding: utf-8

###Verbosity

Option Default Description
Treat.core.verbosity.silence true A boolean value indicating whether to silence the output of external libraries (e.g. Stanford tools, Enju, LDA) when they are used.
Treat.core.verbosity.debug false A boolean value indicating whether to silence the output of external libraries (e.g. Stanford tools, Enju, LDA) when they are used.

###Languages

Option Default Description
Treat.core.language.detect false A boolean value indicating whether Treat should # try to detect the language of newly input text.
Treat.core.language.default 'english' The language to default to when detection is off.
Treat.core.language.detect_at :sentence A symbol representing the finest level at which language detection should be performed if language detection is turned on.

###Paths

Option Default Description
Treat.paths.tmp '$GEM_FOLDER/tmp/' A directory in which to create temporary files.
Treat.paths.files '$GEM_FOLDER/files/' A directory in which to store downloaded files.
Treat.paths.bin '/$GEM_FOLDER/bin/' The directory containing executables and JAR files.
Treat.paths.models '/$GEM_FOLDER/models/' The directory containing training models.

Databases

Currently, Treat only supports MongoDB, although support for more DB formats are on the way. You can configure MongoDB as such:

Treat.databases.mongo.db = 'your_database'
Treat.databases.mongo.host = 'localhost'
Treat.databases.mongo.port = '27017'

Entities

Textual entities can be created by using the special "builder" methods available in the global namespace. These methods are aliases for the corresponding Treat::Entities::X.build(...) methods, so that Word('hello') is equivalent to Treat::Entities::Word.build('hello').

From now on, the DSL is not turned on by default. You must include it for the following to work:

include Treat::Core::DSL

###Creating Textual Entities From Strings

# Create a word
word = Word('run')
	
# Create a phrase
phrase = Phrase('am running')

# Create a sentence
sentence = Sentence('Welcome to Treat!')

# Create a section
section = Section("A small text\nA factitious paragraph.")

###Creating Documents From Files or URLs

# If a filename is supplied, the file format and 
# the appropriate parser to use will be determined
# based on the file extension:
d = Document('text.extension')
# N.B. Supports .txt, .doc, .htm(l), .abw, .odt;
# .pdf can be parsed with poppler-utils installed
# .doc can be parsed with antiword installed
# .jpg, .gif or .png with ocropus installed
# Tip: `port install ocropus poppler antiword`

# If a URL is specified, the file will be downloaded
# and then parsed as a regular file:
d = Document('http://www.example.com/XX/XX')
# N.B. By default, files will be downloaded into the 
# '/files/' folder of the gem's directory. This can 
# be changed by modifying 'Treat.core.paths.files'
# N.B. Format will assumed to be HTML if the web page
# does not have any file extension. Otherwise,
# will be determined based on the file extension.

# If a hash is provided, that hash will be used as
# a selector to retrieve a document from the DB.
# This can be done based on the ID of the document:
d = Document({id: 103757301323})
# Or through any given feature of a certain document:
d = Document({'features.file' => 'somefile.txt'})
# N.B. Currently, MongoDB is the only supported DB.
# You need to configure the database adapter before
# using this particular way of retrieving documents.

###Creating Collections from Folders

A collection is a set of documents that are grouped together.

# If an existing folder is passed to the builder,
# that folder is recursively searched for files 
# with any of the supported formats, and these 
# files are loaded into the collection object:
c = Collection('existing_folder')

# If a non-existing folder is passed to the builder,
# that folder is created and the collection is opened:
c = Collection('new_folder')

# If a collection has been created with a folder,
# documents added to the collection are copied 
# to the newly created folder:
c = Collection('some_folder')
d = Document 'http://www.someurl.com'
c << d

# If a hash is passed to the builder, that hash
# will be used as a selector to retrieve documents
# from the DB, and a collection containing these 
# documents will be loaded. An empty hash loads 
# all documents in the DB:
c = Collection({})
# You can also use any arbitrary feature that 
# you have defined on the documents to create 
# a collection on-the-fly:
c = Collection({'features.topic' => 'news'})
# N.B. Currently, MongoDB is the only supported DB.
# You need to configure the database adapter before
# using this particular way of creating collections.

Visualizing Entities

Three useful ways to visualize entities (in addition, of course, to #inspect) are the tree, graph and standoff (tag-bracketed) formats.

Format Example Description
Tree entity.visualize :tree Outputs a tree representation of any kind of entity in a terminal-friendly format.
Graph entity.visualize :dot, file: 'test.dot' Outputs a DOT graph representation of the entity in Graphviz format.
Standoff sentence.visualize :standoff Outputs a tag-bracketed version of a tagged sentence (only works on sentences).

Serializing Entities

Treat currently provides mongo, xml and yaml serialization. Deserialization methods are also shown in the examples below.

Format Serialization Example Deserialization Example Description
MongoDB doc.serialize :mongo, db: 'testing' doc = Document({id: your_doc_id}) Serializes the entity and its whole subtree in a single document, in collection with a name derived from the type of that entity (e.g. "documents"). See the Mongo configuration options for details.
XML page.serialize :xml, file: 'test.xml' page = Page('test.xml') Serializes the entity to the Treat XML format.
YAML sentence.serialize :yaml, file: 'test.yml' sentence = Sentence('test.yml') Serializes the entity to YAML format using Psych.

String-to-Entity and Number-to-Entity Casting

If a method defined by Treat is called on a plain old string object, that method will be caught through method_missing(), and Treat will attempt to cast the string object to the proper type of textual entity using String#to_entity. The method will then be called on the result of that casting. The advantage of this is that, in many cases, it is not even necessary to create a Treat object using one of the builders to access the desired information.

For example, let's say we do something like 'glass'.synonyms. The string 'glass' will be casted to an object of type Treat::Entities::Word, which happens to have a hook to respond to synonyms. Thus, the synonyms of 'glass' will be returned. In other words, "glass".synonyms is shorthand for "glass".to_entity.synonyms, which itself is equivalent to Word("glass").synonyms.

If a method defined by Treat is called on a Numeric object, that object will automatically be cast to a Treat::Entities::Number object.

Casting Examples

Consider the following examples to further illustrate how casting works.

Operation Performed Resulting Type
"A syntactical phrase".to_entity Treat::Entities::Phrase
"A super little sentence.".to_entity Treat::Entities::Sentence
"Sentence number one. Sentence number two.".to_entity Treat::Entities::Paragraph
"A title\nA short amount of text.".to_entity` Treat::Entities::Section
20.to_entity Treat::Entities::Number

##Text Processing

About Text Processors

The first step once a textual entity has been created is usually to split it into smaller bits and pieces that are more useful to work with. Treat allows to successively split a text into logical zones, sentences, syntactical phrases, and, finally, tokens (which include words, numbers, punctuation, etc.) All text processors work destructively on the receiving object, returning the modified object. They add the results of their operations in the @children hash of the receiving object. Note that each of these methods are only available on specific types of entities, which are bolded in the text below.

Workers for Text Processors

Also note that when called without any options, each of the text processing tasks will be done using the default worker. The default worker will be determined based on the format of the supplied file (for chunkers) or the language of the text (for segmenters, tokenizers and parsers). You can specify a non-default worker by passing it as a symbol to the method, e.g. paragraph.segment :punkt, sentence.parse :enju, sentence.tokenize :stanford, etc.

###Chunkers

Chunkers split a document into its logical sections (a page or block) and zones (a title, paragraph or list). A section contains at least one zone, and usually contains a title along with one or more paragraphs and/or lists.

d = Document('doc.html').chunk

###Segmenters

Segmenters split a zone of text (a title or a paragraph) into sentences.

p = Paragraph('A walk in the park. A trip on a boat.').segment

Available Processors

Name Description Reference
srx Sentence segmentation based on a set of predefined rules defined in SRX (Segmentation Rules eXchange) format and developped by Marcin Milkowski. Marcin Milkowski, Jaroslaw Lipski, 2009. Using SRX standard for sentence segmentation in LanguageTool, in: Human Language Technologies as a Challenge for Computer Science and Linguistics.
tactful Sentence segmentation based on a Naive Bayesian statistical model. Trained on Wall Street Journal news combined with the Brown Corpus, which is intended to be widely representative of written English. Dan Gillick. 2009. Sentence Boundary Detection and the Problem with the U.S. University of California, Berkeley.
punkt Sentence segmentation based on a set of log- likelihood-based heuristics to infer abbreviations and common sentence starters from a large text corpus. Easily adaptable but requires a large (unlabeled) indomain corpus for assembling statistics. Kiss, Tibor and Strunk, Jan. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32:485-525.
stanford Detects sentence boundaries by first tokenizing the text and deciding whether periods are sentence ending or used for other purposes (abreviations, etc.). The obtained tokens are then grouped into sentences. -
scalpel Sentence segmentation based on a set of predefined rules that handle a large number of usage cases of sentence enders. The idea is to remove all cases of .!? being used for other purposes than marking a full stop before naively segmenting the text. -

###Tokenizers

Tokenizers split groups of words (sentences, phrases and fragments) into tokens.

s = Sentence('An uninteresting sentence, yes it is.').tokenize

Available Processors

Name Description Reference
ptb Tokenization based on the tokenizer developped by Robert Macyntyre in 1995 for the Penn Treebank project. This tokenizer follows the conventions used by the Penn Treebank, except that by default it will not change double quotes to directional quotes. Robert MacIntyre. 1995. Reference implementation for PTB tokenization. University of Pennsylvania.
stanford Tokenization provided by Stanford Penn-Treebank style tokenizer. Most punctuation is split from adjoining words, double quotes (") are changed to doubled single forward- and backward- quotes (`` and ''), verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. -
tactful Tokenization script lifted from the 'tactful- tokenizer' gem. -
punkt Tokenization script from the 'punkt-segmenter' Ruby gem. -

###Parsers

Parsers parse a groups of words (sentences, phrases and fragments) into their syntactical tree.

s = Sentence('The prospect of an Asian arms race is genuinely frightening.').parse

###Chains

You can chain any number of processors using do. This allows rapid splitting of any textual entity down to the desired granularity. The tree of the receiver will be recursively searched for entities to which each of the supplied processors can be applied.

sect = Section  "A walk in the park\n"+
'Obama and Sarkozy met this friday to investigate ' +
'the possibility of a new rescue plan. The French ' +
'president Sarkozy is to meet Merkel next Tuesday.'

sect.do(:chunk, :segment, :parse)

Non-Default Text Processors

As when calling the text processors directly, you can specify a non-default text processor to use when chaining them. The name of the text processor becomes a hash key, pointing to a value which represents the worker to use:

sect.do(:chunk => :txt, :segment => :punkt, :parse => :stanford)

Note that you should not mix and match the "default worker" syntax with the "non-default worker" syntax, e.g.:

sect.do(:chunk => :txt, :segment, :tokenize)  # INCORRECT!

Instead, to keep the default workers for a text processing task, use the :default worker:

sect.do(:chunk => :txt, :segment => :default, :tokenize => :default)

##Annotations

Annotations are a type of metadata which can be grafted to a textual entity to help in further classification tasks. All annotators work destructively on the receiving object, and store their result in the @features hash of that object. Each annotator is available only on specific types of entities (it makes no sense to get the synonyms of a sentence, or the topics of a word). This section is split by entity type, and lists the annotations that are available on each particular type of entity.

Note that in many of the following examples, transparent string-to-entity casting is used. Refer to the relevant section above for more information.

###Manual Annotations

You can set your own arbitrary annotations on any entity by using set, check if an annotation is defined on an entity by using has?, and retrieve an annotation by using get:

w = Word('hello')
w.set  :topic, "conversation"
w.has? :topic  # => true
w.get  :topic  # => "conversation"

###Non-Specific Annotations

Language

The "language" annotation is available on all types of entities. Note that Treat.core.language.detect must be set to true for language detection to be performed when the language method is called. Otherwise, Treat.core.language.default will be returned regardless of the actual content of the entity.

    Treat.core.language.detect = true

    a = "I want to know God's thoughts; the rest are details. - Albert Einstein"
    b = "El mundo de hoy no tiene sentido, así que ¿por qué debería pintar cuadros que lo tuvieran? - Pablo Picasso"
    c = "Un bon Allemand ne peut souffrir les Français, mais il boit volontiers les vins de France. - Goethe"
    d = "Wir haben die Kunst, damit wir nicht an der Wahrheit zugrunde gehen. - Friedrich Nietzsche"

    puts a.language    # => :english
    puts b.language    # => :spanish
    puts c.language    # => :french
    puts d.language    # => :german

###Annotations Available on Words

Part of Speech Tags

'running'.tag			# => "VBG"
'running'.category              # => "noun"
'inflection'.tag		# => "NN"
'inflection'.category		# => "noun"

Available Annotators

Name Description Reference
lingua POS tagging using part-of-speech statistics from the Penn Treebank to assign POS tags to English text. The tagger applies a bigram (two-word) Hidden Markov Model to guess the appropriate POS tag for a word. -
brill POS tagging using a set of rules developped by Eric Brill. Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the third conference on Applied natural language processing.
stanford POS tagging using (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) �ne-grained modeling of unknown word features. Toutanova, Manning, Klein and Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.

Synonyms, Antonyms, Hypernyms and Hyponyms

'ripe'.synonyms
	# => ["mature", "ripe(p)", "good", "right", "advanced"]
'ripe'.antonyms
	# => ["green", "unripe", "unripened", "immature"]
'coffee'.hypernyms
	# => ["beverage", "drink",  [...], "drinkable", "potable"]
'juice'.hyponyms
	# => ["lemon_juice", "lime_juice", [...], "digestive_fluid"]

Word Stemming

'running'.stem			# => "run"
'inflection'.stem		# => "inflect"

Available annotators

Name Description Reference
porter Stemming using a native Ruby implementation of the Porter stemming algorithm, a rule-based suffix-stripping stemmer which is very widely used and is considered the de-facto standard algorithm used for English stemming. Porter, 1980. An algorithm for suffix stripping. Program, vol. 14, no. 3, p. 130-137.
porter_c Stemming using a wrapper for a C implementation of the Porter stemming algorithm, a rule-based suffix-stripping stemmer which is very widely used and is considered the de-facto standard algorithm used for English stemming. Porter, 1980. An algorithm for suffix stripping. Program, vol. 14, no. 3, p. 130-137.
uea Stemming using the UEA algorithm, a stemmer that operates on a set of rules which are used as steps. There are two groups of rules: the first to clean the tokens, and the second to alter suffixes. Jenkins, Marie-Claire, Smith, Dan, Conservative stemming for search and indexing, 2005.

Noun and Adjective Declensions

'inflection'.plural		# => "inflections"
'inflections'.singular		# => "inflection"

Verb Inflections

'running'.infinitive		# => "run"
'run'.present_participle	# => "running"
'runs'.plural_verb		# => "run"

Ordinals & Cardinals

20.ordinal			# => "twentieth"
20.cardinal			# => "twenty"

Named Entity Tags

The annotation :name_tag allows to retrieve person, location and time expressions in texts.

    p = Paragraph "Obama and Sarkozy met on January 1st to investigate the possibility of a new rescue plan." +
        "President Sarkozy is to meet Merkel next Tuesday in Berlin."

    p.do(:chunk, :segment, :tokenize, :name_tag)

Annotations Available on Groups (Sentences, Phrases and Fragments)

Date and Time

The annotation :time allows to retrieve natural language expressions describing events in time.

    s = Section  "A bad day for leaders\n2011-12-23 - Obama and Sarkozy announced that they will start meeting every Tuesday."
   
    s.do(:chunk, :segment, :parse, :time)

Available Annotators

chronic Time/date extraction using a rule-based, pure Ruby natural language date parser.
ruby Date extraction using Ruby's standard library DateTime.parse() method.
nickel Time extraction using a pure Ruby natural language time parser.

TF*IDF

The annotation :tf_idf allows to get the TF*IDF score of a word within its parent collection.

    c = Collection('economist')

    c.words[0].tf_idf

Annotations Available on Documents, Sections and Zones

Keywords

The annotation :keywords allows to retrieve the keywords of a document, section or zone. Uses a naive TF*IDF approach, i.e. the relevant document/section/zone must

    c = Collection('economist')
    c.do(:chunk, :segment, :tokenize, :keywords)

General Topic

The annotation :topic allows to retrieve the general topic of a document, section or zone. Uses a model trained on a large set of Reuters articles.

    s = Paragraph 'Michigan, Ohio (Reuters) - Unfortunately, the RadioShack is closing.'
    s.do(:segment, :tokenize, :topics)

Annotations Available on Collections

Topic Words

The annotation :topic_words allows you to retrieve clusters of topics within documents. Uses Latent Dirichlet Allocation (LDA).

    c = Collection('economist')
    c.do(:chunk, :segment, :tokenize)
    puts c.topic_words(
      :lda,
      :num_topics => 10,
      :words_per_topic => 5,
      :iterations => 20
    ).inspect

Computers

About Computers

By contrast with processors and annotators, which modify the receiving object, computers perform operations that leave the receiving entity untouched. [Work in progress.]

Computers Available on All Entities

Serializers

Serializers allow to persist entities on disk or in a database.

d = Document('index.html')
d.serialize :format

Computers Available on Collections

Indexers

This builds a searchable index for a collection.

c = Collection('folder')
c.index 

Searchers

This allows to retrieve documents inside a collection by searching through the index.

c.search(:q => 'some query').each do |doc|
  # Process the document of interest
end

Support for Other Languages

Current support for other languages is as follows, in theory (most of these models are untested):

  • Parsers: English, French, German, Arabic, Chinese.
  • Segmenters: Dutch, English, French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish and Swedish.
  • Taggers: English, French, German, Arabic, Chinese.
  • Tag sets:
    • Penn Treebank for English
    • Stuttgart-Tübingen Tag Set for German
    • Paris7 Tag Set for French

Refer to configuration options above for how to set language defaults or automatic language detection appropriately.

Example: POS Tagging with Language Detection

Treat.core.language.detect = true

s1 = "I want to know God's thoughts; the rest are details - AE. "
s2 = "Bienvenue au château! Mettez-vous tous bien à l'aise - SI."
par = Paragraph (s1 + s2)

par.do :segment, :tag, :category

##Dynamically Extending Treat

Extending Treat with your own processors, annotators or computers is extremely simple. You can dynamically define your own pluggable workers by adding them in the right group. Once this is done, the algorithm will be available on the right types of entities. For example, if you add a stemmer, it will be callable on words or strings representing words. If you add a tokenizer, it will be callable on any phrase/sentence or a string representing one.

Stemmer Example

Here is a dummy stemmer that removes the last letter of a word:

Treat::Workers::Inflectors::Stemmers.add(:dummy) do |word, options={}| 
  word.to_s[0..-2]
end

'dummy'.stem(:dummy)     # => dumm 

Tokenizer Example

Here is a tokenizer that naively splits on space characters:

Treat::Workers::Processors::Tokenizers.add(:dummy) do |sentence, options={}| 
  sentence.to_s.split(' ').each do |token|
    sentence << Treat::Entities::Token.from_string(token)
  end
end

s = Sentence('A sentence to tokenize.')
s.tokenize(:dummy)
Clone this wiki locally