Skip to content

Commit

Permalink
Version 0.4.0
Browse files Browse the repository at this point in the history
  • Loading branch information
pierrec committed Nov 6, 2012
1 parent 53ec9bd commit b7a0c55
Show file tree
Hide file tree
Showing 21 changed files with 255 additions and 191 deletions.
8 changes: 4 additions & 4 deletions History.md
@@ -1,4 +1,4 @@
0.4.0 / 2012-10-30
0.4.0 / 2012-11-06
==================

* Major internal refactoring in defining and running rules and subrules:
Expand All @@ -8,12 +8,12 @@
* `Atok#write(data)`
* `data` (_String_ | _Buffer_): always pass either type. Use `Atok#setEncoding()` when using strings (default=utf-8).
* `Atok#addRule(pattern...type)`
* `pattern` (_String_ | _Buffer_): Buffers can be used instead of strings, except for {start,end} and {firstOf} patterns.
* `pattern` (_String_ | _Buffer_): Buffers can be used instead of strings, except for {start,end} and {firstOf} subrules.
* Behaviour changes:
* Subrules defined as a Number now do not apply the following ones to the matched token. Use another Atok instance to emulate previous behaviour.
* {firstOf} subrules cannot be in first position. Use addRule('', {firstOf}) instead.
* Custom subrules returning 0 __must__ set continue() properly to avoid potential infinite loops.
* All rules are cleared after a saveRuleSet()
* Function subrules returning 0 __must__ set continue() properly to avoid potential infinite loops.
* currentRule property is now a method

0.3.2 / 2012-09-16
==================
Expand Down
6 changes: 2 additions & 4 deletions README.md
Expand Up @@ -3,24 +3,22 @@

## Overview

Atok is a fast, easy and flexible tokenizer designed for use with [node.js](http://nodejs.org). It is based around the [Stream](http://nodejs.org/docs/latest/api/streams.html) concept and is implemented as a read/write one.
Atok is a fast, easy and dynamic tokenizer designed for use with [node.js](http://nodejs.org). It is based around the [Stream](http://nodejs.org/docs/latest/api/streams.html) concept and is implemented as a read/write one.

It was originally inspired by [node-tokenizer](https://github.com/floby/node-tokenizer), but quickly grew into its own form as I wanted it to be RegExp agnostic so it could be used on node Buffer intances and more importantly *faster*.

Atok is built using [ekam](https://github.com/pierrec/node-ekam) as it abuses includes and dynamic method generation.

Atok is the fundation for the [atok-parser](https://github.com/pierrec/node-atok-parser), which provides the environment for quickly building efficient and easier to maintain parsers.

This is a work in progress as Buffer data is still converted into String before being processed. Removing this drawback is planned for the next version (0.4.0).


## Core concepts

First let's see some definitions. In atok's terms:

* a `subrule` is an atomic check against the current data. It can be represented by a user defined function (rarely), a string or a number, or an array of those, as well as specific objects defining a range of values for instance (e.g. { start: 'a', end: 'z' } is equivalent to /[a-z]/ in RegExp)
* a `rule` is an __ordered__ combination of subrules. Each subrule is evaluated in order and if any fails, the whole rule is considered failed. If all of them are valid, then the handler supplied at rule instanciation is triggered, or if none was supplied, a data event is emitted instead.
* a `ruleSet` is a list of `rules` that are saved under a given name. Using `ruleSets` is useful when writting a parser to break down its complexity into smaller, easier to solve chunks.
* a `ruleSet` is a list of `rules` that are saved under a given name. Using `ruleSets` is useful when writting a parser to break down its complexity into smaller, easier to solve chunks. RuleSets can be created or altered __on the fly__ by any of its handlers.
* a `property` is an option applicable to the current rules being created.
* properties are set using their own methods. For instance, a `rule` may load a different `ruleSet` upon match using `next()`
* properties are defined before the rules they need to be applied to. E.g. atok.next('rules2').addRule(...)
Expand Down
5 changes: 5 additions & 0 deletions TODO.md
@@ -1,5 +1,10 @@
# TODO

## 0.5.0

* Compile rule sets to JS code


## 0.4.0

* looping rules optimization
Expand Down
89 changes: 51 additions & 38 deletions doc/tokenizer.html

Large diffs are not rendered by default.

53 changes: 41 additions & 12 deletions doc/tokenizer.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions examples/csv.js
Expand Up @@ -45,7 +45,7 @@ function stringHandler (token, idx) {
}
function rawStringHandler (token, idx) {
addLine(idx)
data[ data.length-1 ].push(token)
data[ data.length-1 ].push( token.toString() )
}
function emptyHandler (token, idx) {
addLine(idx)
Expand All @@ -55,7 +55,7 @@ function numberHandler (token, idx) {
addLine(idx)
var num = Number(token)
// Valid Number?
data[ data.length-1 ].push( isFinite(num) ? num : token )
data[ data.length-1 ].push( isFinite(num) ? num : token.toString() )
}

// Define the main parser rules
Expand Down
35 changes: 14 additions & 21 deletions lib/rule.js
Expand Up @@ -20,22 +20,23 @@ module.exports = Rule
* @constructor
* @api private
*/
function Rule (subrules, type, handler, atok) {
function Rule (subrules, type, handler, props, groupProps, encoding) {
var self = this
var n = subrules.length

this.atok = atok
this.props = atok.getProps()
this.props = props

this.debug = false

// Used for cloning
this.subrules = subrules

// Required by Atok#_resolveRules
this.group = atok._group
this.groupStart = atok._groupStart
this.groupEnd = atok._groupEnd
for (var p in groupProps)
this[p] = groupProps[p]
// this.group = atok._group
// this.groupStart = atok._groupStart
// this.groupEnd = atok._groupEnd

// Runtime values for continue props
this.continue = this.props.continue[0]
Expand All @@ -55,7 +56,7 @@ function Rule (subrules, type, handler, atok) {

// First subrule
var subrule = this.first = n > 0
? SubRule.firstSubRule( subrules[0], this.props, atok._encoding )
? SubRule.firstSubRule( subrules[0], this.props, encoding )
// Special case: no rule given -> passthrough
: SubRule.emptySubRule

Expand All @@ -77,7 +78,7 @@ function Rule (subrules, type, handler, atok) {
var prev = subrule
// Many subrules or none
for (var i = 1; i < n; i++) {
subrule = SubRule.SubRule( subrules[i], this.props, atok._encoding )
subrule = SubRule.SubRule( subrules[i], this.props, encoding )
prev.next = subrule
prev = subrule
if (this.length < subrule.length) this.length = subrule.length
Expand Down Expand Up @@ -121,9 +122,8 @@ function wrapDebug (rule, id, atok) {
return rule._test(buf, offset)
}
}
Rule.prototype.setDebug = function (debug) {
Rule.prototype.setDebug = function (debug, atok) {
var self = this
var atok = this.atok

// Rule already in debug mode
if (this.debug === debug) return
Expand Down Expand Up @@ -173,15 +173,8 @@ Rule.prototype.setDebug = function (debug) {
*
* @api private
*/
Rule.prototype.clone = function () {
var self = this
// Instantiate a dummy rule
var rule = new Rule(this.subrules, this.type, this.handler, this.atok)

// Overwrite its props
Object.keys(self).forEach(function (k) {
rule[k] = self[k]
})

Rule.prototype.clone = function (name) {
var rule = new Rule(this.subrules, this.type, this.handler, this.props, this)
rule.currentRule = name
return rule
}
}
2 changes: 2 additions & 0 deletions lib/subrule.js
Expand Up @@ -444,6 +444,8 @@ exports.firstSubRule = function (rule, props, encoding) {
if (rule === null || rule === undefined)
throw new Error('Tokenizer#addRule: Invalid rule ' + rule + ' (function/string/integer/array only)')

// var loop = props.ignore && props.continue[0] === -1 && !props.next[0] ? '_loop' : ''
// var type = typeOf(rule) + loop
var type = typeOf(rule)

switch (type) {
Expand Down
83 changes: 39 additions & 44 deletions lib/tokenizer.js
Expand Up @@ -94,8 +94,6 @@ function Atok (options) {
this._groupStartPrev = []


// this.currentRule = { get: function (ruleSet) { return this._firstRule.currentRule }, set: function () { throw new Error('Atok: Cannot set currentRule') } } // Name of the current rule
this.currentRule = null // Name of the current rule
this._rules = [] // Rules to be checked against
this._defaultHandler = null // Matched token default handler
this._savedRules = {} // Saved rules
Expand All @@ -122,6 +120,10 @@ function Atok (options) {
}
inherits(Atok, EV, Stream.prototype)

// Atok.prototype.__defineGetter__('currentRule', function () {
// return this._firstRule ? this._firstRule.currentRule : null
// })

Atok.prototype._error = function (err) {
this.readable = false
this.writable = false
Expand Down Expand Up @@ -161,8 +163,6 @@ Atok.prototype.clear = function (keepRules) {

if (!keepRules) {

// this.currentRule = { get: function (ruleSet) { return this._firstRule.currentRule }, set: function () { throw new Error('Atok: Cannot set currentRule') } } // Name of the current rule
this.currentRule = null // Name of the current rule
this._rules = [] // Rules to be checked against
this._defaultHandler = null // Matched token default handler
this._savedRules = {} // Saved rules
Expand All @@ -182,14 +182,6 @@ Atok.prototype.clear = function (keepRules) {
* @api public
*/
Atok.prototype.slice = function (start, end) {
// switch (arguments.length) {
// case 0:
// start = this.offset
// case 1:
// end = this.length
// }

// return this.buffer.substr(start, end - start)
return this.buffer.slice(start, end)
}
/**
Expand Down Expand Up @@ -241,12 +233,12 @@ Atok.prototype.debug = function (flag) {
this.debugMode = _debug

// Apply debug mode to all defined rules...
var self = this
this._rulesForEach(function (rule) {
rule.setDebug(_debug)
rule.setDebug(_debug, self)
})

// Apply debug mode to some methods
var self = this
;[ 'loadRuleSet' ].forEach(function (method) {
if (_debug) {
var prevMethod = self[method]
Expand Down Expand Up @@ -277,7 +269,15 @@ Atok.prototype._rulesForEach = function (fn) {
saved[ruleSet].rules.forEach(fn)
})
}
// include("methods_ruleprops.js")
/**
* Get the current rule set name
*
* @return {String} rule set name
* @api public
*/
Atok.prototype.currentRule = function () {
return this._firstRule ? this._firstRule.currentRule : null
}// include("methods_ruleprops.js")
/**
* Set the default handler.
* Triggered on all subsequently defined rules if the handler is not supplied
Expand Down Expand Up @@ -611,15 +611,22 @@ Atok.prototype.addRule = function (/*rule1, rule2, ... type|handler*/) {

if ( first === 0 )
this._error( new Error('Atok#addRule: invalid first subrule, must be > 0') )
else
else {
var groupProps = Object.create(null)
groupProps.group = this._group
groupProps.groupStart = this._groupStart
groupProps.groupEnd = this._groupEnd
this._rules.push(
new Rule(
args
, type
, handler
, this
, this.getProps()
, groupProps
, this._encoding
)
)
}

this._rulesToResolve = true

Expand Down Expand Up @@ -653,9 +660,9 @@ Atok.prototype.removeRule = function (/* name ... */) {
*/
Atok.prototype.clearRule = function () {
this.clearProps()
this._firstRule = null
this._rules = []
this._defaultHandler = null
this.currentRule = null
this._rulesToResolve = false

return this
Expand All @@ -671,18 +678,15 @@ Atok.prototype.saveRuleSet = function (name) {
if (arguments.length === 0 || name === null)
return this._error( new Error('Atok#saveRuleSet: invalid rule name supplied') )

this.currentRule = name
this._savedRules[name] = {
rules: this._rules.slice() // Make sure to make a copy of the list
// Clone the rules
// .map(function (rule) { return rule.clone() })
// Assign the current rule set name
.map(function (rule) { rule.currentRule = name; return rule })
rules: this._rules
.map(function (rule) { // Clone and assign the current rule set name
return rule.clone(name)
})
}

// Resolve and check continues
this._resolveRules(name)
this.clearRule()

return this
}
Expand All @@ -701,9 +705,7 @@ Atok.prototype.loadRuleSet = function (name, index) {

index = typeof index === 'number' ? index : 0

this.currentRule = name
this._rules = ruleSet.rules
this._rulesToResolve = false
// Set the rule index
this._firstRule = this._rules[index]
this._resetRule = true
Expand All @@ -719,8 +721,6 @@ Atok.prototype.loadRuleSet = function (name, index) {
*/
Atok.prototype.removeRuleSet = function (name) {
delete this._savedRules[name]
// Make sure no reference to the rule set exists
if (this.currentRule === name) this.currentRule = null

return this
}
Expand All @@ -740,10 +740,9 @@ Atok.prototype._resolveRules = function (name) {
var self = this
// Check and set the continue values
var rules = name ? this._savedRules[name].rules : this._rules
var groupStartPrev = this._groupStartPrev

function getErrorData (i) {
return ( self.currentRule ? '@' + self.currentRule : ' ' )
return ( self.currentRule() ? '@' + self.currentRule() : ' ' )
+ (arguments.length > 0
? '[' + i + ']'
: ''
Expand Down Expand Up @@ -1133,25 +1132,25 @@ Atok.prototype._tokenize = function () {
p = this._firstRule
this._resetRule = false

while ( p !== null && this.offset < this.length ) {
while ( p && this.offset < this.length ) {
props = p.props

// Return the size of the matched data (0 is valid!)
//TODO matched = p.first.test(this.buffer, this.offset) - this.offset
matched = p.test(this.buffer, this.offset)

if ( matched < 0 ) {
p = p.nextFail
// End of the rule set, end the loop
if (!p.nextFail) break

// Next rule exists, carry on
if (p) continue

// End of the rule set, end the loop
break
p = p.nextFail
continue
}

// Is the token to be processed?
if ( !props.ignore ) {
if ( props.ignore ) {
p = p.next
} else {
// Emit the data by default, unless the handler is set
token = props.quiet
? matched - (p.single ? 0 : p.last.length) - p.first.length
Expand All @@ -1173,12 +1172,8 @@ Atok.prototype._tokenize = function () {
} else {
p = p.next
}
// p = this._resetRule ? this._firstRule : p.next
} else {
p = p.next
}


this.offset += matched

// NB. `break()` prevails over `pause()`
Expand All @@ -1193,7 +1188,7 @@ Atok.prototype._tokenize = function () {
}

// Keep track of the rule we are at
this._firstRule = p || this._firstRule
if (p) this._firstRule = p

// Truncate the buffer if possible: min(offset, markedOffset)
if (this.markedOffset < 0) {
Expand Down
2 changes: 0 additions & 2 deletions src/Atok_properties.js
Expand Up @@ -18,8 +18,6 @@
//if(keepRules)
if (!keepRules) {
//endif
// this.currentRule = { get: function (ruleSet) { return this._firstRule.currentRule }, set: function () { throw new Error('Atok: Cannot set currentRule') } } // Name of the current rule
this.currentRule = null // Name of the current rule
this._rules = [] // Rules to be checked against
this._defaultHandler = null // Matched token default handler
this._savedRules = {} // Saved rules
Expand Down

0 comments on commit b7a0c55

Please sign in to comment.