Skip to content

Commit

Permalink
Merge 63e017e into c44ad8a
Browse files Browse the repository at this point in the history
  • Loading branch information
marcellino-ornelas committed Nov 15, 2018
2 parents c44ad8a + 63e017e commit 82b72b0
Show file tree
Hide file tree
Showing 4 changed files with 112 additions and 100 deletions.
86 changes: 60 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
<!-- vim: set spelllang=en : -->

# Saxophone 🎷

Fast and lightweight event-driven streaming XML parser in pure JavaScript.
Expand Down Expand Up @@ -28,13 +29,13 @@ $ npm install --save saxophone

This benchmark compares the performance of four of the most popular SAX parsers against Saxophone’s performance while parsing a 21 KB document. Below are the results when run on a Intel® Core™ i7-7500U processor (2.70GHz, 2 physical cores with 2 logical cores each).

Library | Version | Operations per second (higher is better)
-------------------|--------:|----------------------------------------:
**Saxophone** | 0.5.0 | **6,840 ±1.48%**
**EasySax** | 0.3.2 | **7,354 ±1.16%**
node-expat | 2.3.17 | 1,251 ±0.60%
libxmljs.SaxParser | 0.19.5 | 1,007 ±0.81%
sax-js | 1.2.4 | 982 ±1.50%
| Library | Version | Operations per second (higher is better) |
| ------------------ | ------: | ---------------------------------------: |
| **Saxophone** | 0.5.0 | **6,840 ±1.48%** |
| **EasySax** | 0.3.2 | **7,354 ±1.16%** |
| node-expat | 2.3.17 | 1,251 ±0.60% |
| libxmljs.SaxParser | 0.19.5 | 1,007 ±0.81% |
| sax-js | 1.2.4 | 982 ±1.50% |

To run the benchmark by yourself, use the following commands:

Expand Down Expand Up @@ -69,14 +70,16 @@ const parser = new Saxophone();
// Called whenever an opening tag is found in the document,
// such as <example id="1" /> - see below for a list of events
parser.on('tagopen', tag => {
console.log(
`Open tag "${tag.name}" with attributes: ${JSON.stringify(Saxophone.parseAttrs(tag.attrs))}.`
);
console.log(
`Open tag "${tag.name}" with attributes: ${JSON.stringify(
Saxophone.parseAttrs(tag.attrs)
)}.`
);
});

// Called when we are done parsing the document
parser.on('finish', () => {
console.log('Parsing finished.');
console.log('Parsing finished.');
});

// Triggers parsing - remember to set up listeners before
Expand Down Expand Up @@ -104,14 +107,16 @@ const parser = new Saxophone();
// Called whenever an opening tag is found in the document,
// such as <example id="1" /> - see below for a list of events
parser.on('tagopen', tag => {
console.log(
`Open tag "${tag.name}" with attributes: ${JSON.stringify(Saxophone.parseAttrs(tag.attrs))}.`
);
console.log(
`Open tag "${tag.name}" with attributes: ${JSON.stringify(
Saxophone.parseAttrs(tag.attrs)
)}.`
);
});

// Called when we are done parsing the document
parser.on('finish', () => {
console.log('Parsing finished.');
console.log('Parsing finished.');
});

// stdin is '<root><example id="1" /><example id="2" /></root>'
Expand Down Expand Up @@ -146,7 +151,7 @@ Trigger the parsing of a whole document. This method will fire registered listen

Arguments:

* `xml` is an UTF-8 string or a `Buffer` containing the XML that you want to parse.
- `xml` is an UTF-8 string or a `Buffer` containing the XML that you want to parse.

This method returns the parser instance.

Expand All @@ -158,15 +163,15 @@ Parse a chunk of a XML document. This method will fire registered listeners so y

Arguments:

* `xml` is an UTF-8 string or a `Buffer` containing a chunk of the XML that you want to parse.
- `xml` is an UTF-8 string or a `Buffer` containing a chunk of the XML that you want to parse.

### `Saxophone#end(xml = "")`

Write an optional last chunk then close the stream. After the stream is closed, a final `finish` event is emitted and no other event will be emitted afterwards. No more data may be written into the stream after closing it.

Arguments:

* `xml` is an UTF-8 string or a `Buffer` containing a chunk of the XML that you want to parse.
- `xml` is an UTF-8 string or a `Buffer` containing a chunk of the XML that you want to parse.

### `Saxophone.parseAttrs(attrs)`

Expand All @@ -186,9 +191,9 @@ This ignores invalid entities, including unrecognized ones, leaving them as-is.

Emitted when an opening tag is parsed. This encompasses both regular tags and self-closing tags. An object is passed with the following data:

* `name`: name of the parsed tag.
* `attrs`: attributes of the tag (as a string). To parse this string, use `Saxophone.parseAttrs`.
* `isSelfClosing`: true if the tag is self-closing.
- `name`: name of the parsed tag.
- `attrs`: attributes of the tag (as a string). To parse this string, use `Saxophone.parseAttrs`.
- `isSelfClosing`: true if the tag is self-closing.

#### `tagclose`

Expand All @@ -214,18 +219,47 @@ Emitted when a comment (such as `<!-- contents -->`) is parsed. An object with t

Emitted when a parsing error is encountered while reading the XML stream such that the rest of the XML cannot be correctly interpreted:

* when a DOCTYPE node is found (not supported yet);
* when a comment node contains the `--` sequence;
* when opening and closing tags are mismatched or missing;
* when a tag name starts with white space;
* when nodes are unclosed (missing their final `>`).
- when a DOCTYPE node is found (not supported yet);
- when a comment node contains the `--` sequence;
- when opening and closing tags are mismatched or missing;
- when a tag name starts with white space;
- when nodes are unclosed (missing their final `>`).

Because this library's goal is not to provide accurate error reports, the passed error will only contain a short description of the syntax error (without giving the position, for example).

#### `finish`

Emitted after all events, without arguments.

## Transform Stream (marcellino-ornelas)

Saxophone can now be used as a transfrom stream to send the parsed XML to another stream. Saxophone works to keep low memory so it doesnt save any of the XML internally but uses events to fire certain data. So how do you tell the parser to send data to the next stream? The way to send data to another stream is to use the `parser.push()` method.

Example:

```xml
<root>
<page>Hey this is some text</page>
</root>
```

```javascript
const Saxophone = require('saxophone');
const parser = new Saxophone();

parser.on('text', function(data) {
// Get the text from saxophone
const text = data.contents;

// Send the data to the next stream;
parser.push(text);
});

fs.createReadStream('file.xml')
.pipe(parser)
.pipe(process.stdout); // Hey this is some text
```

## Contributions

This is free and open source software. All contributions (even small ones) are welcome. [Check out the contribution guide to get started!](CONTRIBUTING.md)
Expand Down
116 changes: 47 additions & 69 deletions lib/Saxophone.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
const {Writable} = require('readable-stream');
const {StringDecoder} = require('string_decoder');
const { Transform } = require('readable-stream');
const { StringDecoder } = require('string_decoder');

/**
* Information about a text node.
Expand Down Expand Up @@ -110,7 +110,7 @@ const Node = {
comment: 'comment',
processingInstruction: 'processinginstruction',
tagOpen: 'tagopen',
tagClose: 'tagclose',
tagClose: 'tagclose'
};

/**
Expand All @@ -129,12 +129,12 @@ const Node = {
* @emits Saxophone#tagopen
* @emits Saxophone#tagclose
*/
class Saxophone extends Writable {
class Saxophone extends Transform {
/**
* Create a new parser instance.
*/
constructor() {
super({decodeStrings: false});
super({ decodeStrings: false });

// String decoder instance
const state = this._writableState;
Expand Down Expand Up @@ -209,19 +209,15 @@ class Saxophone extends Writable {
// We read a TEXT node but there might be some
// more text data left, so we stall
if (nextTag === -1) {
this._stall(
Node.text,
input.slice(chunkPos)
);
this._stall(Node.text, input.slice(chunkPos));
break;
}

// A tag follows, so we can be confident that
// we have all the data needed for the TEXT node
this.emit(
Node.text,
{contents: input.slice(chunkPos, nextTag)}
);
this.emit(Node.text, {
contents: input.slice(chunkPos, nextTag)
});

chunkPos = nextTag;
}
Expand All @@ -246,17 +242,13 @@ class Saxophone extends Writable {
// Unclosed CDATA section, we need to wait for
// upcoming data
if (cdataClose === -1) {
this._stall(
Node.cdata,
input.slice(chunkPos - 9)
);
this._stall(Node.cdata, input.slice(chunkPos - 9));
break;
}

this.emit(
Node.cdata,
{contents: input.slice(chunkPos, cdataClose)}
);
this.emit(Node.cdata, {
contents: input.slice(chunkPos, cdataClose)
});

chunkPos = cdataClose + 3;
continue;
Expand All @@ -269,10 +261,7 @@ class Saxophone extends Writable {
// Unclosed comment node, we need to wait for
// upcoming data
if (commentClose === -1) {
this._stall(
Node.comment,
input.slice(chunkPos - 4)
);
this._stall(Node.comment, input.slice(chunkPos - 4));
break;
}

Expand All @@ -281,10 +270,9 @@ class Saxophone extends Writable {
return;
}

this.emit(
Node.comment,
{contents: input.slice(chunkPos, commentClose)}
);
this.emit(Node.comment, {
contents: input.slice(chunkPos, commentClose)
});

chunkPos = commentClose + 3;
continue;
Expand All @@ -309,10 +297,9 @@ class Saxophone extends Writable {
break;
}

this.emit(
Node.processingInstruction,
{contents: input.slice(chunkPos, piClose)}
);
this.emit(Node.processingInstruction, {
contents: input.slice(chunkPos, piClose)
});

chunkPos = piClose + 2;
continue;
Expand All @@ -322,10 +309,7 @@ class Saxophone extends Writable {
const tagClose = input.indexOf('>', chunkPos);

if (tagClose === -1) {
this._stall(
Node.tagOpen,
input.slice(chunkPos - 1)
);
this._stall(Node.tagOpen, input.slice(chunkPos - 1));
break;
}

Expand All @@ -340,10 +324,7 @@ class Saxophone extends Writable {
return;
}

this.emit(
Node.tagClose,
{name: tagName}
);
this.emit(Node.tagClose, { name: tagName });

chunkPos = tagClose + 1;
continue;
Expand Down Expand Up @@ -390,10 +371,8 @@ class Saxophone extends Writable {
* @param {function} callback Called when the chunk has been parsed, with
* an optional error argument.
*/
_write(chunk, encoding, callback) {
const data = encoding === 'buffer'
? this._decoder.write(chunk)
: chunk;
_transform(chunk, encoding, callback) {
const data = encoding === 'buffer' ? this._decoder.write(chunk) : chunk;

this._parseChunk(data, callback);
}
Expand All @@ -414,34 +393,33 @@ class Saxophone extends Writable {

// Handle unclosed nodes
switch (this._stalled) {
case Node.text:
// Text nodes are implicitly closed
this.emit(
'text',
{contents: this._stall(null)}
);
break;
case Node.cdata:
callback(new Error('Unclosed CDATA section'));
return;
case Node.comment:
callback(new Error('Unclosed comment'));
return;
case Node.processingInstruction:
callback(new Error('Unclosed processing instruction'));
return;
case Node.tagOpen:
case Node.tagClose:
// We do not distinguish between unclosed opening
// or unclosed closing tags
callback(new Error('Unclosed tag'));
return;
default:
case Node.text:
// Text nodes are implicitly closed
this.emit('text', { contents: this._stall(null) });
break;
case Node.cdata:
callback(new Error('Unclosed CDATA section'));
return;
case Node.comment:
callback(new Error('Unclosed comment'));
return;
case Node.processingInstruction:
callback(new Error('Unclosed processing instruction'));
return;
case Node.tagOpen:
case Node.tagClose:
// We do not distinguish between unclosed opening
// or unclosed closing tags
callback(new Error('Unclosed tag'));
return;
default:
// Pass
}

if (this._tagStack.length !== 0) {
callback(new Error(`Unclosed tags: ${this._tagStack.join(',')}`));
callback(
new Error(`Unclosed tags: ${this._tagStack.join(',')}`)
);
return;
}

Expand Down
2 changes: 1 addition & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 82b72b0

Please sign in to comment.