Skip to content

Commit

Permalink
Merge 30985a7 into a77b2d9
Browse files Browse the repository at this point in the history
  • Loading branch information
Justin Wilaby committed Feb 8, 2019
2 parents a77b2d9 + 30985a7 commit 691e8e6
Show file tree
Hide file tree
Showing 20 changed files with 624 additions and 347 deletions.
2 changes: 2 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

90 changes: 71 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ const saxPath = require.resolve('sax-wasm/lib/sax-wasm.wasm');
const saxWasmBuffer = fs.readFileSync(saxPath);

// Instantiate
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag);
const options = {highWaterMark: 64 * 1024}; // 64k chunks
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag, options);
parser.eventHandler = (event, data) => {
if (event === SaxEventType.Attribute ) {
// process attribute
Expand All @@ -42,8 +43,11 @@ parser.eventHandler = (event, data) => {
// Instantiate and prepare the wasm for parsing
parser.prepareWasm(saxWasmBuffer).then(ready => {
if (ready) {
parser.write('<div class="modal"></div>');
parser.end();
const readable = fs.createReadStream(path.resolve(__dirname + '/path/to/doument.xml'), options);
readable.on('data', (chunk) => {
parser.write(chunk);
});
readable.on('end', parser.end);
}
});

Expand All @@ -56,7 +60,7 @@ import { SaxEventType, SAXParser } from 'sax-wasm';
async function loadAndPrepareWasm() {
const saxWasmResponse = await fetch('./path/to/wasm/sax-wasm.wasm');
const saxWasmbuffer = await saxWasmResponse.arrayBuffer();
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag);
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag, {highWaterMark: 64 * 1024}); // 64k chunks

// Instantiate and prepare the wasm for parsing
const ready = await parser.prepareWasm(new Uint8Array(saxWasmbuffer));
Expand All @@ -75,8 +79,21 @@ function processDocument(parser) {
// process open tag
}
}
parser.write('<div class="modal"></div>');
parser.end();

fetch('path/to/document.xml').then(async response => {
if (!response.ok) {
// fail in some meaningful way
}
// Get the reader to stream the document to sax-wasm
const reader = response.body.getReader();
while(true) {
const chunk = await reader.read();
if (chunk.done) {
return parser.end();
}
parser.write(chunk);
}
});
}
```

Expand All @@ -88,10 +105,10 @@ when migrating:
1. No attempt is made to validate the document. sax-wasm reports what it sees. If you need strict mode or document validation, it may
be recreated by applying rules to the events that are reported by the parser.
1. Namespaces are reported in attributes. No special events dedicated to namespaces.
1. The parser is ready as soon as the promise is handled.
1. Streaming utf-8 code points in a Uint8Array is required.

## Streaming
Streaming is supported with sax-wasm by writing utf-8 encoded text to the parser instance. Writes can occur safely
Streaming is supported with sax-wasm by writing utf-8 code points (Uint8Array) to the parser instance. Writes can occur safely
anywhere except within the `eventHandler` function or within the `eventTrap` (when extending `SAXParser` class).
Doing so anyway risks overwriting memory still in play.

Expand All @@ -107,29 +124,60 @@ Complete list of event/argument pairs:

|Event |Mask |Argument passed to handler |
|----------------------------------|--------------|---------------------------------------------|
|SaxEventType.Text |0b000000000001|text: [Text](src/js/saxWasm.ts#L42) |
|SaxEventType.Text |0b000000000001|text: [Text](src/js/saxWasm.ts#L91) |
|SaxEventType.ProcessingInstruction|0b000000000010|procInst: string |
|SaxEventType.SGMLDeclaration |0b000000000100|sgmlDecl: string |
|SaxEventType.Doctype |0b000000001000|doctype: string |
|SaxEventType.Comment |0b000000010000|comment: string |
|SaxEventType.OpenTagStart |0b000000100000|tag: [Tag](src/js/saxWasm.ts#L48) |
|SaxEventType.Attribute |0b000001000000|attribute: [Attribute](src/js/saxWasm.ts#L33)|
|SaxEventType.Attribute |0b000001000000|attribute: [Attribute](src/js/saxWasm.ts#L51)|
|SaxEventType.OpenTag |0b000010000000|tag: [Tag](src/js/saxWasm.ts#L48) |
|SaxEventType.CloseTag |0b000100000000|tag: [Tag](src/js/saxWasm.ts#L48) |
|SaxEventType.OpenCDATA |0b001000000000|start: [Position](src/js/saxWasm.ts#L28) |
|SaxEventType.OpenCDATA |0b001000000000|start: [Position](src/js/saxWasm.ts#L41) |
|SaxEventType.CDATA |0b010000000000|cdata: string |
|SaxEventType.CloseCDATA |0b100000000000|end: [Position](src/js/saxWasm.ts#L28) |
|SaxEventType.CloseCDATA |0b100000000000|end: [Position](src/js/saxWasm.ts#L41) |

## Speeding things up on large documents
The speed of the sax-wasm parser is incredibly fast and can parse very large documents in a blink of an eye. Although
it's performance out of the box is ridiculous, the JavaScript thread *must* be involved with transforming raw
bytes to human readable data, there are times where slowdowns can occur if you're not careful. These are some of the
items to consider when top speed and performance is an absolute must:
1. Stream your document from it's source as a `Uint8Array` - This is covered in the examples above. Things slow down
significantly when the document is loaded in JavaScript as a string, then encoded to bytes using `Buffer.from(document)` or
`new TextEncoder.encode(document)` before being passed to the parser. Encoding on the JavaScript thread is adds a non-trivial
amount of overhead so its best to keep the data as raw bytes. Streaming often means the parser will already be done once
the document finishes downloading!
1. Keep the events bitmask to a bare minimum whenever possible - the more events that are required, the more work the
JavaScript thread must do once sax-wasm.wasm reports back.
1. Limit property reads on the reported data to only what's necessary - this includes things like stringifying the data to
json using `JSON.stringify()`. The first read of a property on a data object reported by the `eventHandler` will
retrieve the value from raw bytes and convert it to a `string`, `number` or `Position` on the JavaScript thread. This
conversion time becomes noticeable on very large documents with many elements and attributes. **NOTE:** After
the initial read, the value is cached and accessing it becomes faster.

## SAXParser.js
## Constructor
`SaxParser([events: number, [options: SaxParserOptions]])`

Constructs new SaxParser instance with the specified events bitmask and options
### Parameters

- `events` - A number representing a bitmask of events that should be reported by the parser.
- `options` - When specified, the `highWaterMark` option is used to prepare the parser for the expected size of each chunk
provided by the stream. The parser will throw if chunks written to it are larger.

### Methods

- `prepareWasm(wasm: Uint8Array): Promise<boolean>` - Instantiates the wasm binary with reasonable defaults and stores
the instance as a member of the class. Always resolves to true or throws if something went wrong.

- `write(buffer: string): void;` - writes the supplied string to the wasm stream and kicks off processing.
The character and line counters are *not* reset.
- `write(chunk: Uint8Array, offset: number = 0): void;` - writes the supplied bytes to the wasm memory buffer and kicks
off processing. An optional offset can be provided if the read should occur at an index other than `0`. **NOTE:**
The `line` and `character` counters are *not* reset.

- `end(): void;` - Ends processing for the stream. The character and line counters are reset to zero and the parser is
- `end(): void;` - Ends processing for the stream. The `line` and `character` counters are reset to zero and the parser is
readied for the next document.

### Properties

- `events` - A bitmask containing the events to subscribe to. See the examples for creating the bitmask
Expand All @@ -139,17 +187,21 @@ readied for the next document.

## sax-wasm.wasm
### Methods

The methods listed here can be used to create your own implementation of the SaxWasm class when extending it or composing
it will not meet the needs of the program.
- `parser(events: u32)` - Prepares the parser struct internally and supplies it with the specified events bitmask. Changing
the events bitmask can be done at anytime during processing using this method.
the events bitmask can be done at *anytime* during processing using this method.

- `write(ptr: *mut u8, length: usize)` - Supplies the parser with the location and length of the newly written bytes in the
stream and kicks off processing. The parser assumes that the bytes are valid utf-8. Writing non utf-8 bytes may cause
unpredictable behavior.
stream and kicks off processing. The parser assumes that the bytes are valid utf-8 grapheme clusters. Writing non utf-8 bytes may cause
unpredictable results but probably will not break.

- `end()` - resets the character and line counts but does not halt processing of the current buffer.
- `end()` - resets the `character` and `line` counts but does not halt processing of the current buffer.

## Building from source
### Prerequisites

This project requires rust v1.30+ since it contains the `wasm32-unknown-unknown` target out of the box.

Install rust:
Expand Down
2 changes: 2 additions & 0 deletions lib/sax-wasm.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/* tslint:disable */

Binary file modified lib/sax-wasm.wasm
Binary file not shown.
69 changes: 42 additions & 27 deletions lib/saxWasm.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,50 +13,65 @@ export declare class SaxEventType {
static CloseCDATA: number;
}
declare abstract class Reader<T> {
constructor(buf: Uint32Array, ptr?: number);
protected abstract read(buf: Uint32Array, ptr: number): void;
protected data: Uint8Array;
protected cache: {
[prop: string]: T;
};
protected ptr: number;
constructor(data: Uint8Array, ptr?: number);
abstract toJSON(): {
[prop: string]: T;
};
}
export declare class Position {
line: number;
character: number;
constructor(line: number, character: number);
}
export declare class Attribute extends Reader<string | number | Position> {
static BYTES_IN_DESCRIPTOR: number;
nameEnd: Position;
nameStart: Position;
valueEnd: Position;
valueStart: Position;
name: string;
value: string;
protected read(buf: Uint32Array, ptr: number): void;
readonly nameStart: Position;
readonly nameEnd: Position;
readonly valueStart: Position;
readonly valueEnd: Position;
readonly name: string;
readonly value: string;
toJSON(): {
[prop: string]: string | number | Position;
};
}
export declare class Text extends Reader<string | Position> {
static BYTES_IN_DESCRIPTOR: number;
end: Position;
start: Position;
value: string;
protected read(buf: Uint32Array, ptr: number): void;
readonly start: Position;
readonly end: Position;
readonly value: string;
toJSON(): {
[prop: string]: string | Position;
};
}
export declare class Tag extends Reader<Attribute[] | Text[] | Position | string | number | boolean> {
name: string;
attributes: Attribute[];
textNodes: Text[];
selfClosing: boolean;
openStart: Position;
openEnd: Position;
closeStart: Position;
closeEnd: Position;
protected read(buf: Uint32Array): void;
readonly openStart: Position;
readonly openEnd: Position;
readonly closeStart: Position;
readonly closeEnd: Position;
readonly selfClosing: boolean;
readonly name: string;
readonly attributes: Attribute[];
readonly textNodes: Text[];
toJSON(): {
[p: string]: Attribute[] | Text[] | Position | string | number | boolean;
};
}
export interface SaxParserOptions {
highWaterMark: number;
}
export declare class SAXParser {
static textDecoder: TextDecoder;
static textEncoder: TextEncoder;
events: number;
eventHandler: (type: SaxEventType, detail: Reader<any> | Position | string) => void;
private readonly options;
private wasmSaxParser;
constructor(events?: number);
write(value: string): void;
private writeBuffer;
constructor(events?: number, options?: SaxParserOptions);
write(chunk: Uint8Array, offset?: number): void;
end(): void;
prepareWasm(saxWasm: Uint8Array): Promise<boolean>;
protected eventTrap: (event: number, ptr: number, len: number) => void;
Expand Down

0 comments on commit 691e8e6

Please sign in to comment.