-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
36 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,36 @@ | ||
# mecab-web-worker | ||
# mecab-web-worker | ||
|
||
Using [MeCab](https://github.com/taku910/mecab) for Japanese segmentation in the browser. Inspired by [fugashi](https://github.com/polm/fugashi)'s API. | ||
|
||
**Only Chromium based browsers are supported since I'm using [Module Workers](https://developer.mozilla.org/en-US/docs/Web/API/Worker/Worker) and the [Compression Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Compression_Streams_API).** | ||
|
||
```javascript | ||
const worker = await MecabWorker.create(); | ||
const result = await worker.parse('和布蕪は、ワカメの付着器の上にある'); | ||
console.log(result); | ||
|
||
const nodes = await worker.parseToNodes('和布蕪は、ワカメの付着器の上にある'); | ||
for (let node of nodes) { | ||
console.log(node); | ||
} | ||
``` | ||
|
||
MeCab was compiled to WASM and runs in a background thread via the [Web Workers API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API). It's necessary to provide a dictionary. Currently, only Unidic is supported and is available here: <https://github.com/leyhline/mecab-web-worker/releases/tag/v0.1.0> By default, `MecabWorker` expects the `/unidic-3.1.0.zip` (around 500 MB!) at the page's root but it's possible to change this by supplying an options object to `MecabWorker.create`. After the first download the extracted files are persisted in the browser cache (using [CacheStorage](https://developer.mozilla.org/en-US/docs/Web/API/CacheStorage)) to avoid repeated downloads. | ||
|
||
## Motivation | ||
|
||
I want to build some interactive tool for aligning Japanese text and audio. Since interactivity is easier to accomplish in the browser I wanted to go full JS instead of putting e.g. Python in the mix. And since the functionality for segmentation is easy to separate I decided to create an NPM package that's hopefully as easy to use as Python's [fugashi](https://github.com/polm/fugashi) (a great wrapper around MeCab, check it out, cite it and sponsor Paul's work). | ||
|
||
My uninformed self did also draw from the knowledge he published on his blog, e.g. [An Overview of Japanese Tokenizer Dictionaries](https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html) and I use his [Unidic distribution](https://github.com/polm/unidic-py). Thanks a lot! I hope to build a better understanding of the theory behind all this at a later date. | ||
|
||
## Technical Background | ||
|
||
MeCab was compiled to WASM using [Emscripten](https://emscripten.org/) without wrapper code in C. See the corresponding [GitHub Action](https://github.com/leyhline/mecab-web-worker/blob/main/.github/workflows/build.yaml) for the compiler flags. | ||
|
||
However, for accessing a C struct from MeCab, I had to use *pointer arithmetic in JavaScript* (see [mecab-worker.js:MecabNode](https://github.com/leyhline/mecab-web-worker/blob/main/src/mecab-worker.ts) which isn't really elegant. I also wrote a simple unzip function using the [Compression Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Compression_Streams_API) which works (in Chrome at least) but is not completely correct. | ||
|
||
## TODO | ||
|
||
* Support different dictionaries; this isn't hard but wasn't a use case for me personally. | ||
* Wrap more of MeCab's functionality like returning nbest results. | ||
* Polyfills for APIs that are not widely supported by browsers. |