Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Tree: 2b13a8d629
Fetching contributors…

Cannot retrieve contributors at this time

645 lines (644 sloc) 24.6 KB
.TH IDSGREP 1 "@release_date@" "@PACKAGE_STRING@" "User Commands"
@PACKAGE@ \- match Extended Ideographic Description Sequences
.RI [ FILE .\|.\|.]
program parses the input files, or standard input if no filename is
specified, into Extended Ideographic Description Sequences
(EIDSes, described
in more detail below) separated by whitespace.
Any EIDS in the input that matches
is echoed through to standard output, along with its trailing whitespace
until the next EIDS in the input.
This is like
.BR grep (1)
except that it operates on EIDSes instead of text lines, and it follows the
rules of EIDS matching instead of regular expression matching.
The function of this program is deliberately general in nature, and it
could concievably be applied to many different purposes, but the
application motivating development is to descriptions of Han
characters (as used to write Chinese, Japanese, and some other languages)
in terms of their decomposition into smaller parts.
For instance, the character <U+840C> can be described as <U+8279> above
<U+660E>, which in turn can be described as <U+65E5> next to <U+6708>.
it is possible to match a partial description against those in a database;
such an operation may be useful in looking up an unfamiliar character or in
certain font development contexts.
Please note that
is somewhat unusual in being a command-line utility, for which a
.B man
page in English and ASCII would be the preferred form of documentation, whose
operation nonetheless revolves around certain non-ASCII characters, namely
the so-called Ideographic Description Characters in the range <U+2FF0> to
Those are not included in many fonts, and cannot be included in portable
.B man
Most users are likely to be native English speakers working with East
Asian languages,
and so Chinese or Japanese characters are also necessary to document
typical usage examples.
In this document, the notation <U+xxxx> denotes a Unicode character by
its hexadecimal code point; for a
more detailed description of
that shows the characters in their proper forms, consult the PDF
documentation in
.IR @flat_docdir@/@PACKAGE@.pdf .
.BI \-c "FMT\fR,\fP " "\-\^\-cooking=" FMT
Specify a mode for cooking of input and output.
parameter may be
.BR raw ", " rawnc ", " ascii ", " cooked ", " indent ", "
or a string of up to 12 decimal digits.
By default, or in
.B raw
.B rawnc
will write out for each matching tree the exact sequence of bytes that were
read and parsed to generate that tree.
In the case of invalid input (for instance, malformed UTF-8), that sequence
of bytes could be bad.
Even with perfectly valid input, the sequence could be unusual, because EIDS
syntax in general permits many different strings to have the same semantic
Setting a non-raw mode for output cooking will cause
to generate its own EIDS syntax, in a guaranteed consistent form, for each
matching tree.
.B rawnc
mode is identical to
.B raw
save that it disables canonicalization (ascii-to-symbolic) processing on
That modification to the syntax is also available in conjunction with output
cooking by means of the advanced decimal-string syntax.
See the PDF documentation for details of
.BR -c "."
.BI \-d "NAME\fR,\fP " "\-\^\-dictionary=" NAME
Choose one or more dictionaries from the default dictionary directory, using
shell glob pattern
To be precise, this will cause
to search all files that match
.RI \(lq@flat_dictdir@/ "NAME" *.eids\(rq.
Note that
.IR NAME ","
if non-empty, must be typed as part of the same argument as
.B \-d
(i.e. no space between them on the command line),
because if
.B \-d
is in an argument by itself, it will be interpreted as the valid and
commonly-used case of
being empty.
.BR \-h ", " \-\^\-help
Display a brief help message.
.BR \-V ", " \-\^\-version
Display the version and license information.
The Han character set is open-ended.
Although a few thousand characters suffice to write the most popular
Han-script languages most of the time,
popular standards define tens of thousands of less-popular characters,
and there are at least hundreds of thousands of rare characters known to
occur in names and historical contexts.
Computer text processing systems that use fixed lists of characters will
inevitably find themselves unable to represent some text.
As a result, there is a need to
.I describe
characters in a standard way that may have no standard code
points of their own.
A similar need for descriptions of characters arises when looking up
characters in a dictionary; a user may recognize some or all the visual
features of a character (such as its parts and the way they are laid out)
without knowing how to enter the character as a whole.
The Unicode standard offers a partial solution to some of these issues by
defining a series of \(lqIdeographic Descripion Characters\(rq (IDCs),
<U+2FF0> to <U+2FFB>, and a syntax for using them to construct
\(lqIdeographic Description Sequences\(rq (IDSes).
Here are the rules of Unicode IDSes:
.IP \(bu 4
A character from the the Unified Han or CJK Radical ranges (including
their extensions) is a complete IDS and simply represents itself.
Unicode 6.1 added permission for the use of private-use characters.
.IP \(bu 4
The IDC code points <U+2FF0>, <U+2FF1>, and <U+2FF4> through <U+2FFB>
are prefix binary operators.
One of these characters followed by two complete IDSes
forms another complete IDS, representing a character formed by joining the
two smaller characters in a way suggested by the name and graphical image
of the IDC.
.IP \(bu 4
The IDCs <U+2FF2> and <U+2FF3> are prefix ternary operators.
(Unicode uses the less-standard word \(lqtrinary.\(rq)
One of them can be followed by three complete IDSes to form an IDS that
describes a
character made of three parts, much in the same manner as the binary
.IP \(bu 4
As of Unicode 6.1, IDS length is unlimited.
Earlier versions specified that an IDS could not be more than 16 code
points long overall nor contain more than six consecutive non-operator
.IP \(bu 4
IDSes non-bindingly \(lqshould\(rq be as short as possible and should
reflect \(lqthe natural radical-phonetic division for an ideograph if it has
.IP \(bu 4
The Unicode standard does not mention variation selectors in any IDS-related
context, except that it offers the possibility of prefixing <U+303E>,
the \(lqideographic variation mark,\(rq to the entire sequence to indicate
a variation.
Such a prefix is explicitly defined not to be part of the IDS.
It is the opinion of the IDSgrep author that Unicode did not
intend to permit variation selectors
.I inside
IDS syntax.
To create a dictionary of character decompositions for
we need to be able to describe characters in a little more detail than
provided by standard Unicode IDSes, and in particular, we need to be able to
specify a code point for a
character or part of one while also specifying that code point's
further decomposition.
There is also a need for specifying
.I partial
descriptions, similar in spirit to
.BR grep (1)'s
regular expressions.
Both these needs are served by the
.I Extended
Ideographic Description Sequence (EIDS) syntax of
Thorough discussion of the syntax, with visual examples, is reserved for the
PDF documentation, but the following rules are an outline.
.IP \(bu 4
An EIDS represents a tree in which each node has an optional
.IR head ,
a required
.IR functor ,
and between zero and three
.IR children ,
each of which is a similar tree, recursively.
Heads and functors are nonempty unlimited-length strings of Unicode code
In the current implementation, no Unicode canonicalization is performed.
The number of children of a node is called its
.IR arity .
Nodes with arity zero through three are called
.IR nullary ,
.IR unary ,
.IR binary ,
.I ternary
.IP \(bu 4
To write an EIDS, write down the head if any of the root,
the functor of the root, and then the EIDSes for its children.
.IP \(bu 4
Heads and functors, in their most explicit form, are written as bracketed
strings, where the choice of
opening bracket indicates whether the string is a head or
a functor, and if a functor, the arity of the node.
.IP \(bu 4
A head may be opened by < (ASCII angle bracket) in which case it is closed
by >; it may be opened by <U+3010> and closed by <U+3011> (black lenticular
brackets); or it may be opened by <U+3016> and closed by <U+3017> (white
lenticular brackets).
.IP \(bu 4
The opening and closing brackets for the functor of a nullary node may be (
with ) (ASCII parentheses); <U+FF08> with <U+FF09> (fullwidth parentheses);
or <U+FF5F> with
<U+FF60> (fullwidth \(lqwhite\(rq [usually double] parentheses).
.IP \(bu 4
For the functor of a unary node, the closing bracket is always the same as
the opening bracket.
Three characters may be used to open and close a unary node's
functor: . (ASCII period); : (ASCII colon); and <U+30FB> (katakana middle dot).
.IP \(bu 4
The opening and closing brackets for the functor of a binary node may be [
with ] (ASCII square brackets);
<U+FF3B> with <U+FF3D> (fullwidth square brackets); or <U+301A> with
<U+301B> (white square brackets).
.IP \(bu 4
The opening and closing brackets for the functor of a ternary node may be {
with } (ASCII curly braces);
<U+3014> with <U+3015> (tortoiseshell brackets); or <U+3018> with
<U+3019> (white tortoiseshell brackets).
.IP \(bu 4
The closing bracket of a bracketed string must be the one that
corresponds to the opening bracket.
For instance, a head opened by < must be closed by >.
Any occurrence of <U+3011> is
taken literally and does not close the string.
.IP \(bu 4
Nested brackets are not detected nor specially processed.
For instance, in a nullary node's functor opened by (,
the first ) ends the string regardless of how many other copies of ( may
have been included in the string.
However, in no case may a bracketed string be empty.
If what would otherwise be a matching closing bracket
appears immediately after the opening bracket, then it is taken literally as
the first character of the string and does
.I not
close the string.
One frequently-used case is that three ASCII periods in a row are valid
syntax for a unary node functor containing a single ASCII period.
The first opens the string, the second is the literal first character, and
the third closes the string.
.IP \(bu 4
Backslash can be used for escape sequences.
The sequences \ea, \eb, \ee, \ef, \et, \en, and \er have the same meanings
as in the C programming language under Unix, with \en always corresponding
to <U+000A>, a single line feed, not a localized end-of-line sequence which
might be different from that.
The \ec sequence followed by a case-insensitive ASCII Latin letter (only)
corresponds to an ASCII control character, equivalent to typing Ctrl plus
that letter on a standard ASCII keyboard.
There are also three hexadecimal escapes:
.RI \ex HH
is two hexadecimal digits);
is four hexadecimal digits);
.RI \eX{ Hx }
.I Hx
is a variable-length sequence of hexadecimal digits).
These allow entering arbitrary code points.
In all backslash-letter escape sequences, the letter after the backslash is
case-sensitive but the parameter, if applicable, is not.
Backslash sequences may not be nested; parameters must be given as literal
.IP \(bu 4
Backslash, if not used with a letter to form one of the sequences above,
causes the next character (which could even be a second backslash) to be
taken literally and lose any special meaning.
Inside a bracketed string, it can be used for instance to escape a closing
bracket that would otherwise end the string.
Outside a bracketed string, a backslash-escaped character will always be
taken as a head with a syrupy semicolon (as described below) instead of
having sugary, opening-bracket, or skipped-whitespace behaviour.
That also applies to characters created by backslash-letter sequences, for
instance with \eX: after decoding the escape sequence, the result is always
literal inside a bracketed string and syrupy outside a bracketed string.
.IP \(bu 4
ASCII control characters and whitespace characters, <U+0000> through
<U+0020> (notably including <U+0000>), are ignored outside bracketed
strings and taken
literally inside bracketed strings.
Non-ASCII Unicode whitespace characters, such as <U+3000>, may be
treated this way in the future but currently are not.
.IP \(bu 4
Some characters have
.IR "sugary implicit brackets" .
That means that if one of these characters occurs where an opening bracket
would otherwise be expected, it will be treated as a one-character bracketed
string with brackets that depend on what character it is.
For instance, ASCII semicolon will be interpreted as if it appeared in ASCII
parentheses, and will thus become the functor of a nullary node.
The complete list of characters that have sugary implicit brackets, with
the brackets they imply, is:
(;) (?) .!. ./. .=. .*. .@. [&] [,] [|]
[<U+2FF0>] [<U+2FF1>] [<U+2FF4>] [<U+2FF5>] [<U+2FF6>] [<U+2FF7>]
[<U+2FF8>] [<U+2FF9>] [<U+2FFA>] [<U+2FFB>]
{<U+2FF2>} {<U+2FF3>}.
.IP \(bu 4
All remaining characters (those without other dispositions specified above,
and notably including the Unified Han characters) have
.IR "syrupy implicit semicolons" .
That means when one occurs outside a bracketed string,
it not only becomes a single-character head
(sugary implicit angle brackets) but it also receives arity zero and a
functor consisting of a single semicolon.
For instance, the lone character x is equivalent to <x>(;) or (because
semicolon itself has implicit parentheses) <x>;.
.IP \(bu 4
Characters, for the purposes of EIDS parsing, are strictly single Unicode
code points.
Such things as combining accents and variation selectors are parsed as
separate characters from the bases to which they may be applied.
The sugary and syrupy parsing rules apply only to single characters.
Thus, appropriate brackets are necessary whenever a sequence containing
more than one code point is to be treated as a single head or functor.
.IP \(bu 4
After parsing, EIDSes are subjected to a \(lqcanonicalization\(rq
transformation in which certain functor and arity combinations (generally
relatively verbose ASCII alphabetic strings) are replaced by
single-character forms.
The idea is to provide human-readable pure-ASCII alternate forms for
the IDCs and matching operators.
.B -c
command-line option makes it possibile to skip this transformation on
input, or perform its inverse on output.
The list of replacements is: (anything) to (?); .anywhere. to ...; [and] to
[&]; [or] to [|]; .not. to .!.; .regex. to ./.; .equal. to .=.; [lr] to
[<U+2FF0>]; [tb] to [<U+2FF1>]; {lcr} to {<U+2FF2>}; {tcb} to {<U+2FF3>};
[enclose] to [<U+2FF4>]; [wrapu] to [<U+2FF5>]; [wrapd] to [<U+2FF6>];
[wrapl] to [<U+2FF7>]; [wrapul] to [<U+2FF8>]; [wrapur] to [<U+2FF9>];
[wrapll] to [<U+2FFA>]; and [overlap] to [<U+2FFB>].
.IP \(bu 4
Total length, and number of consecutive nullary nodes (which are like
the non-operator characters in Unicode IDSes), are unlimited.
In the case of a dictionary entry it may be desirable to make the EIDS as
.I long
as possible (contrary to Unicode's recommendation)
in order to offer a detailed decomposition and many query strategies to
the user.
.IP \(bu 4
It is a consequence of these rules that all syntactically valid Unicode
IDSes are syntactically valid EIDSes, but the reverse is not true.
The matching pattern and the input are parsed as described above to generate
trees, and then the trees are matched as follows.
.IP \(bu 4
If the pattern and input both have heads, then they match if and only if the
heads are identical.
This overrides the other rules below.
.IP \(bu 4
If the pattern's functor and arity are (?), it always matches.
.IP \(bu 4
If the pattern is
.RI ... "x"
(unary, functor is a single ASCII period, and
.I x
is its only child), then it matches if and only if
.I x
matches some subtree of the input.
.IP \(bu 4
If the pattern is
.RI .*. "x" ,
then it matches if and only if some permutation of the children of
.I x
(at the root level only) will cause it to match the input; that is, the
children are allowed to match in any order.
.IP \(bu 4
If the pattern is
.RI .!. "x" ,
then it matches if and only if
.I x
.I not
match the input.
.IP \(bu 4
If the pattern is
.RI [&] "xy" ,
then it matches if and only if both
.I x
.I y
match the input.
.IP \(bu 4
If the pattern is
.RI [|] "xy" ,
then it matches if and only if
.I x
matches the input or
.I y
matches the input.
.IP \(bu 4
If the pattern is
.RI .=. "x"
and if
.I x
and the input both have heads, then it matches if and only if those heads
are identical.
.I x
and the input do not both have heads, then
.RI .=. "x"
matches if and only if
.I x
and the input have identical functors, identical arity, and all their
corresponding children match.
The effect of this operation is to ignore any
special matching semantics of
.IR x 's
functor, should it happen to be one of the special values mentioned in these
.IP \(bu 4
If the pattern is
.RI .@. "x" ,
its root will be matched in an associative way.
That means any children of
.I x
whose functor and arity match the root will be replaced in the sequence of
children by their own children, recursively as long as the functor and
arity continue to match.
This process could easily result in the root having more than three
The same thing is done to the root of the input pattern.
Finally, if the resulting functor, expanded arity, and all corresponding
children match, then the pattern is considered to match.
The effect is to ignore the parenthesization of a binary (or ternary, but
that makes less sence) operator that has an associative law: +a+bc
and ++abc will be treated the same.
.IP \(bu 4
If the pattern is
.RI ./. "x"
and if
.I x
and the input both have heads, then it matches if and only if the head of
.IR x ,
considered as a regular expression, matches the head of the input.
.I x
and the input do not both have heads, then it matches if and only if
.I x
and the input have identical arity, all their corresponding children match
recursively, and the functor of
.IR x ,
considered as a regular expression, matches the functor of the input.
The effect of this operation is to do the usual matching on trees except
with equality replaced (at the root) by a regular expression test.
Regular expression matching is performed using the PCRE library.
Patterns given to PCRE have already passed through
.BR idsgrep 's
parser and so an additional level of escaping may be necessary if
backslash escapes are desired in a pattern.
.IP \(bu 4
Otherwise, the pattern matches the input if and only if its functor and
arity are the same as the input's and all the children of the pattern match
the corresponding children of the input.
Individual sites may well have a different set of dictionaries installed,
but these are some popular ones.
The paths shown are those that were specified during configuration of
this package and built into the
.BR idsgrep (1)
binary; site admins who think they are too cool for
.B configure
may possibly have installed the files elsewhere.
.I @flat_dictdir@/chise.eids
Character decompositions from the CHISE EIDS database.
Coverage of approximately 130000 Han-script characters, spanning multiple
Non-Unicode characters are expressed using symbolic names apparently invented
by the CHISE project, or possibly by the affiliated UTF-2000 initiative.
This dictionary is generally of high quality because the original source
provides it more or less in IDS format already (actually an extended
IDS format of their own, distinct from IDSgrep's extended IDS); as a result
there is very little guesswork involved in the conversion to IDSgrep's EIDS.
Its broad coverage is hard to beat.
However, about 6% of the entries in the CHISE IDS database are
syntactically invalid, and are therefore excluded from IDSgrep's
converted dictionary.
What that implies about the quality of the remaining entries is an
open question.
Because this dictionary has GPL 2+ licensing, it can be bundled with the
IDSgrep source package (the majority of which is GPL 3).
.I @flat_dictdir@/edict.eids
Japanese-English dictionary of words and their meanings, based on
EDICT2 with character decompositions from one of the other dictionaries
selected at build time, most likely CHISE IDS.
This allows searching for multi-character words using partial descriptions
of the individual characters.
The EDICT2 entries are translated to EIDS format such that the EDICT2 head
(typically the word being defined) is the head of the EIDS; below that there
follows a chain of binary comma nodes each having a character of the word as
its left child and the rest of the chain as the right child.
These left children are decomposed according to whichever dictionary was used.
The final right child is a nullary node containing the remainder of the
EDICT2 entry, including pronunciation, part of speech, definition, and any
other tags.
For example, an entry that defined \(lqXYZ\(rq as \(lqdefinition\(rq might
be encoded as \(lq<XYZ>,X,Y,Z(definition)\(rq.
.I @flat_dictdir@/kanjivg.eids
Japanese kanji decompositions from the KanjiVG database.
Basically complete coverage of about 6700 characters.
However, KanjiVG decomposes glyphs according to stroke order and traditional
etymology, even to
the point of listing a single radical more than once in the decomposition if
its strokes are written non-consecutively,
and the format translator that generates this file only sort
of works.
As a result, decompositions in this database may be incomplete,
idiosyncratic, or even flat-out wrong.
.I @flat_dictdir@/tsukurimashou.eids
Japanese kanji decompositions from the Tsukurimashou font project.
These are relatively clean in terms of accurately reflecting the visual
construction of each character, but they only cover the glyphs included in
the fonts, and they are based on the visual appearance of the glyphs
(and, specifically, their appearance
.IR "in the Tsukurimashou fonts" )
rather than traditional etymology.
If specified, this will be used as the path for default dictionaries
as used by the -d option, instead of the value
.I @flat_dictdir@
which was specified at build time.
Note that locale-related environment variables are
.I not
Input and output are in UTF-8 regardless of locale.
Non-ASCII characters are essential to documenting this software, and
.B man
can't handle them portably, so this document is less useful than it should
PCRE, due to the use of C-strings in its API, cannot accept literal zero
bytes (<U+0000>) in matching patterns, even though EIDS syntax would
otherwise allow them.
Zero bytes are allowed in matching subjects, and may be matched
by writing them in the pattern with PCRE's escape syntax instead of
If used on a system that permits non-UTF-8 byte sequences as filenames, it
is possible that
may fail when it attempts to write out such a filename, because of
trying to pass the filename through escape-generation logic designed for
This issue is hard to test because all modern systems use UTF-8 or some
restriction thereof for filenames, but it will probably be revisited, and
if it's a problem, fixed, in some future version.
Matthew Skala <>
Copyright \(co
Matthew Skala
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program.
If not, see <>.
Please note that dictionaries prepared for use with IDSgrep may be subject to
their own copyright terms differing from those of IDSgrep itself.
In particular, the IDSgrep distribution contains code to build dictionaries
based on KanjiVG and EDICT2.
The input files for those are subject to the Creative Commons
Attribution-Share Alike 3.0 Licence, and their authors might make a
copyright claim on the resulting dictionaries.
The input files for the CHISE IDS dictionary are subject to GPL 2 or any
later version.
The Tsukurimashou project also builds an EIDS-format dictionary for
use with IDSgrep, but happens to use the same copyright and GPL 3 licensing
terms as IDSgrep anyway.
.BR grep (1),
.BR pcrepattern (3)
Jump to Line
Something went wrong with that request. Please try again.