Skip to content
This repository
Browse code

readme

  • Loading branch information...
commit a021b95bc51bb18b110933d942449c26ce708bf8 1 parent 6d21b9a
Todd Ditchendorf authored

Showing 1 changed file with 38 additions and 20 deletions. Show diff stats Hide diff stats

  1. +38 20 README.textile
58 README.textile
Source Rendered
@@ -24,6 +24,7 @@ h2. Projects using ParseKit:
24 24 "Exedore":http://tr.im/exedore: XPath 1.0 implemented in Cocoa (ported from "Saxon":http://saxonica.com/)
25 25
26 26 h2. Xcode Project
  27 +
27 28 The ParseKit Xcode project consists of 6 targets:
28 29
29 30 **ParseKit** : the ParseKit Objective-C framework. The central feature/codebase of this project.
@@ -41,6 +42,7 @@ The API for tokenization is provided by the PKTokenizer class. Cocoa developers
41 42 Example usage:
42 43
43 44
  45 +<pre>
44 46 NSString *s = @""It's 123 blast-off!", she said, // watch out!n"
45 47 @"and <= 3.5 'ticks' later /* wince */, it's blast-off!";
46 48 PKTokenizer *t = [PKTokenizer tokenizerWithString:s];
@@ -53,6 +55,7 @@ while ((tok = [t nextToken]) != eof) {
53 55
54 56 outputs:
55 57
  58 +<pre>
56 59 ("It's 123 blast-off!")
57 60 (,)
58 61 (she)
@@ -73,11 +76,13 @@ Each token produced is an object of class PKToken. PKTokens have a tokenType (Wo
73 76 More information about a token can be easily discovered using the -debugDescription method instead of the default -description. Replace the line containing NSLog above with this line:
74 77
75 78
  79 +<pre>
76 80 NSLog(@" (%@)", [tok debugDescription]);
77 81 </pre>
78 82
79 83 and each token's type will be printed as well:
80 84
  85 +<pre>
81 86 <Quoted String «"It's 123 blast-off!"»>
82 87 <Symbol «,»>
83 88 <Word «she»>
@@ -96,56 +101,68 @@ and each token's type will be printed as well:
96 101
97 102 As you can see from the output, PKTokenzier is configured by default to properly group characters into tokens including:
98 103
99   -single- and double-quoted string tokens
100   -common multiple character symbols (<=)
101   -apostrophes, dashes and other symbol chars that should not signal the start of a new Symbol token, but rather be included in the current Word or Num token (it's, blast-off, 3.5)
102   -silently ignoring C- and C++-style comments
103   -silently ignoring whitespace
  104 +* single- and double-quoted string tokens
  105 +* common multiple character symbols (<=)
  106 +* apostrophes, dashes and other symbol chars that should not signal the start of a new Symbol token, but rather be included in the current Word or Num token (it's, blast-off, 3.5)
  107 +* silently ignoring C- and C++-style comments
  108 +* silently ignoring whitespace
104 109
105 110 The PKTokenizer class is very flexible, and **all** of those features are configurable. PKTokenizer may be configured to:
106 111
107   -recognize more (or fewer) multi-char symbols. ex: p. [t.symbolState add:@"!="];</pre>
  112 +* recognize more (or fewer) multi-char symbols. ex:
  113 +
  114 +<pre>[t.symbolState add:@"!="];</pre>
  115 +
108 116 allows != to be recognized as a single Symbol token rather than two adjacent Symbol tokens
109 117
110   -add new internal symbol chars to be included in the current Word token OR recognize internal symbols like apostrophe and dash to actually signal a new Symbol token rather than being part of the current Word token. ex:
111   -p. [t.wordState setWordChars:YES from:'_' to:'_'];</pre>
  118 +*add new internal symbol chars to be included in the current Word token OR recognize internal symbols like apostrophe and dash to actually signal a new Symbol token rather than being part of the current Word token. ex:
  119 +
  120 +<pre>[t.wordState setWordChars:YES from:'_' to:'_'];</pre>
  121 +
112 122 allows Word tokens to contain internal underscores
113   -p. [t.wordState setWordChars:NO from:'-' to:'-'];</pre>
  123 +
  124 +<pre>[t.wordState setWordChars:NO from:'-' to:'-'];</pre>
  125 +
114 126 disallows Word tokens from containing internal dashes.
115 127
116   -change which chars singnal the start of a token of any given type. ex:
117   -p. [t setTokenizerState:t.wordState from:'_' to:'_'];</pre>
  128 +* change which chars singnal the start of a token of any given type. ex:
  129 +
  130 +<pre>[t setTokenizerState:t.wordState from:'_' to:'_'];</pre>
  131 +
118 132 allows Word tokens to start with underscore
119   -p. [t setTokenizerState:t.quoteState from:'*' to:'*'];</pre>
  133 +
  134 +<pre>[t setTokenizerState:t.quoteState from:'*' to:'*'];</pre>
120 135 allows Quoted String tokens to start with an asterisk, effectively making * a new quote symbol (like " or ')
121 136
122   -turn off recognition of single-line "slash-slash" (//) comments. ex:
123   -p. [t setTokenizerState:t.symbolState from:'/' to:'/'];</pre>
  137 +* turn off recognition of single-line "slash-slash" (//) comments. ex:
  138 +
  139 +<pre>[t setTokenizerState:t.symbolState from:'/' to:'/'];</pre>
  140 +
124 141 slash chars now produce individual Symbol tokens rather than causing the tokenizer to strip text until the next newline char or begin striping for a multiline comment if appropriate (/*)
125 142
126   -turn on recognition of "hash" (#) single-line comments. ex:
  143 +* turn on recognition of "hash" (#) single-line comments. ex:
127 144
128   -[t setTokenizerState:t.commentState from:'#' to:'#'];
  145 +<pre>[t setTokenizerState:t.commentState from:'#' to:'#'];
129 146 [t.commentState addSingleLineStartSymbol:@"#"];</pre>
130 147
131 148
132   -turn on recognition of "XML/HTML" (<!-- -->) multi-line comments. ex:
  149 +* turn on recognition of "XML/HTML" (<!-- -->) multi-line comments. ex:
133 150
134 151 <pre>[t setTokenizerState:t.commentState from:'<' to:'<'];
135 152 [t.commentState addMultiLineStartSymbol:@"<!--" endSymbol:@"-->"];</pre>
136 153
137 154
138   -report (rather than silently consume) Comment tokens. ex:
  155 +* report (rather than silently consume) Comment tokens. ex:
139 156
140 157 <pre>t.commentState.reportsCommentTokens = YES; // default is NO</pre>
141 158
142 159
143   -report (rather than silently consume) Whitespace tokens. ex:
  160 +* report (rather than silently consume) Whitespace tokens. ex:
144 161
145 162 <pre>t.whitespaceState.reportsWhitespaceTokens = YES; // default is NO</pre>
146 163
147 164
148   -turn on recognition of any characters (say, digits) as whitespace to be silently ignored. ex:
  165 +* turn on recognition of any characters (say, digits) as whitespace to be silently ignored. ex:
149 166
150 167 <pre>[t setTokenizerState:t.whitespaceState from:'0' to:'9'];</pre>
151 168
@@ -153,6 +170,7 @@ turn on recognition of any characters (say, digits) as whitespace to be silently
153 170
154 171
155 172 h3. Parsing
  173 +
156 174 ParseKit also includes a collection of token parser subclasses (of the abstract PKParser class) including collection parsers such as PKAlternation, PKSequence, and PKRepetition as well as terminal parsers including PKWord, PKNum, PKSymbol, PKQuotedString, etc. Also included are parser subclasses which work in individual chars such as PKChar, PKDigit, and PKSpecificChar. These char parsers are useful for things like RegEx parsing. Generally speaking though, the token parsers will be more useful and interesting.
157 175 The parser classes represent a **Composite** pattern. Programs can build a composite parser, in **Objective-C** (rather than a separate language like with lex&yacc), from a collection of terminal parsers composed into alternations, sequences, and repetitions to represent an infinite number of languages.
158 176 Parsers built from ParseKit are **non-deterministic, recursive descent parsers**, which basically means they trade some performance for ease of user programming and simplicity of implementation.

0 comments on commit a021b95

Please sign in to comment.
Something went wrong with that request. Please try again.