Skip to content

Commit

Permalink
Change from token type to tokenizer type (#2)
Browse files Browse the repository at this point in the history
* Update API

Switch internal terminology from use of `Token` to `Tokenizer`.
Change external API from `matches(:)` back to the original `tokens(:)`.
Add a convenience method `components(:)` that returns just substrings
rather than array of `Token`.

* Update file name.

* Update README to show `Token` type.

* Include `import Mustard` in code snippet.
  • Loading branch information
mathewsanders committed Jan 2, 2017
1 parent 56bfad7 commit 6cbc2d8
Show file tree
Hide file tree
Showing 19 changed files with 614 additions and 567 deletions.
82 changes: 39 additions & 43 deletions Documentation/Expressive matching.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,79 @@
# Example: expressive matching

The results returned by `matches(from:)`returns an array tuples with the signature `(tokenizer: TokenType, text: String, range: Range<String.Index>)`
The results returned by `tokens(matchedWith:)`returns an array `Token` which in turn is a tuple with the signature `(tokenizer: TokenizerType, text: String, range: Range<String.Index>)`

To make use of the `tokenizer` element, you need to either use type casting (using `as?`) or type checking (using `is`) for the `tokenizer` element to be useful.

Maybe we want to filter out only tokens that are numbers:
Maybe we want to filter out only tokens that were matched with a number tokenizer:

````Swift
import Mustard

let messy = "123Hello world&^45.67"
let matches = messy.matches(from: .decimalDigits, .letters)
// matches.count -> 5
let tokens = "123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
// tokens.count -> 5

let numbers = matches.filter({ $0.tokenizer is NumberToken })
// numbers.count -> 0
let numberTokens = tokens.filter({ $0.tokenizer is NumberTokenizer })
// numberTokens.count -> 0

````

This can lead to bugs in your logic-- in the example above `numberTokens` will be empty because the tokenizers used were the character sets `.decimalDigits`, and `.letters`, so the filter won't match any of the tokens.
This can lead to bugs in your logic-- in the example above `numberTokens` will be empty because the tokenizers used were `CharacterSet.decimalDigits`, and `CharacterSet.letters`, so the filter won't match any of the tokens.

This may seem like an obvious error, but it's the type of unexpected bug that can slip in when we're using loosely typed results.

Thankfully, Mustard can return a strongly typed set of matches if a single `TokenType` is used:
Thankfully, Mustard can return a strongly typed set of matches if a single `TokenizerType` is used:

````Swift
import Mustard

let messy = "123Hello world&^45.67"

// call `matches()` method on string to get matching tokens from string
let numberMatches: [NumberToken.Match] = messy.matches()
// numberMatches.count -> 2
// call `tokens()` method on `String` to get matching tokens from the string
let numberTokens: [NumberTokenizer.Token] = "123Hello world&^45.67".tokens()
// numberTokens.count -> 2

````

Used in this way, this isn't very useful, but it does allow for multiple `TokenType` to be bundled together as a single `TokenType` by implementing a TokenType using an `enum`.
Used in this way, this isn't very useful, but it does allow for multiple `TokenizerType` to be bundled together as a single tokenizer by implementing with an `enum`.

An enum token type can either manage it's own internal state, or potentially act as a lightweight wrapper to existing tokenizers.
Here's an example `TokenType` that acts as a wrapper for word, number, and emoji tokenizers:
An enum tokenizer can either manage it's own internal state, or potentially act as a lightweight wrapper to other existing tokenizers.

````Swift
Here's an example `TokenizerType` that acts as a wrapper for word, number, and emoji tokenizers:

enum MixedToken: TokenType {
````Swift
enum MixedTokenizer: TokenizerType {

case word
case number
case emoji
case none // 'none' case not strictly needed, and
// in this implementation will never be matched

init() {
self = .none
}

static let wordToken = WordToken()
static let numberToken = NumberToken()
static let emojiToken = EmojiToken()
static let wordTokenizer = WordTokenizer()
static let numberTokenizer = NumberTokenizer()
static let emojiTokenizer = EmojiTokenizer()

func canAppend(next scalar: UnicodeScalar) -> Bool {
func tokenCanTake(_ scalar: UnicodeScalar) -> Bool {
switch self {
case .word: return MixedToken.wordToken.canAppend(next: scalar)
case .number: return MixedToken.numberToken.canAppend(next: scalar)
case .emoji: return MixedToken.emojiToken.canAppend(next: scalar)
case .word: return MixedTokenizer.wordTokenizer.tokenCanTake(scalar)
case .number: return MixedTokenizer.numberTokenizer.tokenCanTake(scalar)
case .emoji: return MixedTokenizer.emojiTokenizer.tokenCanTake(scalar)
case .none:
return false
}
}

func token(startingWith scalar: UnicodeScalar) -> TokenType? {
func token(startingWith scalar: UnicodeScalar) -> TokenizerType? {

if let _ = MixedToken.wordToken.token(startingWith: scalar) {
return MixedToken.word
if let _ = MixedTokenizer.wordTokenizer.token(startingWith: scalar) {
return MixedTokenizer.word
}
else if let _ = MixedToken.numberToken.token(startingWith: scalar) {
return MixedToken.number
else if let _ = MixedTokenizer.numberTokenizer.token(startingWith: scalar) {
return MixedTokenizer.number
}
else if let _ = MixedToken.emojiToken.token(startingWith: scalar) {
return MixedToken.emoji
else if let _ = MixedTokenizer.emojiTokenizer.token(startingWith: scalar) {
return MixedTokenizer.emoji
}
else {
return nil
Expand All @@ -90,25 +86,25 @@ Mustard defines a default typealias for `Token` that exposes the specific type i
results tuple.

````Swift
public extension TokenType {
typealias Match = (tokenizer: Self, text: String, range: Range<String.Index>)
public extension TokenizerType {
typealias Token = (tokenizer: Self, text: String, range: Range<String.Index>)
}
````

Setting your results array to this type gives you the option to use the shorter `matches()` method,
Setting your results array to this type gives you the option to use the shorter `tokens()` method,
where Mustard uses the inferred type to perform tokenization.

Since the matches array is strongly typed, you can be more expressive with the results, and the
Since the tokens array is strongly typed, you can be more expressive with the results, and the
complier can give you more hints to prevent you from making mistakes.

````Swift

// use the `matches()` method to grab matching substrings using a single tokenizer
let matches: [MixedToken.Match] = "123👩‍👩‍👦‍👦Hello world👶 again👶🏿 45.67".matches()
// matches.count -> 8
// use the `tokens()` method to grab matching substrings using a single tokenizer
let tokens: [MixedTokenizer.Token] = "123👩‍👩‍👦‍👦Hello world👶 again👶🏿 45.67".tokens()
// tokens.count -> 8

matches.forEach({ match in
switch (match.tokenizer, match.text) {
tokens.forEach({ token in
switch (token.tokenizer, token.text) {
case (.word, let word): print("word:", word)
case (.number, let number): print("number:", number)
case (.emoji, let emoji): print("emoji:", emoji)
Expand Down
58 changes: 29 additions & 29 deletions Documentation/Greedy tokens and tokenizer order.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,43 @@
# Greedy tokens and tokenizer order

Tokenizers are greedy. The order that tokenizers are passed into the `matches(from: TokenType...)` will effect how substrings are matched.
Tokenizers are greedy. The order that tokenizers are passed into the `matches(from: TokenizerType...)` will effect how substrings are matched.

Here's an example using the `CharacterSet.decimalDigits` tokenizer and the custom tokenizer `DateToken` that matches dates in the format `MM/dd/yy` ([see example](Tokens with internal state.md) for implementation).
Here's an example using the `CharacterSet.decimalDigits` tokenizer and the custom tokenizer `DateTokenizer` that matches dates in the format `MM/dd/yy` ([see example](Tokens with internal state.md) for implementation).

````Swift
import Mustard

let numbers = "03/29/17 36"
let matches = numbers.matches(from: CharacterSet.decimalDigits, DateToken.tokenizer)
// matches.count -> 4
let tokens = numbers.tokens(matchedWith: CharacterSet.decimalDigits, DateTokenizer.defaultTokenizer)
// tokens.count -> 4
//
// matches[0].text -> "03"
// matches[0].tokenizer -> CharacterSet.decimalDigits
// tokens[0].text -> "03"
// tokens[0].tokenizer -> CharacterSet.decimalDigits
//
// matches[1].text -> "29"
// matches[1].tokenizer -> CharacterSet.decimalDigits
// tokens[1].text -> "29"
// tokens[1].tokenizer -> CharacterSet.decimalDigits
//
// matches[2].text -> "17"
// matches[2].tokenizer -> CharacterSet.decimalDigits
// tokens[2].text -> "17"
// tokens[2].tokenizer -> CharacterSet.decimalDigits
//
// matches[3].text -> "36"
// matches[3].tokenizer -> CharacterSet.decimalDigits
// tokens[3].text -> "36"
// tokens[3].tokenizer -> CharacterSet.decimalDigits
````

To get expected behavior, the `matches` method should be called with more specific tokenizers placed before more general tokenizers:
To get expected behavior, the `tokens` method should be called with more specific tokenizers placed before more general tokenizers:

````Swift
import Mustard

let numbers = "03/29/17 36"
let matches = numbers.matches(from: DateToken.tokenizer, CharacterSet.decimalDigits)
// matches.count -> 2
let tokens = numbers.tokens(matchedWith: DateTokenizer.defaultTokenizer, CharacterSet.decimalDigits)
// tokens.count -> 2
//
// matches[0].text -> "03/29/17"
// matches[0].tokenizer -> DateToken()
// tokens[0].text -> "03/29/17"
// tokens[0].tokenizer -> DateTokenizer()
//
// matches[1].text -> "36"
// matches[1].tokenizer -> CharacterSet.decimalDigits
// tokens[1].text -> "36"
// tokens[1].tokenizer -> CharacterSet.decimalDigits
````

If the more specific tokenizer fails to match a token, the more general tokens still have a chance to perform matches:
Expand All @@ -46,18 +46,18 @@ If the more specific tokenizer fails to match a token, the more general tokens s
import Mustard

let numbers = "99/99/99 36"
let matches = numbers.matches(from: DateToken.tokenizer, CharacterSet.decimalDigits)
// matches.count -> 4
let tokens = numbers.tokens(matchedWith: DateTokenizer.defaultTokenizer, CharacterSet.decimalDigits)
// tokens.count -> 4
//
// matches[0].text -> "99"
// matches[0].tokenizer -> CharacterSet.decimalDigits
// tokens[0].text -> "99"
// tokens[0].tokenizer -> CharacterSet.decimalDigits
//
// matches[1].text -> "99"
// matches[1].tokenizer -> CharacterSet.decimalDigits
// tokens[1].text -> "99"
// tokens[1].tokenizer -> CharacterSet.decimalDigits
//
// matches[2].text -> "99"
// matches[2].tokenizer -> CharacterSet.decimalDigits
// tokens[2].text -> "99"
// tokens[2].tokenizer -> CharacterSet.decimalDigits
//
// matches[3].text -> "36"
// matches[3].tokenizer -> CharacterSet.decimalDigits
// tokens[3].text -> "36"
// tokens[3].tokenizer -> CharacterSet.decimalDigits
````
12 changes: 6 additions & 6 deletions Documentation/Matching emoji.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,21 @@ As an example, the character '👶🏿' is comprised by two scalars: '👶', and
The rainbow flag character '🏳️‍🌈' is again comprised by two adjacent scalars '🏳' and '🌈'.
A final example, the character '👨‍👨‍👧‍👦' is actually 7 scalars: '👨' '👨' '👧' '👦' joined by three ZWJs (zero-with joiner).

To create a TokenType that matches emoji we can instead check to see if a scalar falls within known range, or if it's a ZWJ.
To create a `TokenizerType` that matches emoji we can instead check to see if a scalar falls within known range, or if it's a ZWJ.

This isn't the most *accurate* emoji tokenizer because it would potentially matches an emoji scalar followed by 100 zero-width joiners, but for basic use it might be enough.

````Swift
struct EmojiToken: TokenType {
struct EmojiTokenizer: TokenizerType {

// (e.g. can't start with a ZWJ)
func canStart(with scalar: UnicodeScalar) -> Bool {
return EmojiToken.isEmojiScalar(scalar)
func tokenCanStart(with scalar: UnicodeScalar) -> Bool {
return EmojiTokenizer.isEmojiScalar(scalar)
}

// either in the known range for a emoji, or a ZWJ
func canTake(_ scalar: UnicodeScalar) -> Bool {
return EmojiToken.isEmojiScalar(scalar) || EmojiToken.isJoiner(scalar)
func tokenCanTake(_ scalar: UnicodeScalar) -> Bool {
return EmojiTokenizer.isEmojiScalar(scalar) || EmojiTokenizer.isJoiner(scalar)
}

static func isJoiner(_ scalar: UnicodeScalar) -> Bool {
Expand Down
114 changes: 0 additions & 114 deletions Documentation/TallyType protocol.md

This file was deleted.

0 comments on commit 6cbc2d8

Please sign in to comment.