New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to BufferedGenerator #12

Closed
Westacular opened this Issue Aug 20, 2014 · 5 comments

Comments

Projects
None yet
2 participants
@Westacular
Copy link

Westacular commented Aug 20, 2014

I thought the whole

buffer.next()
while buffer.current != nil {
    if let unicode = buffer.current { // ... somewhere, buffer.next() is called

dance was kind of ugly: you're dealing with the overhead of using a generator, but receiving none of the benefits it provides (e.g. for in loops). Also, using a struct for your BufferedGenerator seems odd -- you end up using a class as a backing store anyway, and having it as a struct means using inout parameters all over the place. There's a discussion on the dev forums that argues the case why GeneratorTypes should, in general, just be reference types.

Anyway, I took a different view on the parsing issue -- it's not that you need access to the "current" element so much as it is that determining what sub-parser to use forces the generator to consume one too many characters. Really, if you could just rewind the generator back a character, everything would be simple. So I wrote a class that lets you do that:

/// Creates a `GeneratorType`/`SequenceType` that allows rewinding and can be passed around.
final public class RewindableGenerator<S : CollectionType where S.Index : BidirectionalIndexType> : GeneratorType, SequenceType {
    typealias Sequence = S

    private var currentIndex: Sequence.Index
    private let prestartIndex: Sequence.Index

    private let seq: Sequence

    // store the `current` element, but it's not really necessary
    private(set) var current: Sequence.Generator.Element?

    /// Initializes a new `RewindableGenerator<S>` with an underlying `CollectionType`.
    /// Requires that `CollectionType.Index` be a `BidirectionalIndexType`.
    ///
    /// :param: sequence the sequence that will be used to traverse the content.
    public init(_ sequence: Sequence) {
        self.seq = sequence
        self.currentIndex = sequence.startIndex
        self.prestartIndex = sequence.startIndex.predecessor()
    }

    /// Moves the `current` element to the next element if one exists.
    ///
    /// :return: The `current` element or `nil` if the element does not exist.
    public func next() -> Sequence.Generator.Element? {

        if currentIndex == seq.endIndex {
            return nil
        }

        currentIndex = currentIndex.successor()

        if currentIndex != seq.endIndex {
            self.current = seq[currentIndex]
        } else {
            self.current = nil
        }
        return self.current
    }

    /// Moves the `current` element to the previous element if one exists.
    ///
    /// :return: The `current` element or `nil` if the element does not exist.
    public func previous() -> Sequence.Generator.Element? {

        if currentIndex == self.prestartIndex {
            return nil
        }

        currentIndex = currentIndex.predecessor()

        if currentIndex != self.prestartIndex {
            self.current = seq[currentIndex]
        } else {
            self.current = nil
        }
        return self.current
    }

    public func generate() -> Self {
        self.previous()
        return self
    }
}

I don't have time to do a full pull request, but I believe this technique should enable you to use for scalar in buffer loops in all the various parsing functions; each new time you start to iterate through the JSON in one of the sub-parsers, there will implicitly be a call to generate() which will in turn automatically rewind the parsing by a character, revealing that lost character as the first item of iteration.

Here's a goofy example use from my testing, that just grabs the characters between paired brackets and adds them to an array:

var words: [String] = []
var unmatched: [String] = []

func parseBrackets(gen: RewindableGenerator<String.UnicodeScalarView>) {
    var sbuf = ""

    for (idx, c) in enumerate(gen) {
        if idx == 0 {
            if c == "[" {
                continue
            } else {
                // error or something
                return
            }
        }

        switch c {
        case "[":
            // nested brackets
            parseBrackets(gen)
        case "]":
            words.append(sbuf)
            return
        default:
            sbuf.append(c)
        }
    }
    unmatched.append(sbuf)
}


let s = "[This] [is ][a] weird [ne[st]e[d]] [\u{aaef}string!"
let g = RewindableGenerator(s.unicodeScalars)

for c in g {
    if c == "[" {
        parseBrackets(g)
    }
}

// words == ["This", "is ", "a", "st", "d", "nee"]
// unmatched == "ꫯstring!"
@owensd

This comment has been minimized.

Copy link
Owner

owensd commented Aug 20, 2014

Yes, you are correct. The original problem was that I was using String.UTF8View before I moved over to String.UnicodeScalarView. While it is true that String.UnicodeScalarView.Index implements BidirectionalIndexType, String.UTF8View.Index only implements ForwardIndexType. Since I was going back and forth between the two testing out some performance characters, I didn't even think to pursue what you've done.

My initial implementation of JSON parsing actually did the back-and-forth dance when I was just using the base String type. That was before I ran into the combining unicode characters problem.

I'll take a closer look at and you suggest above because I do think it is better, especially given the constraint that I should have been using String.UnicodeScalarView from the start.

Thanks!

@owensd

This comment has been minimized.

Copy link
Owner

owensd commented Aug 20, 2014

Ah yes... I remember why I stopped using the Index based approach in one of my earlier iterations... performance sucks!!

func test_baseline() {
    var string = ""

    self.measureBlock() {
        for scalar in self.largeJSON.unicodeScalars {
            scalar.writeTo(&string)
        }
    }
}

func test_generate() {
    var string = ""

    self.measureBlock() {
        var generator = RewindableGenerator(self.largeJSON.unicodeScalars)
        for scalar in generator {
            scalar.writeTo(&string)
        }
    }
}

Two code samples looping over the same string (about 688KB in size). The GeneratorType version takes 0.77s and the RewindableGenerator takes 1.77s to perform the task.

The problem is two fold:

  1. currentIndex = currentIndex.successor()
  2. collection[currentIndex]

Both of these methods are actually fairly slow. Maybe just a limitation in Beta 6, but I'm not sure...

@owensd

This comment has been minimized.

Copy link
Owner

owensd commented Aug 20, 2014

Here's a better BufferedGenerator implementation:

/// Creates a `GeneratorType` that is able to "replay" its previous value on the next `next()` call.
/// Useful for generator sequences in which you need to simply step-back a single element.
final public class ReplayableGenerator<S: SequenceType> : GeneratorType, SequenceType {
    typealias Sequence = S

    private var firstRun = true
    private var usePrevious = false
    private var previousElement: Sequence.Generator.Element? = nil
    private var generator: Sequence.Generator

    /// Initializes a new `ReplayableGenerator<S>` with an underlying `SequenceType`.
    ///
    /// :param: sequence the sequence that will be used to traverse the content.
    public init(_ sequence: Sequence) {
        self.generator = sequence.generate()
    }

    /// Moves the current element to the next element in the collection, if one exists.
    ///
    /// :return: The `current` element or `nil` if the element does not exist.
    public func next() -> Sequence.Generator.Element? {
        switch usePrevious {
        case true:
            usePrevious = false
            return previousElement

        default:
            previousElement = generator.next()
            return previousElement
        }
    }

    /// Ensures that the next call to `next()` will use the current value.
    public func replay() {
        usePrevious = true
        return
    }

    /// Creates a generator that can be used to traversed the content. Each call to
    /// `generate` will call `replay()`.
    ///
    /// :return: A iteratable collection backing the content.
    public func generate() -> ReplayableGenerator {
        switch firstRun {
        case true:
            firstRun = false
            return self

        default:
            self.replay()
            return self
        }
    }

    /// Determines if the generator is at the end of the collection's content.
    ///
    /// :return: `true` if there more content in the collection to view, `false` otherwise.
    public func atEnd() -> Bool {
        let element = next()
        replay()

        return element == nil
    }
}

This implementation gets my parser back down to 0.25s vs. 0.17s for the NSJSONSerialization parser.

@owensd

This comment has been minimized.

Copy link
Owner

owensd commented Aug 21, 2014

@owensd owensd closed this Aug 21, 2014

@Westacular

This comment has been minimized.

Copy link

Westacular commented Aug 21, 2014

Glad to help. Great job, and great post! Given the performance cost of indexing, I think I'd have taken the same approach as you did with ReplayableGenerator. The parsing code looks a lot cleaner now, I'd say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment