Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to use this to decode JSON lines? #33

Closed
rj93 opened this issue Jul 25, 2022 · 10 comments
Closed

Is there a way to use this to decode JSON lines? #33

rj93 opened this issue Jul 25, 2022 · 10 comments
Assignees

Comments

@rj93
Copy link

rj93 commented Jul 25, 2022

I have a JSON line file and I'm trying to parse it, I'm wondering if this can be used?

Currently swift-json is ~4 times slower than the Foundation implementation, but I suspect this is because I am recreating the decoder each time:

swift-json (~97 seconds)

let (bytes, urlResponse) = try! await URLSession.shared.bytes(from: file)
var logs: [Log] = []
do {
    let startTime = CFAbsoluteTimeGetCurrent()
    for try await line in bytes.lines {
        let trimmed = line.trimmingCharacters(in: .whitespaces)
        if !trimmed.isEmpty {
            let decoder: JSON = try Grammar.parse(trimmed.utf8, as: JSON.Rule<String.Index>.Root.self)
            let log: Log = try .init(from: decoder)
            logs.append(log)
        }
    }
    let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
    print("Time elapsed: \(timeElapsed) s.")
} catch {
    print("Reading file failed with error \(error)")
}

Foundation (~24 seconds)

let decoder: JSONDecoder = JSONDecoder()
let (bytes, urlResponse) = try! await URLSession.shared.bytes(from: file)
var logs: [Log] = []
do {
    let startTime = CFAbsoluteTimeGetCurrent()
    for try await line in bytes.lines {
        let data = line.trimmingCharacters(in: .whitespaces).data(using: .utf8) ?? Data()
        if !data.isEmpty {
            let log: Log = try! decoder.decode(Log.self, from: data)
            logs.append(log)
        }
    }
    let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
    print("Time elapsed: \(timeElapsed) s.")
} catch {
    print("Reading file failed with error \(error)")
}

Tested on a ~100MB file with 140000 lines

@tayloraswift
Copy link
Owner

tayloraswift commented Jul 26, 2022

there are two reasons why this may be slow:

let trimmed = line.trimmingCharacters(in: .whitespaces)

this is not necessary. the Root rule skips leading whitespace anyway, and the foundation trimmingCharacters(in:) copies the string to new storage, which is wasteful.

if you know the root message is an object (and not an array), it may also be slightly faster to use the Object rule.


let log: Log = try .init(from: decoder)

this is probably the main reason why your decoding is slow, because it is using the standard library Codable hooks, which use existential dispatch. JSONDecoder also does this, but the reason swift-json can be faster than JSONDecoder is because it allows you to bypass the Codable interface by dropping down to the LintingDictionary level. (this usually gives like a 10x speedup)

alas i have not gotten a chance to write tutorials for LintingDictionary (it is a v0.3.0 API), but if you can share the format of your JSON messages, i’d be happy to point you in the right direction.

by the way, Grammar.parse(trimmed.utf8, as: JSON.Rule<String.Index>.Root.self) doesn’t really have “initialization overhead”, and swift-json doesn’t really “create” decoders the way Foundation does. however, creating the Decoder wrapper is rather expensive, which is why you usually want to decode with LintingDictionary. (or even switch on JSON variants directly, for really bare-metal performance)

finally, would you mind sharing your test data (and the toolchain version used to compile the test program)? a 4x slowdown seems extreme, even if you're using the Codable wrappers.

@rj93
Copy link
Author

rj93 commented Jul 26, 2022

Thanks for the quick reply @Kelvin13.

I tried removing let trimmed = line.trimmingCharacters(in: .whitespaces), let log: Log = try .init(from: decoder) and updated it to use Object rule as follows, but it still took 85 seconds:

let (bytes, urlResponse) = try! await URLSession.shared.bytes(from: file)
var logs: [Log] = []
do {
    let startTime = CFAbsoluteTimeGetCurrent()
    for try await line in bytes.lines {
        if !line.isEmpty {
            let decoder = try Grammar.parse(line.utf8, as: JSON.Rule<String.Index>.Object.self)
        }
    }
    let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
    print("Time elapsed: \(timeElapsed) s.")
} catch {
    print("Reading file failed with error \(error)")
}

Side note, how would you create a Codable struct from the [(key: String, value: JSON)] type when you use the Object rule? (Sorry if this is basic, I'm pretty new to swift)


I am unfortunately not able to share the file as it contains sensitive data, if you need one created I can do so, but for now heres an example file:

{"@timestamp":"2000-01-01T00:00:00.000+00:00","message":"An example debug message 1","logger_name":"com.example.ExampleLogger","thread_name":"main","level":"DEBUG"}
{"@timestamp":"2001-01-01T00:00:00.000+00:00","message":"An example info message 2","logger_name":"com.example.ExampleLogger","thread_name":"main","level":"INFO"}
{"@timestamp":"2002-01-01T00:00:00.000+00:00","message":"An example warn message 3","logger_name":"com.example.ExampleLogger","thread_name":"main","level":"WARN"}
{"@timestamp":"2003-01-01T00:00:00.000+00:00","message":"An example error message 4","logger_name":"com.example.ExampleLogger","thread_name":"main","level":"ERROR"}
{"@timestamp":"2004-01-01T00:00:00.000+00:00","message":"An example fatal message 5","logger_name":"com.example.ExampleLogger","thread_name":"main","level":"FATAL"}
{"@timestamp":"2005-01-01T00:00:00.000+00:00","message":"An example message 6","logger_name":"com.example.longer.ExampleLogger","thread_name":"main","level":"INFO"}

and the toolchain version used to compile the test program

I'm not sure if this is what you're looking for:

swift-driver version: 1.45.2 Apple Swift version 5.6.1 (swiftlang-5.6.0.323.66 clang-1316.0.20.12)
Target: arm64-apple-macosx12.0

@rj93
Copy link
Author

rj93 commented Jul 26, 2022

Also 0.3.0 doesn't seem to exist as a tag, but it's in the README so not sure if something has broken there?

@tayloraswift
Copy link
Owner

tayloraswift commented Jul 26, 2022

I tried removing let trimmed = line.trimmingCharacters(in: .whitespaces), let log: Log = try .init(from: decoder) and updated it to use Object rule as follows, but it still took 85 seconds:

it is very strange that is is taking 85 seconds even with no decoding taking place. can you please regenerate the input data so i can investigate this further?

by the way, you are using CFAbsoluteTimeGetCurrent, which is a wall-clock time, and you are running the timer across a suspension boundary (the await in the asynchronous loop). this means that you are not only measuring the execution time of the JSON decoders, but also arbitrary tasks run by the scheduler in between loop iterations.

i don’t know why this is affecting swift-json more than the Foundation implementation, but for the sake of getting an accurate measurement, can you please re-run the benchmarks in a synchronous loop?

also, i wouldn’t normally ask this, but you said you are new to swift, so are you compiling in production mode? (swift build -c release)? this is very important to swift-json and other SPM packages that are distributed as source, because they use the same optimization settings as the rest of the project. (unlike Foundation, which is distributed as a binary.)

Side note, how would you create a Codable struct from the [(key: String, value: JSON)] type when you use the Object rule?

the construction returned by the Object rule is just a normal swift array, and you can operate on it like any other swift array. however, it’s probably easier (and you would get better error-handling) if you use LintingDictionary like this:

init(from json:JSON) throws
{
    (self.timestamp, self.message, self.logger, self.thread, self.level) = 
        try json.lint
    {
        (
            timestamp: try $0.remove("@timestamp", as: String.self),
            message: try $0.remove("message", as: String.self),
            logger: try $0.remove("logger_name", as: String.self),
            thread: try $0.remove("thread_name", as: String.self),
            level: try $0.remove("level") { try $0.case(of: LoggingLevel.self) },
        )
    }
}

where LoggingLevel is some RawRepresentable type, which i am using with the JSON.case(of:) API.

enum LoggingLevel:String 
{
    case info = "INFO"
    case warn = "WARN"
    case debug = "DEBUG"
    ...
}

the easiest way to create the LintingDictionary is just to widen the object to a JSON variant (with the JSON.object(_:) case constructor) and then call one of its lint(_:) methods, but you can also dodge a miniscule amount of overhead by converting to a dictionary yourself, and passing it to LintingDictionary.init(_:)

Also 0.3.0 doesn't seem to exist as a tag, but it's in the README so not sure if something has broken there?

the README is correct, it shows the next planned release, which is 0.3.0, and does not have a tag yet. “0.3.0” right now simply refers to the current master trunk of the repository.

@rj93
Copy link
Author

rj93 commented Jul 28, 2022

Apologies for the delayed reply, been trying to figure out how to make the build work but I've done it now.

You can find a sample application at https://github.com/rj93/json-decode

You were right though, swift-json's performance improves when built using the swift build -c release command, previously I was just running it from xcode. But the difference is tiny (less than half a second between them).

I also tried the 0.3.0 API, but it didn't make any noticeable improvements

Any pointers to how to speed that up in my use case?

@tayloraswift
Copy link
Owner

tayloraswift commented Jul 29, 2022

it sounds like you were already compiling in release mode then, since going from debug to release mode should cause a way bigger difference than half a second.

based on what you described earlier, the performance problem is most likely not in the decoding stage, so switching to the 0.3.0 API would not have made a big difference, since that wasn’t the bottleneck.

based on what you have tried and what you’ve described, this is probably a bug in swift-json. i will try to reproduce it and investigate this further with the example project you linked. thanks for the report!

@tayloraswift tayloraswift self-assigned this Jul 29, 2022
@tayloraswift
Copy link
Owner

tayloraswift commented Jul 29, 2022

hi @rj93 , i was not able to reproduce your results on linux, in fact swift-json was about 28% faster than the Foundation implementation, even when using the default standard library Decodable hooks:

$ .build/release/json-decode aws-ipamd.jsonl 
fileUrl: aws-ipamd.jsonl -- file:///home/klossy/dev/json-decode-test/
Starting foundation read
foundation time elapsed: 12.463001012802124 s.
Starting swift-json read
swiftJson time elapsed: 8.935490012168884 s.
$ swift --version
Swift version 5.8-dev (LLVM e41a0a126d143ae, Swift 23e5143d22cb768)
Target: x86_64-unknown-linux-gnu
$ swiftenv version
DEVELOPMENT-SNAPSHOT-2022-07-25-a

the only change i made to the example program you provided was adding a import CoreFoundation to the top, to make it compile on linux. though it is extremely unlikely this has any effect on performance.

i will investigate adding your test data to a CI workflow, to see if there are any potential platform differences that could be causing the performance discrepancy.

@tayloraswift
Copy link
Owner

tayloraswift commented Jul 29, 2022

update: i have added your test case to the project’s CI benchmarks here:

https://github.com/kelvin13/swift-json/runs/7586938970?check_suite_focus=true#step:7:64

i was also able to significantly optimize your Foundation JSONDecoder usage, which brought the Foundation decoding time down from around 11s to around 6.7s.

swift-json decoding time (fast decoding api): 3.649686595 seconds
swift-json decoding time (compatibility api): 4.506052616 seconds
foundation decoding time:                     6.677785956 seconds

i will keep this issue open for now, as it is possible this only affects your particular macOS/Xcode configuration, and we do not yet have CI workflows set up to run benchmarks on macOS.

@rj93
Copy link
Author

rj93 commented Jul 31, 2022

@Kelvin13 I checked out project, updated a few things to make it run on macOS (removed the timing functions and used the CFAbsoluteTimeGetCurrent as thats only available in the current macOS beta, along with adding in some @available(macOS 11.0, *) annotations in the aws benchmark and your swift-system-extras library) and I got the following results:

swift run -c release benchmark-aws-ipam ./data/aws-ipamd.jsonl
Building for production...
Build complete! (0.07s)
swift-json: decoded 415901 logs
swift-json: decoded 415901 logs
foundation: decoded 415901 logs
swift-json decoding time (fast decoding api): 2.2130579948425293
swift-json decoding time (compatibility api): 2.8436840772628784
foundation decoding time:                     18.538437962532043

So this can probably be closed as it is pretty evident that swift-json is significantly quicker.

Thanks for taking the time to look into it.

@rj93
Copy link
Author

rj93 commented Jul 31, 2022

I tried implementing exactly what was in the benchmark code and it's still slower than the foundation implementation, so it must be something that I'm doing which is very wrong from with within my app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants