Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RecordParser #30

Merged
merged 4 commits into from
Jul 10, 2021
Merged

Conversation

leandromoh
Copy link
Contributor

@leandromoh leandromoh commented Jun 29, 2021

Hi! I saw your post and I would like to add my parser on benchmark, when this PR is finished.

@leandromoh leandromoh changed the title add implm add RecordParser Jun 29, 2021
@leandromoh leandromoh marked this pull request as draft June 29, 2021 05:10
@joelverhagen
Copy link
Owner

Hey @leandromoh, thanks for taking the time to open a PR. It looks like the CI is complaining about some missing types. Perhaps we're missing the package reference in the .csproj to https://www.nuget.org/packages/recordparser?

@leandromoh
Copy link
Contributor Author

leandromoh commented Jul 5, 2021

Perhaps we're missing the package reference in the .csproj

yes, I was working with a local reference, whilist the PR was in draft, to do some experiments..

It was a bit funny that your benchmark test the exactly opposite cenario that my lib was thought to resolve: when you want to get all column values as string before to parse to specific type (ex. int, DateTime, etc). I designed RecordParser exactly to avoid intermediary unnecessary string allocations, if a field is some value type, like int for example, I parse it without intermediary string, in a straight manner from ReadOnlySpan<byte> (from file) to ReadOnlySpan<char> and finally int. This way avoids unnecessary pressure for GC / memory allocations.

Fortunately the lib had a good result even in this unpredicted cenario, at least on my pc.

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.18363.1440 (1909/November2019Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.104
  [Host]     : .NET 5.0.4 (5.0.421.11614), X64 RyuJIT
  DefaultJob : .NET 5.0.4 (5.0.421.11614), X64 RyuJIT

Method LineCount Mean Error StdDev Median Gen 0 Gen 1 Gen 2 Allocated
RecordParser 1000000 2.978 s 0.0591 s 0.1450 s 2.936 s 58000.0000 21000.0000 3000.0000 345 MB
CsvHelper 1000000 3.546 s 0.4463 s 1.2293 s 2.960 s 44000.0000 17000.0000 3000.0000 261 MB
Cursively 1000000 2.214 s 0.0406 s 0.0733 s 2.191 s 58000.0000 21000.0000 3000.0000 345 MB
Sylvan_Data_Csv 1000000 2.911 s 0.4912 s 1.4252 s 2.088 s 44000.0000 17000.0000 3000.0000 261 MB

@leandromoh leandromoh marked this pull request as ready for review July 5, 2021 02:46
@leandromoh
Copy link
Contributor Author

@joelverhagen PR is ready for review

{
var key = string.GetHashCode(span);

return cache.TryGetValue(key, out var text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using just the hashcode of the string as the key into the dictionary will occasionally result in two different strings (which happen to have the same hashcode) to be considered equal. This might never occur with the test data set, since the number of unique strings is relatively small, but this code should be corrected. The default string hashing algorithm in .NET includes per-process randomization, so even with the limited test data set it will likely fail eventually.

You might consider using the Ben.StringIntern nuget package for the string caching part, it is very high performance and (I assume) correct. I have an implementation of my own, but I'd probably swap for Ben's now; I would have used it originally, but it didn't exist at the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkPflug thanks for the point, I didnt know about the possibility of hash colisions, but indeed it was in the method documentation. Thanks for suggests Ben lib too. Fixed in #b2cfd4f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I saw that your lib Sylvan_Data_Csv and CsvHelper allocates 261 MB while RecordParse and Cursively does 345 MB. I really would like to know how do you allocates less than that, since each parsed object allocates zero on heap in RecordParse project. Do you have any guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you allocates less than that

Interesting question. I don't know the answer. Let's take a look.

The first thing to observe is that there is a baseline allocation that is required by the benchmark. The benchmark requires us to construct a List<PackageAsset> containing 1mil instances. I've created a benchmark that measures that minimal allocation:

        [Benchmark]
        public void AllocationBaseline()
        {
            var c = 1_000_000;
            var f = new List<PackageAsset>(c);
            for (int i = 0; i < c; i++)
            {
                f.Add(new PackageAsset());
            }
        }

Running that benchmark reports an allocation baseline of ~257MB. Is this the absolute minimum that we could allocate? Practically, yes, but, technically, no. The reason that string pooling (deduping, caching, whatever you want to call it) is so successful in these benchmarks is because of the way that Joel constructs the test data. There is a 507KB file containing only 1695 rows, this file is duplicated until 1mil rows are provided. This means that there are only 1695 * 25 (#rows) = 42,375 unique strings that could possibly be needed. Actually even fewer, because there is a lot of duplication even within those 1695 rows. However, you might also observe that there are only 1695 unique rows; so could we get away with only allocating 1695 rows, and just repeating those in the List? I think that goes too far and takes too much advantage of fore-knowledge of the dataset. Honestly, I think Joel should forbid string caching too, as it also a bit abusive.

So, with 257MB as a baseline, it would seem that the fastest libraries are either producing (261-257) 4MB or (345 - 257) 88MB of additional allocations. I would estimate that a maximum of 1MB is needed for the string allocations, since with pooling enabled the maximum size would be 507KB (the size of the file) * 2 (decoding to UTF-16 strings). So, what accounts for the rest of the allocations? For the 261 class, the overhead is coming from List. We don't specify a capacity when allocating the List. This means that the list will have to repeatedly grow to accommodate the 1mil items. Altering the above benchmark to not specify an initial capacity bumps the allocation to 260MB. So, the difference is about 1MB, which is about what I would expect.

So, what about the 345MB class? Both Cursively and RecordParser are hitting the same issue: there is a Func<int,string> that is being allocated for every row in the data set.

Here is where it's happening in RecordParser:

public List<T> GetRecords<T>(MemoryStream stream) where T : ICsvReadable, new()
        {
            var activate = ActivatorFactory.Create<T>(_activationMethod);
            var reader = BuildReader(activate);
            var result = ProcessStream<T>(stream, spanLine =>
            {
                var fields = reader.Parse(spanLine);
                var record = activate();

                record.Read(i => fields[i]); // <-- this is allocating a closure over fields

                return record;
            });

            return result;
        }

The inner lambda is allocating a closure over fields for each call to Read (1mil). You can refactor this to avoid that inner closure:

        public List<T> GetRecords<T>(MemoryStream stream) where T : ICsvReadable, new()
        {
            var activate = ActivatorFactory.Create<T>(_activationMethod);
            var reader = BuildReader(activate);

            string[] fields = Array.Empty<string>(); // <-- move fields outside of the func.
            Func<int, string> f = i => fields[i]; // <-- closure over fields only allocated once.

            var result = ProcessStream(stream, spanLine =>
            {
                fields = reader.Parse(spanLine); // <-- one more closure allocated here to capture fields.
                var record = activate();

                record.Read(f); // < f accesses fields via its own closure over the same variable

                return record;
            });

            return result;
        }

That takes RecordParser allocations down to 261, which is pretty much the minimum that can be expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkPflug thank you for your analysis, I found it very detailed! I didnt realize that the closure was being created for every loop/scope execution, great catch! I also could never imagine a closure in a loop could allocate 84MB.

Thank you very much for all help!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for curiosity, I try to allocated less than 261 and got it with the following change
Basically I use the first 1.000.000 records to calculate the average of record size, then combine it with the length of MemoryStream to have an expectation of existing records. So I can create the List<> setting the default capacity and avoid its grown. Allocations fall from 261 to 253 MB.

...
var reader = PipeReader.Create(stream);
var temp = new List<(long size, T item)>();

List<T> records = null;

while (true)
{
    ReadResult read = reader.ReadAsync().Result;
    ReadOnlySequence<byte> buffer = read.Buffer;
    while (TryReadLine(ref buffer, out ReadOnlySequence<byte> sequence))
    {
        var item = ProcessSequence(sequence, parser);

        if (records != null)
        {
            records.Add(item);
            continue;
        }

        temp.Add((sequence.Length, item));

        if (temp.Count == 1_000)
        {
            var avgSize = (int) temp.Average(x => x.size);
            var estimateRecordCount = stream.Length / avgSize;
            records = new List<T>((int)estimateRecordCount);

            foreach (var t in temp)
                records.Add(t.item);

            temp = null;
        }
    }
...

@joelverhagen joelverhagen merged commit eb10884 into joelverhagen:main Jul 10, 2021
@leandromoh leandromoh deleted the record-parser branch July 11, 2021 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants