Read in batches #94

opustecnica · 2021-11-15T22:35:24Z

Advanced apologies if this is documented somewhere and I, obviously, failed to locate the hint.

I am in the process of writing a PowerShell module leveraging your, fast, libraries. I am looking for a method to Read/ReadAsync in batches in order to process large (> 1GB) CSV files without overtaxing the system. I have seen this (https://stackoverflow.com/questions/66570865/memorystream-data-getting-truncated) and the "RangeDataReader" would work. Is this still the advised approach? Which library contains the derivative class?

Many thanks in advance

MarkPflug · 2021-11-15T23:50:57Z

What is your PowerShell module intending to do? You shouldn't have to do special batching at all. The CsvDataReader allocates a single working buffer internally (which is configurable in size) and that is all the memory that it will ever use, regardless of how big the input file is. Conceptually, you can thing of it as operating on a batch size of 1. Each time you call read there will be a single record processed.

opustecnica · 2021-11-16T00:26:14Z

Perfect!
A little bit of history is due. I have been Using the PowerShell built in Import-Csv for all of my Csv needs. Slow, but part of the default install. Suddenly, a few multi gigabyte Csv sources appeared and … default became rapidly too slow and memory hungry. After a bit of googling, I run into https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers and … the performance of your library made it for a quick buy.
Mostly, the module will concentrate on importing data to datatables that will then be inserted into a DB. Same thing for export operations. The batching in these cases will happen at the datatable level to avoid individual commits.
I have the core of the module already working and it is now time for cleanup and publishing. Will keep you posted. Thank you again.

MarkPflug · 2021-11-16T00:35:30Z

If you are trying to load data into SqlServer (or any other relational database) I would suggest you avoid using an intermediate DataTable and just feed the CsvDataReader directly to SqlBulkCopy (other providers have similar capabilities). Here is an example:

Sylvan/source/Sylvan.Data.Csv.Tests/SampleTests.cs

Line 118 in abdd3e0

public void SqlBulkLoadSample()

SqlBulkCopy.WriteToServer accepts a DbDataReader, so there's no need to load into a DataTable first. The key here, is that the DbDataReader needs to provide a schema that conforms to the target SQL table. The linked example shows how you can apply the schema of the table in SQL to the CsvDataReader.

opustecnica · 2021-11-16T00:39:24Z

Copy that. SQL is definitely a target together with Postgres.

MarkPflug · 2021-11-16T16:09:51Z

@opustecnica shoot me a notification when your powershell module is released. I'd be interested in reviewing it.

MarkPflug closed this as completed Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read in batches #94

Read in batches #94

opustecnica commented Nov 15, 2021

MarkPflug commented Nov 15, 2021

opustecnica commented Nov 16, 2021

MarkPflug commented Nov 16, 2021

opustecnica commented Nov 16, 2021

MarkPflug commented Nov 16, 2021

Read in batches #94

Read in batches #94

Comments

opustecnica commented Nov 15, 2021

MarkPflug commented Nov 15, 2021

opustecnica commented Nov 16, 2021

MarkPflug commented Nov 16, 2021

opustecnica commented Nov 16, 2021

MarkPflug commented Nov 16, 2021