Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read in batches #94

Closed
opustecnica opened this issue Nov 15, 2021 · 5 comments
Closed

Read in batches #94

opustecnica opened this issue Nov 15, 2021 · 5 comments

Comments

@opustecnica
Copy link

Advanced apologies if this is documented somewhere and I, obviously, failed to locate the hint.

I am in the process of writing a PowerShell module leveraging your, fast, libraries. I am looking for a method to Read/ReadAsync in batches in order to process large (> 1GB) CSV files without overtaxing the system. I have seen this (https://stackoverflow.com/questions/66570865/memorystream-data-getting-truncated) and the "RangeDataReader" would work. Is this still the advised approach? Which library contains the derivative class?

Many thanks in advance

@MarkPflug
Copy link
Owner

What is your PowerShell module intending to do? You shouldn't have to do special batching at all. The CsvDataReader allocates a single working buffer internally (which is configurable in size) and that is all the memory that it will ever use, regardless of how big the input file is. Conceptually, you can thing of it as operating on a batch size of 1. Each time you call read there will be a single record processed.

@opustecnica
Copy link
Author

Perfect!
A little bit of history is due. I have been Using the PowerShell built in Import-Csv for all of my Csv needs. Slow, but part of the default install. Suddenly, a few multi gigabyte Csv sources appeared and … default became rapidly too slow and memory hungry. After a bit of googling, I run into https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers and … the performance of your library made it for a quick buy.
Mostly, the module will concentrate on importing data to datatables that will then be inserted into a DB. Same thing for export operations. The batching in these cases will happen at the datatable level to avoid individual commits.
I have the core of the module already working and it is now time for cleanup and publishing. Will keep you posted. Thank you again.

@MarkPflug
Copy link
Owner

If you are trying to load data into SqlServer (or any other relational database) I would suggest you avoid using an intermediate DataTable and just feed the CsvDataReader directly to SqlBulkCopy (other providers have similar capabilities). Here is an example:

public void SqlBulkLoadSample()

SqlBulkCopy.WriteToServer accepts a DbDataReader, so there's no need to load into a DataTable first. The key here, is that the DbDataReader needs to provide a schema that conforms to the target SQL table. The linked example shows how you can apply the schema of the table in SQL to the CsvDataReader.

@opustecnica
Copy link
Author

Copy that. SQL is definitely a target together with Postgres.

@MarkPflug
Copy link
Owner

@opustecnica shoot me a notification when your powershell module is released. I'd be interested in reviewing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants