Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Latest commit

 

History

History
52 lines (36 loc) · 2.28 KB

csv_reader.md

File metadata and controls

52 lines (36 loc) · 2.28 KB

CSV reader

When compiled with feature io_csv, you can use this crate to read CSV files. This crate makes minimal assumptions on how you want to read a CSV, and offers a large degree of customization to it, along with a useful default.

Background

There are two CPU-intensive tasks in reading a CSV file:

  • split the CSV file into rows, which includes parsing quotes and delimiters, and is necessary to seek to a given row.
  • parse a set of CSV rows (bytes) into a RecordBatch.

Parsing bytes into values is more expensive than interpreting lines. As such, it is generally advantageous to have multiple readers of a single file that scan different parts of the file (within IO constraints).

This crate relies on the crate csv to scan and seek CSV files, and your code also needs such a dependency. With that said, arrow2 makes no assumptions as to how to efficiently read the CSV: as a single reader per file or multiple readers.

As an example, the following infers the schema and reads a CSV by re-using the same reader:

{{#include ../../../examples/csv_read.rs}}

Orchestration and parallelization

Because csv's API is synchronous, the functions above represent the "minimal unit of synchronous work", IO and CPU. Note that rows above are Send, which implies that it is possible to run parse on a separate thread, thereby maximizing IO throughput. The example below shows how to do just that:

{{#include ../../../examples/csv_read_parallel.rs}}

Async

This crate also supports reading from a CSV asyncronously through the csv-async crate. The example below demonstrates this:

{{#include ../../../examples/csv_read_async.rs}}

Note that the deserialization should be performed on a separate thread to not block (see also here), which this example does not show.

Customization

In the code above, parser and infer allow for customization: they declare how rows of bytes should be inferred (into a logical type), and processed (into a value of said type). They offer good default options, but you can customize the inference and parsing to your own needs. You can also of course decide to parse everything into memory as Utf8Array and delay any data transformation.