- Author: Benjamin Du
- Date: 2023-01-01 18:28:47
- Modified: 2023-01-02 09:41:30
- Title: Read CSV Files Using Polars in Rust
- Slug: read-csv-files-using-polars-in-rust
- Category: Computer Science
- Tags: Computer Science, programming, Rust, Polars, CSV, CsvReader, LazyCsvReader, DataFrame, IO

**Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!**

## Tips and Traps

1. LazyCsvReader is more limited compared to CsvReader.
    CsvReader support specifying schema 
    while LazyCsvReader does not.

2. CsvReader does not support parsing columns into UInt8, Int8, UInt16 or Int16 at this time
    (even though the Python API `polars::read_csv` supports those types).
    Please refer to 
    [this issue](https://github.com/pola-rs/polars/issues/5214)
    for more discussions.

3. An empty filed is parsed as `null` instead of an empty string by default.
    And there is no way to change this behavior at this time.
    Please refer to 
    [this issue](https://github.com/pola-rs/polars/issues/5984)
    for more discussions.
    Characters other than empty are NOT parsed as `null` by default.
    However,
    parsing special characters into `null` is supported via the API `CsvReader::with_null_values`.

In [2]:
:timing
:sccache 1
:dep polars = { version = "0.26.1", features = ["lazy", "parquet"] }

Timing: true
sccache: true


In [3]:
use polars::df;
use polars::prelude::*;
use polars::datatypes::DataType;
use std::fs::File;
use std::io::BufWriter;
use std::io::Write;

## CsvReader and DataFrame

In [4]:
let mut s = Schema::new();
s.with_column("column_1".into(), DataType::UInt32);
s.with_column("column_2".into(), DataType::UInt32);
s.with_column("column_3".into(), DataType::UInt32);
s.with_column("column_4".into(), DataType::UInt32);
s.with_column("column_5".into(), DataType::Utf8);
s

Schema:
name: column_1, data type: UInt32
name: column_2, data type: UInt32
name: column_3, data type: UInt32
name: column_4, data type: UInt32
name: column_5, data type: Utf8


In [13]:
let df = CsvReader::from_path("rank53_j0_j0.csv")?
            .has_header(false)
            .with_dtypes(Some(&s))
            .with_null_values(None)
            .finish()?;
df

shape: (10, 5)
┌──────────┬──────────┬──────────┬──────────┬────────────────┐
│ column_1 ┆ column_2 ┆ column_3 ┆ column_4 ┆ column_5       │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---            │
│ u32      ┆ u32      ┆ u32      ┆ u32      ┆ str            │
╞══════════╪══════════╪══════════╪══════════╪════════════════╡
│ 0        ┆ 1        ┆ 2        ┆ 0        ┆ 56229711839232 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 1        ┆ 57324928499712 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 2        ┆ 37744977903616 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 3        ┆ NA             │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...      ┆ ...      ┆ ...      ┆ ...      ┆ ...            │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 6    

In [14]:
df.filter(
    &df.column("column_5")?.equal("")?
)?

shape: (0, 5)
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ column_1 ┆ column_2 ┆ column_3 ┆ column_4 ┆ column_5 │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ u32      ┆ u32      ┆ u32      ┆ u32      ┆ str      │
╞══════════╪══════════╪══════════╪══════════╪══════════╡
└──────────┴──────────┴──────────┴──────────┴──────────┘

In [16]:
df.filter(
    &df.column("column_5")?.equal("NA")?
)?

shape: (1, 5)
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ column_1 ┆ column_2 ┆ column_3 ┆ column_4 ┆ column_5 │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ u32      ┆ u32      ┆ u32      ┆ u32      ┆ str      │
╞══════════╪══════════╪══════════╪══════════╪══════════╡
│ 0        ┆ 1        ┆ 2        ┆ 3        ┆ NA       │
└──────────┴──────────┴──────────┴──────────┴──────────┘

In [15]:
df.filter(
    &df.column("column_5")?.is_null()
)?

shape: (1, 5)
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ column_1 ┆ column_2 ┆ column_3 ┆ column_4 ┆ column_5 │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ u32      ┆ u32      ┆ u32      ┆ u32      ┆ str      │
╞══════════╪══════════╪══════════╪══════════╪══════════╡
│ 0        ┆ 1        ┆ 2        ┆ 7        ┆ null     │
└──────────┴──────────┴──────────┴──────────┴──────────┘

## LazyCsvReader and LazyFrame

In [20]:
let df: LazyFrame = LazyCsvReader::new("rank53_j0_j0.csv")
            .has_header(false)
            .with_null_values(None)
            .finish()?;
df.collect()?

shape: (10, 5)
┌──────────┬──────────┬──────────┬──────────┬────────────────┐
│ column_1 ┆ column_2 ┆ column_3 ┆ column_4 ┆ column_5       │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---            │
│ i64      ┆ i64      ┆ i64      ┆ i64      ┆ str            │
╞══════════╪══════════╪══════════╪══════════╪════════════════╡
│ 0        ┆ 1        ┆ 2        ┆ 0        ┆ 56229711839232 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 1        ┆ 57324928499712 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 2        ┆ 37744977903616 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 3        ┆ NA             │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...      ┆ ...      ┆ ...      ┆ ...      ┆ ...            │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ 1        ┆ 2        ┆ 6    