Parquet read skips a few rows at the end of the page #373

vincev · 2021-09-03T15:45:50Z

When reading a parquet file with a string column with many NAs both parquet_read and parquet_record_read examples skip a few rows.

To reproduce I run the following script that creates a data frame with a NA every 20 rows:

import sys
import pandas as pd

values = [f'{x:020}' if x % 20 != 0 else None
          for x in range(int(sys.argv[1]))]
df = pd.DataFrame({'values': values})
df.to_parquet('test.parquet', index=False, version='2.0')
print(f"{df.info()}")

running the script with 100,000 rows:

> python gen.py 100000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   values  95000 non-null  object
dtypes: object(1)
memory usage: 781.4+ KB
None

In parquet_read I changed the code to print the number of rows:

    let array = read_column_chunk(file_path, row_group, column)?;
    println!("{}", array.len());

This print an array of length 99992:

> cargo run --release --features io_parquet,io_parquet_compression  --example parquet_read test.parquet 0 0
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/parquet_read test.parquet 0 0`
99992

I also tried changing the loop in parquet_record_read to double check for NA:

    let reader = read::RecordReader::try_new(reader, None, None, Arc::new(|_, _| true), None)?;
    let mut num_values = 0;
    for maybe_batch in reader {
        let batch = maybe_batch?;
        let column = batch
            .column(0)
            .as_any()
            .downcast_ref::<Utf8Array<i32>>()
            .expect("Not a String array");
        for row in column {
            if let Some(_) = row {
                num_values += 1;
            }
        }
    }

    println!("Num values {}", num_values);

When I run it there are 7 missing values (we should have 95,000 non null):

> cargo run --release --features io_parquet,io_parquet_compression  --example parquet_read_record test.parquet
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/parquet_read_record test.parquet`
Num values 94993

There are no problems if the generator outputs a column of say Int64.

I am running latest master:

> git rev-parse HEAD
dbb7b8a69a990a1f37c81b2d8dfeadaf3fba48a8

The text was updated successfully, but these errors were encountered:

jorgecarleitao added the bug Something isn't working label Sep 3, 2021

jorgecarleitao mentioned this issue Sep 3, 2021

Fixed reading multiple parquet pages. #374

Merged

jorgecarleitao closed this as completed in #374 Sep 3, 2021

jorgecarleitao changed the title ~~Parquet read skips a few rows for a column with many NAs~~ Parquet read skips a few rows at the end Sep 7, 2021

jorgecarleitao changed the title ~~Parquet read skips a few rows at the end~~ Parquet read skips a few rows at the end of the page Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet read skips a few rows at the end of the page #373

Parquet read skips a few rows at the end of the page #373

vincev commented Sep 3, 2021

Parquet read skips a few rows at the end of the page #373

Parquet read skips a few rows at the end of the page #373

Comments

vincev commented Sep 3, 2021