Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Parquet read skips a few rows at the end of the page #373

Closed
vincev opened this issue Sep 3, 2021 · 0 comments · Fixed by #374
Closed

Parquet read skips a few rows at the end of the page #373

vincev opened this issue Sep 3, 2021 · 0 comments · Fixed by #374
Labels
bug Something isn't working

Comments

@vincev
Copy link

vincev commented Sep 3, 2021

When reading a parquet file with a string column with many NAs both parquet_read and parquet_record_read examples skip a few rows.

To reproduce I run the following script that creates a data frame with a NA every 20 rows:

import sys
import pandas as pd

values = [f'{x:020}' if x % 20 != 0 else None
          for x in range(int(sys.argv[1]))]
df = pd.DataFrame({'values': values})
df.to_parquet('test.parquet', index=False, version='2.0')
print(f"{df.info()}")

running the script with 100,000 rows:

> python gen.py 100000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   values  95000 non-null  object
dtypes: object(1)
memory usage: 781.4+ KB
None

In parquet_read I changed the code to print the number of rows:

    let array = read_column_chunk(file_path, row_group, column)?;
    println!("{}", array.len());

This print an array of length 99992:

> cargo run --release --features io_parquet,io_parquet_compression  --example parquet_read test.parquet 0 0
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/parquet_read test.parquet 0 0`
99992

I also tried changing the loop in parquet_record_read to double check for NA:

    let reader = read::RecordReader::try_new(reader, None, None, Arc::new(|_, _| true), None)?;
    let mut num_values = 0;
    for maybe_batch in reader {
        let batch = maybe_batch?;
        let column = batch
            .column(0)
            .as_any()
            .downcast_ref::<Utf8Array<i32>>()
            .expect("Not a String array");
        for row in column {
            if let Some(_) = row {
                num_values += 1;
            }
        }
    }

    println!("Num values {}", num_values);

When I run it there are 7 missing values (we should have 95,000 non null):

> cargo run --release --features io_parquet,io_parquet_compression  --example parquet_read_record test.parquet
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/parquet_read_record test.parquet`
Num values 94993

There are no problems if the generator outputs a column of say Int64.

I am running latest master:

> git rev-parse HEAD
dbb7b8a69a990a1f37c81b2d8dfeadaf3fba48a8
@jorgecarleitao jorgecarleitao added the bug Something isn't working label Sep 3, 2021
@jorgecarleitao jorgecarleitao changed the title Parquet read skips a few rows for a column with many NAs Parquet read skips a few rows at the end Sep 7, 2021
@jorgecarleitao jorgecarleitao changed the title Parquet read skips a few rows at the end Parquet read skips a few rows at the end of the page Sep 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants