Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid parquet version error for parquet files generated via python script #144

Open
saritvakrat opened this issue Jan 9, 2024 · 3 comments

Comments

@saritvakrat
Copy link

saritvakrat commented Jan 9, 2024

Hi, I am trying to read parquet files that are in S3 and were generated via python script.
I get the following error:
Error: thrown: "invalid parquet version"
When I am trying to read similar file but the file was generated via spark - it manages to digest the file and read it.

I am also able to parse the python file and open it in a parquet viewer

Any idea why? the file is parquet lvl 2
File metadata:
file written by pyarrow 11.0.0
created_by: parquet-cpp-arrow version 11.0.0
num_columns: 6
num_rows: 42
num_row_groups: 1
format_version: 2.6
serialized_size: 3975

Full error:

(node:41711) V8: /Users/saritvakrat/Documents/automation/be_automation/node_modules/brotli/build/encode.js:34 Linking failure in asm.js: Unexpected stdlib member (Use node --trace-warnings ...` to show where the warning was created)
console.error
Error parsing Parquet file: invalid parquet version

  39 |         return records;
  40 |     } catch (error) {
> 41 |         console.error('Error parsing Parquet file:', error);
     |                 ^
  42 |         throw error; // Rethrow the error to be handled by the caller
  43 |     }
  44 | }`
  
  Packages:
      "parquetjs": "^0.11.2",
"@types/parquetjs": "^0.10.6",

My function:
``export async function parseParquetFile(filePath: string): Promise<any[]> {
try {
// create new ParquetReader
const reader = await ParquetReader.openFile(filePath) as any;
// create a new cursor
const cursor = reader.getCursor();
const records = [];
// read all records from the file and print them
let record = await cursor.next();
while (record !== null) {
records.push(record);
record = await cursor.next();
}
await reader.close();
return records;
} catch (error) {
console.error('Error parsing Parquet file:', error);
throw error; // Rethrow the error to be handled by the caller
}
}

`async parseSingleParquetFromS3(bucketName: string, key: string | null | undefined): Promise<any[]> {
if (!bucketName || !key) {
throw new Error('S3 client or bucket name is not provided');
}

    const getObjectCommand = new GetObjectCommand({
        Bucket: bucketName,
        Key: key
    });

    let objectResponse;
    try {
        objectResponse = await this.s3Client.send(getObjectCommand);
    } catch (error) {
        console.error(`Error fetching object from S3: ${error}`);
        throw error;
    }

    const objectData = objectResponse.Body;
    if (!(objectData instanceof Readable)) {
        throw new Error('Object data is not a readable stream');
    }

    const fileName = key.split('/').pop() || 'temp.parquet';
    const tempFilePath = join(tmpdir(), fileName);

    try {
        await pipeline(objectData, createWriteStream(tempFilePath));
        return await parseParquetFile(tempFilePath);
    } catch (error) {
        console.error(`Error in streaming data to file: ${error}`);
        throw error;
    }
}`
@WestenMichael
Copy link

I have same issue

@WestenMichael
Copy link

As workaround I used the https://www.npmjs.com/package/@dsnp/parquetjs

@saritvakrat
Copy link
Author

@WestenMichael I tried this package as well, but they have another issue "invalid encoding: RLE_DICTIONARY"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants