Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read properties of undefined (reading 'fromIPCStream') #471

Closed
MaTiAtSIE opened this issue Mar 6, 2024 · 8 comments
Closed

Cannot read properties of undefined (reading 'fromIPCStream') #471

MaTiAtSIE opened this issue Mar 6, 2024 · 8 comments

Comments

@MaTiAtSIE
Copy link

Setup:
Typescript == 4.9.5
node == 20.0.0
theia == 1.45.0
Webpack == 5.90.3
parquet-wasm == 0.5.0
apache-arrow == 15.0.0

I tried the example code:

const ipcStream = arrow.tableToIPC(rainfall, 'stream');
// the following line crashes
const wasmTable = parquetWASM.Table.fromIPCStream(ipcStream);

and get the error message Cannot read properties of undefined (reading 'fromIPCStream'). Inspecting the ipcStream at runtime reveals that ipcStream is an Uint8Array with 24400 entries.

My imports:

import * as arrow from 'apache-arrow';
import * as parquetWASM from 'parquet-wasm';

Do you have any idea of what is going wrong?

@kylebarron
Copy link
Owner

kylebarron commented Mar 6, 2024

The API changed after 0.5.0. In 0.5 the Table object doesn't exist. You can look at the 0.5.0 README for an example https://github.com/kylebarron/parquet-wasm/tree/v0.5.0?tab=readme-ov-file#example

@MaTiAtSIE
Copy link
Author

Thanks for this hint (it is really good that you answer so fast to the issues 👍). Now I have another Problem with the writeParquet function. The following lines make trouble

const uintArr = tableToIPC(rainfall, 'stream');
const parquetBuffer = writeParquet(
  uintArr, // this should be a table
  writerProperties
);

as writeParquet is expecting a Table:

Argument of type 'Uint8Array' is not assignable to parameter of type 'Table'.
  Type 'Uint8Array' is missing the following properties from type 'Table': free, recordBatch, toFFI, intoFFI, and 3 more.

So I tried

let writerProperties = new WriterPropertiesBuilder();
writerProperties = writerProperties.setCompression(Compression.ZSTD);
const props = writerProperties.build();
const uintArr = tableToIPC(rainfall, 'stream');
const arrTable = Table.fromIPCStream(uintArr);
const parquetBuffer = writeParquet(
    arrTable,
    props
);

and importing from 'parquet-wasm/node/arrow1' which compiles. However, this produces an empty schema. Therefore, the question is, how to call writeParquet from the return of tableToIPC(rainfall, 'stream')?

BTW: I changed the apache-arrow version to 13.0.0 as this version is also used in parquet-wasm 0.5.0

@kylebarron
Copy link
Owner

The best suggestion is to use the typescript types to guide you.

This is working for me with 0.5.0

import { tableFromArrays, tableToIPC } from "apache-arrow";
import * as parquet from "parquet-wasm/node/arrow1";
import { writeFileSync } from "fs";

// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
  Number((Math.random() * 20).toFixed(1))
);

const rainDates = Array.from(
  { length: LENGTH },
  (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);

const rainfall = tableFromArrays({
  precipitation: rainAmounts,
  date: rainDates,
});

// Write Arrow Table to Parquet
const writerProperties = new parquet.WriterPropertiesBuilder()
  .setCompression(parquet.Compression.ZSTD)
  .build();
const arrowWasmTable = parquet.Table.fromIPCStream(
  tableToIPC(rainfall, "stream")
);
const parquetBuffer = parquet.writeParquet(arrowWasmTable, writerProperties);
writeFileSync("out.parquet", parquetBuffer);

I can verify that the file loads correctly in Python
image

@MaTiAtSIE
Copy link
Author

Hello Kyle, thanks for your support and time :). Indeed, your code works, and it turned out that my code, which I posted earlier, works as well. However, the schema is empty when I inspect the table by setting a break point after calling 'tableFromIPC(readParquet(parquetBuffer));'.
empty_schema

@kylebarron
Copy link
Owner

The entire table is empty. readParquet does not return a Uint8Array, it returns a Table object, so you need to call a method to convert that table to IPC bytes first. It should be something like tableFromIPC(readParquet().intoIPCStream()). The types will guide you

@MaTiAtSIE
Copy link
Author

Perfect, 'tableFromIPC(readParquet(parquetBuffer).intoIPCStream())' worked.

@MaTiAtSIE
Copy link
Author

MaTiAtSIE commented Mar 12, 2024

BTW: Do you have any example to use 'readParquetStream'?
Background: I have a huge parquet file (~500MB) and I only want to read, e.g., the first line or the schema (I don't know if the 'readParquetStream' is the right function for that).

If I do this with the stream:

readParquetStream('file:///C:/Users/marcel.tiator/Projekte/IDE/IDETest4/example.parquet').then((value) =>
{
    console.log('test');
});

I get the following runtime error:

2024-03-12T09:04:47.226Z root ERROR RuntimeError: unreachable
    at wasm://wasm/0132be12:wasm-function[2356]:0x3158b9
    at wasm://wasm/0132be12:wasm-function[4456]:0x393d5b
    at wasm://wasm/0132be12:wasm-function[3105]:0x35a45b
    at wasm://wasm/0132be12:wasm-function[90]:0xbde8c
    at wasm://wasm/0132be12:wasm-function[2045]:0x2e9c24
    at wasm://wasm/0132be12:wasm-function[4892]:0x39bbde
    at __wbg_adapter_28 (...)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

@kylebarron
Copy link
Owner

I think the right abstract is a class like ParquetFile from the pyarrow world in Python that only reads the metadata. We don't have something like that today, but it might come in in the next release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants