Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. #9024

Closed
dabasmoti opened this issue Aug 13, 2019 · 12 comments · Fixed by #9053
Closed

BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. #9024

dabasmoti opened this issue Aug 13, 2019 · 12 comments · Fixed by #9053
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.

Comments

@dabasmoti
Copy link

pyarrow-0.14.0
pandas '0.24.2'
windows 10
Hi,
I am tring to load dataframe to big query that looks like that

uid_first	agg_col
1001	[{'page_type': 1}, {'record_type': 1}, {'non_consectutive_home': 0}]

the agg_col is list of dicts
I also tried dict

Schema config:

schema = [
          bigquery.SchemaField("uid_first","STRING",mode="NULLABLE"),
          bigquery.SchemaField("agg_col","RECORD",mode="NULLABLE",fields=[
                  bigquery.SchemaField("page_type", "INTEGER", mode="NULLABLE"),
                  bigquery.SchemaField("record_type", "INTEGER", mode="NULLABLE"),
                  bigquery.SchemaField("non_consectutive_home", "INTEGER", mode="NULLABLE")])]

load command

dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('table')
table = bigquery.Table(table_ref,schema=schema)
table = client.load_table_from_dataframe(dff, table).result()

The error message

Traceback (most recent call last):

  File "<ipython-input-167-60a73e366976>", line 4, in <module>
    table = client.load_table_from_dataframe(dff, table).result()

  File "C:\ProgramData\Anaconda3\envs\p37\lib\site-packages\google\cloud\bigquery\client.py", line 1546, in load_table_from_dataframe
    os.remove(tmppath)

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpcxotr6mb_job_5c94b06f.parquet'
@tseaver tseaver added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Aug 13, 2019
@tseaver
Copy link
Contributor

tseaver commented Aug 13, 2019

@dabasmoti Is there another exception being raised when that os.remove() statement (which occurs in a finally: clause) raises this exception? Can you show the full traceback?

@tseaver tseaver changed the title Push Pandas DataFrame to BigQuery with nested column BigQuery: 'load_table_from_dataframe' raises OSError. Aug 13, 2019
@dabasmoti
Copy link
Author

dabasmoti commented Aug 13, 2019

@tseaver - No,
The command, client.load_table_from_dataframe(), comes before it

@peter765
Copy link

I'm getting the same issue on mac.

Traceback (most recent call last): File "loadjob.py", line 19, in <module> job = client.load_table_from_dataframe(dataframe, table_ref, location="US") File "/usr/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1567, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/_v/wj4pm45x4txg4vv02kptkl7c0000gn/T/tmpvsbi2rrx_job_1cb60ca1.parquet'

@tswast
Copy link
Contributor

tswast commented Aug 14, 2019

Do you have write permissions to those temp directories?

We originally started using tempfiles because fastparquet does not support in-memory file objects, but I wonder if there are systems in which tempfiles cannot be created?

@tswast
Copy link
Contributor

tswast commented Aug 14, 2019

Note: pyarrow-0.14.0 had some bad pip wheels, so this may be related to that.

@dabasmoti
Copy link
Author

@tswast - what version should i use?

@dabasmoti
Copy link
Author

I have to mention that the error occur only when use type dict in the dataframe column

@dabasmoti
Copy link
Author

I am running as admin

@tswast
Copy link
Contributor

tswast commented Aug 14, 2019

what version should i use?

0.14.1 and 0.13.0 are good releases of pyarrow.

I have to mention that the error occur only when use type dict in the dataframe column

Thank you for mentioning this. STRUCT / RECORD columns are not yet supported by the pandas connector. https://github.com/googleapis/google-cloud-python/issues/8191 Neither are ARRAY / REPEATED columns, unfortunately. https://github.com/googleapis/google-cloud-python/issues/8544 Those issues are currently blocked on improvements to the Parquet file serialization logic.

@plamut Can you investigate this further? Hopefully pyarrow can provide an exception that we can catch when trying to write a table with unsupported data types to a parquet file. If no exception is thrown, perhaps we need to check for these and raise a ValueError?

@tswast tswast changed the title BigQuery: 'load_table_from_dataframe' raises OSError. BigQuery: 'load_table_from_dataframe' raises OSError with STRUCT / RECORD columns. Aug 14, 2019
@plamut
Copy link
Contributor

plamut commented Aug 15, 2019

TL; DR - pyarrow does not yet support serializing nested fields to parquet (there is an active PR for it, though), thus for the time being we can catch these exceptions and and propagate them to the users in an informative way. Or detecting nesting columns ourselves without relying on pyarrow's exceptions.


I was able to reproduce the reported behavior. Using the posted code and the following dataframe:

data = {
    "uid_first": "1001",
    "agg_col": [
        {"page_type": 1},
        {"record_type": 1},
        {"non_consectutive_home": 0},
    ]
}
df = pandas.DataFrame(data=data)

I got the following traceback in Python 3.6:

Traceback (most recent call last):
  File "/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/client.py", line 1552, in load_table_from_dataframe
    dataframe.to_parquet(tmppath, compression=parquet_compression)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2203, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pandas/io/parquet.py", line 122, in write
    coerce_timestamps=coerce_timestamps, **kwargs)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 1271, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/peter/workspace/google-cloud-python/venv-3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 427, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../reproduce/reproduce_9024.py", line 41, in <module>
    load_job = client.load_table_from_dataframe(df, table)
  File "/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/client.py", line 1568, in load_table_from_dataframe
    os.remove(tmppath)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr7gxstqv_job_2f382186.parquet'

Trying the same with Python 2.7, I only got the second part of the traceback, i.e. the OSError about a missing file - seems like @dabasmoti is using Python 2.7.

That was with pandas==0.24.2 and pyarrow=0.14.1., and the root cause in both Python versions was an ArrowInvalid error: "Nested column branch had multiple children."

We could try catching this error in client.load_table_from_dataframe() and act upon it.

Edit:
FWIW, trying the same with pyarrow==1.13.0 produces a different error

Traceback (most recent call last):
    ...
    raise NotImplementedError(str(arrow_type))
NotImplementedError: struct<non_consectutive_home: int64, page_type: int64, record_type: int64>

More recent versions of pyarrow do not raise NotImplementedError anymore when determining the logical type of composite types, and instead return 'object' for them, hence the difference.

@dabasmoti
Copy link
Author

@plamut - I am using python 3.7

@plamut
Copy link
Contributor

plamut commented Aug 15, 2019

@dabasmoti I see, let me try with Python 3.7, too, just in case ... although the outcome should probably be the same.

Update:
The same error occurs:

pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

... which is then followed by the FileNotFoundError when trying to remove the temp .parquet file that was never created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.
Projects
None yet
5 participants