New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
500 server error when creating table using clustering #242
Comments
I can confirm the reported behavior, having a column explicitly specified as type GEOGRAPHY in the schema triggers the error on the backend. @shollyman Any idea what the error 3144498 represents? |
This is a case where we need to improve our documentation. The docstrings in the REST reference bear this constraint 'Only top-level, non-repeated, simple-type fields are supported'. But we're not doing a great job describing what that means. You can see from the Standard SQL datatypes page that GEOGRAPHY types have problems with ordering and comparisons, which prevents the kinds of operations we need to do to cluster data. I've filed internal issue 166457597 to improve this. Once we get better clarity in the reference, we can propagate that into the various client libraries. |
OK thanks, that makes sense. I'll still keep this open for visibility until the docs are improved. |
@shollyman that is odd. Actually I wanted to try this because the documentation said that GEOGRAPHY columns supported clustering, which I thought was really cool. This section states the following:
If you try the code sample I shared, it crashes also if you select as clustering column the GEOGRAPHY one |
It appears I've failed to read the details. Going to take a deeper look here, stay tuned. |
So, please disregard my comments on GEOGRAPHY type not being valid for CLUSTERING, that was in error and based on a stale recollection of the state of data clustering. We need to do a better job of authoritative linking some of our docs, but I'll handle that in the internal ticket. I find I'm unable to reproduce exactly using your code example. Running it against my own project, it completes successfully. The job metadata is here:
However, when I modify the clustering config and attempt to re-run, it does fail due to the existing table having a conflicting specification with the load specification. However, in this case it surfaces as HTTP 400 rather than 500 you're observing.
I'm wondering if there's something we're missing related to the current state of the resources in your project? |
state of my env:
|
What do you mean with the current state of the resources? In my example the dataset |
I'm unable to reproduce what you're observing. Perhaps try setting debug logging for the underlying http interactions, it may make it clearer what I'm missing. Add something like the following to the beginning of your repro and re-run:
Note that the dumps will include live authentication headers (e.g. Bearer headers, namely) which may be used to impersonate you. Please redact the values before sharing. If you'd prefer not to post details publicly, as an alternative you can also send them directly to me (same username at google.com). |
These are the logs:
Then same traceback as example |
I'm not seeing the request that yields a 500 reply. We should just be polling job status, so perhaps grabbing the job response through another means may help? If you've got the gcloud SDK installed, try |
|
Interesting. Mind sending me the project id out of band (email)? It'll help me investigate the underlying error to see what's going wrong. |
current state: backend team investigating the failure. |
@shollyman any news from this end? |
Getting back to this (was out on vacation). Issue appears to be related to parquet interpreter and issue was related to the format of the geography column (the dataframe integration serializes to parquet when sending row data to BigQuery). Since then, there have been changes to the parquet readers responsible, but it's not immediately clear if it addresses your issue. Could you retry? If you're still seeing issues, next step is to understand how your environment differs from mine, where I cannot reproduce. Possibly there's a difference in libraries like the shapely dependency that converts to well-known-byte (WKB) representation. If you're still getting issues, if you could also provide another job ID where the ingestion is failing? |
I have this same exact environment. I am running python3.7.7 with conda on Ubuntu20. |
@charlielito Could you patch python-bigquery/google/cloud/bigquery/client.py Line 2228 in ae647eb
Peter did as much in #56 (comment) and it was helpful in narrowing down the backend issue (in that case a Feature Request to support DATETIME in Parquet) |
@tswast Here is the zipped .parquet file generated |
UPDATE: Somehow now this works, maybe the backend was updated or fixed, it would be nice if a notification could be sent ;) |
Environment details
3.7.8
20.1.1
google-cloud-bigquery
version:1.27.2
Steps to reproduce
I'm creating a table with some columns, one of them is of type
GEOGRAHPY
. When I try to create the table with a sample data, if I choose to use clustering, I got the 500 error. I can create the table only if no clustering is made. Also I can create the table with clustering if I don't include the column of typeGEOGRAHPY
.Code with a toy example to reproduce it:
Code example
Stack trace
Thank you in advance!
The text was updated successfully, but these errors were encountered: