Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 server error when creating table using clustering #242

Closed
charlielito opened this issue Aug 25, 2020 · 20 comments
Closed

500 server error when creating table using clustering #242

charlielito opened this issue Aug 25, 2020 · 20 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@charlielito
Copy link

charlielito commented Aug 25, 2020

Environment details

  • OS type and version: Ubuntu20 PopOs
  • Python version: 3.7.8
  • pip version: 20.1.1
  • google-cloud-bigquery version: 1.27.2

Steps to reproduce

I'm creating a table with some columns, one of them is of type GEOGRAHPY. When I try to create the table with a sample data, if I choose to use clustering, I got the 500 error. I can create the table only if no clustering is made. Also I can create the table with clustering if I don't include the column of type GEOGRAHPY.
Code with a toy example to reproduce it:

Code example

import time
import pandas as pd
from google.cloud import bigquery
from shapely.geometry import Point

client = bigquery.Client()
PROJECT_ID = ""
table_id = f"{PROJECT_ID}.data_capture.toy"

df = pd.DataFrame(
    dict(
        lat=[6.208969] * 100,
        lon=[-75.571696] * 100,
        logged_at=[int(time.time() * 1000) for _ in range(100)],
    )
)
df["point"] = df.apply(lambda row: Point(row["lon"], row["lat"]).wkb_hex, axis=1)

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("lon", "FLOAT64", "REQUIRED"),
        bigquery.SchemaField("lat", "FLOAT64", "REQUIRED"),
        bigquery.SchemaField("point", "GEOGRAPHY", "REQUIRED"),
        bigquery.SchemaField("logged_at", "TIMESTAMP", "REQUIRED"),
    ],
    write_disposition="WRITE_TRUNCATE",
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY, field="logged_at",
    ),
    clustering_fields=["logged_at"],
)

job = client.load_table_from_dataframe(
    df, table_id, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

Stack trace

Traceback (most recent call last):
  File "test.py", line 108, in <module>
    job.result()  # Wait for the job to complete.
  File "/home/charlie/data/kiwi/data-upload/.venv/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 812, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "/home/charlie/data/kiwi/data-upload/.venv/lib/python3.7/site-packages/google/api_core/future/polling.py", line 130, in result
    raise self._exception
google.api_core.exceptions.InternalServerError: 500 An internal error occurred and the request could not be completed. Error: 3144498

Thank you in advance!

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Aug 25, 2020
@meredithslota meredithslota added the type: question Request for information or clarification. Not an issue. label Aug 25, 2020
@plamut
Copy link
Contributor

plamut commented Aug 26, 2020

I can confirm the reported behavior, having a column explicitly specified as type GEOGRAPHY in the schema triggers the error on the backend.

@shollyman Any idea what the error 3144498 represents?
(also, fell free to increase the priority to P1 if no good workaround exists and user experience in considerably affected)

@plamut plamut added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed type: question Request for information or clarification. Not an issue. labels Aug 26, 2020
@shollyman
Copy link
Contributor

This is a case where we need to improve our documentation.

The docstrings in the REST reference bear this constraint 'Only top-level, non-repeated, simple-type fields are supported'. But we're not doing a great job describing what that means.

You can see from the Standard SQL datatypes page that GEOGRAPHY types have problems with ordering and comparisons, which prevents the kinds of operations we need to do to cluster data.

I've filed internal issue 166457597 to improve this. Once we get better clarity in the reference, we can propagate that into the various client libraries.

@plamut
Copy link
Contributor

plamut commented Aug 26, 2020

OK thanks, that makes sense. I'll still keep this open for visibility until the docs are improved.

@plamut plamut added type: docs Improvement to the documentation for an API. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 26, 2020
@charlielito
Copy link
Author

charlielito commented Aug 26, 2020

@shollyman that is odd. Actually I wanted to try this because the documentation said that GEOGRAPHY columns supported clustering, which I thought was really cool. This section states the following:

  • Clustering columns must be top-level, non-repeated columns of one of the following types:
    • DATE
    • BOOL
    • GEOGRAPHY
    • INT64
    • NUMERIC
    • STRING
    • TIMESTAMP

If you try the code sample I shared, it crashes also if you select as clustering column the GEOGRAPHY one

@shollyman
Copy link
Contributor

It appears I've failed to read the details. Going to take a deeper look here, stay tuned.

@shollyman
Copy link
Contributor

So, please disregard my comments on GEOGRAPHY type not being valid for CLUSTERING, that was in error and based on a stale recollection of the state of data clustering. We need to do a better job of authoritative linking some of our docs, but I'll handle that in the internal ticket.

I find I'm unable to reproduce exactly using your code example. Running it against my own project, it completes successfully. The job metadata is here:

{
  "configuration": {
    "jobType": "LOAD",
    "load": {
      "clustering": {
        "fields": [
          "logged_at"
        ]
      },
      "destinationTable": {
        "datasetId": "data_capture",
        "projectId": "shollyman-demo-test",
        "tableId": "toy"
      },
      "schema": {
        "fields": [
          {
            "mode": "REQUIRED",
            "name": "lat",
            "type": "FLOAT"
          },
          {
            "mode": "REQUIRED",
            "name": "lon",
            "type": "FLOAT"
          },
          {
            "mode": "REQUIRED",
            "name": "logged_at",
            "type": "TIMESTAMP"
          },
          {
            "mode": "REQUIRED",
            "name": "point",
            "type": "GEOGRAPHY"
          }
        ]
      },
      "sourceFormat": "PARQUET",
      "timePartitioning": {
        "field": "logged_at",
        "type": "DAY"
      },
      "writeDisposition": "WRITE_TRUNCATE"
    }
  },
  "etag": "LmJB7x/0E+MJxSNeKl2/3w==",
  "id": "shollyman-demo-test:US.130f8487-abd2-4deb-ad30-e4e750cf09bc",
  "jobReference": {
    "jobId": "130f8487-abd2-4deb-ad30-e4e750cf09bc",
    "location": "US",
    "projectId": "shollyman-demo-test"
  },
  "kind": "bigquery#job",
  "selfLink": "https://bigquery.googleapis.com/bigquery/v2/projects/shollyman-demo-test/jobs/130f8487-abd2-4deb-ad30-e4e750cf09bc?location=US",
  "statistics": {
    "completionRatio": 1,
    "creationTime": "1598474653319",
    "endTime": "1598474662958",
    "load": {
      "badRecords": "0",
      "inputFileBytes": "1899",
      "inputFiles": "1",
      "outputBytes": "6400",
      "outputRows": "100"
    },
    "reservationUsage": [
      {
        "name": "default-pipeline",
        "slotMs": "67607"
      }
    ],
    "reservation_id": "default-pipeline",
    "startTime": "1598474653689",
    "totalSlotMs": "67607"
  },
  "status": {
    "state": "DONE"
  },
  "user_email": "test-robot@shollyman-demo-test.iam.gserviceaccount.com"
}

However, when I modify the clustering config and attempt to re-run, it does fail due to the existing table having a conflicting specification with the load specification. However, in this case it surfaces as HTTP 400 rather than 500 you're observing.

Incompatible table partitioning specification. Expects partitioning specification interval(type:day,field:logged_at) clustering(logged_at), but input partitioning specification is interval(type:day,field:logged_at) clustering(point)

I'm wondering if there's something we're missing related to the current state of the resources in your project?

@shollyman
Copy link
Contributor

shollyman commented Aug 26, 2020

state of my env:

$ pip freeze
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.2
chardet==3.0.4
google-api-core==1.22.1
google-auth==1.20.1
google-cloud-bigquery==1.27.2
google-cloud-bigquery-storage==1.0.0
google-cloud-core==1.4.1
google-crc32c==1.0.0
google-resumable-media==1.0.0
googleapis-common-protos==1.52.0
grpcio==1.31.0
idna==2.10
numpy==1.19.1
pandas==1.1.1
protobuf==3.13.0
pyarrow==1.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
python-dateutil==2.8.1
pytz==2020.1
requests==2.24.0
rsa==4.6
Shapely==1.7.1
six==1.15.0
tqdm==4.48.2
urllib3==1.25.10

@charlielito
Copy link
Author

What do you mean with the current state of the resources? In my example the dataset data_capture already exists and the table toy does not exist. I just tested the code snippet with a fresh env with your versions and I still get the same 500.

@shollyman
Copy link
Contributor

I'm unable to reproduce what you're observing.

Perhaps try setting debug logging for the underlying http interactions, it may make it clearer what I'm missing. Add something like the following to the beginning of your repro and re-run:

import http
http.client.HTTPConnection.debuglevel=5

Note that the dumps will include live authentication headers (e.g. Bearer headers, namely) which may be used to impersonate you. Please redact the values before sharing.

If you'd prefer not to post details publicly, as an alternative you can also send them directly to me (same username at google.com).

@charlielito
Copy link
Author

These are the logs:

send: b'POST /token HTTP/1.1\r\nHost: oauth2.googleapis.com\r\nUser-Agent: python-requests/2.24.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\ncontent-type: application/x-www-form-urlencoded\r\nContent-Length: 859\r\n\r\n'
send: b'assertion=XXXXXXXXXXXXXXXX&grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Encoding: gzip
header: Date: Wed, 26 Aug 2020 23:27:40 GMT
header: Server: scaffolding on HTTPServer2
header: Cache-Control: private
header: X-XSS-Protection: 0
header: X-Frame-Options: SAMEORIGIN
header: X-Content-Type-Options: nosniff
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
header: Transfer-Encoding: chunked
send: b'POST /upload/bigquery/v2/projects/PROJECT_ID/jobs?uploadType=resumable HTTP/1.1\r\nHost: bigquery.googleapis.com\r\nUser-Agent: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nAccept-Encoding: gzip, deflate\r\nAccept: application/json\r\nConnection: keep-alive\r\ncontent-type: application/json\r\nx-upload-content-type: */*\r\nauthorization: Bearer XXXXXXXXXXXXXXXX\r\nContent-Length: 738\r\n\r\n'
send: b'{"jobReference": {"jobId": "171f09e7-2b61-4819-837c-db8619565f18", "projectId": "PROJECT_ID"}, "configuration": {"load": {"schema": {"fields": [{"mode": "REQUIRED", "name": "lat", "type": "FLOAT64", "description": null}, {"mode": "REQUIRED", "name": "lon", "type": "FLOAT64", "description": null}, {"mode": "REQUIRED", "name": "logged_at", "type": "TIMESTAMP", "description": null}, {"mode": "REQUIRED", "name": "point", "type": "GEOGRAPHY", "description": null}]}, "writeDisposition": "WRITE_TRUNCATE", "timePartitioning": {"type": "DAY", "field": "logged_at"}, "clustering": {"fields": ["logged_at"]}, "sourceFormat": "PARQUET", "destinationTable": {"projectId": "PROJECT_ID", "datasetId": "data_capture", "tableId": "toy"}}}}'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: text/plain; charset=utf-8
header: X-GUploader-UploadID: UPLOAD_ID
header: Location: https://bigquery.googleapis.com/upload/bigquery/v2/projects/PROJECT_ID/jobs?uploadType=resumable&upload_id=UPLOAD_ID
header: ETag: QRf43t5EtSKPbjwAJWMVOQ==
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Length: 0
header: Date: Wed, 26 Aug 2020 23:27:41 GMT
header: Server: UploadServer
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
send: b'PUT /upload/bigquery/v2/projects/PROJECT_ID/jobs?uploadType=resumable&upload_id=UPLOAD_ID HTTP/1.1\r\nHost: bigquery.googleapis.com\r\nUser-Agent: python-requests/2.24.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\ncontent-type: */*\r\ncontent-range: bytes 0-1898/1899\r\nauthorization: Bearer XXXXXXXXXXXXXXXX\r\nContent-Length: 1899\r\n\r\n'
send: b'PAR1\x15\x04\...XXXXXX...'
reply: 'HTTP/1.1 200 OK\r\n'
header: X-GUploader-UploadID: UPLOAD_ID
header: ETag: KnJI7lOFKFjxuLn+P8rytA==
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Length: 1594
header: Date: Wed, 26 Aug 2020 23:27:42 GMT
header: Server: UploadServer
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
send: b'GET /bigquery/v2/projects/PROJECT_ID/jobs/171f09e7-2b61-4819-837c-db8619565f18?location=US HTTP/1.1\r\nHost: bigquery.googleapis.com\r\nUser-Agent: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nAccept-Encoding: gzip\r\nAccept: */*\r\nConnection: keep-alive\r\nX-Goog-API-Client: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nauthorization: Bearer XXXXXXXXXXXXXXXX\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: ETag: KnJI7lOFKFjxuLn+P8rytA==
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Encoding: gzip
header: Date: Wed, 26 Aug 2020 23:27:43 GMT
header: Server: ESF
header: Cache-Control: private
header: X-XSS-Protection: 0
header: X-Frame-Options: SAMEORIGIN
header: X-Content-Type-Options: nosniff
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
header: Transfer-Encoding: chunked
send: b'GET /bigquery/v2/projects/PROJECT_ID/jobs/171f09e7-2b61-4819-837c-db8619565f18?location=US HTTP/1.1\r\nHost: bigquery.googleapis.com\r\nUser-Agent: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nAccept-Encoding: gzip\r\nAccept: */*\r\nConnection: keep-alive\r\nX-Goog-API-Client: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nauthorization: Bearer XXXXXXXXXXXXXXXX\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: ETag: KnJI7lOFKFjxuLn+P8rytA==
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Encoding: gzip
header: Date: Wed, 26 Aug 2020 23:27:43 GMT
header: Server: ESF
header: Cache-Control: private
header: X-XSS-Protection: 0
header: X-Frame-Options: SAMEORIGIN
header: X-Content-Type-Options: nosniff
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
header: Transfer-Encoding: chunked
send: b'GET /bigquery/v2/projects/PROJECT_ID/jobs/171f09e7-2b61-4819-837c-db8619565f18?location=US HTTP/1.1\r\nHost: bigquery.googleapis.com\r\nUser-Agent: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nAccept-Encoding: gzip\r\nAccept: */*\r\nConnection: keep-alive\r\nX-Goog-API-Client: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nauthorization: Bearer XXXXXXXXXXXXXXXX\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: ETag: zb3hm09cwfnAiOYjmzMmaw==
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Encoding: gzip
header: Date: Wed, 26 Aug 2020 23:27:45 GMT
header: Server: ESF
header: Cache-Control: private
header: X-XSS-Protection: 0
header: X-Frame-Options: SAMEORIGIN
header: X-Content-Type-Options: nosniff
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
header: Transfer-Encoding: chunked
send: b'GET /bigquery/v2/projects/PROJECT_ID/jobs/171f09e7-2b61-4819-837c-db8619565f18?location=US HTTP/1.1\r\nHost: bigquery.googleapis.com\r\nUser-Agent: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nAccept-Encoding: gzip\r\nAccept: */*\r\nConnection: keep-alive\r\nX-Goog-API-Client: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nauthorization: Bearer XXXXXXXXXXXXXXXX\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: ETag: c3pJouaM/0o2KaV0DuaVnQ==
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Encoding: gzip
header: Date: Wed, 26 Aug 2020 23:27:46 GMT
header: Server: ESF
header: Cache-Control: private
header: X-XSS-Protection: 0
header: X-Frame-Options: SAMEORIGIN
header: X-Content-Type-Options: nosniff
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
header: Transfer-Encoding: chunked
send: b'GET /bigquery/v2/projects/PROJECT_ID/jobs/171f09e7-2b61-4819-837c-db8619565f18?location=US HTTP/1.1\r\nHost: bigquery.googleapis.com\r\nUser-Agent: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nAccept-Encoding: gzip\r\nAccept: */*\r\nConnection: keep-alive\r\nX-Goog-API-Client: gl-python/3.7.7 grpc/1.31.0 gax/1.22.1 gapic/1.27.2 gccl/1.27.2\r\nauthorization: Bearer XXXXXXXXXXXXXXXX\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: ETag: 7ztKRvG3ECRYHY6fwL0rpg==
header: Content-Type: application/json; charset=UTF-8
header: Vary: Origin
header: Vary: X-Origin
header: Vary: Referer
header: Content-Encoding: gzip
header: Date: Wed, 26 Aug 2020 23:28:02 GMT
header: Server: ESF
header: Cache-Control: private
header: X-XSS-Protection: 0
header: X-Frame-Options: SAMEORIGIN
header: X-Content-Type-Options: nosniff
header: Alt-Svc: h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
header: Transfer-Encoding: chunked

Then same traceback as example

@shollyman
Copy link
Contributor

I'm not seeing the request that yields a 500 reply. We should just be polling job status, so perhaps grabbing the job response through another means may help?

If you've got the gcloud SDK installed, try bq show -j --format=prettyjson 171f09e7-2b61-4819-837c-db8619565f18

@charlielito
Copy link
Author

{
  "configuration": {
    "jobType": "LOAD", 
    "load": {
      "clustering": {
        "fields": [
          "logged_at"
        ]
      }, 
      "destinationTable": {
        "datasetId": "data_capture", 
        "projectId": "PROJECT_ID", 
        "tableId": "toy"
      }, 
      "schema": {
        "fields": [
          {
            "mode": "REQUIRED", 
            "name": "lat", 
            "type": "FLOAT"
          }, 
          {
            "mode": "REQUIRED", 
            "name": "lon", 
            "type": "FLOAT"
          }, 
          {
            "mode": "REQUIRED", 
            "name": "logged_at", 
            "type": "TIMESTAMP"
          }, 
          {
            "mode": "REQUIRED", 
            "name": "point", 
            "type": "GEOGRAPHY"
          }
        ]
      }, 
      "sourceFormat": "PARQUET", 
      "timePartitioning": {
        "field": "logged_at", 
        "type": "DAY"
      }, 
      "writeDisposition": "WRITE_TRUNCATE"
    }
  }, 
  "etag": "7ztKRvG3ECRYHY6fwL0rpg==", 
  "id": "PROJECT_ID:US.171f09e7-2b61-4819-837c-db8619565f18", 
  "jobReference": {
    "jobId": "171f09e7-2b61-4819-837c-db8619565f18", 
    "location": "US", 
    "projectId": "PROJECT_ID"
  }, 
  "kind": "bigquery#job", 
  "selfLink": "https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/jobs/171f09e7-2b61-4819-837c-db8619565f18?location=US", 
  "statistics": {
    "completionRatio": 0, 
    "creationTime": "1598484462115", 
    "endTime": "1598484470223", 
    "reservation_id": "default-pipeline", 
    "startTime": "1598484462475", 
    "totalSlotMs": "12726"
  }, 
  "status": {
    "errorResult": {
      "message": "An internal error occurred and the request could not be completed. Error: 3144498", 
      "reason": "internalError"
    }, 
    "errors": [
      {
        "message": "An internal error occurred and the request could not be completed. Error: 3144498", 
        "reason": "internalError"
      }
    ], 
    "state": "DONE"
  }, 
  "user_email": "charlie-lenovo@PROJECT_ID.iam.gserviceaccount.com"
}

@shollyman
Copy link
Contributor

Interesting. Mind sending me the project id out of band (email)? It'll help me investigate the underlying error to see what's going wrong.

@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: docs Improvement to the documentation for an API. labels Aug 27, 2020
@shollyman
Copy link
Contributor

current state: backend team investigating the failure.

@charlielito
Copy link
Author

@shollyman any news from this end?

@shollyman
Copy link
Contributor

Getting back to this (was out on vacation).

Issue appears to be related to parquet interpreter and issue was related to the format of the geography column (the dataframe integration serializes to parquet when sending row data to BigQuery). Since then, there have been changes to the parquet readers responsible, but it's not immediately clear if it addresses your issue.

Could you retry? If you're still seeing issues, next step is to understand how your environment differs from mine, where I cannot reproduce. Possibly there's a difference in libraries like the shapely dependency that converts to well-known-byte (WKB) representation.

If you're still getting issues, if you could also provide another job ID where the ingestion is failing?

@charlielito
Copy link
Author

state of my env:

$ pip freeze
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.2
chardet==3.0.4
google-api-core==1.22.1
google-auth==1.20.1
google-cloud-bigquery==1.27.2
google-cloud-bigquery-storage==1.0.0
google-cloud-core==1.4.1
google-crc32c==1.0.0
google-resumable-media==1.0.0
googleapis-common-protos==1.52.0
grpcio==1.31.0
idna==2.10
numpy==1.19.1
pandas==1.1.1
protobuf==3.13.0
pyarrow==1.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
python-dateutil==2.8.1
pytz==2020.1
requests==2.24.0
rsa==4.6
Shapely==1.7.1
six==1.15.0
tqdm==4.48.2
urllib3==1.25.10

I have this same exact environment. I am running python3.7.7 with conda on Ubuntu20.
The problem still persists. One job failing (tested today) 5a8b7f92-edf0-418f-af75-d378de4ebe68

@tswast
Copy link
Contributor

tswast commented Sep 15, 2020

@charlielito Could you patch load_table_from_dataframe at

to write to your desktop / somewhere that won't get deleted and post the resulting parquet file here?

Peter did as much in #56 (comment) and it was helpful in narrowing down the backend issue (in that case a Feature Request to support DATETIME in Parquet)

@charlielito
Copy link
Author

charlielito commented Sep 15, 2020

@tswast Here is the zipped .parquet file generated
test_parquet.zip

@plamut plamut removed their assignment Oct 1, 2020
@charlielito
Copy link
Author

UPDATE: Somehow now this works, maybe the backend was updated or fixed, it would be nice if a notification could be sent ;)
Closing this issue now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants