Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodetect for BQ happening automatically even with schema defined. #847

Closed
brummetj opened this issue Aug 2, 2021 · 12 comments
Closed
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@brummetj
Copy link

brummetj commented Aug 2, 2021

Issue

I'm currently using the python sdk to transform and load data into BQ. One of the fields I have is supposed to be a string but the data will come in as either a long float number (ex: 75257229615211513551680283362634662820), int number ( ex: 4 ), or string (ex: http://some-string-url ).

So the data representation looks like

 {
            "count": "6",
            "value": "75257229615211513551680283362634662820"
          },
          {
            "count": "4",
            "value": "5"
          },
          {
            "count": "4",
            "value": "http://some-string-url"
          }

Since BQ does the auto detection, these values can be recognized as either a FLOAT, INT, or STRING. I followed docs, and stack recommendations to assign a hard schema to the config which looks like

        table = bigquery.table.Table(table_ref)

        job_config = bigquery.LoadJobConfig()
        if table_name == self.incident_table_name:
            job_config.schema = self.incidents_schema

        if table_name == self.kpi_table_name:
            job_config.schema = self.kpi_schema
        
        job_config.autodetect = False  # should already be assigned to False in the SDK ( just for a hard assurance )

The schema is read in from a json file that looks like

  "fields": [
          {
            "mode": "NULLABLE",
            "name": "count",
            "type": "INTEGER"
          },
          {
            "mode": "NULLABLE",
            "name": "value",
            "type": "STRING"
          }
        ],

and parsed into the SchemaField() object..

    field_schema = bigquery.SchemaField(name=name,
       field_type=field_type,
       mode=mode,
       fields=subschema
   )

Just to be clear i do believe that this is apart of a subschema and assigned to the fields as it's an array of objects.

then i submit the data through the SDK via the load_table_from_json which looks like

            job = params.bq_client.load_table_from_json([incident],
                                                        job_config=job_config,
                                                        destination=table)
            result = job.result()

All fairly straightforward, and follows closely with what is documented and shared by others. BUT for whatever reason, the API request will fail and continue to try parse value as either a STRING, FLOAT or INT even with the schema defined.

So a couple questions here... is this a bug with BQ, or the SDK not accepting the schema? Am i doing something wrong here? Does the autodetect feature always do its thing even when turned off and a schema defined?

Any communication here would be greatly appreciated.

Cheers

info

sdk version: google-cloud-bigquery 2.20.0
os: MacOS
environment: Cloud Function
python version: 3.9

@busunkim96 busunkim96 transferred this issue from googleapis/google-cloud-python Aug 3, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Aug 3, 2021
@plamut plamut added the type: question Request for information or clarification. Not an issue. label Aug 3, 2021
@plamut plamut self-assigned this Aug 3, 2021
@tswast
Copy link
Contributor

tswast commented Aug 3, 2021

is this a bug with BQ, or the SDK not accepting the schema?

To determine this, it would help to see what is being sent in the job insert request. Would it be possible for you to add a breakpoint / print out the contents of the job resource here?

job_resource = load_job.to_api_repr()

@plamut
Copy link
Contributor

plamut commented Aug 3, 2021

BUT for whatever reason, the API request will fail and continue to try parse value as either a STRING, FLOAT or INT even with the schema defined.

Could you please elaborate? What error do you get when the request "fails"? Or is the error that type of the value column is not always detected correctly?

@brummetj
Copy link
Author

brummetj commented Aug 3, 2021

@tswast i'll go ahead and get you that job_resource values here shortly! I'll post back when i do.

@plamut It is the latter. The type of value is never detected correctly. Error response looks like

google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table
value has changed type from STRING to FLOAT

@brummetj
Copy link
Author

brummetj commented Aug 3, 2021

@tswast here is a output of job_resource

{
  "jobReference": {
    "jobId": "b6c53e3e-65b9-461e-a952-3cf93cf67454",
    "projectId": "xxxx"
  },
  "configuration": {
    "load": {
      "schema": {
        "fields": [
          {
            "name": "period_count",
            "type": "INTEGER",
            "mode": "NULLABLE"
          },
          {
            "name": "anomalies",
            "type": "RECORD",
            "mode": "REPEATED",
            "fields": [
              {
                "name": "segments",
                "type": "RECORD",
                "mode": "REPEATED",
                "fields": [
                  {
                    "name": "segment_value",
                    "type": "STRING",
                    "mode": "NULLABLE"
                  },
                  {
                    "name": "segment_type",
                    "type": "STRING",
                    "mode": "NULLABLE"
                  }
                ]
              },
              {
                "name": "kpi",
                "type": "STRING",
                "mode": "NULLABLE"
              }
            ]
          },
          {
            "name": "all_segment_anomaly",
            "type": "RECORD",
            "mode": "REPEATED",
            "fields": [
              {
                "name": "anomalous",
                "type": "BOOLEAN",
                "mode": "NULLABLE"
              },
              {
                "name": "kpi",
                "type": "STRING",
                "mode": "NULLABLE"
              }
            ]
          },
          {
            "name": "incident_id",
            "type": "STRING",
            "mode": "NULLABLE"
          },
          {
            "name": "ts",
            "type": "FLOAT",
            "mode": "NULLABLE"
          },
          {
            "name": "features",
            "type": "RECORD",
            "mode": "REPEATED",
            "fields": [
              {
                "name": "event_value_counts",
                "type": "RECORD",
                "mode": "REPEATED",
                "fields": [
                  {
                    "name": "count",
                    "type": "INTEGER",
                    "mode": "NULLABLE"
                  },
                  {
                    "name": "value",
                    "type": "STRING",
                    "mode": "NULLABLE"
                  }
                ]
              },
              {
                "name": "non_event_feature_value",
                "type": "STRING",
                "mode": "NULLABLE"
              },
              {
                "name": "feature_name",
                "type": "STRING",
                "mode": "NULLABLE"
              },
              {
                "name": "importances",
                "type": "INTEGER",
                "mode": "NULLABLE"
              }
            ]
          },
          {
            "name": "partition_date",
            "type": "DATE",
            "mode": "NULLABLE"
          }
        ]
      },
      "schemaUpdateOptions": "ALLOW_FIELD_ADDITION",
      "createDisposition": "CREATE_IF_NEEDED",
      "writeDisposition": "WRITE_APPEND",
      "sourceFormat": "NEWLINE_DELIMITED_JSON",
      "autodetect": false,
      "timePartitioning": {
        "type": "DAY",
        "field": "partition_date"
      },
      "destinationTable": {
        "projectId": "xxxx",
        "datasetId": "xxxx",
        "tableId": "bbb_segment_incidents"
      }
    },
    "labels": {
      "product": "eai"
    }
  }
}

the value is under the features -> event_value_counts

@tswast
Copy link
Contributor

tswast commented Aug 3, 2021

Thanks, @brummetj That means the library is sending the schema. Double-checking with https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad I don't see any typos with regards to autodetect or schema properties.

This must be a backend issue / issue with the table creation.

@brummetj
Copy link
Author

brummetj commented Aug 3, 2021

@tswast Sounds good to me! ... Any steps you want me to take to help with the backend issue?

Should i be calling create_table statically instead of relying on the API to call create table for me?

@tswast
Copy link
Contributor

tswast commented Aug 3, 2021

Should i be calling create_table statically instead of relying on the API to call create table for me?

That could be a good workaround. If a table already exists, the backend API should use that.

Any steps you want me to take to help with the backend issue?

I don't have as much access to internal job logs and such. Filing a support request if you have it would get you to the right folks. If not, the backend team does monitor the public issue tracker here: https://issuetracker.google.com/issues/new?component=187149&template=1162659

@tswast
Copy link
Contributor

tswast commented Aug 3, 2021

Oh, I have another thought!

@brummetj Are those "long float number (ex: 75257229615211513551680283362634662820), int number ( ex: 4 )" values actual float / int values in Python? If so, it's likely that that the json.dumps call we make in the library is formatting these as JSON "number" objects (missing the quotes).

Would it be possible to convert these to strings before passing them to load_table_from_json?

I could see the library adding logic to convert values based on the BigQuery schema we have, but this would likely come at a pretty severe performance penalty. Alternatively, you could file that issue against the backend to see if they can convert those to strings for you a load time.

@brummetj
Copy link
Author

brummetj commented Aug 3, 2021

@tswast they are not actual float/int values. They are wrapped in quotes and python is interpreting them as a string.... e.i ... they are already strings :D

I'm probably just going to go ahead and file issue with the link you shared

@tswast
Copy link
Contributor

tswast commented Aug 3, 2021

Thanks. If you could comment here with a link to the issue you file, that'd be great. Then I can track it as well as anyone else who hits this issue.

@brummetj
Copy link
Author

brummetj commented Aug 3, 2021

@tswast here you go! https://issuetracker.google.com/issues/195444399

@plamut plamut added external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Aug 4, 2021
@plamut plamut removed their assignment Aug 4, 2021
@chalmerlowe
Copy link
Contributor

chalmerlowe commented Sep 13, 2022

This item went to the BQ backend team and was closed as Won't fix
Closing this item.
https://issuetracker.google.com/issues/195444399

@chalmerlowe chalmerlowe added the status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. label Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants