BigQuery: LoadJobConfig.schema setter should accept None #9074

tswast · 2019-08-21T23:59:52Z

Is your feature request related to a problem? Please describe.

There is no (public) way to unset a schema after you've set one. Even though the client doesn't distinguish between unset schema and an empty schema, the backend API does.

Describe the solution you'd like

When the set schema is None, remove the whole schema property from the underlying _properties.

Describe alternatives you've considered

We could always send None if a schema is empty, but this wouldn't be right, either.

Additional context

Needed for #9064

The text was updated successfully, but these errors were encountered:

tswast · 2019-08-22T00:05:52Z

@plamut Should be a pretty small change to LoadJobConfig's @schema.setter in job.py. Let me know if you can't get to this, and I'll try to tackle it tomorrow.

plamut · 2019-08-22T11:05:52Z

@tswast On it, prioritizing to unblock the PR #9064.

tswast · 2019-08-22T17:00:32Z

Just so we don't lose the additional context from #9064 (comment), I'm copying my comment here as well.

Since there is already a partial schema set, we have to unset it somehow. Otherwise, the number of columns in the load job does not match the number of columns in the file.

It turns out, setting an empty schema has a different behavior in the backend than not setting a schema at all.

___________________ test_load_table_from_dataframe[pyarrow] ____________________
client = <google.cloud.bigquery.client.Client object at 0x7f919ef73450>
to_delete = [Dataset(DatasetReference(u'precise-truck-742', 'load_table_from_dataframe_1566428806902'))]
parquet_engine = 'pyarrow'
    @pytest.mark.skipif(pandas is None, reason="Requires `pandas`")
    @pytest.mark.parametrize("parquet_engine", ["pyarrow", "fastparquet"])
    def test_load_table_from_dataframe(client, to_delete, parquet_engine):
        if parquet_engine == "pyarrow" and pyarrow is None:
            pytest.skip("Requires `pyarrow`")
        if parquet_engine == "fastparquet" and fastparquet is None:
            pytest.skip("Requires `fastparquet`")
        pandas.set_option("io.parquet.engine", parquet_engine)
        dataset_id = "load_table_from_dataframe_{}".format(_millis())
        dataset = bigquery.Dataset(client.dataset(dataset_id))
        client.create_dataset(dataset)
        to_delete.append(dataset)
        # [START bigquery_load_table_dataframe]
        # from google.cloud import bigquery
        # import pandas
        # client = bigquery.Client()
        # dataset_id = 'my_dataset'
        dataset_ref = client.dataset(dataset_id)
        table_ref = dataset_ref.table("monty_python")
        records = [
            {"title": u"The Meaning of Life", "release_year": 1983},
            {"title": u"Monty Python and the Holy Grail", "release_year": 1975},
            {"title": u"Life of Brian", "release_year": 1979},
            {"title": u"And Now for Something Completely Different", "release_year": 1971},
        ]
        # Optionally set explicit indices.
        # If indices are not specified, a column will be created for the default
        # indices created by pandas.
        index = [u"Q24980", u"Q25043", u"Q24953", u"Q16403"]
        dataframe = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))
>       job = client.load_table_from_dataframe(dataframe, table_ref, location="US")
docs/snippets.py:2779:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
google/cloud/bigquery/client.py:1574: in load_table_from_dataframe
    job_config=job_config,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <google.cloud.bigquery.client.Client object at 0x7f919ef73450>
file_obj = <closed file '/tmp/tmpoL17bp_job_31ab0413.parquet', mode 'rb' at 0x7f919ef4a540>
destination = TableReference(DatasetReference(u'precise-truck-742', 'load_table_from_dataframe_1566428806902'), 'monty_python')
rewind = True, size = None, num_retries = 6
job_id = '31ab0413-37cd-4ff8-b417-446f356846fa', job_id_prefix = None
location = 'US', project = 'precise-truck-742'
job_config = <google.cloud.bigquery.job.LoadJobConfig object at 0x7f9194de2890>
    def load_table_from_file(
        self,
        file_obj,
        destination,
        rewind=False,
        size=None,
        num_retries=_DEFAULT_NUM_RETRIES,
        job_id=None,
        job_id_prefix=None,
        location=None,
        project=None,
        job_config=None,
    ):
        """Upload the contents of this table from a file-like object.
        Similar to :meth:`load_table_from_uri`, this method creates, starts and
        returns a :class:`~google.cloud.bigquery.job.LoadJob`.
        Arguments:
            file_obj (file): A file handle opened in binary mode for reading.
            destination (Union[ \
                :class:`~google.cloud.bigquery.table.Table`, \
                :class:`~google.cloud.bigquery.table.TableReference`, \
                str, \
            ]):
                Table into which data is to be loaded. If a string is passed
                in, this method attempts to create a table reference from a
                string using
                :func:`google.cloud.bigquery.table.TableReference.from_string`.
        Keyword Arguments:
            rewind (bool):
                If True, seek to the beginning of the file handle before
                reading the file.
            size (int):
                The number of bytes to read from the file handle. If size is
                ``None`` or large, resumable upload will be used. Otherwise,
                multipart upload will be used.
            num_retries (int): Number of upload retries. Defaults to 6.
            job_id (str): (Optional) Name of the job.
            job_id_prefix (str):
                (Optional) the user-provided prefix for a randomly generated
                job ID. This parameter will be ignored if a ``job_id`` is
                also given.
            location (str):
                Location where to run the job. Must match the location of the
                destination table.
            project (str):
                Project ID of the project of where to run the job. Defaults
                to the client's project.
            job_config (google.cloud.bigquery.job.LoadJobConfig):
                (Optional) Extra configuration options for the job.
        Returns:
            google.cloud.bigquery.job.LoadJob: A new load job.
        Raises:
            ValueError:
                If ``size`` is not passed in and can not be determined, or if
                the ``file_obj`` can be detected to be a file opened in text
                mode.
        """
        job_id = _make_job_id(job_id, job_id_prefix)
        if project is None:
            project = self.project
        if location is None:
            location = self.location
        destination = _table_arg_to_table_ref(destination, default_project=self.project)
        job_ref = job._JobReference(job_id, project=project, location=location)
        load_job = job.LoadJob(job_ref, None, destination, self, job_config)
        job_resource = load_job.to_api_repr()
        if rewind:
            file_obj.seek(0, os.SEEK_SET)
        _check_mode(file_obj)
        try:
            if size is None or size >= _MAX_MULTIPART_SIZE:
                response = self._do_resumable_upload(
                    file_obj, job_resource, num_retries
                )
            else:
                response = self._do_multipart_upload(
                    file_obj, job_resource, size, num_retries
                )
        except resumable_media.InvalidResponse as exc:
>           raise exceptions.from_http_response(exc.response)
E           BadRequest: 400 POST https://www.googleapis.com/upload/bigquery/v2/projects/precise-truck-742/jobs?uploadType=resumable: Empty schema specified for the load job. Please specify a schema that describes the data being loaded.
google/cloud/bigquery/client.py:1439: BadRequest
=============================== warnings summary ===============================
docs/snippets.py::test_load_table_from_dataframe[pyarrow]
  /tmpfs/src/github/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:229: UserWarning: Unable to determine type of column 'title'.
    warnings.warn("Unable to determine type of column '{}'.".format(column))

To fix this, the correct thing to do is to unset the schema if set, which is why I filed: #9074

=================================== FAILURES ===================================
___________________ test_load_table_from_dataframe[pyarrow] ____________________
client = <google.cloud.bigquery.client.Client object at 0x7f9963974450>
to_delete = [Dataset(DatasetReference(u'precise-truck-742', 'load_table_from_dataframe_1566434996535'))]
parquet_engine = 'pyarrow'
    @pytest.mark.skipif(pandas is None, reason="Requires `pandas`")
    @pytest.mark.parametrize("parquet_engine", ["pyarrow", "fastparquet"])
    def test_load_table_from_dataframe(client, to_delete, parquet_engine):
        if parquet_engine == "pyarrow" and pyarrow is None:
            pytest.skip("Requires `pyarrow`")
        if parquet_engine == "fastparquet" and fastparquet is None:
            pytest.skip("Requires `fastparquet`")
        pandas.set_option("io.parquet.engine", parquet_engine)
        dataset_id = "load_table_from_dataframe_{}".format(_millis())
        dataset = bigquery.Dataset(client.dataset(dataset_id))
        client.create_dataset(dataset)
        to_delete.append(dataset)
        # [START bigquery_load_table_dataframe]
        # from google.cloud import bigquery
        # import pandas
        # client = bigquery.Client()
        # dataset_id = 'my_dataset'
        dataset_ref = client.dataset(dataset_id)
        table_ref = dataset_ref.table("monty_python")
        records = [
            {"title": u"The Meaning of Life", "release_year": 1983},
            {"title": u"Monty Python and the Holy Grail", "release_year": 1975},
            {"title": u"Life of Brian", "release_year": 1979},
            {"title": u"And Now for Something Completely Different", "release_year": 1971},
        ]
        # Optionally set explicit indices.
        # If indices are not specified, a column will be created for the default
        # indices created by pandas.
        index = [u"Q24980", u"Q25043", u"Q24953", u"Q16403"]
        dataframe = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))
>       job = client.load_table_from_dataframe(dataframe, table_ref, location="US")
docs/snippets.py:2779:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
google/cloud/bigquery/client.py:1535: in load_table_from_dataframe
    dataframe, job_config.schema
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <google.cloud.bigquery.job.LoadJobConfig object at 0x7f99632d7d90>
value = None
    @schema.setter
    def schema(self, value):
>       if not all(hasattr(field, "to_api_repr") for field in value):
E       TypeError: 'NoneType' object is not iterable
google/cloud/bigquery/job.py:1163: TypeError
=============================== warnings summary ===============================
docs/snippets.py::test_load_table_from_dataframe[pyarrow]
  /tmpfs/src/github/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:229: UserWarning: Unable to determine type of column 'title'.
    warnings.warn("Unable to determine type of column '{}'.".format(column))

tswast mentioned this issue Aug 22, 2019

BigQuery: Allow subset of schema to be passed into load_table_from_dataframe. #9064

Merged

3 tasks

tswast assigned plamut Aug 22, 2019

tswast added api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Aug 22, 2019

plamut mentioned this issue Aug 22, 2019

BigQuery: Add support for unsetting LoadJobConfig schema #9077

Merged

tswast closed this as completed in #9077 Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: LoadJobConfig.schema setter should accept None #9074

BigQuery: LoadJobConfig.schema setter should accept None #9074

tswast commented Aug 21, 2019

tswast commented Aug 22, 2019

plamut commented Aug 22, 2019

tswast commented Aug 22, 2019

BigQuery: LoadJobConfig.schema setter should accept None #9074

BigQuery: LoadJobConfig.schema setter should accept None #9074

Comments

tswast commented Aug 21, 2019

tswast commented Aug 22, 2019

plamut commented Aug 22, 2019

tswast commented Aug 22, 2019