Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: LoadJobConfig.schema setter should accept None #9074

Closed
tswast opened this issue Aug 21, 2019 · 3 comments · Fixed by #9077
Closed

BigQuery: LoadJobConfig.schema setter should accept None #9074

tswast opened this issue Aug 21, 2019 · 3 comments · Fixed by #9077
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Aug 21, 2019

Is your feature request related to a problem? Please describe.

There is no (public) way to unset a schema after you've set one. Even though the client doesn't distinguish between unset schema and an empty schema, the backend API does.

Describe the solution you'd like

When the set schema is None, remove the whole schema property from the underlying _properties.

Describe alternatives you've considered

We could always send None if a schema is empty, but this wouldn't be right, either.

Additional context

Needed for #9064

@tswast tswast added api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Aug 22, 2019
@tswast
Copy link
Contributor Author

tswast commented Aug 22, 2019

@plamut Should be a pretty small change to LoadJobConfig's @schema.setter in job.py. Let me know if you can't get to this, and I'll try to tackle it tomorrow.

@plamut
Copy link
Contributor

plamut commented Aug 22, 2019

@tswast On it, prioritizing to unblock the PR #9064.

@tswast
Copy link
Contributor Author

tswast commented Aug 22, 2019

Just so we don't lose the additional context from #9064 (comment), I'm copying my comment here as well.

Since there is already a partial schema set, we have to unset it somehow. Otherwise, the number of columns in the load job does not match the number of columns in the file.

It turns out, setting an empty schema has a different behavior in the backend than not setting a schema at all.

___________________ test_load_table_from_dataframe[pyarrow] ____________________
client = <google.cloud.bigquery.client.Client object at 0x7f919ef73450>
to_delete = [Dataset(DatasetReference(u'precise-truck-742', 'load_table_from_dataframe_1566428806902'))]
parquet_engine = 'pyarrow'
    @pytest.mark.skipif(pandas is None, reason="Requires `pandas`")
    @pytest.mark.parametrize("parquet_engine", ["pyarrow", "fastparquet"])
    def test_load_table_from_dataframe(client, to_delete, parquet_engine):
        if parquet_engine == "pyarrow" and pyarrow is None:
            pytest.skip("Requires `pyarrow`")
        if parquet_engine == "fastparquet" and fastparquet is None:
            pytest.skip("Requires `fastparquet`")
        pandas.set_option("io.parquet.engine", parquet_engine)
        dataset_id = "load_table_from_dataframe_{}".format(_millis())
        dataset = bigquery.Dataset(client.dataset(dataset_id))
        client.create_dataset(dataset)
        to_delete.append(dataset)
        # [START bigquery_load_table_dataframe]
        # from google.cloud import bigquery
        # import pandas
        # client = bigquery.Client()
        # dataset_id = 'my_dataset'
        dataset_ref = client.dataset(dataset_id)
        table_ref = dataset_ref.table("monty_python")
        records = [
            {"title": u"The Meaning of Life", "release_year": 1983},
            {"title": u"Monty Python and the Holy Grail", "release_year": 1975},
            {"title": u"Life of Brian", "release_year": 1979},
            {"title": u"And Now for Something Completely Different", "release_year": 1971},
        ]
        # Optionally set explicit indices.
        # If indices are not specified, a column will be created for the default
        # indices created by pandas.
        index = [u"Q24980", u"Q25043", u"Q24953", u"Q16403"]
        dataframe = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))
>       job = client.load_table_from_dataframe(dataframe, table_ref, location="US")
docs/snippets.py:2779:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
google/cloud/bigquery/client.py:1574: in load_table_from_dataframe
    job_config=job_config,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <google.cloud.bigquery.client.Client object at 0x7f919ef73450>
file_obj = <closed file '/tmp/tmpoL17bp_job_31ab0413.parquet', mode 'rb' at 0x7f919ef4a540>
destination = TableReference(DatasetReference(u'precise-truck-742', 'load_table_from_dataframe_1566428806902'), 'monty_python')
rewind = True, size = None, num_retries = 6
job_id = '31ab0413-37cd-4ff8-b417-446f356846fa', job_id_prefix = None
location = 'US', project = 'precise-truck-742'
job_config = <google.cloud.bigquery.job.LoadJobConfig object at 0x7f9194de2890>
    def load_table_from_file(
        self,
        file_obj,
        destination,
        rewind=False,
        size=None,
        num_retries=_DEFAULT_NUM_RETRIES,
        job_id=None,
        job_id_prefix=None,
        location=None,
        project=None,
        job_config=None,
    ):
        """Upload the contents of this table from a file-like object.
        Similar to :meth:`load_table_from_uri`, this method creates, starts and
        returns a :class:`~google.cloud.bigquery.job.LoadJob`.
        Arguments:
            file_obj (file): A file handle opened in binary mode for reading.
            destination (Union[ \
                :class:`~google.cloud.bigquery.table.Table`, \
                :class:`~google.cloud.bigquery.table.TableReference`, \
                str, \
            ]):
                Table into which data is to be loaded. If a string is passed
                in, this method attempts to create a table reference from a
                string using
                :func:`google.cloud.bigquery.table.TableReference.from_string`.
        Keyword Arguments:
            rewind (bool):
                If True, seek to the beginning of the file handle before
                reading the file.
            size (int):
                The number of bytes to read from the file handle. If size is
                ``None`` or large, resumable upload will be used. Otherwise,
                multipart upload will be used.
            num_retries (int): Number of upload retries. Defaults to 6.
            job_id (str): (Optional) Name of the job.
            job_id_prefix (str):
                (Optional) the user-provided prefix for a randomly generated
                job ID. This parameter will be ignored if a ``job_id`` is
                also given.
            location (str):
                Location where to run the job. Must match the location of the
                destination table.
            project (str):
                Project ID of the project of where to run the job. Defaults
                to the client's project.
            job_config (google.cloud.bigquery.job.LoadJobConfig):
                (Optional) Extra configuration options for the job.
        Returns:
            google.cloud.bigquery.job.LoadJob: A new load job.
        Raises:
            ValueError:
                If ``size`` is not passed in and can not be determined, or if
                the ``file_obj`` can be detected to be a file opened in text
                mode.
        """
        job_id = _make_job_id(job_id, job_id_prefix)
        if project is None:
            project = self.project
        if location is None:
            location = self.location
        destination = _table_arg_to_table_ref(destination, default_project=self.project)
        job_ref = job._JobReference(job_id, project=project, location=location)
        load_job = job.LoadJob(job_ref, None, destination, self, job_config)
        job_resource = load_job.to_api_repr()
        if rewind:
            file_obj.seek(0, os.SEEK_SET)
        _check_mode(file_obj)
        try:
            if size is None or size >= _MAX_MULTIPART_SIZE:
                response = self._do_resumable_upload(
                    file_obj, job_resource, num_retries
                )
            else:
                response = self._do_multipart_upload(
                    file_obj, job_resource, size, num_retries
                )
        except resumable_media.InvalidResponse as exc:
>           raise exceptions.from_http_response(exc.response)
E           BadRequest: 400 POST https://www.googleapis.com/upload/bigquery/v2/projects/precise-truck-742/jobs?uploadType=resumable: Empty schema specified for the load job. Please specify a schema that describes the data being loaded.
google/cloud/bigquery/client.py:1439: BadRequest
=============================== warnings summary ===============================
docs/snippets.py::test_load_table_from_dataframe[pyarrow]
  /tmpfs/src/github/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:229: UserWarning: Unable to determine type of column 'title'.
    warnings.warn("Unable to determine type of column '{}'.".format(column))

To fix this, the correct thing to do is to unset the schema if set, which is why I filed: #9074

=================================== FAILURES ===================================
___________________ test_load_table_from_dataframe[pyarrow] ____________________
client = <google.cloud.bigquery.client.Client object at 0x7f9963974450>
to_delete = [Dataset(DatasetReference(u'precise-truck-742', 'load_table_from_dataframe_1566434996535'))]
parquet_engine = 'pyarrow'
    @pytest.mark.skipif(pandas is None, reason="Requires `pandas`")
    @pytest.mark.parametrize("parquet_engine", ["pyarrow", "fastparquet"])
    def test_load_table_from_dataframe(client, to_delete, parquet_engine):
        if parquet_engine == "pyarrow" and pyarrow is None:
            pytest.skip("Requires `pyarrow`")
        if parquet_engine == "fastparquet" and fastparquet is None:
            pytest.skip("Requires `fastparquet`")
        pandas.set_option("io.parquet.engine", parquet_engine)
        dataset_id = "load_table_from_dataframe_{}".format(_millis())
        dataset = bigquery.Dataset(client.dataset(dataset_id))
        client.create_dataset(dataset)
        to_delete.append(dataset)
        # [START bigquery_load_table_dataframe]
        # from google.cloud import bigquery
        # import pandas
        # client = bigquery.Client()
        # dataset_id = 'my_dataset'
        dataset_ref = client.dataset(dataset_id)
        table_ref = dataset_ref.table("monty_python")
        records = [
            {"title": u"The Meaning of Life", "release_year": 1983},
            {"title": u"Monty Python and the Holy Grail", "release_year": 1975},
            {"title": u"Life of Brian", "release_year": 1979},
            {"title": u"And Now for Something Completely Different", "release_year": 1971},
        ]
        # Optionally set explicit indices.
        # If indices are not specified, a column will be created for the default
        # indices created by pandas.
        index = [u"Q24980", u"Q25043", u"Q24953", u"Q16403"]
        dataframe = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))
>       job = client.load_table_from_dataframe(dataframe, table_ref, location="US")
docs/snippets.py:2779:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
google/cloud/bigquery/client.py:1535: in load_table_from_dataframe
    dataframe, job_config.schema
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <google.cloud.bigquery.job.LoadJobConfig object at 0x7f99632d7d90>
value = None
    @schema.setter
    def schema(self, value):
>       if not all(hasattr(field, "to_api_repr") for field in value):
E       TypeError: 'NoneType' object is not iterable
google/cloud/bigquery/job.py:1163: TypeError
=============================== warnings summary ===============================
docs/snippets.py::test_load_table_from_dataframe[pyarrow]
  /tmpfs/src/github/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:229: UserWarning: Unable to determine type of column 'title'.
    warnings.warn("Unable to determine type of column '{}'.".format(column))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants