Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to provide partition column and partition expiry time #313

Open
ShantanuKumar opened this issue Feb 14, 2020 · 7 comments
Open

Option to provide partition column and partition expiry time #313

ShantanuKumar opened this issue Feb 14, 2020 · 7 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@ShantanuKumar
Copy link
Contributor

ShantanuKumar commented Feb 14, 2020

While creating a new table using pandas, it would be nice if it can partition the table and set an partition expiry time. The python bigquery library already supports it

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_ref = client.dataset('my_dataset')

table_ref = dataset_ref.table("my_partitioned_table")
schema = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("post_abbr", "STRING"),
    bigquery.SchemaField("date", "DATE"),
]
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY,
    field="date",  # name of column to use for partitioning
    expiration_ms=7776000000,
)  # 90 days

table = client.create_table(table)

print(
    "Created table {}, partitioned on column {}".format(
        table.table_id, table.time_partitioning.field
    )
)

https://cloud.google.com/bigquery/docs/creating-column-partitions

I can create a pull request, if people feel like it's something they find useful. At least in my work, we create lot of monitoring tables on bigquery using pandas, and push data to it. These tables keep growing and since we can't set the partition when a table has already been created, these tables just become too big, and expensive.

@tswast
Copy link
Collaborator

tswast commented Feb 14, 2020

I'm open to a pull request that adds this.

Ideally, I'd like to see to_gbq gain a configuration parameter that takes a load job JSON configuration. (job configuration resource, load job configuration resource)

Example:

pandas_gbq.to_gbq(
  df,
  configuration={
    "load": {
      "timePartitioning": {
         "type": "DAY",
         "expirationMs": str(1000*60*60*24*30),  # 30 days
         "field": "my_timestamp_col"
      }
    }
  }
)

One problem with the configuration proposal is that currently pandas-gbq creates tables with calls to create_table if the table doesn't exist, rather than letting the load job create it. I'd like to refactor to_gbq to avoid this (unnecessary, IMO) step. Open to PRs to do that refactoring, but if you'd like to extract options from configuration when creating the table, that might be simpler short-term.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Feb 14, 2020
@ShantanuKumar
Copy link
Contributor Author

Yeah, I had something like configuration thing in mind. Why does the expirationMs need to be passed as a string though?

@tswast
Copy link
Collaborator

tswast commented Feb 18, 2020

Why does the expirationMs need to be passed as a string though?

It's a historical artifact of the BigQuery REST endpoint using an older JSON parsing implementation that only had JavaScript Number (floating point) available. Encoding it as a string allows the BigQuery REST endpoint to interpret the value as a 64-bit integer without loss of precision.

I believe an integer will be accepted, but you might lose precision for large values.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jul 17, 2021
@MichailParaskevopoulos
Copy link

I'm open to a pull request that adds this.

Ideally, I'd like to see to_gbq gain a configuration parameter that takes a load job JSON configuration. (job configuration resource, load job configuration resource)

Example:

pandas_gbq.to_gbq(
  df,
  configuration={
    "load": {
      "timePartitioning": {
         "type": "DAY",
         "expirationMs": str(1000*60*60*24*30),  # 30 days
         "field": "my_timestamp_col"
      }
    }
  }
)

One problem with the configuration proposal is that currently pandas-gbq creates tables with calls to create_table if the table doesn't exist, rather than letting the load job create it. I'd like to refactor to_gbq to avoid this (unnecessary, IMO) step. Open to PRs to do that refactoring, but if you'd like to extract options from configuration when creating the table, that might be simpler short-term.

Are you still open to a PR that does the refactoring? If so, I'd be interested to work on it.

@tswast
Copy link
Collaborator

tswast commented Nov 17, 2021

I referenced this issue from #425, but will keep this open in case we don't decide to rely on the load job for table creation.

@fryck
Copy link

fryck commented Mar 16, 2023

Any news about this feature ? could be very useful ! thanks !

@assem-ch
Copy link

Is this feature implemented already?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants