Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Load to table from dataframe without index #5572

Closed
brodiemackenzie opened this issue Jul 5, 2018 · 12 comments · Fixed by #9084
Closed

BigQuery: Load to table from dataframe without index #5572

brodiemackenzie opened this issue Jul 5, 2018 · 12 comments · Fixed by #9084
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@brodiemackenzie
Copy link

client.load_table_from_dataframe() results in the dataframe index being loaded into the bigquery table.

Can the capability to load data to a table from a dataframe without having to load the index be implemented?

@tseaver tseaver added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. labels Jul 5, 2018
@tseaver tseaver changed the title [BigQuery] Load to table from dataframe always loads an index when using client.load_table_from_dataframe() BigQuery: Load to table from dataframe without index Jul 5, 2018
@max-sixty
Copy link

Details for how to do this are here: googleapis/python-bigquery-pandas#133 (comment)

More than happy for this to be implemented here rather than pandas-gbq...

@sungchun12
Copy link

Any updates on this being updated for the bigquery api vs. the pandas-gbq module?

@mikeymezher
Copy link

@sungchun12 I just tried the solution @max-sixty posted above for the bigquery client api and it worked fine. Load the job configuration and override the schema as suggested.
The schema should be a list with the format:

from google.cloud import bigquery 
schema=[bigquery.SchemaField('field1_name','field1_type'),...,bigquery.SchemaField('fieldn_name','fieldn_type')]
 

@sungchun12
Copy link

@mikeymezher , thanks for getting back to me! I'll try it out and let you know. Do you know if one is more performant over the other in your hands-on experience?

@mikeymezher
Copy link

mikeymezher commented Mar 14, 2019

Haven't tested, but anecdotally I've noticed pandas-gbq to be faster than the client library. But there are instances where the client library is needed. Writing to partitioned tables for instance.

@tswast
Copy link
Contributor

tswast commented May 16, 2019

They are implementation details, but pandas-gbq uses CSV whereas google-cloud-bigquery uses parquet as the serialization format. The reason for this is to support STRUCT / ARRAY BigQuery columns (though these aren't supported in pandas, anyway).

Implementation-wise, I just noticed pandas provides a way to override the parquet engine's default behavior with an index argument. I'm open to adding a similar argument to google-cloud-bigquery.

df.to_parquet('test.parquet', index=False)

From https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet

@sungchun12
Copy link

@tswast , I would LOVE "index=False" functionality in the google-cloud-bigquery package as it would allow me to remove pandas-gbq imports and have a consistent API to work with bigquery...at least for my use cases in having to not build override schema configurations.

@cavindsouza
Copy link

@tswast Has the index=False functionality been added to google-cloud-bigquery?? Thanks.

@tswast
Copy link
Contributor

tswast commented Jun 3, 2019

@cavindsouza not yet. Right now you can avoid writing indexes by passing in a job_config argument with the desired job_config.schema. I agree this would still be a useful feature.

@tswast
Copy link
Contributor

tswast commented Aug 22, 2019

FYI: #9064 and #9049 are changing the index behavior, as a schema will be automatically populated in more cases now.

We might actually have a need to explicitly add indexes to the table. Currently, it's inconsistent when an index will be added and when not. It depends on if the schema is populated and which parquet engine is used to serialized the DataFrame.

Preferred option

Check if index (or indexes if multi-index) name(s) are present in job_config.schema. If so, include the index(es) that are specified. If not, omit the indexes.

Edge case: What if index name matches that of a column name? Prefer serializing the column. Don't add the index in this case.

Alternative

Add a index=True / index=False option to the load_table_from_dataframe function.

This makes it when to include indexes. This would allow the index dtype to be used to automatically determine the schema in some cases.

When the index dtype is object, we'll need to add the index to the job_config.schema, anyway, so this requires the same implementation as the preferred option.

@tswast
Copy link
Contributor

tswast commented Aug 22, 2019

Follow-up from #9064

When this feature is added,

  • Update test_load_table_from_dataframe snippets to show overriding schema for columns whose types can't be autodetected. Explicitly add indexes to the partial schema as well.

@tswast
Copy link
Contributor

tswast commented Aug 28, 2019

Once this feature is released, do the following to omit indexes:

  • Install the pyarrow library.
  • Specify the schema with LoadJobConfig.schema for any object dtype columns.

To include indexes:

  • Install the pyarrow library.
  • Construct your index with a name by passing in name="my_index_name" to the Index constructo.
  • Set the index when you construct the DataFrame with the index=my_index argument.
  • Specify the schema with LoadJobConfig.schema for any object dtype columns and any indexes you want to include.
  • Note: if an index has the same name as a column, the column is included not the index..

Code sample:

from google.cloud import bigquery
import pandas
# TODO(developer): Construct a BigQuery client object.
# client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
records = [
{"title": u"The Meaning of Life", "release_year": 1983},
{"title": u"Monty Python and the Holy Grail", "release_year": 1975},
{"title": u"Life of Brian", "release_year": 1979},
{"title": u"And Now for Something Completely Different", "release_year": 1971},
]
dataframe = pandas.DataFrame(
records,
# In the loaded table, the column order reflects the order of the
# columns in the DataFrame.
columns=["title", "release_year"],
# Optionally, set a named index, which can also be written to the
# BigQuery table.
index=pandas.Index(
[u"Q24980", u"Q25043", u"Q24953", u"Q16403"], name="wikidata_id"
),
)
job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.
schema=[
# Specify the type of columns whose type cannot be auto-detected. For
# example the "title" column uses pandas dtype "object", so its
# data type is ambiguous.
bigquery.SchemaField("title", bigquery.enums.SqlTypeNames.STRING),
# Indexes are written if included in the schema by name.
bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING),
],
# Optionally, set the write disposition. BigQuery appends loaded rows
# to an existing table by default, but with WRITE_TRUNCATE write
# disposition it replaces the table with the loaded data.
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config, location="US"
)
job.result() # Waits for table load to complete.
table = client.get_table(table_id)
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
7 participants