BigQuery: Load to table from dataframe without index #5572

brodiemackenzie · 2018-07-05T17:47:49Z

client.load_table_from_dataframe() results in the dataframe index being loaded into the bigquery table.

Can the capability to load data to a table from a dataframe without having to load the index be implemented?

max-sixty · 2018-08-07T16:33:26Z

Details for how to do this are here: googleapis/python-bigquery-pandas#133 (comment)

More than happy for this to be implemented here rather than pandas-gbq...

sungchun12 · 2019-03-02T21:15:21Z

Any updates on this being updated for the bigquery api vs. the pandas-gbq module?

mikeymezher · 2019-03-14T19:23:30Z

@sungchun12 I just tried the solution @max-sixty posted above for the bigquery client api and it worked fine. Load the job configuration and override the schema as suggested.
The schema should be a list with the format:

from google.cloud import bigquery 
schema=[bigquery.SchemaField('field1_name','field1_type'),...,bigquery.SchemaField('fieldn_name','fieldn_type')]

sungchun12 · 2019-03-14T19:56:00Z

@mikeymezher , thanks for getting back to me! I'll try it out and let you know. Do you know if one is more performant over the other in your hands-on experience?

mikeymezher · 2019-03-14T21:24:51Z

Haven't tested, but anecdotally I've noticed pandas-gbq to be faster than the client library. But there are instances where the client library is needed. Writing to partitioned tables for instance.

tswast · 2019-05-16T23:18:18Z

They are implementation details, but pandas-gbq uses CSV whereas google-cloud-bigquery uses parquet as the serialization format. The reason for this is to support STRUCT / ARRAY BigQuery columns (though these aren't supported in pandas, anyway).

Implementation-wise, I just noticed pandas provides a way to override the parquet engine's default behavior with an index argument. I'm open to adding a similar argument to google-cloud-bigquery.

df.to_parquet('test.parquet', index=False)

From https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet

sungchun12 · 2019-05-17T14:34:19Z

@tswast , I would LOVE "index=False" functionality in the google-cloud-bigquery package as it would allow me to remove pandas-gbq imports and have a consistent API to work with bigquery...at least for my use cases in having to not build override schema configurations.

cavindsouza · 2019-06-03T01:09:08Z

@tswast Has the index=False functionality been added to google-cloud-bigquery?? Thanks.

tswast · 2019-06-03T17:13:16Z

@cavindsouza not yet. Right now you can avoid writing indexes by passing in a job_config argument with the desired job_config.schema. I agree this would still be a useful feature.

tswast · 2019-08-22T17:58:37Z

FYI: #9064 and #9049 are changing the index behavior, as a schema will be automatically populated in more cases now.

We might actually have a need to explicitly add indexes to the table. Currently, it's inconsistent when an index will be added and when not. It depends on if the schema is populated and which parquet engine is used to serialized the DataFrame.

Preferred option

Check if index (or indexes if multi-index) name(s) are present in job_config.schema. If so, include the index(es) that are specified. If not, omit the indexes.

Edge case: What if index name matches that of a column name? Prefer serializing the column. Don't add the index in this case.

Alternative

Add a index=True / index=False option to the load_table_from_dataframe function.

This makes it when to include indexes. This would allow the index dtype to be used to automatically determine the schema in some cases.

When the index dtype is object, we'll need to add the index to the job_config.schema, anyway, so this requires the same implementation as the preferred option.

tswast · 2019-08-22T18:05:10Z

Follow-up from #9064

When this feature is added,

Update test_load_table_from_dataframe snippets to show overriding schema for columns whose types can't be autodetected. Explicitly add indexes to the partial schema as well.

tswast · 2019-08-28T19:44:36Z

Once this feature is released, do the following to omit indexes:

Install the pyarrow library.
Specify the schema with LoadJobConfig.schema for any object dtype columns.

To include indexes:

Install the pyarrow library.
Construct your index with a name by passing in name="my_index_name" to the Index constructo.
Set the index when you construct the DataFrame with the index=my_index argument.
Specify the schema with LoadJobConfig.schema for any object dtype columns and any indexes you want to include.
Note: if an index has the same name as a column, the column is included not the index..

Code sample:

google-cloud-python/bigquery/samples/load_table_dataframe.py

Lines 18 to 71 in a6ed945

    
           from google.cloud import bigquery 
        
           import pandas 
        
           # TODO(developer): Construct a BigQuery client object. 
        
           # client = bigquery.Client() 
        
           # TODO(developer): Set table_id to the ID of the table to create. 
        
           # table_id = "your-project.your_dataset.your_table_name" 
        
           records = [ 
        
               {"title": u"The Meaning of Life", "release_year": 1983}, 
        
               {"title": u"Monty Python and the Holy Grail", "release_year": 1975}, 
        
               {"title": u"Life of Brian", "release_year": 1979}, 
        
               {"title": u"And Now for Something Completely Different", "release_year": 1971}, 
        
           ] 
        
           dataframe = pandas.DataFrame( 
        
               records, 
        
               # In the loaded table, the column order reflects the order of the 
        
               # columns in the DataFrame. 
        
               columns=["title", "release_year"], 
        
               # Optionally, set a named index, which can also be written to the 
        
               # BigQuery table. 
        
               index=pandas.Index( 
        
                   [u"Q24980", u"Q25043", u"Q24953", u"Q16403"], name="wikidata_id" 
        
               ), 
        
           ) 
        
           job_config = bigquery.LoadJobConfig( 
        
               # Specify a (partial) schema. All columns are always written to the 
        
               # table. The schema is used to assist in data type definitions. 
        
               schema=[ 
        
                   # Specify the type of columns whose type cannot be auto-detected. For 
        
                   # example the "title" column uses pandas dtype "object", so its 
        
                   # data type is ambiguous. 
        
                   bigquery.SchemaField("title", bigquery.enums.SqlTypeNames.STRING), 
        
                   # Indexes are written if included in the schema by name. 
        
                   bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING), 
        
               ], 
        
               # Optionally, set the write disposition. BigQuery appends loaded rows 
        
               # to an existing table by default, but with WRITE_TRUNCATE write 
        
               # disposition it replaces the table with the loaded data. 
        
               write_disposition="WRITE_TRUNCATE", 
        
           ) 
        
           job = client.load_table_from_dataframe( 
        
               dataframe, table_id, job_config=job_config, location="US" 
        
           ) 
        
           job.result()  # Waits for table load to complete. 
        
           table = client.get_table(table_id) 
        
           print( 
        
               "Loaded {} rows and {} columns to {}".format( 
        
                   table.num_rows, len(table.schema), table_id 
        
               ) 
        
           )

tseaver added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. labels Jul 5, 2018

tseaver changed the title ~~[BigQuery] Load to table from dataframe always loads an index when using client.load_table_from_dataframe()~~ BigQuery: Load to table from dataframe without index Jul 5, 2018

This was referenced Aug 22, 2019

BigQuery: Deprecate automatic schema conversion in load_table_from_dataframe #9042

Closed

BigQuery: Allow subset of schema to be passed into load_table_from_dataframe. #9064

Merged

tswast self-assigned this Aug 22, 2019

tswast mentioned this issue Aug 23, 2019

BigQuery: Allow specifying index data type in partial schema to load_table_from_dataframe. #9084

Merged

3 tasks

tswast closed this as completed in #9084 Aug 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Load to table from dataframe without index #5572

BigQuery: Load to table from dataframe without index #5572

brodiemackenzie commented Jul 5, 2018

max-sixty commented Aug 7, 2018

sungchun12 commented Mar 2, 2019

mikeymezher commented Mar 14, 2019

sungchun12 commented Mar 14, 2019

mikeymezher commented Mar 14, 2019 •

edited

Loading

tswast commented May 16, 2019

sungchun12 commented May 17, 2019

cavindsouza commented Jun 3, 2019

tswast commented Jun 3, 2019

tswast commented Aug 22, 2019

tswast commented Aug 22, 2019 •

edited

Loading

tswast commented Aug 28, 2019

BigQuery: Load to table from dataframe without index #5572

BigQuery: Load to table from dataframe without index #5572

Comments

brodiemackenzie commented Jul 5, 2018

max-sixty commented Aug 7, 2018

sungchun12 commented Mar 2, 2019

mikeymezher commented Mar 14, 2019

sungchun12 commented Mar 14, 2019

mikeymezher commented Mar 14, 2019 • edited Loading

tswast commented May 16, 2019

sungchun12 commented May 17, 2019

cavindsouza commented Jun 3, 2019

tswast commented Jun 3, 2019

tswast commented Aug 22, 2019

tswast commented Aug 22, 2019 • edited Loading

tswast commented Aug 28, 2019

mikeymezher commented Mar 14, 2019 •

edited

Loading

tswast commented Aug 22, 2019 •

edited

Loading