Change to_gbq() from stream to batch loading #14670

nicku33 · 2016-11-16T16:43:16Z

A small, complete example of the issue

Recently we have discovered that streaming loads into google tables are best effort but can be up to 90 minute delay. It has no transactional guarantees. As such to_gbq() returning is no guarantee of anything.

I propose that we switch the loading from tableData.insertAll() to the batch bigquery load, either using an http upload, or pushing to a cloud bucket. The latter would suck because it requires bucket perms too, but I'm not sure on the size bounds for the http stream version.

However, the atomicity seems important in the case where we next execute a BQ query with the recently uploaded to_gbq() results and worth a tradeoff in terms of time.

At least we would could include an option.
Thoughts ?

(not sure how to label this IO:google)

Output of `pd.show_versions()`

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.0
setuptools: 25.2.0
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2015.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.4
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-11-23T12:28:27Z

cc @parthea

nicku33 · 2016-12-13T15:40:52Z

Hey, just want to pick up this discussion again. We have had some more confirmed cases of the streaming uploaded data not appearing in a timely manner. However, the delay is not deterministic. I depends on google and speaking to Jordan Tigrani, he said the only guarantee they provide is 90 minutes, though typical cases are much shorter, obviously.

I think most users would expect to_gbq() to be transactional. The ability to query directly after to_gbq() is executed is probably a common use case. For example, one may use pandas to make a scoring table, push to bq, and run bq over a huge dataset joined to the scoring table where the large table is too large to push into local memory.

At work, we've written a helper function to use the non-streaming load and most people are avoiding to_gbq() because of this issue.

I see two tacks:
a) push google to have better bounds on streaming insertion
b) rewrite it as a big query load

However this latter option requires the user to have write access into a google bucket, which is a more complicated config.

Thoughts ?

nicku33 · 2016-12-13T15:43:51Z

Here's code to see the behaviour:

import time
import pandas as pd
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import random 

project_id = 'sample-gcp-project'
credentials = GoogleCredentials.get_application_default()
bigquery_service = build('bigquery', 'v2', credentials=credentials)
bq_tablename = 'tmp.tmp'

# Replace the table ad nauseum
k = 0
while(1):

    n=random.randint(1000,10000)
    df = pd.DataFrame({'a': xrange(n), 'b': xrange(n)})
    pd.io.gbq.to_gbq(df, bq_tablename, project_id, if_exists='replace')

    # Did it work?
    done = False
    while not done:
        check_query = 'SELECT COUNT(1) as ct FROM ' + bq_tablename
        df_check = pd.io.gbq.read_gbq(check_query, project_id)
        n_check = df_check['ct'].values[0]
        if n_check == len(df):
            print 'Success!'
            break
        else:
            print 'Crap, table length is {0} not {1}.'.format(n_check, len(df))
        k += 1
        print '(k = %s) Sleeping 10 seconds before retrying the check or uploading again.' % (k)
        time.sleep(10)

max-sixty · 2016-12-13T17:03:22Z

Agree in principle that a batch upload is a much better fit for this than a streaming load

However this latter option requires the user to have write access into a google bucket, which is a more complicated config.

Where do you see that? While you could load into Google Storage, and then from there to BigQuery, you can also do a single POST request: https://cloud.google.com/bigquery/loading-data-post-request

nicku33 · 2016-12-13T18:15:01Z

Ah you're right. I misunderstood.
I will look into what their POST size limits are to be safe.

If we agree that batch is better, does anyone object if I try to write this patch myself over the holidays ? I've wanted to contribute to pandas in some way for years and this would be a nice easy one to get my feet wet.

jreback · 2017-02-26T20:12:04Z

moved to googleapis/python-bigquery-pandas#7

sinhrks added the IO Google label Nov 16, 2016

parthea mentioned this issue Dec 29, 2016

Fix gbq integration tests so that unique datasets/table names are used #15009

Merged

max-sixty mentioned this issue Feb 8, 2017

Bulk upload googleapis/python-bigquery-pandas#7

Closed

jreback closed this as completed Feb 26, 2017

jorisvandenbossche added this to the No action milestone Feb 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change to_gbq() from stream to batch loading #14670

Change to_gbq() from stream to batch loading #14670

nicku33 commented Nov 16, 2016 •

edited

jorisvandenbossche commented Nov 23, 2016

nicku33 commented Dec 13, 2016

nicku33 commented Dec 13, 2016

max-sixty commented Dec 13, 2016

nicku33 commented Dec 13, 2016

jreback commented Feb 26, 2017

Change to_gbq() from stream to batch loading #14670

Change to_gbq() from stream to batch loading #14670

Comments

nicku33 commented Nov 16, 2016 • edited

A small, complete example of the issue

Output of pd.show_versions()

jorisvandenbossche commented Nov 23, 2016

nicku33 commented Dec 13, 2016

nicku33 commented Dec 13, 2016

max-sixty commented Dec 13, 2016

nicku33 commented Dec 13, 2016

jreback commented Feb 26, 2017

nicku33 commented Nov 16, 2016 •

edited

Output of `pd.show_versions()`