Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to_gbq() from stream to batch loading #14670

Closed
nicku33 opened this issue Nov 16, 2016 · 6 comments
Closed

Change to_gbq() from stream to batch loading #14670

nicku33 opened this issue Nov 16, 2016 · 6 comments

Comments

@nicku33
Copy link
Contributor

nicku33 commented Nov 16, 2016

A small, complete example of the issue

Recently we have discovered that streaming loads into google tables are best effort but can be up to 90 minute delay. It has no transactional guarantees. As such to_gbq() returning is no guarantee of anything.

I propose that we switch the loading from tableData.insertAll() to the batch bigquery load, either using an http upload, or pushing to a cloud bucket. The latter would suck because it requires bucket perms too, but I'm not sure on the size bounds for the http stream version.

However, the atomicity seems important in the case where we next execute a BQ query with the recently uploaded to_gbq() results and worth a tradeoff in terms of time.

At least we would could include an option.
Thoughts ?

(not sure how to label this IO:google)

Output of pd.show_versions()

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.0
setuptools: 25.2.0
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2015.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.4
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

cc @parthea

@nicku33
Copy link
Contributor Author

nicku33 commented Dec 13, 2016

Hey, just want to pick up this discussion again. We have had some more confirmed cases of the streaming uploaded data not appearing in a timely manner. However, the delay is not deterministic. I depends on google and speaking to Jordan Tigrani, he said the only guarantee they provide is 90 minutes, though typical cases are much shorter, obviously.

I think most users would expect to_gbq() to be transactional. The ability to query directly after to_gbq() is executed is probably a common use case. For example, one may use pandas to make a scoring table, push to bq, and run bq over a huge dataset joined to the scoring table where the large table is too large to push into local memory.

At work, we've written a helper function to use the non-streaming load and most people are avoiding to_gbq() because of this issue.

I see two tacks:
a) push google to have better bounds on streaming insertion
b) rewrite it as a big query load

However this latter option requires the user to have write access into a google bucket, which is a more complicated config.

Thoughts ?

@nicku33
Copy link
Contributor Author

nicku33 commented Dec 13, 2016

Here's code to see the behaviour:

import time
import pandas as pd
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import random 

project_id = 'sample-gcp-project'
credentials = GoogleCredentials.get_application_default()
bigquery_service = build('bigquery', 'v2', credentials=credentials)
bq_tablename = 'tmp.tmp'

# Replace the table ad nauseum
k = 0
while(1):

    n=random.randint(1000,10000)
    df = pd.DataFrame({'a': xrange(n), 'b': xrange(n)})
    pd.io.gbq.to_gbq(df, bq_tablename, project_id, if_exists='replace')

    # Did it work?
    done = False
    while not done:
        check_query = 'SELECT COUNT(1) as ct FROM ' + bq_tablename
        df_check = pd.io.gbq.read_gbq(check_query, project_id)
        n_check = df_check['ct'].values[0]
        if n_check == len(df):
            print 'Success!'
            break
        else:
            print 'Crap, table length is {0} not {1}.'.format(n_check, len(df))
        k += 1
        print '(k = %s) Sleeping 10 seconds before retrying the check or uploading again.' % (k)
        time.sleep(10)

@max-sixty
Copy link
Contributor

Agree in principle that a batch upload is a much better fit for this than a streaming load

However this latter option requires the user to have write access into a google bucket, which is a more complicated config.

Where do you see that? While you could load into Google Storage, and then from there to BigQuery, you can also do a single POST request: https://cloud.google.com/bigquery/loading-data-post-request

@nicku33
Copy link
Contributor Author

nicku33 commented Dec 13, 2016

Ah you're right. I misunderstood.
I will look into what their POST size limits are to be safe.

If we agree that batch is better, does anyone object if I try to write this patch myself over the holidays ? I've wanted to contribute to pandas in some way for years and this would be a nice easy one to get my feet wet.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2017

moved to googleapis/python-bigquery-pandas#7

@jreback jreback closed this as completed Feb 26, 2017
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Feb 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants