Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Google BigQuery IO Module #4140

Closed
wants to merge 2 commits into from

Conversation

@sean-schaefer
Copy link

commented Jul 5, 2013

This module adds Google BigQuery functionality to the pandas project. The work is based on the existing sql.py and ga.py, providing methods to query a BigQuery database and read the results into a pandas DataFrame.

@cpcloud

View changes

pandas/io/gbq.py Outdated
credentials: SignedJwtAssertionCredentials object
"""
try:
creds_file = open(self.private_key, 'r')

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 5, 2013

Member

you should use a context manager here. if the next line fails then private_key will not be closed.

@cpcloud

View changes

pandas/io/gbq.py Outdated
http = httplib2.Http()
http = credentials.authorize(http)
return build('bigquery','v2', http=http), http, credentials
except Exception:

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 5, 2013

Member

also see if you can narrow down the Exception to something more useful than just alerting the user to an authentication error. see if you can give more detailed information about the error

@cpcloud

View changes

pandas/io/gbq.py Outdated
if('rows' in query_reply):
return self._parse_data(query_reply,index_col, col_order)

except Exception:

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 5, 2013

Member

again this is going to hide errors. ideally you want exactly one statement inside a try: suite. in this case, the call to job_collection.query could generate an error, query_reply['jobReference'] could throw a KeyError, etc.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 5, 2013

Also, is there any way to test this?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 5, 2013

needs docs too (release, whatsnew, io)

@cpcloud FYI ...need ga docs too

since I don't use these, what exactly does this do?

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 5, 2013

me neither, i created an issue a while ago to make ga docs...i guess low-prio for now.

bigquery is a relational db engine used for querying gigantic datasets. this looks like it puts a pandas wrapper around the google bigquery query language

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 5, 2013

i would say that based on the experience with the historical finance APIs that we should be very careful about network-ish APIs, insofar as they are difficult to test reliably. just my 2c

@sean-schaefer

This comment has been minimized.

Copy link
Author

commented Jul 5, 2013

We have test cases, but they're rather specific to our own usage. The problem is that the tests require our credentials, as well as having known datasets to compare results with. There are sample datasets provided by Google, though you'll still need an account.

Thanks for the other suggestions, we'll work on these and update when we finish.

@sean-schaefer

This comment has been minimized.

Copy link
Author

commented Jul 5, 2013

We updated the script with some of your error handling suggestions. Could you be more specific about what you need for documentation?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 5, 2013

A usage example in here: http://pandas.pydata.org/pandas-docs/dev/io.html#data-reader (or prob create another section to include ga as well). can you code-block to show how to do it (which doesn't actually execute)

also a blurb (in enhancements) for release notes (doc/source/release.rst)

@sean-schaefer

This comment has been minimized.

Copy link
Author

commented Jul 9, 2013

Unfortunately, we did not include use cases for ga as well because we are not familiar with it, but we did write documentation for gbq in the io.rst and release.rst files. We also refactored the original script to improve ease of testing and committed a test suite that we've been using. There are several test cases that do require BigQuery credentials, so they will be skipped unless you hardcode those values into the script. The other tests use CSVs that we've included in the appropriate data directory.

Please let us know any other suggestions / requirements you have.

@ghost

This comment has been minimized.

Copy link

commented Jul 12, 2013

Does anyone have more thoughts on this? During testing, we've noticed there are occasionally problems with OSX using Google's Python API (in particular the OpenSSL/PyCrypto modules) - we're investigating, since this will also affect the GA module, but it seems this may be out of our control. Otherwise, it's been fairly stable.

@sean-schaefer

This comment has been minimized.

Copy link
Author

commented Jul 17, 2013

When using our gbq module internally, we found that there are OpenSSL issues across platforms using the authentication method we employed. Although it works well on my local Ubuntu 12.04 system, we've had difficulty getting it to work on Snow Leopard and Windows 7.

There is another form of authentication through oauth2client, but we rejected that option because it cannot run headless as we plan on running this on an EC2 instance (users are required to grant access to the Google API through a browser window that pops-up during execution). However, for use through this library, do you feel it would be better to authenticate in that manner and make it more suitable cross-platform? This is a question we are currently considering for internal use as well.

@jtratner

View changes

doc/source/io.rst Outdated
@@ -820,7 +812,7 @@ rows will skip the interveaing rows.
print open('mi.csv').read()
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)

Note: The default behavior in 0.12 remains unchanged (``tupleize_cols=True``),
Note: The default behavior in 0.11.1 remains unchanged (``tupleize_cols=True``),

This comment has been minimized.

Copy link
@jtratner

jtratner Jul 18, 2013

Contributor

Did you force this to be your doc when rebasing? Looks like a number of version changes...

@ghost

This comment has been minimized.

Copy link

commented Jul 18, 2013

Although not spelled out in our various dev docs, the established norm in pandas is to
avoid placing author names in source code. This applies to everyone from wesm on down.

Contributors large and small so far have accepted git log/blame as suitable credit, and
this can be hereby elevated to the level of a project dictum IMO.

I (and I believe other pandai) feel very strongly about religiously maintaining attribution and giving
proper credit in OSS projects. The accepted form for this is an acknowledgment in the release notes
by name or GH handle. We often do this unbidden when new contributors are involved to show
our gratitude for their contribution to the project.

So, please remove the Authors line from the code and feel free to add a "thanks to @sean-schaefer, Jacob Schaer"
note to the appropriate item in RELEASE.rst file.

@azbones

This comment has been minimized.

Copy link

commented Jul 18, 2013

As far as whether to use OpenSSL or oauth2, I would vote for oauth2. The ga.py submodule uses this currently, many users are using Pandas in iPython, and it is significantly easier to install given it doesn't have all the platform specific OpenSSL dependencies. I spent quite a bit of time trying to make this work during testing and it was quite the hassle when using OpenSSL.

Also, I will volunteer to put the documentation together. I'll just need to research how to do that as I've never contributed documentation before...

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 18, 2013

The test file you added is pretty large. Can you trim it down in size?

@cpcloud

View changes

pandas/io/gbq.py Outdated
raise

# If execution gets here, the file could not be opened
raise IOError

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 21, 2013

Member

maybe put the comment text in the IOError message

@cpcloud

View changes

pandas/io/gbq.py Outdated
raise
except Exception as ex:
print 'Unknown error occurred: ', ex
raise

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 21, 2013

Member

is this a stray raise?

i don't think this is syntactically valid in the context of the rest of the code

also if you leave it at the scope that it looks like it's at then any code below won't be executed since the last raise exception will always be raise dhere

@cpcloud

View changes

pandas/io/gbq.py Outdated
job_reference = query_reply['jobReference']
except KeyError as ex:
print 'The query response does not include a jobReference: ', ex
raise

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 21, 2013

Member

this will show too much output IMO

it will show your message
then the exception's repr
then it will raise the exception again printing a traceback, which in all likelihood will hide your printed message unless you have a huge monitor

if you must do this then just raise a new exception or remove the try suite and let the error bubble up

this is all over so maybe clean that up a bit and decide where you need to raise vs. where it's ok to let the exception bubble up

@cpcloud

View changes

pandas/io/gbq.py Outdated
value = int(value)
elif schema == 'TIMESTAMP':
if value is not None:
value = pd.to_datetime(datetime.datetime.fromtimestamp(int(float(value))).strftime('%Y-%m-%d %H:%M:%S'))

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 21, 2013

Member

maybe a couple of temporaries here to shorten the line, but not absolutely critical

@cpcloud

View changes

pandas/io/gbq.py Outdated
df.set_index(index_col, inplace=True)
col_names.remove(index_col)
else:
print 'Index column specified does not exist in DataFrame...'

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 21, 2013

Member

i would remove the superfluous output all around

@cpcloud

View changes

pandas/io/tests/test_gbq.py Outdated
self.assertTrue((result_frame.columns == self.correct_frame_small.columns).all(),"The test case didn't match the results")

suite = unittest.TestLoader().loadTestsFromTestCase(test_gbq)
unittest.TextTestRunner(verbosity=2).run(suite)

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 21, 2013

Member

the above 2 lines aren't necessary with nose

maybe stick it in an if __main__ == '__name__' clause so that if i want to import a test from here it won't run the tests, but you can still run the file if you want

@cpcloud

View changes

doc/source/io.rst Outdated
can read in a multi-index for the columns. Specifying non-consecutive
rows will skip the interveaing rows.
can read in a ``MultiIndex`` for the columns. Specifying non-consecutive
rows will skip the interveaning rows.

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jul 21, 2013

Member

i think this should be "interleaving"

This comment has been minimized.

Copy link
@sean-schaefer

sean-schaefer Jul 22, 2013

Author

I'm not sure about this - it's not something we actually added. The problem is we were asked to update our documentation to match the current version, so this appears to have been a change by someone else that was committed.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 21, 2013

Docs on the other python libraries needed to use this would be great!

@sean-schaefer

This comment has been minimized.

Copy link
Author

commented Jul 22, 2013

We made a few changes as suggested by @cpcloud; thank you for the feedback. We're not sure where you want documentation on library dependencies, but we did add dependencies as a note in io.rst. It should be noted that there is presently a bug in BigQuery that is preventing our module from being 100% reliable, see:

http://stackoverflow.com/questions/17751944/google-bigquery-incomplete-query-replies-on-odd-attempts/

We are planning to implement pagination for the results so that datasets can be much larger and responses more reliable.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2013

is httplib2 kind of like the request library?

these deps need go in: http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies

and you should also have tests that skip if these are not installed

@sean-schaefer

This comment has been minimized.

Copy link
Author

commented Jul 22, 2013

Yes, httplib2 is Google's extension of the original httplib library. I imagine request could do the same thing, but the examples and recommendations with the BigQuery API was to use httplib2.

We added the third party libraries to the install.rst, and added the import skip tests.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2013

you might already have this, but on the call to the main api, read_gbq? need to raise with the deps if you don't find them. generally try the imports at the top of the module and then set a variable to indicate if they are successful, but don't raise (as the api files import your main module), only raise when the user tries to use

see core/expressions.py for an example of doing this (this actually falls back to using diffferent features, but its the same idea)

basically a user will try out your function and be like, hey I need these deps....(rather than having pandas auto install them or failing when pandas is imported, after all they are not required)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2013

@jtratner @cpcloud any final comments?

@sean-schaefer @jacobschaer i'll rebase this when I put it in...

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2013

@jreback - Sorry I'm a little slow when it comes to git. I am pretty sure it's the way you want it now. I did seperate docs and ci stuff from the rest though.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2013

@jacobschaer yes...looks fine now....

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2013

all I need to install is: easy_install bigquery right?

[sheep-jreback-~] bq
Traceback (most recent call last):
  File "/usr/local/bin/bq", line 8, in <module>
    load_entry_point('bigquery==2.0.15', 'console_scripts', 'bq')()
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 318, in load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 2221, in load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1954, in load
  File "build/bdist.linux-x86_64/egg/bq.py", line 39, in <module>
  File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 26, in <module>
ImportError: cannot import name discovery

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

@jreback I thought that's all I installed. Per:
http://code.google.com/p/google-bigquery-tools/source/browse/bq/README.txt

easy_install bigquery

What interpreter are you using? This might be related to:
#5116

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

got it sorted...had an old version installed.....

bombs away....

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

merged via these commits:

390a2d6
2c60400
6ee748e

odd that github didn't close this PR....but in master now

@jreback jreback closed this Oct 9, 2013

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

@sean-schaefer @jacobschaer

thanks for all of your nice work on this!

the pandas/big community will be happy!

and you get to support in perpetuity !

check out docs (built tomorrow by 5 est).

also checkout master and test...if anything pls submit a followup PR

@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

one more step toward a pandopoly!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

Is this supposed to test on under py3?
does it work in py3? I seem to remember you testing/saying it was working?

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

@jreback I have not tested under Python 3, and from some tests a while ago it seemed that it can't be supported in this version of bq.py. The issue appears to be related to their handling of unicode.

@cpcloud Soon...

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

ok can u test and see if need a nice failure message?

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

@jreback - I'll see what I can do. I've been using a mac and have been hesitant to do Python 3.

@azbones - Good news, Google is actively working on the 100k result bug. As I understand it, no changes will need to be made to our code as it's entirely backend. We can then uncomment the test I made for this situation.

http://stackoverflow.com/questions/19145587/bq-py-not-paging-results

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

@jreback Do we have directions for getting this up and running on Python 3? We kept running into troubles with cython and various dependencies.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

what do you mean, installing py3, pandas? or bq? py3? you are on linux, right?

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

I was trying to test on Linux, yes. I had a setup Ubuntu virtual machine, and we got python 3 installed, but were having some problems building our repository.
We did something to the effect of:

apt-get install python3, python3-dev, python-pandas, git, cython
git clone https://github.com/sean-schaefer/pandas.git
python pandas/setup.py develop
@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

u need to be in the pandas directory...then run python setup.py develop

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

... sorry, that's what we did. I was just typing this up from memory. We were getting errors like in:
#2439 (comment)

We were also having trouble getting all the python3 versions of things from apt-get.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

is there maybe a cython3 that you need instead?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

@jacobschaer I use pip3 for most of the pandas installs (e.g. dateutil,cython)

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

On Ubuntu 12.4, we did roughly the following...

sudo apt-get install python3, python3-dev, cython, python3-setuptools
sudo easy_install3 pip
...
pip3-2 install git+https://github.com/sean-schaefer/pandas.git

We eventually got:

Exception: Cython-generated file 'pandas/index.c' not found ...

See:
https://gist.github.com/clayton/c658f4d9e20afc635e35

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

I believe you can actually compile with Python 2 Cython and use with Py 3
On Oct 9, 2013 5:08 PM, "jacobschaer" notifications@github.com wrote:

On Ubuntu 12.4, we did roughly the following...

sudo apt-get install python3, python3-dev, cython, python3-setuptools
sudo easy_install3 pip
...
pip3-2 install git+https://github.com/sean-schaefer/pandas.git

We eventually got:

Exception: Cython-generated file 'pandas/index.c' not found ...


Reply to this email directly or view it on GitHubhttps://github.com//pull/4140#issuecomment-26008906
.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2013

@sean-schaefer @jacobschaer

docs are built: http://pandas.pydata.org/pandas-docs/dev/io.html#google-bigquery-experimental

  • need a little blurb/example for the v0.13.0 announcements (you can paste here or put in a new PR ....just need a short section in experimental, can be the same example in the docs, this is just to give users a quick taste.)
  • I would remove this part of the docs:
The general structure of this module and its provided functions are based loosely on those in

pandas.io.sql.
  • I think you can add a to_gbq method in core/generic.py that just calls gbq.to_gbq(self....)...similar to how the other to_**** methods work (e.g. see to_hdf)
  • pls add the to/read gbq to doc/source/api.rst

all of these (plus any p3k changes) can be rolled into a single PR

thanks

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2013

What kind of blurb/example would you want, and what file should it go in? I came up with a moderately interesting example:

query = """SELECT station_number as STATION, month as MONTH, AVG(mean_temp) as MEAN_TEMP FROM publicdata:samples.gsod
WHERE YEAR = 2000 
GROUP BY STATION, MONTH 
ORDER BY STATION, MONTH ASC"""

df = gbq.read_gbq(query)
df2 = df.pivot(index='STATION', columns='MONTH', values='MEAN_TEMP')
df3 = pandas.concat([df2.min(), df2.mean(), df2.max()], axis=1,keys=["Min Tem", "Mean Temp", "Max Temp"])

Yields the monthly min, mean, and max US Temperatures for the year 2000 using NOAA gsod data.

         Min Tem  Mean Temp    Max Temp
MONTH                                  
1     -53.336667  39.827892   89.770968
2     -49.837500  43.685219   93.437932
3     -77.926087  48.708355   96.099998
4     -82.892858  55.070087   97.317240
5     -92.378261  61.428117  102.042856
6     -77.703334  65.858888  102.900000
7     -87.821428  68.169663  106.510714
8     -89.431999  68.614215  105.500000
9     -86.611112  63.436935  107.142856
10    -78.209677  56.880838   92.103333
11    -50.125000  48.861228   94.996428
12    -50.332258  42.286879   94.396774

As far as a blurb, perhaps:
The gbq module provides a simple way to extract from and load data into Google's BigQuery Data Sets by way of pandas DataFrames. BigQuery is a high performance SQL-like database service, useful for performing ad-hoc queries against extremely large datasets.

I'll make those doc changes and put them in a seperate pull request. I still have not been able to successfully test pandas in python 3. I should be able to get the other two changes done pretty quickly today.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2013

@jacobschaer is that an example for a public/sample dataset? e.g. that could in theory reproduced by a user? (those are the best kind!) put a small example in doc/source/v0.13.0.txt in experimental section (fyi...be sure to pull from master as there were some edits today)

you can put larger/examples edit docs at your leisure......

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2013

Will do. Those are public sample datasets provided on all GBQ accounts. As long as they have a BigQuery account that has API access (I think accounts not fully setup only get web access), this will take them through the auth process if they haven't already and then work perfectly.

@jacobschaer

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.