Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

pandas.io.gbq Version 2 #6937

Merged
merged 2 commits into from Jun 30, 2014
Jump to file or symbol
Failed to load files and symbols.
+661 −948
Split
View
@@ -4,7 +4,6 @@ python-dateutil==1.5
pytz==2013b
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
html5lib==1.0b2
-bigquery==2.0.17
@jreback

jreback Apr 23, 2014

Contributor

don;'t you still need the bigquery package so that bq is installed? (or is that in the google-api-python-client), what a horrible package name google!

@jreback

jreback Apr 23, 2014

Contributor

also if requirements are changing, pls update install.rst as well

@jacobschaer

jacobschaer Apr 23, 2014

Contributor

bigquery is only required for the to_gbq() test suite, which can't be run in CI anyways due to lack of valid project id. Will update install.rst soon.

numexpr==1.4.2
sqlalchemy==0.7.1
pymysql==0.6.0
View
@@ -19,5 +19,7 @@ lxml==3.2.1
scipy==0.13.3
beautifulsoup4==4.2.1
statsmodels==0.5.0
-bigquery==2.0.17
boto==2.26.1
+httplib2==0.8
+python-gflags==2.0
+google-api-python-client==1.2
View
@@ -112,7 +112,9 @@ Optional Dependencies
:func:`~pandas.io.clipboard.read_clipboard`. Most package managers on Linux
distributions will have xclip and/or xsel immediately available for
installation.
- * `Google bq Command Line Tool <https://developers.google.com/bigquery/bq-command-line-tool/>`__
+ * Google's `python-gflags` and `google-api-python-client`
@jreback

jreback Jun 5, 2014

Contributor

add httplib2 here as well

+ * Needed for :mod:`~pandas.io.gbq`
+ * `httplib2`
* Needed for :mod:`~pandas.io.gbq`
* One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.io.html.read_html` function:
View
@@ -3373,83 +3373,79 @@ Google BigQuery (Experimental)
The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
analytics web service to simplify retrieving results from BigQuery tables
using SQL-like queries. Result sets are parsed into a pandas
-DataFrame with a shape derived from the source table. Additionally,
-DataFrames can be uploaded into BigQuery datasets as tables
-if the source datatypes are compatible with BigQuery ones.
+DataFrame with a shape and data types derived from the source table.
+Additionally, DataFrames can be appended to existing BigQuery tables if
+the destination table is the same shape as the DataFrame.
For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__
-As an example, suppose you want to load all data from an existing table
-: `test_dataset.test_table`
-into BigQuery and pull it into a DataFrame.
+As an example, suppose you want to load all data from an existing BigQuery
+table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
@jorisvandenbossche

jorisvandenbossche Jun 10, 2014

Owner

pandas.io.read_gbq -> pandas.read_gbq

+function.
.. code-block:: python
-
- from pandas.io import gbq
-
# Insert your BigQuery Project ID Here
- # Can be found in the web console, or
- # using the command line tool `bq ls`
+ # Can be found in the Google web console
projectid = "xxxxxxxx"
- data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
+ data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
-The user will then be authenticated by the `bq` command line client -
-this usually involves the default browser opening to a login page,
-though the process can be done entirely from command line if necessary.
-Datasets and additional parameters can be either configured with `bq`,
-passed in as options to `read_gbq`, or set using Google's gflags (this
-is not officially supported by this module, though care was taken
-to ensure that they should be followed regardless of how you call the
-method).
+You will then be authenticated to the specified BigQuery account
+via Google's Oauth2 mechanism. In general, this is as simple as following the
+prompts in a browser window which will be opened for you. Should the browser not
+be available, or fail to launch, a code will be provided to complete the process
+manually. Additional information on the authentication mechanism can be found
+`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__
-Additionally, you can define which column to use as an index as well as a preferred column order as follows:
+You can define which column from BigQuery to use as an index in the
+destination DataFrame as well as a preferred column order as follows:
.. code-block:: python
- data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table',
+ data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
index_col='index_column_name',
- col_order='[col1, col2, col3,...]', project_id = projectid)
-
-Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:
+ col_order=['col1', 'col2', 'col3'], project_id = projectid)
+
+Finally, you can append data to a BigQuery table from a pandas DataFrame
+using the :func:`~pandas.io.to_gbq` function. This function uses the
+Google streaming API which requires that your destination table exists in
+BigQuery. Given the BigQuery table already exists, your DataFrame should
+match the destination table in column order, structure, and data types.
@jorisvandenbossche

jorisvandenbossche Jun 10, 2014

Owner

and is it then appended? (not fully clear to me, previously you had fail/replace/append, now only one default action?)

@jacobschaer

jacobschaer Jun 11, 2014

Contributor

The other actions were a benefit of relying on bq.py in the past. While possible to do strictly with the API, it's a lot of code for very little benefit. The data is strictly appended which was, at least in our experience, the most common use case.

@jorisvandenbossche

jorisvandenbossche Jun 11, 2014

Owner

Ah, I missed the rather obvious "you can append data using to_gbq()" part.

So OK, no problem here. But maybe add it more explicitely in the docstring of to_gbq as well?

+DataFrame indexes are not supported. By default, rows are streamed to
+BigQuery in chunks of 10,000 rows, but you can pass other chuck values
+via the ``chunksize`` argument. You can also see the progess of your
+post via the ``verbose`` flag which defaults to ``True``. The http
+response code of Google BigQuery can be successful (200) even if the
+append failed. For this reason, if there is a failure to append to the
+table, the complete error response from BigQuery is returned which
+can be quite long given it provides a status for each row. You may want
+to start with smaller chuncks to test that the size and types of your
+dataframe match your destination table to make debugging simpler.
.. code-block:: python
df = pandas.DataFrame({'string_col_name' : ['hello'],
'integer_col_name' : [1],
'boolean_col_name' : [True]})
- schema = ['STRING', 'INTEGER', 'BOOLEAN']
- data_frame = gbq.to_gbq(df, 'my_dataset.my_table',
- if_exists='fail', schema = schema, project_id = projectid)
-
-To add more rows to this, simply:
-
-.. code-block:: python
-
- df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
- 'integer_col_name' : [2],
- 'boolean_col_name' : [False]})
- data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append', project_id = projectid)
+ df.to_gbq('my_dataset.my_table', project_id = projectid)
-.. note::
+The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__
- A default project id can be set using the command line:
- `bq init`.
+While BigQuery uses SQL-like syntax, it has some important differences
+from traditional databases both in functionality, API limitations (size and
+qunatity of queries or uploads), and how Google charges for use of the service.
+You should refer to Google documentation often as the service seems to
+be changing and evolving. BiqQuery is best for analyzing large sets of
+data quickly, but it is not a direct replacement for a transactional database.
- There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
- see `here <https://developers.google.com/bigquery/query-reference>`__
-
- You can access the management console to determine project id's by:
- <https://code.google.com/apis/console/b/0/?noredirect>
+You can access the management console to determine project id's by:
+<https://code.google.com/apis/console/b/0/?noredirect>
.. warning::
- To use this module, you will need a BigQuery account. See
- <https://cloud.google.com/products/big-query> for details.
-
- As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
- but any client changes will not make it into 0.13.1. See:
- http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
+ To use this module, you will need a valid BigQuery account. See
+ <https://cloud.google.com/products/big-query> for details on the
+ service.
.. _io.stata:
View
@@ -154,14 +154,11 @@ Performance
Experimental
~~~~~~~~~~~~
-``pandas.io.data.Options`` has gained a ``get_all_data method``, and now consistently returns a multi-indexed ``DataFrame`` (:issue:`5602`). See :ref:`the docs<remote_data.yahoo_options>`
-
- .. ipython:: python
-
- from pandas.io.data import Options
- aapl = Options('aapl', 'yahoo')
- data = aapl.get_all_data()
- data.iloc[0:5, 0:5]
+- ``io.gbq.read_gbq`` and ``io.gbq.to_gbq`` were refactored to remove the
+ dependency on the Google ``bq.py`` command line client. This submodule
+ now uses ``httplib2`` and the Google ``apiclient`` and ``oauth2client`` API client
+ libraries which should be more stable and, therefore, reliable than
+ ``bq.py`` (:issue:`6937`).
@jorisvandenbossche

jorisvandenbossche Jun 10, 2014

Owner

Can you maybe add a more elaborate description of what actually changed in the API? So from the point of vue of someone who was already using these functions: what has he/she to adapt in the code? (maybe an example of a function call now)

@jacobschaer

jacobschaer Jun 11, 2014

Contributor

Would example code be appropriate in this file? If so, @azbones and I can come up with something. The comment about the API client was more of a reference to the back-end implementation, though as you noted above - there are a few minor changes that pandas users will face.

@jorisvandenbossche

jorisvandenbossche Jun 11, 2014

Owner

Yes, you can certainly put some example code in the whatsnew file. And maybe also summarize the interface changes (some keywords removed, ..)

.. _whatsnew_0141.bug_fixes:
View
@@ -669,47 +669,43 @@ def to_dict(self, outtype='dict'):
else: # pragma: no cover
raise ValueError("outtype %s not understood" % outtype)
- def to_gbq(self, destination_table, schema=None, col_order=None,
- if_exists='fail', **kwargs):
+ def to_gbq(self, destination_table, project_id=None, chunksize=10000,
+ verbose=True, reauth=False):
"""Write a DataFrame to a Google BigQuery table.
- If the table exists, the DataFrame will be appended. If not, a new
- table will be created, in which case the schema will have to be
- specified. By default, rows will be written in the order they appear
- in the DataFrame, though the user may specify an alternative order.
+ THIS IS AN EXPERIMENTAL LIBRARY
+
+ If the table exists, the dataframe will be written to the table using
+ the defined table schema and column types. For simplicity, this method
+ uses the Google BigQuery streaming API. The to_gbq method chunks data
+ into a default chunk size of 10,000. Failures return the complete error
+ response which can be quite long depending on the size of the insert.
+ There are several important limitations of the Google streaming API
+ which are detailed at:
+ https://developers.google.com/bigquery/streaming-data-into-bigquery.
Parameters
- ---------------
+ ----------
+ dataframe : DataFrame
+ DataFrame to be written
destination_table : string
- name of table to be written, in the form 'dataset.tablename'
- schema : sequence (optional)
- list of column types in order for data to be inserted, e.g.
- ['INTEGER', 'TIMESTAMP', 'BOOLEAN']
- col_order : sequence (optional)
- order which columns are to be inserted, e.g. ['primary_key',
- 'birthday', 'username']
- if_exists : {'fail', 'replace', 'append'} (optional)
- - fail: If table exists, do nothing.
- - replace: If table exists, drop it, recreate it, and insert data.
- - append: If table exists, insert data. Create if does not exist.
- kwargs are passed to the Client constructor
-
- Raises
- ------
- SchemaMissing :
- Raised if the 'if_exists' parameter is set to 'replace', but no
- schema is specified
- TableExists :
- Raised if the specified 'destination_table' exists but the
- 'if_exists' parameter is set to 'fail' (the default)
- InvalidSchema :
- Raised if the 'schema' parameter does not match the provided
- DataFrame
+ Name of table to be written, in the form 'dataset.tablename'
+ project_id : str
+ Google BigQuery Account project ID.
+ chunksize : int (default 10000)
+ Number of rows to be inserted in each chunk from the dataframe.
+ verbose : boolean (default True)
+ Show percentage complete
+ reauth : boolean (default False)
+ Force Google BigQuery to reauthenticate the user. This is useful
+ if multiple accounts are used.
+
"""
from pandas.io import gbq
- return gbq.to_gbq(self, destination_table, schema=None, col_order=None,
- if_exists='fail', **kwargs)
+ return gbq.to_gbq(self, destination_table, project_id=project_id,
+ chunksize=chunksize, verbose=verbose,
+ reauth=reauth)
@classmethod
def from_records(cls, data, index=None, exclude=None, columns=None,
Oops, something went wrong.