refactor(bigquerystorage): `to_dataframe` on an arrow stream uses faster `to_arrow` + `to_pandas`, internally #9997

tswast · 2019-12-18T20:02:32Z

…to_arrow + to_pandas, internally

Towards https://issuetracker.google.com/140579733

…to_arrow + to_pandas, internally Towards https://issuetracker.google.com/140579733

plamut

Looks good to me code-wise, but I still need to verify the performance gain on a "big" table (ran out of time in this review pass 😃 ).

plamut · 2020-01-11T19:26:15Z

Unfortunately, I cannot confirm the speedup, the timings with and without this fix are about the same on my machine (i7, 16GB RAM, ~25 Mbs network).

It's probably because I also wasn't able to reproduce the speed difference in the first place using the sample from the issue description (tested on 2x 10e6 table with random floats).

@tswast What data did you use in testing? Maybe I can try with that ...

tswast · 2020-01-13T19:53:41Z

I'm getting a more modest speed-up than I was getting this past summer, too. It's at least some speedup, though. The taxi cab data has a more variety of data types. I wonder if the latest Pandas does a better job of combining floating point columns with concatenate now?

From a VM in us-central:

sudo apt-get install python3-pip git
pip3 install --upgrade pandas
pip3 install --upgrade pyarrow
pip3 install --upgrade google-cloud-core
pip3 install --upgrade google-api-core[grpc]
git clone https://github.com/tswast/google-cloud-python.git
cd google-cloud-python/bigquery_storage
git checkout b140579733-bq-to_dataframe
pip3 install -e .
cd ../..

benchmark_bqstorage.py

import sys
from google.cloud import bigquery_storage_v1beta1

# TODO(developer): Set the project_id variable.
project_id = 'swast-scratch'
#
# The read session is created in this project. This project can be
# different from that which contains the table.

client = bigquery_storage_v1beta1.BigQueryStorageClient()

# This example reads baby name data from the public datasets.
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "swast-scratch"
table_ref.dataset_id = "to_dataframe_benchmark"
table_ref.table_id = "tlc_green_{}pct".format(sys.argv[1])

parent = "projects/{}".format(project_id)
session = client.create_read_session(
    table_ref,
    parent,
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID),
    requested_streams=1,
)  # API request.

reader = client.read_rows(
    bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[0])
)

dataframe = reader.rows(session).to_dataframe()
print("Got {} rows.".format(len(dataframe.index)))

After this change:

swast@instance-1:~$ time python3 benchmark_bqstorage.py 6
Got 950246 rows.

real    0m7.219s
user    0m3.296s
sys     0m1.564s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 12_5
Got 1980854 rows.

real    0m10.859s
user    0m5.952s
sys     0m2.156s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 25
Got 3963696 rows.

real    0m22.156s
user    0m11.188s
sys     0m4.340s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 50
Got 7917713 rows.

real    0m44.529s
user    0m21.124s
sys     0m8.976s

Before this change:

cd google-cloud-python
git checkout master
cd ..

swast@instance-1:~$ time python3 benchmark_bqstorage.py 6
Got 950246 rows.

real    0m7.301s
user    0m6.572s
sys     0m1.148s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 12_5
Got 1980854 rows.

real    0m13.595s
user    0m12.584s
sys     0m2.056s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 25
Got 3963696 rows.

real    0m26.442s
user    0m24.552s
sys     0m3.904s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 50
Got 7917713 rows.

real    0m50.157s
user    0m48.556s
sys     0m7.876s

tswast · 2020-01-13T20:12:14Z

I made the swast-scratch.to_dataframe_benchmark.tlc_green_6pct, etc. tables publicly readable, so you should be able to reproduce my results as well.

I've updated the PR title to remove the "2x", as it's a more modest 1.126x speedup.

plamut · 2020-01-15T12:01:02Z

@tswast Thanks for the data, and the benchmarking results!

I was finally able to reproduce a significant difference pretty consistently (~33 seconds vs. ~35 seconds for the 6pct test, for example).

As this was much slower than your results, it became apparent that the network I/O dominated the timings on my end (I'm on a 50 Mbps fiber optic). By shutting down everything that might have affected the average download speed, I was able to bring the variance down by enough to not lose the signal in the noise (didn't initially expect that such measures would be necessary based on the reported 2x gains 🙂).

plamut

The speedup is indeed noticeable 👍

OmarMAmin · 2022-12-12T16:05:05Z

@tswast
Sorry for asking here, not sure if there's somewhere else to ask,
Does that mean that results from here aren't currently valid?
https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171

using to_dataframe() is equivalent to to_arrow().to_pandas() if pyarrow is installed, and fastavro is no longer a requirement?

tswast · 2022-12-12T21:24:37Z

Yes, that's true. We should see similar performance for to_dataframe() compared to to_arrow().to_pandas() after this PR.

tswast requested a review from a team December 18, 2019 20:02

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 18, 2019

tswast added api: bigquerystorage Issues related to the BigQuery Storage API. and removed cla: yes This human has signed the Contributor License Agreement. labels Dec 18, 2019

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 18, 2019

tswast requested a review from plamut December 19, 2019 16:25

tswast assigned plamut Dec 19, 2019

tswast mentioned this pull request Dec 30, 2019

refactor(bigquery): to_dataframe uses faster to_arrow + to_pandas when pyarrow is available #10027

Merged

tswast requested a review from shollyman January 7, 2020 17:20

fix(bigquerystorage): to_dataframe on an arrow stream uses 2x faster …

ef291eb

…to_arrow + to_pandas, internally Towards https://issuetracker.google.com/140579733

tswast force-pushed the b140579733-bq-to_dataframe branch from 2009c86 to ef291eb Compare January 7, 2020 17:21

plamut reviewed Jan 9, 2020

View reviewed changes

tswast changed the title ~~fix(bigquerystorage): to_dataframe on an arrow stream uses 2x faster …~~ fix(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally Jan 13, 2020

tswast changed the title ~~fix(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally~~ refactor(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally Jan 13, 2020

tswast requested a review from plamut January 13, 2020 20:09

plamut approved these changes Jan 15, 2020

View reviewed changes

plamut merged commit 121394c into googleapis:master Jan 15, 2020

tswast deleted the b140579733-bq-to_dataframe branch January 15, 2020 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(bigquerystorage): `to_dataframe` on an arrow stream uses faster `to_arrow` + `to_pandas`, internally #9997

refactor(bigquerystorage): `to_dataframe` on an arrow stream uses faster `to_arrow` + `to_pandas`, internally #9997

tswast commented Dec 18, 2019

plamut left a comment

plamut commented Jan 11, 2020

tswast commented Jan 13, 2020

tswast commented Jan 13, 2020 •

edited

Loading

plamut commented Jan 15, 2020

plamut left a comment

OmarMAmin commented Dec 12, 2022

tswast commented Dec 12, 2022

refactor(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally #9997

refactor(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally #9997

Conversation

tswast commented Dec 18, 2019

plamut left a comment

Choose a reason for hiding this comment

plamut commented Jan 11, 2020

tswast commented Jan 13, 2020

tswast commented Jan 13, 2020 • edited Loading

plamut commented Jan 15, 2020

plamut left a comment

Choose a reason for hiding this comment

OmarMAmin commented Dec 12, 2022

tswast commented Dec 12, 2022

refactor(bigquerystorage): `to_dataframe` on an arrow stream uses faster `to_arrow` + `to_pandas`, internally #9997

refactor(bigquerystorage): `to_dataframe` on an arrow stream uses faster `to_arrow` + `to_pandas`, internally #9997

tswast commented Jan 13, 2020 •

edited

Loading