Provide doc on making the connector available in Jupyter #81

functicons · 2019-11-08T19:21:09Z

This was asked on SO.

mkleinbort · 2019-11-11T05:10:22Z

I was the one that asked the SO question, and the notebook example helped me fix the issue.

It's probably enough to add something like:

To run this from within a Dataproc Jupyter instance, be sure to start a Python notebook (Not PySpark) and run:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName('EDA')\
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
  .getOrCreate()

You can then load data as a spark dataframe through

s = spark.read \
  .format('bigquery') \
  .option('table', '{_project_}.{_database_}.{_table_name_}') \
  .load()

mkleinbort · 2019-11-11T05:23:05Z

That said, I'm seeing terrible performance when I try to run anything.

Even something like

s = spark.read.format('bigquery').option('table', 'my_project.all_data.stores').load()
s.take(1)

Is taking minutes... and then crashing.

Also

s.columns

results in a crash.

That said, that's nothing to do with this issue. I'll do some googling and then ask again.

mkleinbort · 2019-11-11T05:25:44Z

A correction on the above, I ran

%%time
s.columns

And got an answer, but it took ~5minutes. This was not registered by the %%time command which clocked a wall time of 57.5 µs

mkleinbort · 2019-11-11T05:34:38Z

Just found that changing things to

s = spark.read \
  .format('bigquery') \
  .option('project', '{_project_name_}') \
  .option('table', '{_database_}.{_table_name_}') \
  .load()

resolved many of my issues, though

s.take(1)

results in

Server Connection Error
Invalid response: 504

Py4JJavaError: An error occurred while calling o119.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 1.0 [...]: java.lang.ClassNotFoundException:
com.google.cloud.spark.bigquery.direct.BigQueryPartition

…ter, and general documentation regarding the usage with different Scala versions

@medb

* Issue #81: Added documentation for using with Jupyter, and general documentation regarding the usage with different Scala versions * Changed latest jar to https URL rather than gs per @medb suggestion

davidrabinowitz · 2020-01-29T22:58:45Z

https://github.com/GoogleCloudDataproc/spark-bigquery-connector#using-in-jupyter-notebooks

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Jan 29, 2020

Issue GoogleCloudDataproc#81: Added documentation for using with Jupy…

656a4e3

…ter, and general documentation regarding the usage with different Scala versions

davidrabinowitz mentioned this issue Jan 29, 2020

Added documentation #115

Merged

davidrabinowitz closed this as completed Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide doc on making the connector available in Jupyter #81

Provide doc on making the connector available in Jupyter #81

functicons commented Nov 8, 2019

mkleinbort commented Nov 11, 2019 •

edited

mkleinbort commented Nov 11, 2019

mkleinbort commented Nov 11, 2019

mkleinbort commented Nov 11, 2019 •

edited

davidrabinowitz commented Jan 29, 2020

Provide doc on making the connector available in Jupyter #81

Provide doc on making the connector available in Jupyter #81

Comments

functicons commented Nov 8, 2019

mkleinbort commented Nov 11, 2019 • edited

mkleinbort commented Nov 11, 2019

mkleinbort commented Nov 11, 2019

mkleinbort commented Nov 11, 2019 • edited

davidrabinowitz commented Jan 29, 2020

mkleinbort commented Nov 11, 2019 •

edited

mkleinbort commented Nov 11, 2019 •

edited