Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide doc on making the connector available in Jupyter #81

Closed
functicons opened this issue Nov 8, 2019 · 5 comments
Closed

Provide doc on making the connector available in Jupyter #81

functicons opened this issue Nov 8, 2019 · 5 comments

Comments

@functicons
Copy link

This was asked on SO.

@mkleinbort
Copy link

mkleinbort commented Nov 11, 2019

I was the one that asked the SO question, and the notebook example helped me fix the issue.

It's probably enough to add something like:


To run this from within a Dataproc Jupyter instance, be sure to start a Python notebook (Not PySpark) and run:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName('EDA')\
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
  .getOrCreate()

You can then load data as a spark dataframe through

s = spark.read \
  .format('bigquery') \
  .option('table', '{_project_}.{_database_}.{_table_name_}') \
  .load()

@mkleinbort
Copy link

That said, I'm seeing terrible performance when I try to run anything.

Even something like

s = spark.read.format('bigquery').option('table', 'my_project.all_data.stores').load()
s.take(1)

Is taking minutes... and then crashing.

Also

s.columns

results in a crash.

That said, that's nothing to do with this issue. I'll do some googling and then ask again.

@mkleinbort
Copy link

A correction on the above, I ran

%%time
s.columns

And got an answer, but it took ~5minutes. This was not registered by the %%time command which clocked a wall time of 57.5 µs

@mkleinbort
Copy link

mkleinbort commented Nov 11, 2019

Just found that changing things to

s = spark.read \
  .format('bigquery') \
  .option('project', '{_project_name_}') \
  .option('table', '{_database_}.{_table_name_}') \
  .load()

resolved many of my issues, though

s.take(1) 

results in

Server Connection Error
Invalid response: 504

Py4JJavaError: An error occurred while calling o119.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 1.0 [...]: java.lang.ClassNotFoundException:
com.google.cloud.spark.bigquery.direct.BigQueryPartition

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Jan 29, 2020
…ter, and general documentation regarding the usage with different Scala versions
davidrabinowitz added a commit that referenced this issue Jan 29, 2020
* Issue #81: Added documentation for using with Jupyter, and general documentation regarding the usage with different Scala versions

* Changed latest jar to https URL rather than gs per @medb suggestion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants