Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use graphframes in Jupyter notebook by referencing graphrames.jar #104

Open
slavakx opened this issue Aug 17, 2016 · 17 comments

Comments

Projects
None yet
@slavakx
Copy link

commented Aug 17, 2016

I'd like to user it locally in Jupyter notebook. I've downloaded the graphrames.jar and created PYSPARK_SUBMIT_ARGS variable that references the jar.
The import from graphframes import * works but fails on call g = GraphFrame(v, e)
Py4JJavaError: An error occurred while calling o57.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI

Operating system: Windows

@twesthead

This comment has been minimized.

Copy link

commented Aug 24, 2016

I'm having a similar problem while following two different guides :
Guide 1. https://www.mapr.com/blog/using-spark-graphframes-analyze-facebook-connections
Guide 2. The most basic one : http://graphframes.github.io/quick-start.html

As soon as I call GraphFrame(v, e) I get:
java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI

The only way what I do differs from Guide 2 is that I don't have the grapframes.jar (can't find it). I use this instead : https://github.com/graphframes/graphframes/archive/master.tar.gz , unzip it and from graphframes import * works fine.

@leoneu

This comment has been minimized.

Copy link

commented Aug 30, 2016

Not nice but I managed to run graphframes with jupyter+python:

mkdir ~/jupyter
cd ~/jupyter
wget https://github.com/graphframes/graphframes/archive/release-0.2.0.zip
unzip release-0.2.0.zip
cd graphframes-release-0.2.0
build/sbt assembly
cd ..

# Copy necessary files to root level so we can start pyspark. 
cp graphframes-release-0.2.0/target/scala-2.11/graphframes-release-0-2-0-assembly-0.2.0-spark2.0.jar .
cp -r graphframes-release-0.2.0/python/graphframes .

# Set environment to use Jupyter
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

# Launch the jupyter server.
pyspark --jars graphframes-release-0-2-0-assembly-0.2.0-spark2.0.jar
@jkbradley

This comment has been minimized.

Copy link
Member

commented Sep 28, 2016

@slavakx Did this solution work for you?

@slavakx

This comment has been minimized.

Copy link
Author

commented Oct 2, 2016

@jkbradley Unfortunately not quite.

I am getting the following exception on g = GraphFrame(vertices, edges):
java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logging

even so i am running pyspark as
pyspark --jars ../jars/graphframes-0.2.0-spark2.0-s_2.11.jar;../jars/scala-logging-slf4j_2.11-2.1.2.jar

(i thought that referencing scala-logging-slf4j would resolve the issue)

@sbromberger

This comment has been minimized.

Copy link

commented Oct 3, 2016

Possible relationship to #109?

@thunterdb

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2017

@slavakx did the cleanup in #109 work for you?

@slavakx

This comment has been minimized.

Copy link
Author

commented Jan 10, 2017

@thunterdb Nope. Even after installing the latest spark 2.1.0 and the latest graphframes lib and using

pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11

Py4JJavaError: An error occurred while calling o69.newInstance.
: java.lang.NoClassDefFoundError: com/typesafe/scalalogging/slf4j/LazyLogging

@thunterdb thunterdb added this to the 0.3.1 milestone Jan 10, 2017

@AbdealiJK

This comment has been minimized.

Copy link

commented Mar 2, 2017

I seem to have a similar issue where I do:

os.environ["PYSPARK_SUBMIT_ARGS"] = '--jars /Users/ajk/spark/commons-csv-1.2.jar,/Users/ajk/spark/graphframes-0.3.0-spark2.0-s_2.10.jar,/Users/ajk/spark/spark-csv_2.10-1.5.0.jar pyspark-shell'

in a Jupyter notebook and then import pyspark. After that, I can use spark-csv correctly but import graphframes does not work.

Using Spark 2.0.1 with graphframes-0.3.0-spark2.0-s_2.10.jar from spark-packages.

@AbdealiJK

This comment has been minimized.

Copy link

commented Mar 2, 2017

Follow up:
I tried copying the graphframe's python/graphframes folder as mentioned by @leoneu in #104 (comment) and it worked.

This works with Py2.7 and Py3.5

@minicat

This comment has been minimized.

Copy link

commented Mar 4, 2017

I encountered similar errors with PySpark and was able to solve it by passing in the jar with --py-files as well as with --jar: eg

spark-submit --jars $JAR_PATH --py-files $JAR_PATH

@jkbradley

This comment has been minimized.

Copy link
Member

commented Mar 27, 2017

@microcat Thanks for pointing out the need for the py-files flag! @slavakx and @AbdealiJK does that fix it for you?

@slavakx

This comment has been minimized.

Copy link
Author

commented Mar 28, 2017

Finally the updated version of graphframes started to work for me:

pyspark --packages graphframes:graphframes:0.4.0-spark2.1-s_2.11

@YuanEric88

This comment has been minimized.

Copy link

commented Mar 5, 2018

@slavakx Hi, my system is windows, could you tell me how to install graphframes in windows system computer? Thank you a lot!

@VeLKerr

This comment has been minimized.

Copy link

commented Sep 9, 2018

I did it in 3 following steps:

  1. Built GF from the sources (as @leoneu adviced)
  2. Started the Jupyter instance using the following command:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="*" --port=<PORT>' pyspark2 --jars /opt/my_path/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar --master=yarn --num-executors=<N>
  1. In the notebook I appended the jar file to the SparkContext:
sc.addPyFile('/opt/my_path/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar')
@Cyb3rWard0g

This comment has been minimized.

Copy link

commented Mar 8, 2019

If you are using a pyspark kernel for Jupyter. You can do it the following way. I am using:

  • Spark 2.4.0
  • Graphframes 0.7.0
  • notebook=5.7.5
  • jupyterlab=0.35.4
  • Miniconda3
  • path-to-spark: /opt/helk

Steps:

  1. Download Graphframes JAR 0.7.0
wget https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11.jar -P /path-to-spark/spark/jars/
  1. Copy graphframes*.jar to your spark folder (do not delete the jar fie. just copy it) as a zip file (Jar is technically a zip file so you can name it with extension .zip when you copy it)
cp /path-to-spark/spark/jars/graphframes*.jar /path-to-spark/spark/graphframes.zip
  1. Add your graphframes.zip to your PYTHONPATH. I am using a kernel.json file for Pyspark in Jupyter and I have it like this:
{
  "display_name": "PySpark_Python3",
  "language": "python",
  "argv": [
   "/opt/conda/bin/python3",
   "-m",
   "ipykernel_launcher",
   "-f",
   "{connection_file}"
  ],
  "env": {
   "SPARK_HOME": "/path-to-spark/spark/",
   "PYTHONPATH": "/path-to-spark/spark/python/:/path-to-spark/spark/python/lib/py4j-0.10.7-src.zip:/path-to-spark/spark/graphframes.zip",
   "PYSPARK_PYTHON": "/opt/conda/bin/python3"
  }
}
  1. go to your notebook and select your Pyspark Kernel
  2. Create a SparkSession Instance
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("HELK Graphs") \
    .enableHiveSupport() \
    .getOrCreate()
  1. Import Graphframe and test
from graphframes import *
from pyspark.sql.functions import *

v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])

g = GraphFrame(v, e)

g.edges.filter("relationship = 'follow'").count()
g.inDegrees.show()

That should work.

image

@krisneupane

This comment has been minimized.

Copy link

commented Mar 11, 2019

Thanks Cyb3rWard0g . What is the purpose of having *.zip file since *.jar is already available in jars directory?

@Cyb3rWard0g

This comment has been minimized.

Copy link

commented Mar 11, 2019

Hey @krisneupane , that is a good question. I add the .zip folder to the PYTHONPATH in my kernel file to also enable the python bindings needed for Graphframes to work. It is similar to using the --py-files flag. If I manually download the jar and just add it to the spark.jars config or the ~/spark/jars folder, it cannot find the python dependencies and fails with the following message:

image

If I add only the graphframes .zip file to the PYTHONPATH, I get the java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI error message.

However, adding the jar file to the ~/spark/jars folder and to the Spark PYTHONPATH as a .zip file, it seems to work fine and find the graphframes modules and the GraphFrame class. It is like using the --jars and --py-files together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.