Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
jupyter/all-spark-notebook pyspark sc.textFile() can not access files stored on Amazon s3 #127
Testing in docker on single 32-core amazon ec2 c3.8xlarge instance as:
Start new python2 notebook. Jupyter displays new notebook.
First Cell runs fine
Second cell has error:
Of course, the error points at the count, as
AFAIK hadoop provides spark's s3 reader, and in 2015 there was a SO question about a similar error on hadoop:
providing the solution as including an omitted jarfile in the classpath defined in
Looking around in the container with
I assume a workaround may to be to run a full spark cluster (say, with either the Amazon GUI or the scripts provided in spark) and connect the jupyter docker to it using the provided instructions, but have not had time to try that. It would be nice if the s3 reader worked in a stand alone case.
The Spark 1.6.0 package used in the stack is prebuilt against Hadoop 2.6. It looks like someone filed a defect against Spark 1.4.1 about the missing jars/config:
It was closed stating that the problem is in hadoop upstream:
which refers to this mailing list post:
So either the s3 classes are in the image already but not configured for use (e.g., classpath). Or they're not in there at all and need to be added in plus configured for use.
That sounds consistent with some articles I found addressing use of s3 with spark.
From Jun 2015 this Stack Overflow answer suggests that it was an issue with the Spark build he had with Hadoop 2.6 and that using a Spark built with Hadoop 2.4 fixed the problem for him.
I may not have time to look into this any time soon; but maybe this info will be useful to others.
I got this to work:
If you want to create a Docker that loads this config in automatically, you can do:
@RobinL I've been hitting my head against this issue for a few hours and decided to put together your solution since it seems to be the most robust approach. I built the image and have got it running, but when I try to access anything on S3 I get a "Socket not created by this factory" error.
I was wondering whether your solution is still working for you or if you have found another way around this problem?
Another update on this:
In addition to running spark in jupyter lab, I have been trying to set up two additional docker commands that allow us to run spark scripts in the same environment:
This works in jupyter but not for (1) and (2) above.
After lots of digging, I figured out why. the jars are installed at like ./home/jovyan/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar.
Jupyter ‘knows’ about the joyvan folder, but the bash prompt and python script don’t. So (1) and (2) can’t see the files in /home/jovyan and therefore go and download them again
The solution is to include this:
in any scripts which are running outside of jupyter