docker-pyspark-dev

Dockerfile for developing python applications and libraries for Apache Spark:

includes latest Apache Spark and Apache Hadoop libraries
uses poetry for modern python development and packaging
uses Ubuntu 18.04 OS

This Dockerfile was created to aid with local testing and continuous integration of applications that target Spark and Hadoop clusters running on Ubuntu machines. It is not intended for production use.

To get started, check out example below.

Example usage

This example shows how to configure testing for a newly created python app.

Create a new poetry project:

poetry new my-app
cd my-app

Add pytest and pyspark to your dependencies in pyproject.toml:

[tool.poetry.dependencies]
python = "^3.7"
pyspark = "^2.4.5"

[tool.poetry.dev-dependencies]
pytest = "^5.2"
pytest-spark = "^0.6.0"

Add a test that checks if pyspark is working correctly in tests/test_my_app.py:

def test_spark(spark_context):
    # spark_context is a pytest fixture provided by pytest-spark plugin
    assert spark_context.parallelize(range(3)).collect() == [0, 1, 2]

Create a Dockerfile for your project:

FROM kowaalczyk/pyspark-dev:latest

WORKDIR /usr/src/my-app

# copy poetry settings and install dependencies (so that they will be cached)
ADD pyproject.toml poetry.lock ./
RUN poetry install --no-interaction --no-ansi

# add project source code and tests
ADD tests ./tests
ADD app ./app

Build the image:

docker build -t my-app .

Run poetry commands in the docker container:

docker run my-app:latest poetry run pytest

You should see output from pytest in your terminal (hopefully saying that all tests passed).

The example is based on spark-minimal-algorithms python package, you can see the project's github repo to see examples of more advanced usage.

Details

Apache Spark version: 2.4.5 (latest stable version)
Hadoop version: 2.7 (latest stable version)
Python version: 3.7.3 (Spark 2.4.5 is not compatible with Python 3.8)

Legal note

This project is neither maintained by nor associated with the Apache Software Foundation.

Apache Hadoop, Hadoop, Apache, the Apache feather logo, and the Apache Hadoop project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.

Apache Spark, Spark, Apache, the Apache feather logo, and the Apache Spark project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

Repository files navigation

docker-pyspark-dev

Example usage

Details

Legal note

About

Releases

Packages

Languages

License

kowaalczyk/docker-pyspark-dev

Folders and files

Latest commit

History

Repository files navigation

docker-pyspark-dev

Example usage

Details

Legal note

About

Topics

Resources

License

Stars

Watchers

Forks

Languages