Skip to content

Building a docker image with Pyspark and Azure Gen2 storage connector so enable local testing your lake

Notifications You must be signed in to change notification settings

magrathj/Pyspark-Docker-Image-With-Azure-Gen2-Connection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docker Pyspark With Azure Gen2

Building a docker image with Pyspark and a Azure Gen2 storage connection, so as to enable testing of your data lake with pyspark locally.

Docker Image CI

Build Docker image

    docker build -t sparklocal .

Run docker image

    docker run -it sparklocal:latest /bin/bash

Run docker image in Visual Studio Code

Install docker

Install docker

Install extension to run container in another window

Install extension

Test pyspark locally

Type pyspark into the terminal and you should be promoted with the Spark UI at localhost:4040

    pyspark 

Pyspark locally

Install extension

Test docker image

Run python script to test pyspark connection

    python test_spark.py

read from the lake

Run python script to test the connect to Azure Data Lake Storage (ADLS) Gen2

    python test_adls.py

read from the lake

References

Databricks docker deployments

Apache Spark docker deployments

Delta connecting to ADLS Gen2

Connecting to ADLS Gen1 locally

developing inside docker container using visual studio code

About

Building a docker image with Pyspark and Azure Gen2 storage connector so enable local testing your lake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published