spark_on_azure_batch_demo

This is a demo that shows how to run Spark jobs on Azure Batch.

This code is NOT PRODUCTION READY!

The original code is from https://medium.com/datamindedbe/run-spark-jobs-on-azure-batch-using-azure-container-registry-and-blob-storage-10a60bd78f90 and respectively https://github.com/datamindedbe/spark_on_azure_batch_demo -> All credits go to him :)

Set-Up

Clone this repo.

Azure Portal

Use the Deploy to Azure Button to deploy a batch account, container registry and storage account to your subscription.
After succesful deployment navigate to your storage account -> container -> titanic and upload the train.csv from this repo.

Environment

Create a python environment and install the requirements from requirements.txt pip install -r requirements.txt
Install jupyter in your environment (conda install -y jupyter or pip install jupyter)

Editor / IDE

Copy template.config.py to config.py and add your credetials and resource names
Adjust the code in titanic_analytics.py (this is the code run in a node)
- Update file and output to match your storage account (e.g. "wasbs://containername@YOURSTORAGEACCOUNTNAME.blob.core.windows.net/train.csv")
- Adjust the spark query and logging however it suits you.

Build Docker Image

Build your docker image and push to your registry (replace sparkonbatch with your ACR name)
- docker build -t sparkonbatch/titanic_spark_on_batch_demo .
- docker tag sparkonbatch/titanic_spark_on_batch_demo:latest
- docker push sparkonbatch.azurecr.io/sparkonbatch/titanic_spark_on_batch_demo:latest
Add the image name (in this case: sparkonbatch.azurecr.io/sparkonbatch/titanic_spark_on_batch_demo)

Jupyter

Run 'titanic-demo.ipynb' in jupyter notebook.
(optional) Check the logs ('stderr.txt') of a Task in Azure Portal

Scaling and teardown

When you're done, just scale the pool to zero (or delete). You can scale the pool via the Azure Portal manually or defining a function (autoscale).

You can delete jobs and pools via the jupyter notebook (last two lines) or the Azure Portal.

You don't need to delete the Azure Resources in case you might need them later on.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
azuredeploy.json		azuredeploy.json
requirements.txt		requirements.txt
run_on_azure_batch.py		run_on_azure_batch.py
template.config.py		template.config.py
titanic-demo.ipynb		titanic-demo.ipynb
titanic_analytics.py		titanic_analytics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spark_on_azure_batch_demo

Set-Up

Azure Portal

Environment

Editor / IDE

Build Docker Image

Jupyter

Scaling and teardown

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

jhchein/spark-on-batch-example

Folders and files

Latest commit

History

Repository files navigation

spark_on_azure_batch_demo

Set-Up

Azure Portal

Environment

Editor / IDE

Build Docker Image

Jupyter

Scaling and teardown

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages