This is a demo that shows how to run Spark jobs on Azure Batch.
This code is NOT PRODUCTION READY!
The original code is from https://medium.com/datamindedbe/run-spark-jobs-on-azure-batch-using-azure-container-registry-and-blob-storage-10a60bd78f90 and respectively https://github.com/datamindedbe/spark_on_azure_batch_demo -> All credits go to him :)
Clone this repo.
- Use the Deploy to Azure Button to deploy a batch account, container registry and storage account to your subscription.
- After succesful deployment navigate to your storage account -> container -> titanic and upload the train.csv from this repo.
- Create a python environment and install the requirements from requirements.txt
pip install -r requirements.txt - Install jupyter in your environment (
conda install -y jupyterorpip install jupyter)
- Copy template.config.py to config.py and add your credetials and resource names
- Adjust the code in titanic_analytics.py (this is the code run in a node)
- Update file and output to match your storage account (e.g. "wasbs://containername@YOURSTORAGEACCOUNTNAME.blob.core.windows.net/train.csv")
- Adjust the spark query and logging however it suits you.
- Build your docker image and push to your registry (replace
sparkonbatchwith your ACR name)docker build -t sparkonbatch/titanic_spark_on_batch_demo .docker tag sparkonbatch/titanic_spark_on_batch_demo:latestdocker push sparkonbatch.azurecr.io/sparkonbatch/titanic_spark_on_batch_demo:latest
- Add the image name (in this case:
sparkonbatch.azurecr.io/sparkonbatch/titanic_spark_on_batch_demo)
- Run 'titanic-demo.ipynb' in jupyter notebook.
- (optional) Check the logs ('stderr.txt') of a Task in Azure Portal
When you're done, just scale the pool to zero (or delete). You can scale the pool via the Azure Portal manually or defining a function (autoscale).
You can delete jobs and pools via the jupyter notebook (last two lines) or the Azure Portal.
You don't need to delete the Azure Resources in case you might need them later on.