Apache Spark for a Machine Learning Engineer

This git repo is a collection of introductory tutorials and code samples on Apache Spark. The code samples are in python, so essentially we are using pySpark.

The goal is to

Build expertise in Spark Dataframe
Read/Write from/to AWS S3
Apply Feature Engineering on the data read from AWS S3 on Spark
Write features back to AWS S3
Learn to use AWS EMR to execute all the above steps
Be familiar with Spark MLLib
Be familiar with Spark Structured Streaming with Kafka

Tools used:

Apache Spark 2.4 with pySpark
AWS S3 for data storage
AWS EMR (Elastic Map Reduce)
Spark Dataframe
Spark MLLib (low priority)
Spark Structured Streaming with Kafka

Reference

Spark Structured Streaming with Kafka

Please follow this Databricks tutorial if you are interested in Spark Structured Streaming with Kafka. Although the tutorial is written in Scala, you can easily do it in python if you have completed the above steps in python.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Spark-dataframe-EMR-S3.ipynb		Spark-dataframe-EMR-S3.ipynb
sparktest.ipynb		sparktest.ipynb
test.csv		test.csv
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spark-dataframe-EMR-S3.ipynb

Spark-dataframe-EMR-S3.ipynb

sparktest.ipynb

sparktest.ipynb

test.csv

test.csv

test.ipynb

test.ipynb

Repository files navigation

Apache Spark for a Machine Learning Engineer

Tools used:

Reference

Spark Structured Streaming with Kafka

About

Releases

Packages

Languages

nahidalam/Spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark for a Machine Learning Engineer

Tools used:

Reference

Spark Structured Streaming with Kafka

About

Topics

Resources

Stars

Watchers

Forks

Languages