Skip to content

Spark, Python, AWS EMR, MLLib, Spark Streaming, Spark - SQL

Notifications You must be signed in to change notification settings

nahidalam/Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Spark for a Machine Learning Engineer

This git repo is a collection of introductory tutorials and code samples on Apache Spark. The code samples are in python, so essentially we are using pySpark.

The goal is to

  • Build expertise in Spark Dataframe
  • Read/Write from/to AWS S3
  • Apply Feature Engineering on the data read from AWS S3 on Spark
  • Write features back to AWS S3
  • Learn to use AWS EMR to execute all the above steps
  • Be familiar with Spark MLLib
  • Be familiar with Spark Structured Streaming with Kafka

Tools used:

  • Apache Spark 2.4 with pySpark
  • AWS S3 for data storage
  • AWS EMR (Elastic Map Reduce)
  • Spark Dataframe
  • Spark MLLib (low priority)
  • Spark Structured Streaming with Kafka

Reference

Spark Structured Streaming with Kafka

Please follow this Databricks tutorial if you are interested in Spark Structured Streaming with Kafka. Although the tutorial is written in Scala, you can easily do it in python if you have completed the above steps in python.

Releases

No releases published

Packages

No packages published