Skip to content

It's is basically a boilerplate code for developing spark job on AWS lambda, it supports scala & java for writing a jobs

Notifications You must be signed in to change notification settings

nabhosal/spark-on-lambda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-on-lambda

It's boilerplate code to help build aws lambda based spark jobs, job can be written using java or scala. We are using the boilerplate code for building etl job for moving hourly & daily data to a datalake. Before lambda, we used to build ETL in AWS Glue, but due to varying nature of few data-source. the Glue jobs are getting underutilized in the range 60% to 95% with just 2 DPU. We wanted to move away few jobs which are underutilized on aws glue to aws lambda. If a job on lambda is not able to handle data workloads then we can easily migrate the job to glue since the same spark code can be deployed at both service with minimal changes.

Build zip to deploy on lambda

./gradlew buildZip

Upload zip through S3, since zip size is 187MB

Set below environment variables

 SPARK_LOCAL_IP = 127.0.0.1
 S3N_AWS_ACCESSKEY = <AWS_ACCESSKEY>
 S3N_AWS_SECRETKEY = <AWS_SECRETKEY>

Set handler com.sparkonlambda.lambda.LambdaJobInitiator::handleRequest

Performance

TestCase data size no of objects time observation
Big file 1.5GB 1 40 sec Very good performance, it can work till 2 GB based on transformation complexities
Small files(in kb) in large number 365.8 MB 51000 900 sec Failed due to Timeout, S3 object / file listing operation is very costly on hadoop
Small files(in kb) in reasonable number 33.7 MB 5891 245 sec Very good performance, by interpolating the stats we can say it can support reading till 15k records

Todo

  • Generating temporary credential instead of using access_key * secret_key or better using IAM role of lambda
  • Reducing the size of jar, by keeping minimal required dependency jar's

About

It's is basically a boilerplate code for developing spark job on AWS lambda, it supports scala & java for writing a jobs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published