AWS S3 & Lambda CSV to Parquet using Golang and Spark Scala

This repository contains sample of converting a CSV file which is uploaded into AWS S3 bucket to Parquet format.

The upload of a CSV file into S3 bucket will trigger a lambda function to convert this object into parquet and then write the result to another prefix in the bucket as shown in the image below.

We have implemented this feature using two different programming language.

Golang

For the golang you have to:

build the binary for your module GOARCH=amd64 GOOS=linux go build -gcflags='-N -l' -o . .
Package the binary: zip function.zip binaryFile
Also sometinmes you'll need to set the executable bit in the zipfile. There are a bunch of ways to do this, if you want to do it on windows, you'll need to run a python script which i find it in the stackoverflow.

Spark Scala

For spark scala:

package a JAR/ZIP file using sbt or maven including the dependencies
Pass this Jar file to the lambda function

AWS

Create A Role that allow lambda to interact with S3 bucket and Log to CloudWatch
Create Lambda function
Create Bucket and the two folder (prefix) for csv files and parquet files
Create Event from s3 properties window to trigger lambda function on upload

DevOps

Using terraform as an infrastructure as a code tool to automate the creation and configuration of the bucket and the lambda function.
Using the AWS CodeBuild for the Continous Integration:

Build the zip file containing the code of our lambda function
execute the scripts terraform to build the infrastructure

The state of our infrastructure is saved in the bucket

Data in Depth

What if we have a big volume of csv files. In this case, lambda functions can not be the best choice. In fact, Lambda function can't be run more than 15 min so in the case where we process for instance 1 TB of data lambda will timeout. So, we have to think about building an ETL to manage our data pipeline such as using Spark-based ETL processing running on Amazon's Elastic Map Reduce (EMR) Hadoop platform.

ETL CHOICE

This table below shows two different solutions for ETL and the difference between them.

AWS Glue
AWS EMR

Criteria	Amazon Glue	Amazon EMR
Deployment Types	Serverless	Server Platform ( Cluster )
Pricing	High	Low
Flexibility & Scalability	Flexible	Harder to scale
ETL operations	Better	Not so good
Performance	Slower & less stable	Faster and more stable

We can orchestrate the piepline using AWS Data Pipeline

Optimization (Thinking!!!!)

We can improve this project with others feature:
- Adding a pipeline to deploy version of lambda automatically after commit using travisCI,gitlabCI or CodeBuild ..
- What if a lambda function fails or a spark job fails when talking about running or processing in EMR cluster ..
- We can even change the way we process the data and using AWS Glue instead which is the serverless ETL service of AWS or we can also use EMR cluster (Hadoop) under the hood to process data.

Resources:

https://www.rittmanmead.com/blog/page/13/

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-copydata-s3.html

https://www.terraform.io/docs/index.html

https://github.com/xitongsys/parquet-go

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
golang		golang
spark-scala		spark-scala
terraform		terraform
.gitignore		.gitignore
README.md		README.md
buildspec.yml		buildspec.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

golang

golang

spark-scala

spark-scala

terraform

terraform

.gitignore

.gitignore

README.md

README.md

buildspec.yml

buildspec.yml

Repository files navigation

AWS S3 & Lambda CSV to Parquet using Golang and Spark Scala

Golang

Spark Scala

AWS

DevOps

Data in Depth

ETL CHOICE

Optimization (Thinking!!!!)

Resources:

About

Releases

Packages

Languages

nael-fridhi/csv-to-parquet-aws

Folders and files

Latest commit

History

Repository files navigation

AWS S3 & Lambda CSV to Parquet using Golang and Spark Scala

Golang

Spark Scala

AWS

DevOps

Data in Depth

ETL CHOICE

Optimization (Thinking!!!!)

Resources:

About

Resources

Stars

Watchers

Forks

Languages