# Spark ETL with AWS (S3 bucket)

[1.Create connection with AWS S3 bucket](#1)

[2.Read data from S3 bucket and store into dataframe](#2)

[3.Transform data](#3)

[4.write data into parquet file](#4) 

[5.write data into JSON file](#5)

## load libraries

In [26]:
# load required libraries

from pyspark.sql import SparkSession
from pyspark.sql.functions import col


In [2]:
# Start Spark Session
spark = SparkSession.builder. appName("S3-ETL")\
    .config('spark.jars.packages','org.apache.hadoop:hadoop-aws:2.6.3,org.apache.hadoop:hadoop-common:2.6.3').\
    getOrCreate()

sqlContext = SparkSession(spark)

#Dont Show warning only error
spark.sparkContext.setLogLevel("ERROR")

spark

<a id = 1> </a>

## Create Connection with AWS S3 Bucket

In [3]:
# set access key and secret key to read and write data from AWS S3

spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key','Put Access Key Here')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key','Put Secret Access Key Here')

<a id = 2> </a>

## Read data from S3 bucket and store into DataFrame

In [4]:
# pyspark-etl-s3 is bucket name where data is stored
# csv-datasets is folder name
# department.csv is filename

s3_df = spark.read.format("csv") \
    .option("header","true") \
    .load('s3a://pyspark-etl-s3/csv-datasets/employee.csv')

In [31]:
s3_df.printSchema()

root
 |-- id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- department_id: string (nullable = true)



In [32]:
# changing data type

s3_df = s3_df.withColumn('salary', col('salary').cast('float'))

In [33]:
s3_df.show()

+---+----------+---------+--------+-------------+
| id|first_name|last_name|  salary|department_id|
+---+----------+---------+--------+-------------+
|  1|      Todd|   Wilson|110000.0|         1006|
|  1|      Todd|   Wilson|106119.0|         1006|
|  2|    Justin|    Simon|128922.0|         1005|
|  2|    Justin|    Simon|130000.0|         1005|
|  3|     Kelly|  Rosario| 42689.0|         1002|
|  4|  Patricia|   Powell|162825.0|         1004|
|  4|  Patricia|   Powell|170000.0|         1004|
|  5|    Sherry|   Golden| 44101.0|         1002|
|  6|   Natasha|  Swanson| 79632.0|         1005|
|  6|   Natasha|  Swanson| 90000.0|         1005|
|  7|     Diane|   Gordon| 74591.0|         1002|
|  8|  Mercedes|Rodriguez| 61048.0|         1005|
|  9|   Christy| Mitchell|137236.0|         1001|
|  9|   Christy| Mitchell|140000.0|         1001|
|  9|   Christy| Mitchell|150000.0|         1001|
| 10|      Sean| Crawford|182065.0|         1006|
| 10|      Sean| Crawford|190000.0|         1006|


<a id = 3> </a>
## Transform data

In [34]:
print("Register df as SQL Temporary Source View")

s3_df.createOrReplaceTempView("temp_Source")

Register df as SQL Temporary Source View


In [38]:
print("Displaying top 10 emplpyees with highest salary")

sqlContext.sql("SELECT first_name, last_name, salary from temp_Source order by salary desc limit 10").show()

Displaying top 10 emplpyees with highest salary
+----------+---------+--------+
|first_name|last_name|  salary|
+----------+---------+--------+
|     Julie|  Sanchez|210000.0|
|     Julie|  Sanchez|200000.0|
|   Stephen|    Smith|194791.0|
|      Kara|    Smith|192838.0|
|      Sean| Crawford|190000.0|
|     Linda|    Clark|186781.0|
|     Julie|  Sanchez|185663.0|
|      Sean| Crawford|182065.0|
|   Richard|     Cole|180361.0|
|     Traci| Williams|180000.0|
+----------+---------+--------+



In [39]:
new_df = sqlContext.sql("SELECT first_name, last_name, salary from temp_Source order by salary desc limit 10")

<a id = 4> </a>
## write data into parquet file

In [40]:
new_df.write.format("parquet").option("compression","snappy").save("parquetdata",mode="append")

<a id = 5> </a>
## write data into JSON file

In [42]:
new_df.write.format("json").option("header","true").save("json_data",mode='append')