# First ETL

ETL stands for Extract, Transform, and Load. It is a process of integrating data from various sources into a target database or data warehouse by extracting data from the sources, transforming it into a desired format, and loading it into the target system.

The ETL process involves the following steps:

- Extraction: Data is extracted from various sources, which may include databases, files, web services, etc.
- Transformation: The extracted data is transformed into a format that is suitable for analysis and storage. This may involve cleaning and filtering the data, merging it with other data, or performing calculations on it.
- Loading: The transformed data is loaded into the target database or data warehouse.

The ETL process is a critical component of data integration and is used in various applications such as business intelligence, data warehousing, and data migration.

## Pyspark ETL process

- In the following example, we will demonstrate a simple ETL process using PySpark. 

- We will extract data from a CSV file, transform it by adding a new column, and load it into a Parquet file.

### Import Libraries

In [8]:
import pyspark 
from pyspark import SparkContext 
from pyspark.sql import SparkSession 
from pyspark import SQLContext
import os 
from delta.tables import * 
from delta.tables import DeltaTable 
import hashlib 
import datetime
import urllib.request 
import json 
from datetime import timedelta, date
from itertools import islice 
import sys
from datetime import datetime
from pyspark.sql import functions as f
from pyspark.sql.types import *

In [None]:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
from datetime import datetime

# Function to create a Spark session

def create_spark_session():
    return SparkSession.builder.master("local[1]") \
        .appName("FirstETL") \
        .config("spark.driver.extraClassPath", "/home/jovyan/work/jars/*") \
        .getOrCreate()


# Function to extract data from CSV using specified schema

def extract_data(spark, input_path):
    custom_schema = StructType([
        StructField("name", StringType(), True),
        StructField("roast", DoubleType(), True)
    ])
    
    df = spark.read.format("csv") \
        .option("header", True) \
        .schema(custom_schema) \
        .load(input_path)
    
    new_df = df.select("name", "roast")
    return new_df

# Function to transform data by adding a timestamp column

def transform_data(data_frame):
    updated = datetime.today().replace(second=0, microsecond=0)
    transformed_df = data_frame.withColumn('updated_at', F.lit(updated))
    return transformed_df

# Function to load transformed data into Parquet with Snappy compression

def load_data(transformed_df, output_folder):
    try:
        today_date = datetime.now().strftime("%Y-%m-%d")
        output_path = f"{output_folder}/{today_date}_curated"
        
        transformed_df.write.option("compression", "snappy").parquet(output_path)
        print("Data written successfully.")
    except Exception as e:
        print("An error occurred:", str(e))

# Main function that orchestrates the ETL process

def main():
    input_path = "/home/jovyan/work/data/raw-coffee.csv"
    output_folder = "/home/jovyan/work/data"
    
    spark = create_spark_session()
    
    try:
        extracted_data = extract_data(spark, input_path)
        transformed_data = transform_data(extracted_data)
        load_data(transformed_data, output_folder)
    finally:
        spark.stop()

if __name__ == "__main__":
    main()

## Learner Version

### Initiate SparkSession

In [3]:
# Create SparkSession from builder
from pyspark.sql import SparkSession

# Create a SparkSession and set the extraClassPath configuration
spark = SparkSession.builder.master("local[1]") \
    .appName("FirstETL") \
    .config("spark.driver.extraClassPath", "/home/jovyan/work/jars/*") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
# Details of the Spark Session
spark

## Set Parameters

In [9]:
##############################################################
## TimeStamp Column ##
###############################################################
UPDATED=datetime.today().replace(second=0, microsecond=0)
###############################################################

##############################################################
## Define Schema ##
######################################################################################
# Define the schema
custom_schema = StructType([
    StructField("name", StringType(), True),
    StructField("roast", DoubleType(), True)
])
######################################################################################

## Extract Data

In [10]:
# Read the data with the specified schema and create a DataFrame
df = spark.read.format("csv") \
    .option("header", True) \
    .schema(custom_schema) \
    .load("/home/jovyan/work/data/raw-coffee.csv")

# Extract relevant columns and create a new DataFrame
new_df = df.select("name", "roast")

## Transform Data

In [11]:
# Transform the data by adding a new column with current timestamp
transformed_df = new_df.withColumn('updated_at', f.lit(UPDATED))

# Create a new directory with timestamp suffix
timestamp = datetime.now().strftime("%Y%m%d")

## Load Data

In [12]:
# Create a new directory with timestamp suffix
timestamp = datetime.now().strftime("%Y%m%d")
output_dir = f"/home/jovyan/work/data/parq_curated_{timestamp}"
transformed_df.write.mode("overwrite").parquet(output_dir)
##########################################

                                                                                