In [1]:
# run first. then have fun.
from pyspark.sql.functions import col, to_date, to_timestamp
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, FloatType


## Generate the Schema (StructType) of the CSV Data (ecomm_behavior_data)

In [2]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, FloatType
schema = (StructType([
    StructField("event_time", StringType(), False),
    StructField("event_type", StringType(), False),
    StructField("product_id", IntegerType(), False),
    StructField("category_id", LongType(), False),
    StructField("category_code", StringType(), False),
    StructField("brand", StringType(), False),
    StructField("price", FloatType(), False),
    StructField("user_id", IntegerType(), False),
    StructField("user_session", StringType(), False),
]))

## Load the Dataset
> Note: The github repo contains the `-sm.csv` data sampled from [Kaggle: Ecommerce Behavior Data Multi Category Store](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store). To follow along with the complete dataset, just download it and drop the `2019-Oct.csv, 2019-Nov.csv` files into the `datasets` directory in the project. 

In [19]:
dataset_dir = '/opt/spark/work-dir/hitchhikers_guide/datasets/ecomm_behavior_data'
# note: if you download the full dataset from https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store,
# just use the following and comment out the `-sm.csv` datasets.
#datasets = ['2019-Oct.csv','2019-Nov.csv']
datasets = ['2019-Oct-sm.csv','2019-Nov-sm.csv']

In [23]:
# read and process initial dataset

ecomm_df = (
    spark.read.format("csv")
    .option("header", True)
    .schema(schema)
    .load(f"{dataset_dir}/{datasets[1]}")
)

In [None]:
ecomm_df.show(20, truncate=False)

In [None]:
ecomm_df.count()

## Convert from CSV to Partitioned Parquet
While there is a simplicity to using CSV. It can be a problematic format to work with. Luckily, the ecomm dataset has already been preprocessed (cleaned).

**What we'll achieve**
1. We will do some minor post-processing, converting the `event_time` from a StringType to a DateTimeType. To do that we will be using the `to_timestamp` function. You'll notice that we need to `format` the timestamp conversion given we are parsing a string and need to reflect the format. `2019-10-01 00:00:00 UTC` is referenced using `yyyy-MM-dd HH:mm:ss z`.
2. Given the size of the data (~9GB for Nov, ~4GB for Oct) it also makes sense to partition by day (to speed up local processing). On that note, we also need to create a new column called `event_date` in order to store the partition information.

In [None]:
from pyspark.sql.functions import to_date, to_timestamp
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")

sink_dir = 'sm' if datasets[1].endswith('-sm.csv') else 'lg'

(ecomm_df
   .withColumn("event_time", to_timestamp(col("event_time"), "yyyy-MM-dd HH:mm:ss z"))
   .withColumn("event_date", to_date(col("event_time")))
   .write
   .format("parquet")
   .partitionBy("event_date")
   .mode("append")
   .save(f"{dataset_dir}/parquet/{sink_dir}")
)

## Read Back our Parquet by Specific Date
> Note: Now that we have our schema (StructType) set, this will be encoded into the Parquet data. This simplifies reading back from our new parquet location (as long as we don't screw up or modify the schema since parquet doesn't have any notion of Schema Enforcement. This is a plus of working with Delta Lake, which we'll see in the rest of the Hitchhiker's Guide.

In [None]:
## Run a gutcheck on one of the days. See how things work. Probably pretty fast.
# if you've imported the `lg` data - switch 

source_dir = 'sm' if datasets[1].endswith('-sm.csv') else 'lg'
(spark.read
 .format("parquet")
 .load(f"{dataset_dir}/parquet/{source_dir}/")
 .where(col("event_date").eqNullSafe("2019-10-01"))
 .show(10)
)

## Where to go Next?
> This notebook only exists to help read, post-process, and write partitioned data from the ecomm dataset. We will be using the `parquet` data for the actual hitchhiker's guide. 

Now that we've learned how to read and write our csv data, it is time to actually tackle problems using Delta Lake. If you are new to using Delta Lake, then it is easiest to head over to [First Steps](../101-first-steps/README.md) to learn how to create and modify Delta Lake tables (with the intention of going from zero-to-hero).