# Overview

This notebook is for writing a delta lake containing a single empty table to S3. Before running the cell below, confirm the version of hadoop:

1. In a terminal, run `poetry shell` in the directory for this project. 
2. `pyspark`
3. Retrieve the hadoop version

    ```python
    spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion()
    ```

4. Replace the hadoop version appropriately

In [None]:
import boto3

from pyspark.sql import SparkSession
from delta import *

HADOOP_VERSION='3.3.4'

credentials = boto3.Session(profile_name='default').get_credentials()

# Upload a delta lake file to S3
builder = SparkSession.builder.appName('s3-upload') \
    .config('spark.jars.packages', f'org.apache.hadoop:hadoop-aws:{HADOOP_VERSION}') \
    .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
    .config('spark.hadoop.fs.s3a.access.key', credentials.access_key) \
    .config('spark.hadoop.fs.s3a.secret.key', credentials.secret_key) \
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension') \
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')

# Extra package will be downloaded to ~/.ivy2/jars
extra_packages = [f'org.apache.hadoop:hadoop-aws:{HADOOP_VERSION}']

spark = configure_spark_with_delta_pip(builder, extra_packages=extra_packages).getOrCreate()

Create a table with the following columns and types.

| Name | Type | 
| --- | --- |
| Year | bigint |
| Position | string |
| Player | string |
| FantasyPoints | double |

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

schema = StructType([
    StructField("Year", IntegerType(), True),
    StructField("Position", StringType(), True),
    StructField("Player", StringType(), True),
    StructField("FantasyPoints", DoubleType(), True)
])

df = spark.createDataFrame([], schema=schema)

Write the table in delta lake format to s3. Set the value for `BUCKET_NAME` appropriately.

In [None]:
BUCKET_NAME=''

df.write.format('delta').mode('overwrite').save(f's3a://{BUCKET_NAME}/database/top_performers_delta')