# PySpark Tutorial - Writing Data

Create the Spark Session required for any PySpark program.  Most programs will store this in a variable named `spark`.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
from pyspark.sql.types import IntegerType, DateType

spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()

## Read the Source Data
The following code creates a DataFrame for both the policies and claims by reading CSV files.

In [2]:
policyDF = spark.read.option("header",True).csv("./data/policy.csv") \
                .withColumn("sum_insured", col("sum_insured").cast(IntegerType())) \
                .withColumn("vehicle_age", col("vehicle_age").cast(IntegerType())) \
                .withColumn("premium", col("premium").cast(IntegerType()))
claimsDF = spark.read.option("header",True).csv("./data/claims.csv") \
                .withColumn("cost", col("cost").cast(IntegerType()))

In [3]:
def fix_dates(df):
    """Find all columns named *_date and convert from string to Spark Date type."""
    for column in df.columns:
        if column.endswith("_date") and dict(df.dtypes)[column]=='string':
            print("NOTE: Fixing date column '{}'.".format(column))
            df = df.withColumn(column, to_date(df[column], "yyyyMMdd"))
    return df

policyDF = fix_dates(policyDF)
claimsDF = fix_dates(claimsDF)

NOTE: Fixing date column 'inception_date'.
NOTE: Fixing date column 'start_date'.
NOTE: Fixing date column 'end_date'.
NOTE: Fixing date column 'incident_date'.


Viewing the updated DataFrames shows that the date columns are now using the Spark `date` types.

In [4]:
policyDF.printSchema()
policyDF.show(5)

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: integer (nullable = true)
 |-- sum_insured: integer (nullable = true)
 |-- inception_date: date (nullable = true)
 |-- start_date: date (nullable = true)
 |-- end_date: date (nullable = true)
 |-- premium: integer (nullable = true)

+-------+------+-----------+-----------+--------------+----------+----------+-------+
| policy|  make|vehicle_age|sum_insured|inception_date|start_date|  end_date|premium|
+-------+------+-----------+-----------+--------------+----------+----------+-------+
|CAR0001|TOYOTA|          1|      15000|    2018-01-01|2018-01-01|2018-12-31|   1000|
|CAR0001|TOYOTA|          2|      13500|    2018-01-01|2019-01-01|2019-12-31|    900|
|CAR0001|TOYOTA|          3|      12000|    2018-01-01|2020-01-01|2020-12-31|    800|
|CAR0002|SUBARU|          2|      14000|    2020-02-10|2020-02-10|2021-02-09|    950|
|CAR0003|  FORD|          6|      10000|    2018-03-15|2018-03-15|20

## Writing Data

The following code writes the DataFrames in `parquet` format.

In [5]:
policyDF.write.mode('overwrite').parquet("./data/policy.parquet")

Spark can be using many partitions for data, so frequently will not just write a single file.  The name above is a folder name which will then contain all of the components:

In [6]:
import glob

for file in sorted(glob.glob("./data/policy.parquet/*")):
    print(file)

./data/policy.parquet/_SUCCESS
./data/policy.parquet/part-00000-cfcdfc72-373b-4a99-bb73-645f3e2a939b-c000.snappy.parquet


The following example shows using 2 partitions for processing:

In [7]:
claimsDF = claimsDF.repartition(2)
claimsDF.write.mode('overwrite').parquet("./data/claims.parquet")

The resulting number of files is based on the number of partitions:

In [8]:
for file in sorted(glob.glob("./data/claims.parquet/*")):
    print(file)

./data/claims.parquet/_SUCCESS
./data/claims.parquet/part-00000-5c5878c3-7666-4f35-a879-586459e5a39a-c000.snappy.parquet
./data/claims.parquet/part-00001-5c5878c3-7666-4f35-a879-586459e5a39a-c000.snappy.parquet


## Reading Parquet Files

Parquet files are a column based format which is generally efficient for storage and performance.  It also supports all Spark data types, so will retain the correct Date formats!

In [9]:
parquetDF = spark.read.parquet("./data/policy.parquet")
parquetDF.printSchema()
parquetDF.show(5)

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: integer (nullable = true)
 |-- sum_insured: integer (nullable = true)
 |-- inception_date: date (nullable = true)
 |-- start_date: date (nullable = true)
 |-- end_date: date (nullable = true)
 |-- premium: integer (nullable = true)

+-------+------+-----------+-----------+--------------+----------+----------+-------+
| policy|  make|vehicle_age|sum_insured|inception_date|start_date|  end_date|premium|
+-------+------+-----------+-----------+--------------+----------+----------+-------+
|CAR0001|TOYOTA|          1|      15000|    2018-01-01|2018-01-01|2018-12-31|   1000|
|CAR0001|TOYOTA|          2|      13500|    2018-01-01|2019-01-01|2019-12-31|    900|
|CAR0001|TOYOTA|          3|      12000|    2018-01-01|2020-01-01|2020-12-31|    800|
|CAR0002|SUBARU|          2|      14000|    2020-02-10|2020-02-10|2021-02-09|    950|
|CAR0003|  FORD|          6|      10000|    2018-03-15|2018-03-15|20

It helps to save resources if you `stop()` the Spark session when you are finished.  Note that by doing this you will be unable to re-run any of the code above without first re-creating the `spark` variable.

In [10]:
spark.stop()