# PySpark Tutorial - Reading data

Create the Spark Session required for any PySpark program.  Most programs will store this in a variable named `spark`.

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, length, lit, concat, to_date
from pyspark.sql.types import IntegerType, DateType, StringType

spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()

The following code creates a DataFrame by reading a CSV file.

In [10]:
df = spark.read.option("header",True).csv("./data/policy.csv")

The Schema of a DataFrame can be viewed with `printSchema()`.
> Note that the default type shown below is string when many of these columns are either integer or date.  This will be fixed in subsequent steps showing data type conversion.

In [11]:
df.printSchema()

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: string (nullable = true)
 |-- sum_insured: string (nullable = true)
 |-- inception_date: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)



The contents of a DataFrame can be viewed in the log using `show()`.

In [12]:
df.show(5)

+-------+------+-----------+-----------+--------------+----------+--------+
| policy|  make|vehicle_age|sum_insured|inception_date|start_date|end_date|
+-------+------+-----------+-----------+--------------+----------+--------+
|CAR0001|TOYOTA|          1|      15000|      20180101|  20180101|20181231|
|CAR0001|TOYOTA|          2|      13500|      20180101|  20190101|20191231|
|CAR0001|TOYOTA|          3|      12000|      20180101|  20200101|20201231|
|CAR0002|SUBARU|          2|      14000|      20200210|  20200210|20210209|
|CAR0003|  FORD|          6|      10000|      20180315|  20180315|20190314|
+-------+------+-----------+-----------+--------------+----------+--------+
only showing top 5 rows



It is possible to `cast()` data from string to integer.  Note that if the data is read from a data source other than CSV it is most likely in the correct format already.

In [13]:
df = df.withColumn("sum_insured", df["sum_insured"].cast(IntegerType())) \
       .withColumn("vehicle_age", df["vehicle_age"].cast(IntegerType()))

df.printSchema()

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: integer (nullable = true)
 |-- sum_insured: integer (nullable = true)
 |-- inception_date: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)



The following code shows how a function can be used to iterate over columns in a DataFrame using `df.columns`.  This appears to be constantly creating new DataFrames, however it should be remembered that Spark uses Lazy Evaluation, and this is just creating a chain of instructions that can later be optimised.

In [17]:
def fix_dates(df):
    """Find all columns named *_date and convert from string to Spark Date type."""
    for col in df.columns:
        if col.endswith("_date") and dict(df.dtypes)[col]=='string':
            print("NOTE: Fixing date column '{}'.".format(col))
            df = df.withColumn(col, to_date(df[col], "yyyyMMdd"))
    return df

df = fix_dates(df)
df.printSchema()
df.show(5)

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: integer (nullable = true)
 |-- sum_insured: integer (nullable = true)
 |-- inception_date: date (nullable = true)
 |-- start_date: date (nullable = true)
 |-- end_date: date (nullable = true)

+-------+------+-----------+-----------+--------------+----------+----------+
| policy|  make|vehicle_age|sum_insured|inception_date|start_date|  end_date|
+-------+------+-----------+-----------+--------------+----------+----------+
|CAR0001|TOYOTA|          1|      15000|    2018-01-01|2018-01-01|2018-12-31|
|CAR0001|TOYOTA|          2|      13500|    2018-01-01|2019-01-01|2019-12-31|
|CAR0001|TOYOTA|          3|      12000|    2018-01-01|2020-01-01|2020-12-31|
|CAR0002|SUBARU|          2|      14000|    2020-02-10|2020-02-10|2021-02-09|
|CAR0003|  FORD|          6|      10000|    2018-03-15|2018-03-15|2019-03-14|
+-------+------+-----------+-----------+--------------+----------+----------+
only sho

The `withColumn()` method can be used to create new columns in the DataFrame, and the `when()` method can perform calculations in a similar way to the SQL `case` statement.

In [22]:
df = df.withColumn("status", when(df.start_date==df.inception_date, "New Business") \
                            .otherwise("Renewal"))
df.select(["policy", "start_date", "status", "make", "vehicle_age", "sum_insured"]).show()

+-------+----------+------------+-------+-----------+-----------+
| policy|start_date|      status|   make|vehicle_age|sum_insured|
+-------+----------+------------+-------+-----------+-----------+
|CAR0001|2018-01-01|New Business| TOYOTA|          1|      15000|
|CAR0001|2019-01-01|     Renewal| TOYOTA|          2|      13500|
|CAR0001|2020-01-01|     Renewal| TOYOTA|          3|      12000|
|CAR0002|2020-02-10|New Business| SUBARU|          2|      14000|
|CAR0003|2018-03-15|New Business|   FORD|          6|      10000|
|CAR0003|2019-03-15|     Renewal|   FORD|          7|       9000|
|CAR0003|2020-03-15|     Renewal|   FORD|          8|       8000|
|CAR0004|2019-04-02|New Business|  MAZDA|          4|      11000|
|CAR0004|2020-04-02|New Business|  MAZDA|          5|      10000|
|CAR0005|2020-05-16|New Business| HOLDEN|          9|       6000|
|CAR0006|2020-06-18|New Business| SUZUKI|          5|       4000|
|CAR0007|2020-07-13|New Business|    BMW|          4|      24000|
|CAR0008|2

The following example shows summarisation with `groupBy`:

In [23]:
summary = df.groupBy("status").sum("sum_insured") \
            .withColumnRenamed("sum(sum_insured)", "total_insured")
summary.show()

+------------+-------------+
|      status|total_insured|
+------------+-------------+
|     Renewal|       107500|
|New Business|       199000|
+------------+-------------+



Note that it is possible to do all of these instructions in a single step, although the syntax for referencing a column uses `col("varname")` rather than `df.varname` as the DataFrame df does not exist at this point:

In [25]:
df2 = spark.read.option("header",True).csv("./data/policy.csv") \
           .withColumn("sum_insured", col("sum_insured").cast(IntegerType())) \
           .withColumn("vehicle_age", col("vehicle_age").cast(IntegerType())) \
           .withColumn("inception_date", to_date(col("inception_date"), "yyyyMMdd")) \
           .withColumn("start_date", to_date(col("start_date"), "yyyyMMdd")) \
           .withColumn("end_date", to_date(col("end_date"), "yyyyMMdd")) \
           .withColumn("status", when(col("start_date")==col("inception_date"), "New Business") \
                                .otherwise("Renewal")) \
           .groupBy("status").sum("sum_insured") \
           .withColumnRenamed("sum(sum_insured)", "total_insured")
df2.show()

+------------+-------------+
|      status|total_insured|
+------------+-------------+
|     Renewal|       107500|
|New Business|       199000|
+------------+-------------+



It helps to save resources if you `stop()` the Spark session when you are finished.  Note that by doing this you will be unable to re-run any of the code above without first re-creating the `spark` variable.

In [10]:
spark.stop()