# PySpark Tutorial - Reading data

Create the Spark Context required for any PySpark program.  Most programs will store this in a variable named `sc`.

In [38]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, length, lit, concat, to_date
from pyspark.sql.types import IntegerType, DateType

spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()

The following code creates a DataFrame by reading a CSV file.

In [26]:
df = spark.read.option("header",True).csv("./data/policy.csv")

The Schema of a DataFrame can be viewed with `printSchema()`.
> Note that the default type shown below is string when many of these columns are either integer or date.  This will be fixed in subsequent steps showing data type conversion.

In [27]:
df.printSchema()

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: string (nullable = true)
 |-- sum_insured: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)



The contents of a DataFrame can be viewed in the log using `show()`.

In [28]:
df.show(5)

+-------+------+-----------+-----------+----------+--------+
| policy|  make|vehicle_age|sum_insured|start_date|end_date|
+-------+------+-----------+-----------+----------+--------+
|CAR0001|TOYOTA|          3|      12000|  20200101|20201231|
|CAR0002|SUBARU|          2|      14000|  20200210|20210209|
|CAR0003|  FORD|          8|       8000|  20200315|20210314|
|CAR0004| MAZDA|          4|      11000|  20200402|20210401|
|CAR0005|HOLDEN|          9|       6000|  20200516|20210515|
+-------+------+-----------+-----------+----------+--------+
only showing top 5 rows



It is possible to `cast()` data from string to integer.  Note that if the data is read from a data source other than CSV it is most likely in the correct format already.

In [29]:
df = df.withColumn("sum_insured", df["sum_insured"].cast(IntegerType())) \
       .withColumn("vehicle_age", df["vehicle_age"].cast(IntegerType()))

df.printSchema()

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: integer (nullable = true)
 |-- sum_insured: integer (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)



The following code shows how a function can be used to iterate over columns in a DataFrame using `df.columns`.  This appears to be constantly creating new DataFrames, however it should be remembered that Spark uses Lazy Evaluation, and this is just creating a chain of instructions that can later be optimised.

In [39]:
def fix_dates(df):
    """Find all columns named *_date and convert from string to Spark Date type."""
    for col in df.columns:
        if col.endswith("_date"):
            print("NOTE: Fixing date column '{}'.".format(col))
            df = df.withColumn(col, to_date(df[col], "yyyyMMdd"))
    return df

df = fix_dates(df)
df.printSchema()

NOTE: Fixing date column 'start_date'.
NOTE: Fixing date column 'end_date'.
root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: integer (nullable = true)
 |-- sum_insured: integer (nullable = true)
 |-- start_date: date (nullable = true)
 |-- end_date: date (nullable = true)
 |-- class: string (nullable = false)



The `withColumn()` method can be used to create new columns in the DataFrame.

In [42]:
df = df.withColumn("class", when(df.sum_insured>50000, "expensive") \
                           .when(df.sum_insured>10000, "standard") \
                           .otherwise("cheap"))
df.select(["policy", "class", "make", "vehicle_age", "sum_insured"]).show()

+-------+---------+-------+-----------+-----------+
| policy|    class|   make|vehicle_age|sum_insured|
+-------+---------+-------+-----------+-----------+
|CAR0001| standard| TOYOTA|          3|      12000|
|CAR0002| standard| SUBARU|          2|      14000|
|CAR0003|    cheap|   FORD|          8|       8000|
|CAR0004| standard|  MAZDA|          4|      11000|
|CAR0005|    cheap| HOLDEN|          9|       6000|
|CAR0006|    cheap| SUZUKI|          5|       4000|
|CAR0007| standard|    BMW|          4|      24000|
|CAR0008| standard|   AUDI|          3|      28000|
|CAR0009|expensive|  TESLA|          2|      65000|
|CAR0010|    cheap|HYUNDAI|          6|       5000|
+-------+---------+-------+-----------+-----------+



The call to `show()` on the new DataFrame forces execution of the `filter()` and the rows are returned.

In [47]:
summary = df.groupBy("class").sum("sum_insured") \
            .withColumnRenamed("sum(sum_insured)", "total_insured")
summary.show()

+---------+-------------+
|    class|total_insured|
+---------+-------------+
|expensive|        65000|
|    cheap|        23000|
| standard|        89000|
+---------+-------------+



It helps to save resources if you `stop()` the Spark session when you are finished.  Note that by doing this you will be unable to re-run any of the code above without first re-creating the `spark` variable.

Note that it is possible to do all of these instructions in a single step, although the syntax for referencing a column uses `col("name")` :

In [52]:
df2 = spark.read.option("header",True).csv("./data/policy.csv") \
           .withColumn("sum_insured", col("sum_insured").cast(IntegerType())) \
           .withColumn("vehicle_age", col("vehicle_age").cast(IntegerType())) \
           .withColumn("start_date", to_date(col("start_date"), "yyyyMMdd")) \
           .withColumn("end_date", to_date(col("end_date"), "yyyyMMdd")) \
           .withColumn("class", when(col("sum_insured")>50000, "expensive") \
                               .when(col("sum_insured")>10000, "standard") \
                               .otherwise("cheap")) \
           .groupBy("class").sum("sum_insured") \
           .withColumnRenamed("sum(sum_insured)", "total_insured")
df2.show()

+---------+-------------+
|    class|total_insured|
+---------+-------------+
|expensive|        65000|
|    cheap|        23000|
| standard|        89000|
+---------+-------------+



In [8]:
spark.stop()