# PySpark Tutorial - Reading data

Create the Spark Session required for any PySpark program.  Most programs will store this in a variable named `spark`.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()

The following code creates a DataFrame by reading a CSV file.

In [2]:
df = spark.read.option("header",True).csv("./data/policy.csv")

The Schema of a DataFrame can be viewed with `printSchema()`.
> Note that the default type shown below is string when many of these columns are either integer or date.  This will be fixed in subsequent steps showing data type conversion.

In [3]:
df.printSchema()

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: string (nullable = true)
 |-- sum_insured: string (nullable = true)
 |-- inception_date: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- premium: string (nullable = true)



The contents of a DataFrame can be viewed in the log using `show()`.

In [4]:
df.show(5)

+-------+------+-----------+-----------+--------------+----------+--------+-------+
| policy|  make|vehicle_age|sum_insured|inception_date|start_date|end_date|premium|
+-------+------+-----------+-----------+--------------+----------+--------+-------+
|CAR0001|TOYOTA|          1|      15000|      20180101|  20180101|20181231|   1000|
|CAR0001|TOYOTA|          2|      13500|      20180101|  20190101|20191231|    900|
|CAR0001|TOYOTA|          3|      12000|      20180101|  20200101|20201231|    800|
|CAR0002|SUBARU|          2|      14000|      20200210|  20200210|20210209|    950|
|CAR0003|  FORD|          6|      10000|      20180315|  20180315|20190314|    700|
+-------+------+-----------+-----------+--------------+----------+--------+-------+
only showing top 5 rows



It is possible to `cast()` data from string to integer.  The example below shows three ways of referencing an existing column in a DataFrame; `df.columnname`, `df["columnname"]` and `col("columnname")`.

Note that if the data is read from a data source other than CSV it is most likely in the correct format already.

In [5]:
df = df.withColumn("sum_insured", df.sum_insured.cast(IntegerType())) \
       .withColumn("vehicle_age", df["vehicle_age"].cast(IntegerType())) \
       .withColumn("premium", col("premium").cast(IntegerType()))

df.printSchema()

root
 |-- policy: string (nullable = true)
 |-- make: string (nullable = true)
 |-- vehicle_age: integer (nullable = true)
 |-- sum_insured: integer (nullable = true)
 |-- inception_date: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- premium: integer (nullable = true)



Note that it is possible to do all of these instructions in a single step, although the syntax for referencing a column uses `col("varname")` rather than `df.varname` as the DataFrame df does not exist at this point:

In [6]:
df2 = spark.read.option("header",True).csv("./data/policy.csv") \
           .withColumn("sum_insured", col("sum_insured").cast(IntegerType())) \
           .withColumn("vehicle_age", col("vehicle_age").cast(IntegerType())) \
           .withColumn("premium", col("premium").cast(IntegerType()))
df2.show(5)

+-------+------+-----------+-----------+--------------+----------+--------+-------+
| policy|  make|vehicle_age|sum_insured|inception_date|start_date|end_date|premium|
+-------+------+-----------+-----------+--------------+----------+--------+-------+
|CAR0001|TOYOTA|          1|      15000|      20180101|  20180101|20181231|   1000|
|CAR0001|TOYOTA|          2|      13500|      20180101|  20190101|20191231|    900|
|CAR0001|TOYOTA|          3|      12000|      20180101|  20200101|20201231|    800|
|CAR0002|SUBARU|          2|      14000|      20200210|  20200210|20210209|    950|
|CAR0003|  FORD|          6|      10000|      20180315|  20180315|20190314|    700|
+-------+------+-----------+-----------+--------------+----------+--------+-------+
only showing top 5 rows



It helps to save resources if you `stop()` the Spark session when you are finished.  Note that by doing this you will be unable to re-run any of the code above without first re-creating the `spark` variable.

In [7]:
spark.stop()