# Day 7 - NULL Problems
Processing known values is already challenging regarding types, scaling, context and so on. Processing unknown values, NULL values, needs further attention to avoid unexpected results. So today, I want to figure out, how Spark will address this issue.

As I know from traditional relational databases, the metadata, i.e. the table schema, defines, if a column is allowed to have empty row cells. When I define al colum as NOT LULL than the RDBMS enforces this by a not-null-contraint and rejects any incoming records violating this rule. 

I know from day 3, that Spark also knows a schema concept, but chow reliable is it?

## Schema Is No Option

In [1]:
import pyspark
from pyspark.sql import SparkSession


spark = SparkSession\
   .builder\
   .getOrCreate()

csvData = spark.read\
   .option("header", "true")\
   .option("inferSchema", "true")\
   .format("csv")\
   .load("./data/retail-data/by-day/*.csv")

csvData.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



Spark schemas have a column nullable attribute. Assumable, Spark observed rows having no value while infering the schema from data samples. Actually I find null values in the CustomerID column.

In [76]:
csvData.where("CustomerID is null").show(10)

+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|   580546|    23406|      CHECK|     -21|2011-12-05 09:27:00|      0.0|      null|United Kingdom|
|   580547|    21201|        ???|    -390|2011-12-05 09:29:00|      0.0|      null|United Kingdom|
|   580549|   84876B|      found|      66|2011-12-05 09:54:00|      0.0|      null|United Kingdom|
|   580561|    22043|     dotcom|      -9|2011-12-05 10:25:00|      0.0|      null|United Kingdom|
|   580580|    21804|       null|      10|2011-12-05 10:33:00|      0.0|      null|United Kingdom|
|   580586|    21804|     dotcom|       4|2011-12-05 10:34:00|      0.0|      null|United Kingdom|
|   580588|    21808|       null|       5|2011-12-05 10:35:00|      0.0|      null|United Kingdom|
|  C580604

So what would happen, when I define a schema explicitly marking the CusomerID column as not nullable?

In [77]:
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DoubleType

myOwnCsv = StructType([
    StructField("InvoiceNo",StringType(),True),
    StructField("StockCode",StringType(),True),
    StructField("Description",StringType(),True),
    StructField("Quantity",IntegerType(),True),
    StructField("UnitPrice",IntegerType(),True),
    StructField("CustomerID",DoubleType(),False),
    StructField("count",StringType(),True)
])

csvData2 = spark.read\
    .option("header", "true")\
    .format("csv")\
    .schema(myOwnCsv)\
    .load("./data/day-007/retail-data/by-day/*.csv")

csvData2.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- UnitPrice: integer (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- count: string (nullable = true)



It seems to me, that Spark does not care about something like not-null-constraints. even though having defined an explict schema I cannot rely on Spark enforcing it. This can be an issue, especially with regards to key columns. Actually the not-null attribute in the schema is just a hint for the query optimizer.

So I have to enforce not-null-constraints it by myself and I have several options to handle this problem.

## Option 1: Filtering out Rows Having Null Values
This is the most radical solution and I would only use it, if data records missing CustomerID have absolutely no value at all for my purpose so loading them is just wastes storage.

In [78]:
filteredData = csvData.where("CustomerID is not null")

filteredData.where("CustomerID is null").show(10)

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



Having datastes with many columns to check for null values, it would anoy me to list douzens of filter rules. The `dropna()` function can make this more convenient. If I need to ensure, that all columns must always have non-null values, I can do it like this:

In [79]:
dropedNA = csvData.dropna(how="any")
dropedNA.where("CustomerID is null").show(10)

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



If I'm less rigid and only whant to prevent empty records, I just need to set `how=all`.

In [80]:
dropedNA = csvData.dropna(how="all")
dropedNA.where("CustomerID is null").show(10)

+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|   580546|    23406|      CHECK|     -21|2011-12-05 09:27:00|      0.0|      null|United Kingdom|
|   580547|    21201|        ???|    -390|2011-12-05 09:29:00|      0.0|      null|United Kingdom|
|   580549|   84876B|      found|      66|2011-12-05 09:54:00|      0.0|      null|United Kingdom|
|   580561|    22043|     dotcom|      -9|2011-12-05 10:25:00|      0.0|      null|United Kingdom|
|   580580|    21804|       null|      10|2011-12-05 10:33:00|      0.0|      null|United Kingdom|
|   580586|    21804|     dotcom|       4|2011-12-05 10:34:00|      0.0|      null|United Kingdom|
|   580588|    21808|       null|       5|2011-12-05 10:35:00|      0.0|      null|United Kingdom|
|  C580604

I've also the option to apply this logic only to a subset of columns, where null values are an issue.

In [81]:
dropedNA = csvData.dropna(how="all", subset=["InvoiceNo", "CustomerID"])
dropedNA.where("CustomerID is null").show(10)

+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|   580546|    23406|      CHECK|     -21|2011-12-05 09:27:00|      0.0|      null|United Kingdom|
|   580547|    21201|        ???|    -390|2011-12-05 09:29:00|      0.0|      null|United Kingdom|
|   580549|   84876B|      found|      66|2011-12-05 09:54:00|      0.0|      null|United Kingdom|
|   580561|    22043|     dotcom|      -9|2011-12-05 10:25:00|      0.0|      null|United Kingdom|
|   580580|    21804|       null|      10|2011-12-05 10:33:00|      0.0|      null|United Kingdom|
|   580586|    21804|     dotcom|       4|2011-12-05 10:34:00|      0.0|      null|United Kingdom|
|   580588|    21808|       null|       5|2011-12-05 10:35:00|      0.0|      null|United Kingdom|
|  C580604

## Option 2: Droping Columns Having Null Values
This approach will preserve all data rows but I will completely lose an attribute of my data. If I care about missing values, this column is likely to be relevant for me, otherwise I wouldn't care.

In [82]:
droppedData = csvData.drop("CustomerID")

droppedData.show(10)

+---------+---------+--------------------+--------+-------------------+---------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|       Country|
+---------+---------+--------------------+--------+-------------------+---------+--------------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|United Kingdom|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|United Kingdom|
|   580538|    22906|12 MESSAGE CARDS ...|      24|2011-12-05 08:38:00|     1.65|United Kingdom|
|   580538|    21914|BLUE HARMONICA IN...|      24|2011-12-05 08:38:00|     1.25|United Kingdom|
|   580538|    22467|   GUMBALL COAT RACK|       6|2011-12-05 08:38:00|     2.55|United Kingdom|
|   580538|    21544|SKULLS  WATER TRA...|      48|2011-12-05 08:38:00|     0.85|United Kingdom|
|   580538|    23126|FELTCRAFT GIRL AM...|       8|2011-12-05 08:38:00|     4.95|United Kingdom|
|   580538|    21833|CAMOUFLAG

## Option 3: Replacing Nulls by Default Values

In [83]:
from pyspark.sql.functions import col, lit, coalesce

replacedData = csvData.select(
    "InvoiceNo",
    "StockCode",
    "Description",
    "Quantity",
    "InvoiceDate",
    "UnitPrice",
    coalesce(col("CustomerID"), lit(-1.0)).alias("CustomerID"),
    "Country")

replacedData.where(col("CustomerID") == -1.0).show(10)

+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|   580546|    23406|      CHECK|     -21|2011-12-05 09:27:00|      0.0|      -1.0|United Kingdom|
|   580547|    21201|        ???|    -390|2011-12-05 09:29:00|      0.0|      -1.0|United Kingdom|
|   580549|   84876B|      found|      66|2011-12-05 09:54:00|      0.0|      -1.0|United Kingdom|
|   580561|    22043|     dotcom|      -9|2011-12-05 10:25:00|      0.0|      -1.0|United Kingdom|
|   580580|    21804|       null|      10|2011-12-05 10:33:00|      0.0|      -1.0|United Kingdom|
|   580586|    21804|     dotcom|       4|2011-12-05 10:34:00|      0.0|      -1.0|United Kingdom|
|   580588|    21808|       null|       5|2011-12-05 10:35:00|      0.0|      -1.0|United Kingdom|
|  C580604

Again Spark provides me an option to apply the same default rule to many columns at once by using `fillna()` instead of `coalesce()`.

In [84]:
filledData = csvData.fillna(-1, subset=["InvoiceNo", "StockCode", "CustomerID"])

filledData.where(col("CustomerID") == -1.0).show(10)

+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|   580546|    23406|      CHECK|     -21|2011-12-05 09:27:00|      0.0|      -1.0|United Kingdom|
|   580547|    21201|        ???|    -390|2011-12-05 09:29:00|      0.0|      -1.0|United Kingdom|
|   580549|   84876B|      found|      66|2011-12-05 09:54:00|      0.0|      -1.0|United Kingdom|
|   580561|    22043|     dotcom|      -9|2011-12-05 10:25:00|      0.0|      -1.0|United Kingdom|
|   580580|    21804|       null|      10|2011-12-05 10:33:00|      0.0|      -1.0|United Kingdom|
|   580586|    21804|     dotcom|       4|2011-12-05 10:34:00|      0.0|      -1.0|United Kingdom|
|   580588|    21808|       null|       5|2011-12-05 10:35:00|      0.0|      -1.0|United Kingdom|
|  C580604