# 6.2 Deal with null values 

Null value is very common in data science. All your code should gracefully handle these null values. Three main strategy
to handle null values:

1. keep null values in data frame, all functions which worked with data frame need to handle the null value gracefully
2. Remove all null values
3. Use Imputation process to fill null values with non null values.

In [6]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType
from pyspark.sql.functions import lit, col, when, concat, udf
import os

In [4]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]").appName("RemoveDuplicates").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("RemoveDuplicates") \
                      .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1') \
                      .getOrCreate()

# 6.2.1 Handle null in data frame creation
When you read source data from file, or you create dataframe from list. You can use schema to control how you want to handle the null values.

In the schema definition, each column in a DataFrame has a **nullable property that can be set to True or False**. If nullable is set to False then the column cannot contain null values. 

In the following example, the column id and zip_code is set to nullable=False. If your data contains null in these two
column, you can't create a dataframe with these data. 

In [10]:
data1 = [
        (1, None, "STANDARD", None, "PR", 30100),
        (2, 704, None, "PASEO COSTA DEL SUR", "PR", None),
        (3, 709, None, "BDA SAN LUIS", "PR", 3700),
        (4, 76166, "UNIQUE", "CINGULAR WIRELESS", "TX", 84000),
        (5, 76177, "STANDARD", None, "TX", None)
    ]

schema = StructType([
        StructField("id", IntegerType(), False),
        StructField("zip_code", LongType(), False),
        StructField("type", StringType(), True),
        StructField("city", StringType(), True),
        StructField("state", StringType(), True),
        StructField("population", IntegerType(), True),
    ])
df1 = spark.createDataFrame(data1, schema=schema)

df1.printSchema()
df1.show()

ValueError: field zip_code: This field is not nullable, but got None

You can see the error message is very clear, we can't have null value in non nullable column

In [11]:
# now let's replace None with 703, you will see we can get our datafrmae correctly
data2 = [
        (1, 703, "STANDARD", None, "PR", 30100),
        (2, 704, None, "PASEO COSTA DEL SUR", "PR", None),
        (3, 709, None, "BDA SAN LUIS", "PR", 3700),
        (4, 76166, "UNIQUE", "CINGULAR WIRELESS", "TX", 84000),
        (5, 76177, "STANDARD", None, "TX", None)
    ]

df2 = spark.createDataFrame(data2, schema=schema)
df2.printSchema()
df2.show()

root
 |-- id: integer (nullable = false)
 |-- zip_code: long (nullable = false)
 |-- type: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     703|STANDARD|               null|   PR|     30100|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|               null|   TX|      null|
+---+--------+--------+-------------------+-----+----------+



## 6.2.2 Handle null value with built in spark function
The built in spark functions handle null value gracefully, so we dont need to worry null.

- isNull(): called by a column, return true if value is null. This function can be used in filter to remove rows with null values

**The default behavior of spark built in function to handle null value is that if either, or both, of the operands 
column are null, then function returns null**

In [12]:
# we can check if a value of a column is null or not.
df_check_null=df2.withColumn("type_is_null",df2.type.isNull())
df_check_null.show()

+---+--------+--------+-------------------+-----+----------+------------+
| id|zip_code|    type|               city|state|population|type_is_null|
+---+--------+--------+-------------------+-----+----------+------------+
|  1|     703|STANDARD|               null|   PR|     30100|       false|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|        true|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|        true|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|       false|
|  5|   76177|STANDARD|               null|   TX|      null|       false|
+---+--------+--------+-------------------+-----+----------+------------+



In [13]:
# if either, or both, of the operands column are null, then function returns null.
 
df2.withColumn("city_and_type", concat(df.city, df.type)).show()

+---+--------+--------+-------------------+-----+----------+--------------------+
| id|zip_code|    type|               city|state|population|       city_and_type|
+---+--------+--------+-------------------+-----+----------+--------------------+
|  1|     704|STANDARD|               null|   PR|     30100|                null|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|                null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|                null|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|CINGULAR WIRELESS...|
|  5|   76177|STANDARD|               null|   TX|      null|                null|
+---+--------+--------+-------------------+-----+----------+--------------------+



In [14]:
# we can overwrite the default handling of the null value by adding null testing logic by ourself.
# if city is null and type is not null, we replace city value by " "
# if city is not null and type is null, we replace type value by " "
# if both null, return null
# if both not null, concat normally
df2.withColumn("city_and_type",
                  when(df.city.isNull() & ~df.type.isNull(), concat(lit(" "), df.type))
                  .when(~df.city.isNull() & df.type.isNull(), concat(df.city, lit(" ")))
                  .when(df.city.isNull() & df.type.isNull(), None)
                  .otherwise(concat(df.city, df.type))).show(truncate=False)

+---+--------+--------+-------------------+-----+----------+-----------------------+
|id |zip_code|type    |city               |state|population|city_and_type          |
+---+--------+--------+-------------------+-----+----------+-----------------------+
|1  |704     |STANDARD|null               |PR   |30100     | STANDARD              |
|2  |704     |null    |PASEO COSTA DEL SUR|PR   |null      |PASEO COSTA DEL SUR    |
|3  |709     |null    |BDA SAN LUIS       |PR   |3700      |BDA SAN LUIS           |
|4  |76166   |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |CINGULAR WIRELESSUNIQUE|
|5  |76177   |STANDARD|null               |TX   |null      | STANDARD              |
+---+--------+--------+-------------------+-----+----------+-----------------------+



## 6.2.3 Equality check with null values
You can noticed in below example, if either, or both, of the operands are null, then == returns null, not a boolean.

In some case, you want the == behave like this :
- When one value is null and the other is not null, return False
- When both values are null, return True

eqNullSafe(col_name): It's called by a column, takes a column as argument and produce a bool value by using above rules 

In [17]:
# you can notice the type_equality_test column contains null values
df2.withColumn("type_test", lit("STANDARD")) \
   .withColumn("type_equality_test", df2.type == col("type_test")) \
   .show()

+---+--------+--------+-------------------+-----+----------+---------+------------------+
| id|zip_code|    type|               city|state|population|type_test|type_equality_test|
+---+--------+--------+-------------------+-----+----------+---------+------------------+
|  1|     703|STANDARD|               null|   PR|     30100| STANDARD|              true|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null| STANDARD|              null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700| STANDARD|              null|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000| STANDARD|             false|
|  5|   76177|STANDARD|               null|   TX|      null| STANDARD|              true|
+---+--------+--------+-------------------+-----+----------+---------+------------------+



In [20]:
# Now with our null testing logic, we don't have null in column type_equality_test anymore

df3 = df2.withColumn("type_test", lit("STANDARD")) \
          .withColumn("type_equality_test", 
                         when(df2.type.isNull() & col("type_test").isNull(), True)
                         .when(df2.type.isNull() | col("type_test").isNull(), False)
                         .otherwise(df2.type == col("type_test")))

df3.show()

+---+--------+--------+-------------------+-----+----------+---------+------------------+
| id|zip_code|    type|               city|state|population|type_test|type_equality_test|
+---+--------+--------+-------------------+-----+----------+---------+------------------+
|  1|     703|STANDARD|               null|   PR|     30100| STANDARD|              true|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null| STANDARD|             false|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700| STANDARD|             false|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000| STANDARD|             false|
|  5|   76177|STANDARD|               null|   TX|      null| STANDARD|              true|
+---+--------+--------+-------------------+-----+----------+---------+------------------+

