# 6.2 Deal with null values 

Null value is very common in data science. All your code should gracefully handle these null values. Three main strategy
to handle null values:

1. keep null values in data frame, all functions which worked with data frame need to handle the null value gracefully
2. Remove all null values
3. Use Imputation process to fill null values with non null values.

In [6]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType
from pyspark.sql.functions import lit, col, when, concat, udf
import os

In [4]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]").appName("RemoveDuplicates").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("RemoveDuplicates") \
                      .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1') \
                      .getOrCreate()

# 6.2.1 Handle null in data frame creation
When you read source data from file, or you create dataframe from list. You can use schema to control how you want to handle the null values.

In the schema definition, each column in a DataFrame has a **nullable property that can be set to True or False**. If nullable is set to False then the column cannot contain null values. 

In the following example, the column id and zip_code is set to nullable=False. If your data contains null in these two
column, you can't create a dataframe with these data. 

In [37]:
data1 = [
        (1, None, "STANDARD", None, "PR", 30100),
        (2, 704, None, "PASEO COSTA DEL SUR", "PR", None),
        (3, 709, None, "BDA SAN LUIS", "PR", 3700),
        (4, 76166, "UNIQUE", "CINGULAR WIRELESS", "TX", 84000),
        (5, 76177, "STANDARD", None, "TX", None)
    ]

schema = StructType([
        StructField("id", IntegerType(), False),
        StructField("zip_code", LongType(), False),
        StructField("type", StringType(), True),
        StructField("city", StringType(), True),
        StructField("state", StringType(), True),
        StructField("population", IntegerType(), True),
    ])
df1 = spark.createDataFrame(data1, schema=schema)

df1.printSchema()
df1.show()

ValueError: field zip_code: This field is not nullable, but got None

You can see the error message is very clear, we can't have null value in non nullable column

In [11]:
# now let's replace None with 703, you will see we can get our datafrmae correctly
data2 = [
        (1, 703, "STANDARD", None, "PR", 30100),
        (2, 704, None, "PASEO COSTA DEL SUR", "PR", None),
        (3, 709, None, "BDA SAN LUIS", "PR", 3700),
        (4, 76166, "UNIQUE", "CINGULAR WIRELESS", "TX", 84000),
        (5, 76177, "STANDARD", None, "TX", None)
    ]

df2 = spark.createDataFrame(data2, schema=schema)
df2.printSchema()
df2.show()

root
 |-- id: integer (nullable = false)
 |-- zip_code: long (nullable = false)
 |-- type: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     703|STANDARD|               null|   PR|     30100|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|               null|   TX|      null|
+---+--------+--------+-------------------+-----+----------+



## 6.2.2 Handle null value with built in spark function
The built in spark functions handle null value gracefully, so we dont need to worry null.

- isNull(): called by a column, return true if value is null. This function can be used in filter to remove rows with null values

**The default behavior of spark built in function to handle null value is that if either, or both, of the operands 
column are null, then function returns null**

In [12]:
# we can check if a value of a column is null or not.
df_check_null=df2.withColumn("type_is_null",df2.type.isNull())
df_check_null.show()

+---+--------+--------+-------------------+-----+----------+------------+
| id|zip_code|    type|               city|state|population|type_is_null|
+---+--------+--------+-------------------+-----+----------+------------+
|  1|     703|STANDARD|               null|   PR|     30100|       false|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|        true|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|        true|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|       false|
|  5|   76177|STANDARD|               null|   TX|      null|       false|
+---+--------+--------+-------------------+-----+----------+------------+



In [13]:
# if either, or both, of the operands column are null, then function returns null.
 
df2.withColumn("city_and_type", concat(df.city, df.type)).show()

+---+--------+--------+-------------------+-----+----------+--------------------+
| id|zip_code|    type|               city|state|population|       city_and_type|
+---+--------+--------+-------------------+-----+----------+--------------------+
|  1|     704|STANDARD|               null|   PR|     30100|                null|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|                null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|                null|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|CINGULAR WIRELESS...|
|  5|   76177|STANDARD|               null|   TX|      null|                null|
+---+--------+--------+-------------------+-----+----------+--------------------+



In [14]:
# we can overwrite the default handling of the null value by adding null testing logic by ourself.
# if city is null and type is not null, we replace city value by " "
# if city is not null and type is null, we replace type value by " "
# if both null, return null
# if both not null, concat normally
df2.withColumn("city_and_type",
                  when(df.city.isNull() & ~df.type.isNull(), concat(lit(" "), df.type))
                  .when(~df.city.isNull() & df.type.isNull(), concat(df.city, lit(" ")))
                  .when(df.city.isNull() & df.type.isNull(), None)
                  .otherwise(concat(df.city, df.type))).show(truncate=False)

+---+--------+--------+-------------------+-----+----------+-----------------------+
|id |zip_code|type    |city               |state|population|city_and_type          |
+---+--------+--------+-------------------+-----+----------+-----------------------+
|1  |704     |STANDARD|null               |PR   |30100     | STANDARD              |
|2  |704     |null    |PASEO COSTA DEL SUR|PR   |null      |PASEO COSTA DEL SUR    |
|3  |709     |null    |BDA SAN LUIS       |PR   |3700      |BDA SAN LUIS           |
|4  |76166   |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |CINGULAR WIRELESSUNIQUE|
|5  |76177   |STANDARD|null               |TX   |null      | STANDARD              |
+---+--------+--------+-------------------+-----+----------+-----------------------+



## 6.2.3 Equality check with null values
You can noticed in below example, if either, or both, of the operands are null, then == returns null, not a boolean.

In some case, you want the == behave like this :
- When one value is null and the other is not null, return False
- When both values are null, return True

eqNullSafe(col_name): It's called by a column, takes a column as argument and produce a bool value by using above rules 

In [21]:
# you can notice the type_equality_test column contains null values
df2.withColumn("type_test", lit("STANDARD")) \
   .withColumn("type_equality_test", df2.type == col("type_test")) \
   .show()

+---+--------+--------+-------------------+-----+----------+---------+------------------+
| id|zip_code|    type|               city|state|population|type_test|type_equality_test|
+---+--------+--------+-------------------+-----+----------+---------+------------------+
|  1|     703|STANDARD|               null|   PR|     30100| STANDARD|              true|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null| STANDARD|              null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700| STANDARD|              null|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000| STANDARD|             false|
|  5|   76177|STANDARD|               null|   TX|      null| STANDARD|              true|
+---+--------+--------+-------------------+-----+----------+---------+------------------+



In [22]:
# Now with our null testing logic, we don't have null in column type_equality_test anymore

df3 = df2.withColumn("type_test", lit("STANDARD")) \
          .withColumn("type_equality_test", 
                         when(df2.type.isNull() & col("type_test").isNull(), True)
                         .when(df2.type.isNull() | col("type_test").isNull(), False)
                         .otherwise(df2.type == col("type_test")))

df3.show()

+---+--------+--------+-------------------+-----+----------+---------+------------------+
| id|zip_code|    type|               city|state|population|type_test|type_equality_test|
+---+--------+--------+-------------------+-----+----------+---------+------------------+
|  1|     703|STANDARD|               null|   PR|     30100| STANDARD|              true|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null| STANDARD|             false|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700| STANDARD|             false|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000| STANDARD|             false|
|  5|   76177|STANDARD|               null|   TX|      null| STANDARD|              true|
+---+--------+--------+-------------------+-----+----------+---------+------------------+



## 6.2.4 Handle null value inside your udf

We know that all spark built in functions can handle null value gracefully. How about the UDF that you defined by yourself? The answer is no, by default your UDF will not handle null value at all. 


In [23]:
#in this function, we define a udf, which takes a string, and return "hi"+str
@udf(returnType=StringType())
def hi_city_bad(city_name: str) -> str:
    return "hi " + city_name

In [25]:
# now we try to use the above udf to create a new column. The result shows we have a type issues. The udf requires a string, 
# but we have null as input.  
df2.withColumn("hi_city", hi_city_bad(df2.city)).show()

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
    for obj in iterator:
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 200, in _batched
    for item in iterator:
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
    return lambda *a: f(*a)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-23-b900aee5ce1b>", line 4, in hi_city_bad
TypeError: can only concatenate str (not "NoneType") to str


In [26]:
# To improve the udf, we need to handle the none type. In below example, we just
# use the default behaviour of spark builtin functions. If one of the input argument is null, we return null too.
@udf(returnType=StringType())
def hi_city(city_name: str) -> str:
    return None if city_name is None else "hi " + city_name

In [27]:
# This time, we get the new column hi_city. 
df2.withColumn("hi_city", hi_city(df2.city)).show()

+---+--------+--------+-------------------+-----+----------+--------------------+
| id|zip_code|    type|               city|state|population|             hi_city|
+---+--------+--------+-------------------+-----+----------+--------------------+
|  1|     703|STANDARD|               null|   PR|     30100|                null|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|hi PASEO COSTA DE...|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|     hi BDA SAN LUIS|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|hi CINGULAR WIRELESS|
|  5|   76177|STANDARD|               null|   TX|      null|                null|
+---+--------+--------+-------------------+-----+----------+--------------------+



## 6.2.5 Remove null values

Sometimes, you may not want to handle null values in your dataframe, so you need to remove them. Spark provide two methods to do so.

- drop(how='any', thresh=None, subset=None): function is used to remove/drop rows with NULL values in DataFrame columns
     It has three 3 optional parameters:
        -- how: This takes values ‘any’ or ‘all’. By using ‘any’, drop a row if it contains NULLs on 
                any columns. By using ‘all’, drop a row only if all columns have NULL values. Default is ‘any’.
        -- thresh: This takes int value, Drop rows that have less than thresh hold non-null values. Default is ‘None’.
        -- subset: Use this to select the columns for NULL values. Default is ‘None.
- df.dropna() : is equivalent to df.na.drop()

In [28]:
# we use drop() without arguments, it removes all rows that have null values
# we only have one row left.    
df2.na.drop().show(truncate=False)

+---+--------+------+-----------------+-----+----------+
|id |zip_code|type  |city             |state|population|
+---+--------+------+-----------------+-----+----------+
|4  |76166   |UNIQUE|CINGULAR WIRELESS|TX   |84000     |
+---+--------+------+-----------------+-----+----------+



In [29]:
# We use drop() with how=all arguments, it removes all rows which have null values on all columns
df2.na.drop(how="all").show(truncate=False)

+---+--------+--------+-------------------+-----+----------+
|id |zip_code|type    |city               |state|population|
+---+--------+--------+-------------------+-----+----------+
|1  |704     |STANDARD|null               |PR   |30100     |
|2  |704     |null    |PASEO COSTA DEL SUR|PR   |null      |
|3  |709     |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166   |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177   |STANDARD|null               |TX   |null      |
+---+--------+--------+-------------------+-----+----------+



In [31]:
# We use drop() with subset arguments, it removes all rows that have null values on the selected columns
df2.na.drop(subset=["population", "type"]).show(truncate=False)

+---+--------+--------+-----------------+-----+----------+
|id |zip_code|type    |city             |state|population|
+---+--------+--------+-----------------+-----+----------+
|1  |703     |STANDARD|null             |PR   |30100     |
|4  |76166   |UNIQUE  |CINGULAR WIRELESS|TX   |84000     |
+---+--------+--------+-----------------+-----+----------+



In [33]:
# Here, we use thresh=1, it means drop() function allows one null row in the output result. Default value is none, no null row
df2.na.drop(subset=["population"],thresh=1).show(truncate=False)

+---+--------+--------+-----------------+-----+----------+
|id |zip_code|type    |city             |state|population|
+---+--------+--------+-----------------+-----+----------+
|1  |703     |STANDARD|null             |PR   |30100     |
|3  |709     |null    |BDA SAN LUIS     |PR   |3700      |
|4  |76166   |UNIQUE  |CINGULAR WIRELESS|TX   |84000     |
+---+--------+--------+-----------------+-----+----------+



In [34]:
# note dropna() is equivalent to df.na.drop()
# dropna() removes all rows that have null values
df.dropna().show(truncate=False)


+---+--------+------+-----------------+-----+----------+
|id |zip_code|type  |city             |state|population|
+---+--------+------+-----------------+-----+----------+
|4  |76166   |UNIQUE|CINGULAR WIRELESS|TX   |84000     |
+---+--------+------+-----------------+-----+----------+



## 6.2.6 Imputation of null value
Sometimes, we can't just remove the rows that contain null value. Because the non null columns are too important to be removed. So we need to replace the null value with some value that will not affect the result of the statistic. We can this process Imputation.

Spark provides two fucntion **fillna and fill** to do the imputation. These two functions are aliases of each other and returns the same results.
- fillna(value, subset=None) : it replaces NULL/None with a specific value, It has two arguments:
            -- value: Value should be the data type of int, long, float, string, or dict. Value specified here 
                            will replace the NULL/None values.
            -- subset: This is optional, when used it should be the subset of the column names where 
                       you want to replace NULL/None values.
- fill(value, subset=None)

Note the most important part of the imputation is to choose the right imputation value that will not affect your analysis. Here, we does not address this problems. We just show how to insert the imputation value into the dataframe correctly.

**The type of the imputation value must match the column type that we want to fill**

In [38]:
# We use fill(value=) without giving column names, it replaces null by 0 for all integer columns that contains null.
# Impute all integer column null value with 0
df2.na.fill(value=0).show()


+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     703|STANDARD|               null|   PR|     30100|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|         0|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|               null|   TX|         0|
+---+--------+--------+-------------------+-----+----------+



In [40]:
# note, in below example, nothing happens to the city column, because wrong value type
df.na.fill(value=0, subset=["city"]).show()

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     704|STANDARD|               null|   PR|     30100|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|               null|   TX|      null|
+---+--------+--------+-------------------+-----+----------+



In [41]:
# We use fill(subset=[]) to specify which we column we want to fill.
# note below two operations return the same result.
df2.na.fill(value=" ").show()
df2.na.fill(value=" ", subset=["type","city"]).show()

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     703|STANDARD|                   |   PR|     30100|
|  2|     704|        |PASEO COSTA DEL SUR|   PR|      null|
|  3|     709|        |       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|                   |   TX|      null|
+---+--------+--------+-------------------+-----+----------+

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     703|STANDARD|                   |   PR|     30100|
|  2|     704|        |PASEO COSTA DEL SUR|   PR|      null|
|  3|     709|        |       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|

In [42]:
# we can fill with any string value as impuation value in a string column
df2.na.fill(value="unknown", subset=["city"]).show()

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     703|STANDARD|            unknown|   PR|     30100|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|            unknown|   TX|      null|
+---+--------+--------+-------------------+-----+----------+



In [43]:
# If you have different imputation value for different column, we can use the following operations to fill multiple column with different values
    
# First solution is to use a dict as the value argument. The key of the dict is the column name, the value of the dict is the value 
# you want to fill.
df.na.fill(value={"city": "unknown", "population": 0}).show()
    

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     704|STANDARD|            unknown|   PR|     30100|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|         0|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|            unknown|   TX|         0|
+---+--------+--------+-------------------+-----+----------+



In [44]:
# Second solution is less elegant. We just use two fill to archive the same result as above function
df.na.fill("unknown", ["city"]).na.fill(0, ["population"]).show()

+---+--------+--------+-------------------+-----+----------+
| id|zip_code|    type|               city|state|population|
+---+--------+--------+-------------------+-----+----------+
|  1|     704|STANDARD|            unknown|   PR|     30100|
|  2|     704|    null|PASEO COSTA DEL SUR|   PR|         0|
|  3|     709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|   76177|STANDARD|            unknown|   TX|         0|
+---+--------+--------+-------------------+-----+----------+

