# Missing Data
* Incomplete or missing data values are common in big datasets.
* Three basic options available:
    * Retain the missing data points.
    * Remove rows containing the missing data points.
    * Replace missing data points with specific value(s).
___

## Option A: Retain the missing data
* Certain machine learning algorithms can handle missing data points in datasets.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("missingdata").getOrCreate()

In [10]:
# df = spark.read.format('csv').option('header','true').load('PySparkDataSets/datawithnull.csv')
df = spark.read.csv("PySparkDataSets/datawithnull.csv", header=True, inferSchema=True)
df.show()
df.printSchema()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sales: double (nullable = true)



___
## Option B: Remove the missing data
* Function syntax: df.na.drop(how='any', thresh=None, subset=None)
* Parameters:
    * how: 'any' or 'all':
        * 'any' = drop a row if it contains any nulls.
        * 'all' = drop a row only if all its values are null.
    * thresh: int, default = None
        * If specified, drop rows that have less than `thresh` non-null values.
        * This overwrites the `how` parameter.
    * subset: 
        * For optional list of column names.

In [3]:
# Drop rows with missing data
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [4]:
# Dropped rows must have at least 2 NON-null values
df.na.drop(thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [5]:
# Drop rows with any missing data
df.na.drop(how='any').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [6]:
# Only drop rows with all missing data
df.na.drop(how='all').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [7]:
# Only drop rows with missing data for a specified column
df.na.drop(subset=["Sales"]).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



___
## Option C: Replace the missing data
* Spark applies new values onto the columns with the same datatype.

In [8]:
# Replace the missing data with a string data
df.na.fill('EMPTY').show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2|EMPTY| null|
|emp3|EMPTY|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [13]:
# Replace the missing data with an integer data
df.na.fill(0).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|  0.0|
|emp2| null|  0.0|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [14]:
# Specifying the column to have its missing data replaced
df.na.fill('???',subset=['Name']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2|  ???| null|
|emp3|  ???|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



### Replace the missing data with COLUMN MEAN

In [17]:
# Import functions (Mean)
from pyspark.sql.functions import mean
# Obtain the mean for a column
a = df.select(mean(df['Sales'])).collect()
Meansales = a[0][0]
Meansales

400.5

In [18]:
# Fill missing values with column mean
df.na.fill(Meansales,["Sales"]).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [19]:
# One statement to combine the above processes
df.na.fill(df.select(mean(df['Sales'])).collect()[0][0],['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



___