### 02_MissingData

using `where`, `isNull`, `dropna`, `fillna`, `isnan` to find/fill  missing data on Pyspark dataframes

In [14]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName('PySparkmissing').getOrCreate()
import pandas as pd

In [15]:
spdfA = spark.read.csv('02_MockDataset.csv',inferSchema=True,header=True)
spdfA.printSchema()

root
 |-- id: integer (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- SPAM: integer (nullable = true)
 |-- numvar: double (nullable = true)



In [24]:
def count_nullsnan(df):
    null_counts = []          
    for col in df.dtypes:     #iterate through the column data types 
        cname = col[0]        #column name   
        ctype = col[1]        #column type
        if ctype != 'string': #no processing of string columns (can't have nulls or nan's)
            nulls = df.where( df[cname].isNull() | isnan(df[cname]) ).count()
            result = tuple([cname, nulls])  #(column name, null count)
            null_counts.append(result)      #append tuple in our output list
    return null_counts

In [26]:
nullcount = count_nullsnan(spdfA)
nullcount

[('id', 0), ('SPAM', 7), ('numvar', 7)]

In [28]:
spdfA.count()

100


### Dropping Null Values

There are three things we can do with our null values now that we know what's in our dataframe. 

We can ignore them, we can drop them, or we can replace them. 

Remember, pySpark dataframes are immutable, so we can't actually change the original dataset. 

All operations return an entirely new dataframe, though we can tell it to overwrite the existing one with 

`df = df.some_operation()` which ends up functionaly equivalent.

The df.dropna() method has two arguments here:  how can equal `any` or `all` 

the first drops a row if any value in it is null, the second drops a row only if all values are.

The subset argument takes a list of columns that you want to look in for null values. 
It does not actually subset the dataframe; it just checks in those three columns, 
then drops the row for the entire dataframe if that subset meets the criteria. 
This can be left off if it should check all columns for null.


In [30]:
df_drops = spdfA.dropna(how='all', subset=['SPAM', 'numvar'])
df_drops.count()

93

### Replacing Null Values

The below line goes through all of columns 'SPAM' and 'numvar' and replaces null values with the value we specified,
in this case a 0. 

To verify we re-run the command on our new dataframe to count nulls that we used above:

In [32]:
df_fill = spdfA.fillna(0, subset=['SPAM','numvar'])

count_nullsnan(df_fill)

[('id', 0), ('SPAM', 0), ('numvar', 0)]

In [33]:
## TODO: imputation