# Spark Imputer 

It is an imputation transformer for completing missing values.
Real world datasets may contain missing values, often encoded as nulls, blanks, NaNs or other placeholders (for example in SQL, missing values are denoted by NULL).
There are many methods to handle missing values:

1) Delete instances if there is any missing feature (this might not be such a good idea since important information from other features will be lost)

2) For a missing feature, find the average of that feature and replace missing values by the computed average

3) A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.

Imputer uses either the mean or the median of the columns in which the missing values are located.
The input columns should be of numeric type.

Currently Imputer does not support categorical features and may create incorrect values for a categorical feature.

Note that the mean/median/mode value is computed after filtering out missing values. All null values in the input columns are treated as missing, and so are also imputed. For computing median, pyspark.sql.DataFrame.approxQuantile() is used with a relative error of 0.001.

In [3]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("pysark-ml-imputer").master("local[*]").getOrCreate()

In [4]:
df = spark.createDataFrame([
    (1, 12.0, 5.0),
    (2, 7.0, 10.0),
    (3, 10.0, 12.0),
    (4, 5.0, float("nan")),
    (5, 6.0, None),
    (6, float("nan"), float("nan")),
    (7, None, None)
    ], ["id", "col1", "col2"])
    
df.show(truncate=False)

+---+----+----+
|id |col1|col2|
+---+----+----+
|1  |12.0|5.0 |
|2  |7.0 |10.0|
|3  |10.0|12.0|
|4  |5.0 |NaN |
|5  |6.0 |null|
|6  |NaN |NaN |
|7  |null|null|
+---+----+----+



In [5]:
from pyspark.ml.feature import Imputer
imputer = Imputer(inputCols=["col1", "col2"], outputCols=["col1_out", "col2_out"])
model = imputer.fit(df)
transformed = model.transform(df)
transformed.show(truncate=False)

+---+----+----+--------+--------+
|id |col1|col2|col1_out|col2_out|
+---+----+----+--------+--------+
|1  |12.0|5.0 |12.0    |5.0     |
|2  |7.0 |10.0|7.0     |10.0    |
|3  |10.0|12.0|10.0    |12.0    |
|4  |5.0 |NaN |5.0     |9.0     |
|5  |6.0 |null|6.0     |9.0     |
|6  |NaN |NaN |8.0     |9.0     |
|7  |null|null|8.0     |9.0     |
+---+----+----+--------+--------+



# strategy


In [6]:
imputer.setStrategy("median")
model = imputer.fit(df)
transformed = model.transform(df)
transformed.show(truncate=False)

+---+----+----+--------+--------+
|id |col1|col2|col1_out|col2_out|
+---+----+----+--------+--------+
|1  |12.0|5.0 |12.0    |5.0     |
|2  |7.0 |10.0|7.0     |10.0    |
|3  |10.0|12.0|10.0    |12.0    |
|4  |5.0 |NaN |5.0     |10.0    |
|5  |6.0 |null|6.0     |10.0    |
|6  |NaN |NaN |7.0     |10.0    |
|7  |null|null|7.0     |10.0    |
+---+----+----+--------+--------+

