## PySpark Handling Missing Values
- Dropping the Columns
- Dropping the Rows
- Various Parameter in Dropping functionalities
- Handling missing values by Mean, Median and Mode

In [1]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('Practice').getOrCreate()

In [2]:
spark.read.csv('test2.csv', header=True, inferSchema=True)

DataFrame[NAME: string, AGE: int, EXPERIENCE: int, SALARY: int]

In [3]:
df_pyspark = spark.read.csv('test2.csv', header=True, inferSchema=True)
df_pyspark.show()

+-------+----+----------+------+
|   NAME| AGE|EXPERIENCE|SALARY|
+-------+----+----------+------+
|    ABI|  31|        10| 30000|
|BANERJI|  15|         8| 25000|
| CHARLI|  30|         8| 20000|
|   DOMC|  25|         6| 20100|
| HARSHA|  29|         3| 15000|
|   PAUL|  32|         5| 18250|
|  MAHAV|NULL|      NULL| 40500|
|   NULL|  31|         9| 12599|
|   NULL|  29|      NULL|  NULL|
+-------+----+----------+------+



- Drop the NaN values

In [4]:
df_pyspark.na.drop().show()

+-------+---+----------+------+
|   NAME|AGE|EXPERIENCE|SALARY|
+-------+---+----------+------+
|    ABI| 31|        10| 30000|
|BANERJI| 15|         8| 25000|
| CHARLI| 30|         8| 20000|
|   DOMC| 25|         6| 20100|
| HARSHA| 29|         3| 15000|
|   PAUL| 32|         5| 18250|
+-------+---+----------+------+



In [6]:
### any == how
df_pyspark.na.drop(how="any").show()

+-------+---+----------+------+
|   NAME|AGE|EXPERIENCE|SALARY|
+-------+---+----------+------+
|    ABI| 31|        10| 30000|
|BANERJI| 15|         8| 25000|
| CHARLI| 30|         8| 20000|
|   DOMC| 25|         6| 20100|
| HARSHA| 29|         3| 15000|
|   PAUL| 32|         5| 18250|
+-------+---+----------+------+



In [None]:
### threshold
df_pyspark.na.drop(how="any", thresh=2).show() # Atleast 2 non-null values

+-------+----+----------+------+
|   NAME| AGE|EXPERIENCE|SALARY|
+-------+----+----------+------+
|    ABI|  31|        10| 30000|
|BANERJI|  15|         8| 25000|
| CHARLI|  30|         8| 20000|
|   DOMC|  25|         6| 20100|
| HARSHA|  29|         3| 15000|
|   PAUL|  32|         5| 18250|
|  MAHAV|NULL|      NULL| 40500|
|   NULL|  31|         9| 12599|
+-------+----+----------+------+



In [8]:
### subsest
df_pyspark.na.drop(how="any", subset=['EXPERIENCE']).show() # Wherever the NaN values are present in the EXPERIENCE column it gets deleted 

+-------+---+----------+------+
|   NAME|AGE|EXPERIENCE|SALARY|
+-------+---+----------+------+
|    ABI| 31|        10| 30000|
|BANERJI| 15|         8| 25000|
| CHARLI| 30|         8| 20000|
|   DOMC| 25|         6| 20100|
| HARSHA| 29|         3| 15000|
|   PAUL| 32|         5| 18250|
|   NULL| 31|         9| 12599|
+-------+---+----------+------+



- Filling the Missing Values

In [9]:
df_pyspark.na.fill('Missing Values').show()

+--------------+----+----------+------+
|          NAME| AGE|EXPERIENCE|SALARY|
+--------------+----+----------+------+
|           ABI|  31|        10| 30000|
|       BANERJI|  15|         8| 25000|
|        CHARLI|  30|         8| 20000|
|          DOMC|  25|         6| 20100|
|        HARSHA|  29|         3| 15000|
|          PAUL|  32|         5| 18250|
|         MAHAV|NULL|      NULL| 40500|
|Missing Values|  31|         9| 12599|
|Missing Values|  29|      NULL|  NULL|
+--------------+----+----------+------+



- Handling the Missing Values

In [12]:
df_pyspark.show()

+-------+----+----------+------+
|   NAME| AGE|EXPERIENCE|SALARY|
+-------+----+----------+------+
|    ABI|  31|        10| 30000|
|BANERJI|  15|         8| 25000|
| CHARLI|  30|         8| 20000|
|   DOMC|  25|         6| 20100|
| HARSHA|  29|         3| 15000|
|   PAUL|  32|         5| 18250|
|  MAHAV|NULL|      NULL| 40500|
|   NULL|  31|         9| 12599|
|   NULL|  29|      NULL|  NULL|
+-------+----+----------+------+



In [15]:
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=['AGE', 'EXPERIENCE', 'SALARY'],
    outputCols=["{}_imputed".format(c) for c in ['AGE', 'EXPERIENCE', 'SALARY']]).setStrategy("mean")

In [16]:
## Add imputation Cols to df
imputer.fit(df_pyspark).transform(df_pyspark).show()

+-------+----+----------+------+-----------+------------------+--------------+
|   NAME| AGE|EXPERIENCE|SALARY|AGE_imputed|EXPERIENCE_imputed|SALARY_imputed|
+-------+----+----------+------+-----------+------------------+--------------+
|    ABI|  31|        10| 30000|         31|                10|         30000|
|BANERJI|  15|         8| 25000|         15|                 8|         25000|
| CHARLI|  30|         8| 20000|         30|                 8|         20000|
|   DOMC|  25|         6| 20100|         25|                 6|         20100|
| HARSHA|  29|         3| 15000|         29|                 3|         15000|
|   PAUL|  32|         5| 18250|         32|                 5|         18250|
|  MAHAV|NULL|      NULL| 40500|         27|                 7|         40500|
|   NULL|  31|         9| 12599|         31|                 9|         12599|
|   NULL|  29|      NULL|  NULL|         29|                 7|         22681|
+-------+----+----------+------+-----------+--------

- Similarly you can change the mean into median and mode