# PySpark Hadling Missing Values 

1. Dropping Columns 
2. Dropping Rows
3. Various Parameter in Dropping Functionalities
4. Handling Missing Values by Mean,Median and Mode


In [1]:
import pyspark 
from pyspark.sql import SparkSession

In [29]:
spark = SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark = spark.read.csv("class-grades.csv",header=True,inferSchema=True)
df_pyspark.show()

+------+----------+--------+-------+--------+-----+
|Prefix|Assignment|Tutorial|Midterm|TakeHome|Final|
+------+----------+--------+-------+--------+-----+
|     5|     57.14|   34.09|  64.38|   51.48| 52.5|
|     8|      null|  105.49|   67.5|   99.07| null|
|     8|      83.7|    null|   30.0|   63.15|48.89|
|  null|     81.22|   96.06|  49.38|  105.93|80.56|
|     8|      null|   93.64|   95.0|  107.41|73.89|
|     7|      null|   92.58|  93.12|    null|68.06|
|  null|     95.05|  102.99|  56.25|   99.07| 50.0|
|     7|     72.85|   86.85|   60.0|    null|56.11|
|  null|     84.26|    null|   47.5|   18.52| null|
+------+----------+--------+-------+--------+-----+



In [10]:
import pandas as pd
data = pd.read_csv("class-grades.csv")
data.head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5.0,57.14,34.09,64.38,51.48,52.5
1,8.0,,105.49,67.5,99.07,
2,8.0,83.7,,30.0,63.15,48.89
3,,81.22,96.06,49.38,105.93,80.56
4,8.0,,93.64,95.0,107.41,73.89


In [14]:
data.isnull().sum()

Prefix        3
Assignment    3
Tutorial      2
Midterm       0
TakeHome      2
Final         2
dtype: int64

## Handling Missing Values

1. **drop** :  remove all nan value.

    how = any,all
        any : drop the row even if any single null is there
        all : if all the values are null then only it will drop 
    
    thresh = let the threshold be = k 
              atleast k non null values should be there in order 
              to keep that row otherwise it will be delted.
    
    subset =  When you want to consider only one specific column in 
            order to drop the nan values
2. **fill** :  filling of missing values is done with the help of this method

    value = It will simply fill whatever you provide in the value as missing value in the column.
             (You can also use the imputer function)
             from pyspark.ml,feature import Imputer
    subset = This will be used to select one or more specific column

In [16]:
df_pyspark.na.drop(how="any").show()

+------+----------+--------+-------+--------+-----+
|Prefix|Assignment|Tutorial|Midterm|TakeHome|Final|
+------+----------+--------+-------+--------+-----+
|     5|     57.14|   34.09|  64.38|   51.48| 52.5|
+------+----------+--------+-------+--------+-----+



In [18]:
df_pyspark.na.drop(how="all").show()

+------+----------+--------+-------+--------+-----+
|Prefix|Assignment|Tutorial|Midterm|TakeHome|Final|
+------+----------+--------+-------+--------+-----+
|     5|     57.14|   34.09|  64.38|   51.48| 52.5|
|     8|      null|  105.49|   67.5|   99.07| null|
|     8|      83.7|    null|   30.0|   63.15|48.89|
|  null|     81.22|   96.06|  49.38|  105.93|80.56|
|     8|      null|   93.64|   95.0|  107.41|73.89|
|     7|      null|   92.58|  93.12|    null|68.06|
|  null|     95.05|  102.99|  56.25|   99.07| 50.0|
|     7|     72.85|   86.85|   60.0|    null|56.11|
|  null|     84.26|    null|   47.5|   18.52| null|
+------+----------+--------+-------+--------+-----+



In [21]:
df_pyspark.na.drop(how="any",thresh=5).show()

+------+----------+--------+-------+--------+-----+
|Prefix|Assignment|Tutorial|Midterm|TakeHome|Final|
+------+----------+--------+-------+--------+-----+
|     5|     57.14|   34.09|  64.38|   51.48| 52.5|
|     8|      83.7|    null|   30.0|   63.15|48.89|
|  null|     81.22|   96.06|  49.38|  105.93|80.56|
|     8|      null|   93.64|   95.0|  107.41|73.89|
|  null|     95.05|  102.99|  56.25|   99.07| 50.0|
|     7|     72.85|   86.85|   60.0|    null|56.11|
+------+----------+--------+-------+--------+-----+



In [27]:
df_pyspark.na.fill(value=10000).show()

+------+----------+--------+-------+--------+-------+
|Prefix|Assignment|Tutorial|Midterm|TakeHome|  Final|
+------+----------+--------+-------+--------+-------+
|     5|     57.14|   34.09|  64.38|   51.48|   52.5|
|     8|   10000.0|  105.49|   67.5|   99.07|10000.0|
|     8|      83.7| 10000.0|   30.0|   63.15|  48.89|
| 10000|     81.22|   96.06|  49.38|  105.93|  80.56|
|     8|   10000.0|   93.64|   95.0|  107.41|  73.89|
|     7|   10000.0|   92.58|  93.12| 10000.0|  68.06|
| 10000|     95.05|  102.99|  56.25|   99.07|   50.0|
|     7|     72.85|   86.85|   60.0| 10000.0|  56.11|
| 10000|     84.26| 10000.0|   47.5|   18.52|10000.0|
+------+----------+--------+-------+--------+-------+



## Filters in DataFrame

Suppose I want to find out the midterm marks greater than 60

In [32]:
df_pyspark.filter("Midterm>=60").show()

+------+----------+--------+-------+--------+-----+
|Prefix|Assignment|Tutorial|Midterm|TakeHome|Final|
+------+----------+--------+-------+--------+-----+
|     5|     57.14|   34.09|  64.38|   51.48| 52.5|
|     8|      null|  105.49|   67.5|   99.07| null|
|     8|      null|   93.64|   95.0|  107.41|73.89|
|     7|      null|   92.58|  93.12|    null|68.06|
|     7|     72.85|   86.85|   60.0|    null|56.11|
+------+----------+--------+-------+--------+-----+



In [34]:
df_pyspark.filter("Midterm>=60").select(["Midterm","Final"]).show()

+-------+-----+
|Midterm|Final|
+-------+-----+
|  64.38| 52.5|
|   67.5| null|
|   95.0|73.89|
|  93.12|68.06|
|   60.0|56.11|
+-------+-----+



Two different conditions in pyspark dataframe

In [36]:
df_pyspark.filter((df_pyspark["Final"]<=60) & 
                  (df_pyspark["Midterm"]>=60)).show()

+------+----------+--------+-------+--------+-----+
|Prefix|Assignment|Tutorial|Midterm|TakeHome|Final|
+------+----------+--------+-------+--------+-----+
|     5|     57.14|   34.09|  64.38|   51.48| 52.5|
|     7|     72.85|   86.85|   60.0|    null|56.11|
+------+----------+--------+-------+--------+-----+

