# Overview

Often datasets contain more information than is required for the immidate analysis. Filtering is used to limit records within datasets based on specified paramters.

In [0]:
import pyspark.sql.functions as f
from pyspark.sql.types import FloatType

In [0]:
sdf = spark.createDataFrame(
[
  ('Robin',  'Science', 95),
  ('Nathan', 'Science', 78),
  ('Anna',   'Science', 88),
  ('Sonia',  'Science', 87),
  ('Robin' , 'English', 95),
  ('Nathan', 'English', 80),
  ('Anna'  , 'English', 87),
  ('Sonia' , 'English', 91)
],
  ['student_name', 'subject_name', 'subject_score']
)

display(sdf)

student_name,subject_name,subject_score
Robin,Science,95
Nathan,Science,78
Anna,Science,88
Sonia,Science,87
Robin,English,95
Nathan,English,80
Anna,English,87
Sonia,English,91


# Filtering Spark DataFrames

Filters are performed using the .where() or .filter functions. Note: when the .where function is used, you muse specify the column literal:

Using the .where() function:

In [0]:
# using pyspark.sql.functions .col() function 
sdf.where(f.col('subject_score') > 90).show() #<-- f.col() required 

# using column index
sdf.where(sdf.subject_score > 90)show()

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|       Robin|     Science|           95|
|       Robin|     English|           95|
|       Sonia|     English|           91|
+------------+------------+-------------+

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|       Robin|     Science|           95|
|       Robin|     English|           95|
|       Sonia|     English|           91|
+------------+------------+-------------+



Alternatively, the .filter() function:

In [0]:
# using pyspark.sql.functions .col() function 
sdf.filter(f.col('subject_score') > 90).show() #<-- f.col() required 

# using column index
sdf.filter(sdf.subject_score > 90).show()

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|       Robin|     Science|           95|
|       Robin|     English|           95|
|       Sonia|     English|           91|
+------------+------------+-------------+

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|       Robin|     Science|           95|
|       Robin|     English|           95|
|       Sonia|     English|           91|
+------------+------------+-------------+



The .like() function enables the filter to search for strings contaning specified stings:

In [0]:
sdf.filter(f.col("subject_name").like("%gli%")).show()

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|       Robin|     English|           95|
|      Nathan|     English|           80|
|        Anna|     English|           87|
|       Sonia|     English|           91|
+------------+------------+-------------+



The .isin() function enables filtering based on a specified list:

In [0]:
list_of_values = ['Sonia','Robin']

sdf.filter(f.col("student_name").isin(list_of_values)).show()

# using a "~" prior to the column literal will result in all records not in the list
sdf.filter(~f.col("student_name").isin(list_of_values)).show()

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|       Robin|     Science|           95|
|       Sonia|     Science|           87|
|       Robin|     English|           95|
|       Sonia|     English|           91|
+------------+------------+-------------+

+------------+------------+-------------+
|student_name|subject_name|subject_score|
+------------+------------+-------------+
|      Nathan|     Science|           78|
|        Anna|     Science|           88|
|      Nathan|     English|           80|
|        Anna|     English|           87|
+------------+------------+-------------+



Scripting for filtering based on Nulls can be found here: https://github.com/mattlibonati/PySpark-And-Databricks/blob/main/6.%20Handling%20Null%20Values.ipynb