<a href="https://colab.research.google.com/github/imtheguna/PySpark-Learning/blob/GoogleColab/7_PySpark_Filter_vs_Where.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!apt-get update # Update apt-get repository.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

!ls

# Initialize findspark
import findspark
findspark.init()
!pip install pyspark

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
0% [Connecting to archive.ubuntu.com (185.125.190.39)] [1 InRelease 5,482 B/110 kB 5%] [Connected to                                                                                                    Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,622 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [830 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:8 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,082 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1,798 kB]
Get:10 http://archive.ubuntu.com

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession \
      .builder \
      .appName('SelectColumns').getOrCreate()

df = spark.read.csv('/content/data2.csv',header=True,inferSchema=True)

df.show(1)

+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
| period|series_reference|region_name|filled jobs|filled jobs revised|filled jobs diff|filled jobs % diff|total_earnings|total earnings revised|earnings diff|earnings % diff|
+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
|2020.09|     BDCQ.SED1RA|  Northland|      65520|              65904|             384|               0.6|           953|                   959|            6|            0.6|
+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
only showing top 1 row



In [3]:
## Filtering Rows Using ‘filter’ Function

df1 = df.filter(col('region_name')=='Northland')

df1.show(4)

+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
| period|series_reference|region_name|filled jobs|filled jobs revised|filled jobs diff|filled jobs % diff|total_earnings|total earnings revised|earnings diff|earnings % diff|
+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
|2020.09|     BDCQ.SED1RA|  Northland|      65520|              65904|             384|               0.6|           953|                   959|            6|            0.6|
|2020.12|     BDCQ.SED1RA|  Northland|      67629|              68137|             508|               0.8|           974|                   982|            8|            0.8|
|2021.03|     BDCQ.SED1RA|  Northland|      67128|              67640|             512|               0.8|           950|    

In [4]:
## Filtering Rows Using ‘where’ Function

## The where function is an alias for the ‘filter’ function and can be used interchangeably. It also takes a boolean expression as an argument and returns a new DataFrame containing only the rows that satisfy the condition.

df1 = df.where(col('region_name').isin(['Northland']))

df1.show(4)

+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
| period|series_reference|region_name|filled jobs|filled jobs revised|filled jobs diff|filled jobs % diff|total_earnings|total earnings revised|earnings diff|earnings % diff|
+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
|2020.09|     BDCQ.SED1RA|  Northland|      65520|              65904|             384|               0.6|           953|                   959|            6|            0.6|
|2020.12|     BDCQ.SED1RA|  Northland|      67629|              68137|             508|               0.8|           974|                   982|            8|            0.8|
|2021.03|     BDCQ.SED1RA|  Northland|      67128|              67640|             512|               0.8|           950|    

In [6]:
## Filtering Rows Using SQL Queries

df.createOrReplaceTempView('data')

df1 = spark.sql('select count(*) from data where region_name="Northland"')

df1.show()

+--------+
|count(1)|
+--------+
|      10|
+--------+



In [9]:
## Combining Multiple Filter Conditions

df1 = df.filter((col('region_name')=='Northland') & (col('total_earnings')>1000))

df1.show()

+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
| period|series_reference|region_name|filled jobs|filled jobs revised|filled jobs diff|filled jobs % diff|total_earnings|total earnings revised|earnings diff|earnings % diff|
+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
|2021.06|     BDCQ.SED1RA|  Northland|      68288|              69064|             776|               1.1|          1001|                  1015|           14|            1.4|
|2021.09|     BDCQ.SED1RA|  Northland|      68905|              69401|             496|               0.7|          1012|                  1020|            8|            0.8|
|2021.12|     BDCQ.SED1RA|  Northland|      71012|              71494|             482|               0.7|          1093|    