## Pyspark - Filter / where

- 원하는 행만 남긴다
- filter와 where은 똑같은 함수다 둘중에 원하는 것을 하면된다!
- 파이썬 연산자 우선순위 때문에 반드시 괄호로 조건을 감싸는 습관을 들이자!

- SQL 스타일과 파이썬 스타일로 두개다 가능하다

단, 
 파이썬 스타일은 `&` SQL스타일은 AND 문자열 사용
 
 파이썬 스타일은 `|` SQL 스타일은 OR 문자열 사용

**Null 필터링**
- isNull() , isNotNull()을 사용
- ==Null(x), ==None (x)

In [1]:
from pyspark.sql import (
    Row,
    SparkSession)
import pyspark.sql.functions as F

In [2]:
spark=(
    SparkSession
    .builder
    .appName("filter_study")
    .master("spark://spark-master:7077")
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/29 09:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
df=spark.read.csv(
    "file:///workspace/data/users_info.csv",
    header=True,
    inferSchema=True
)
df.show()
df.printSchema()

                                                                                

+-------+----+-----+----+------+-------+
|user_id|name| dept| age|salary|   city|
+-------+----+-----+----+------+-------+
|      1| Kim|   IT|  29|  5200|  Seoul|
|      2| Lee|   IT|  35|  6800|  Busan|
|      3|Park|   HR|  41|  4500|  Seoul|
|      4|Choi|   HR|  28|  4000|Incheon|
|      5|Jung|Sales|  33|  6100|  Seoul|
|      6| Han|Sales|  39|  7300|  Busan|
|      7| Seo|   IT|  26|  4800|  Seoul|
|      8|Yoon|Sales|  30|  5900|Incheon|
|      9|Kang|   IT|NULL|  6200|  Seoul|
|     10| Lim|   HR|  34|  NULL|  Busan|
+-------+----+-----+----+------+-------+

root
 |-- user_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: integer (nullable = true)
 |-- city: string (nullable = true)



In [4]:
# filter

In [5]:
df.filter(F.col("age")>30).show()

+-------+----+-----+---+------+-----+
|user_id|name| dept|age|salary| city|
+-------+----+-----+---+------+-----+
|      2| Lee|   IT| 35|  6800|Busan|
|      3|Park|   HR| 41|  4500|Seoul|
|      5|Jung|Sales| 33|  6100|Seoul|
|      6| Han|Sales| 39|  7300|Busan|
|     10| Lim|   HR| 34|  NULL|Busan|
+-------+----+-----+---+------+-----+



In [6]:
df.filter(F.col("salary")>=6000).show()

+-------+----+-----+----+------+-----+
|user_id|name| dept| age|salary| city|
+-------+----+-----+----+------+-----+
|      2| Lee|   IT|  35|  6800|Busan|
|      5|Jung|Sales|  33|  6100|Seoul|
|      6| Han|Sales|  39|  7300|Busan|
|      9|Kang|   IT|NULL|  6200|Seoul|
+-------+----+-----+----+------+-----+



In [7]:
df.filter(F.col("dept")=="IT").show()

+-------+----+----+----+------+-----+
|user_id|name|dept| age|salary| city|
+-------+----+----+----+------+-----+
|      1| Kim|  IT|  29|  5200|Seoul|
|      2| Lee|  IT|  35|  6800|Busan|
|      7| Seo|  IT|  26|  4800|Seoul|
|      9|Kang|  IT|NULL|  6200|Seoul|
+-------+----+----+----+------+-----+



In [8]:
df.filter("age>30").show()

+-------+----+-----+---+------+-----+
|user_id|name| dept|age|salary| city|
+-------+----+-----+---+------+-----+
|      2| Lee|   IT| 35|  6800|Busan|
|      3|Park|   HR| 41|  4500|Seoul|
|      5|Jung|Sales| 33|  6100|Seoul|
|      6| Han|Sales| 39|  7300|Busan|
|     10| Lim|   HR| 34|  NULL|Busan|
+-------+----+-----+---+------+-----+



In [10]:
# 문자열 비교는 작은 따옴표 ! 
df.filter("dept='IT'").show()

+-------+----+----+----+------+-----+
|user_id|name|dept| age|salary| city|
+-------+----+----+----+------+-----+
|      1| Kim|  IT|  29|  5200|Seoul|
|      2| Lee|  IT|  35|  6800|Busan|
|      7| Seo|  IT|  26|  4800|Seoul|
|      9|Kang|  IT|NULL|  6200|Seoul|
+-------+----+----+----+------+-----+



In [15]:
# 조건 여러개 사용
# 괄호 사용 !! 
# 파이썬 스타일은 & SQL스타일은 AND 문자열 사용
# 파이썬 스타일은 | SQL 스타일은 OR 문자열 사용

In [14]:
df.filter((F.col("age")>30)&(F.col("dept")=="IT")).show()

+-------+----+----+---+------+-----+
|user_id|name|dept|age|salary| city|
+-------+----+----+---+------+-----+
|      2| Lee|  IT| 35|  6800|Busan|
+-------+----+----+---+------+-----+



In [16]:
df.filter("age>30 AND dept='IT'").show()

+-------+----+----+---+------+-----+
|user_id|name|dept|age|salary| city|
+-------+----+----+---+------+-----+
|      2| Lee|  IT| 35|  6800|Busan|
+-------+----+----+---+------+-----+



In [17]:
df.filter(
    (F.col("dept")=="IT") | (F.col("dept")=="HR")
).show()

+-------+----+----+----+------+-------+
|user_id|name|dept| age|salary|   city|
+-------+----+----+----+------+-------+
|      1| Kim|  IT|  29|  5200|  Seoul|
|      2| Lee|  IT|  35|  6800|  Busan|
|      3|Park|  HR|  41|  4500|  Seoul|
|      4|Choi|  HR|  28|  4000|Incheon|
|      7| Seo|  IT|  26|  4800|  Seoul|
|      9|Kang|  IT|NULL|  6200|  Seoul|
|     10| Lim|  HR|  34|  NULL|  Busan|
+-------+----+----+----+------+-------+



In [18]:
df.filter("dept='IT' OR dept='HR'").show()

+-------+----+----+----+------+-------+
|user_id|name|dept| age|salary|   city|
+-------+----+----+----+------+-------+
|      1| Kim|  IT|  29|  5200|  Seoul|
|      2| Lee|  IT|  35|  6800|  Busan|
|      3|Park|  HR|  41|  4500|  Seoul|
|      4|Choi|  HR|  28|  4000|Incheon|
|      7| Seo|  IT|  26|  4800|  Seoul|
|      9|Kang|  IT|NULL|  6200|  Seoul|
|     10| Lim|  HR|  34|  NULL|  Busan|
+-------+----+----+----+------+-------+



In [19]:
# Null걸러내기

In [27]:
# 에러는 안나지만 결과가 이상하다
df.filter(F.col("salary") == None).show()

+-------+----+----+---+------+----+
|user_id|name|dept|age|salary|city|
+-------+----+----+---+------+----+
+-------+----+----+---+------+----+



In [23]:
df.filter(F.col("salary").isNull()).show()

+-------+----+----+---+------+-----+
|user_id|name|dept|age|salary| city|
+-------+----+----+---+------+-----+
|     10| Lim|  HR| 34|  NULL|Busan|
+-------+----+----+---+------+-----+



In [24]:
df.filter(F.col("age").isNotNull()).show()

+-------+----+-----+---+------+-------+
|user_id|name| dept|age|salary|   city|
+-------+----+-----+---+------+-------+
|      1| Kim|   IT| 29|  5200|  Seoul|
|      2| Lee|   IT| 35|  6800|  Busan|
|      3|Park|   HR| 41|  4500|  Seoul|
|      4|Choi|   HR| 28|  4000|Incheon|
|      5|Jung|Sales| 33|  6100|  Seoul|
|      6| Han|Sales| 39|  7300|  Busan|
|      7| Seo|   IT| 26|  4800|  Seoul|
|      8|Yoon|Sales| 30|  5900|Incheon|
|     10| Lim|   HR| 34|  NULL|  Busan|
+-------+----+-----+---+------+-------+



In [26]:
# 나이 30초과 null제외
df.filter(
    (F.col("age").isNotNull())&(F.col("age")>30)
).show()

+-------+----+-----+---+------+-----+
|user_id|name| dept|age|salary| city|
+-------+----+-----+---+------+-----+
|      2| Lee|   IT| 35|  6800|Busan|
|      3|Park|   HR| 41|  4500|Seoul|
|      5|Jung|Sales| 33|  6100|Seoul|
|      6| Han|Sales| 39|  7300|Busan|
|     10| Lim|   HR| 34|  NULL|Busan|
+-------+----+-----+---+------+-----+



In [28]:
# where

In [30]:
df.where(
    (F.col("age")>30)&(F.col("city")=="Busan")
).show()

+-------+----+-----+---+------+-----+
|user_id|name| dept|age|salary| city|
+-------+----+-----+---+------+-----+
|      2| Lee|   IT| 35|  6800|Busan|
|      6| Han|Sales| 39|  7300|Busan|
|     10| Lim|   HR| 34|  NULL|Busan|
+-------+----+-----+---+------+-----+



In [31]:
spark.stop()