# 3교시 데이터 타입

> 스파크에서 사용되는 데이터 타입에 대해 실습합니다

## 목차
* [1. 리터럴 타입](#1.-리터럴-타입)
* [2. 불리언 형 데이터 타입 다루기](#2.-불리언-형-데이터-타입-다루기)
* [3. 수치형 데이터 타입 다루기](#3.-수치형-데이터-타입-다루기)
* [4. 문자열 데이터 타입 다루기](#4.-문자열-데이터-타입-다루기)
* [5. 정규 표현식](#5.-정규-표현식)
* [6. 날짜와 타임스팸프 데이터 타입 다루기](#6.-날짜와-타임스팸프-데이터-타입-다루기)
* [7. 널 값 다루기](#7.-널-값-다루기)
* [참고자료](#참고자료)
 


In [1]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from IPython.display import display, display_pretty, clear_output, JSON

spark = (
    SparkSession
    .builder
    .config("spark.sql.session.timeZone", "Asia/Seoul")
    .getOrCreate()
)
# 노트북에서 테이블 형태로 데이터 프레임 출력을 위한 설정을 합니다
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # display enabled
spark.conf.set("spark.sql.repl.eagerEval.truncate", 100) # display output columns size

In [2]:
""" DataFrame 생성 """
df = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df.printSchema()
df.createOrReplaceTempView("retail")
df.show(5)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   8

## 1. 리터럴 타입

In [3]:
from pyspark.sql.functions import lit
df.select(lit(5), lit("five"), lit(5.0))

5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0
5,five,5.0


## 2. 불리언 형 데이터 타입 다루기
### 2.1 AND 조건

In [4]:
from pyspark.sql.functions import col

x1 = df.where(col("InvoiceNO") != 536365).select("InvoiceNO", "Description")
x2 = df.where("InvoiceNO <> 536365").select("InvoiceNO", "Description")
x3 = df.where("InvoiceNO = 536365").select("InvoiceNO", "Description")

x1.show(2)
x2.show(2)

+---------+--------------------+
|InvoiceNO|         Description|
+---------+--------------------+
|   536366|HAND WARMER UNION...|
|   536366|HAND WARMER RED P...|
+---------+--------------------+
only showing top 2 rows

+---------+--------------------+
|InvoiceNO|         Description|
+---------+--------------------+
|   536366|HAND WARMER UNION...|
|   536366|HAND WARMER RED P...|
+---------+--------------------+
only showing top 2 rows



### 2.2 OR 조건

In [5]:
from pyspark.sql.functions import instr
df.where("UnitPrice > 600 OR instr(Description, 'POSTAGE') >= 1").show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536370|     POST|       POSTAGE|       3|2010-12-01 08:45:00|     18.0|   12583.0|        France|
|   536403|     POST|       POSTAGE|       1|2010-12-01 11:27:00|     15.0|   12791.0|   Netherlands|
|   536527|     POST|       POSTAGE|       1|2010-12-01 13:04:00|     18.0|   12662.0|       Germany|
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



### 2.3 ISIN - 제공된 목록에 포함되었는지 여부

In [6]:
# SparkSQL 을 이용한 is in 구문 사용
from pyspark.sql.functions import desc
df.select('StockCode').where("StockCode in ('DOT', 'POST', 'C2')").distinct().show()

+---------+
|StockCode|
+---------+
|      DOT|
|     POST|
|       C2|
+---------+



### 2.4 INSTR - 특정 문자열이 포함되었는지 여부

In [7]:
from pyspark.sql.functions import *
""" instr 함수 """
df.withColumn("added", instr(df.Description, "POSTAGE")).where("added > 1").show() # 8번째 글자에 'POSTAGE'가 시작됨

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+-----+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|added|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+-----+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|    8|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|    8|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+-----+



### <font color=green>1. [기본]</font> "data/retail-data/by-day/2010-12-01.csv" 에 저장된 CSV 파일을 읽고
#### 1. 스키마를 출력하세요
#### 2. 데이터를 10건 출력하세요
#### 3. 송장번호(InvoiceNo) 가 '536365' 이면서
#### 4. 상품코드(StockCode) 가 ('85123A', '84406B', '84029G', '84029E') 중에 하나이면서
#### 5. 제품단가(UnitPrice) 가 2.6 이하 혹은 3.0 이상인 경우를 출력하세요

<details><summary>[실습1] 출력 결과 확인 </summary>

> 아래와 유사하게 방식으로 작성 되었다면 정답입니다

```python
df1 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df1.printSchema()
df1.show(10)
answer = df1.where("InvoiceNo = '536365'").where("StockCode in ('85123A', '84406B', '84029G', '84029E')").where("UnitPrice < 2.6 or UnitPrice > 3.0")
answer.show()
```

</details>


In [8]:
# 여기에 실습 코드를 작성하고 실행하세요 (Shift+Enter)
df1 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df1.printSchema()
df1.show(10)
answer = df1.where("InvoiceNo = '536365'").where("StockCode in ('85123A', '84406B', '84029G', '84029E')").where("UnitPrice < 2.6 or UnitPrice > 3.0")
answer.show()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   8

## 3. 수치형 데이터 타입 다루기
### 3.1 각종 함수를 표현식으로 작성합니다

In [9]:
from pyspark.sql.functions import expr, pow
df.selectExpr("CustomerID", "pow(Quantity * UnitPrice, 2) + 5 as realQuantity").show(2)

+----------+------------------+
|CustomerID|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



### 3.2 반올림(round), 올림(ceil), 버림(floor)

In [10]:
from pyspark.sql.functions import *
df.selectExpr("round(2.5, 0)", "ceil(2.4)", "floor(2.6)").show(1)

+-------------+---------+----------+
|round(2.5, 0)|CEIL(2.4)|FLOOR(2.6)|
+-------------+---------+----------+
|            3|        3|         2|
+-------------+---------+----------+
only showing top 1 row



### 3.3 요약 통계

In [11]:
df.describe().show()
df.describe("InvoiceNo").show() # 컬럼을 입력

+-------+-----------------+------------------+--------------------+------------------+-------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|        InvoiceDate|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+-------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|               3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                null| 8.627413127413128|               null| 4.151946589446603|15661.388719512195|          null|
| stddev|72.89447869788873|17407.897548583845|                null|26.371821677029203|               null|15.638659854603892|1854.4496996893627|          null|
|    min|           536365|             

### <font color=blue>2. [중급]</font> "data/retail-data/by-day/2010-12-01.csv" 에 저장된 CSV 파일을 읽고
#### 1. 스키마를 출력하세요
#### 2. 데이터를 10건 출력하세요
#### 3. 송장번호(InvoiceNo) 가 '536367' 인 거래 내역의
#### 4. 총 금액 (TotalPrice) = 수량(Quantity) * 단가(UnitPrice) 를 계산하여 TotalPrice 컬럼을 추가하세요
#### 5. 단, 총 금액 (TotalPrice) 계산시에 소수점 이하는 버림으로 처리하세요

<details><summary>[실습2] 출력 결과 확인 </summary>

> 아래와 유사하게 방식으로 작성 되었다면 정답입니다

```python
df2 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df2.printSchema()
df2.show(10)
answer = df2.where("InvoiceNo = '536367'").withColumn("TotalPrice", expr("floor(Quantity * UnitPrice)"))
display(answer)
```

</details>


In [12]:
# 여기에 실습 코드를 작성하고 실행하세요 (Shift+Enter)
df2 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df2.printSchema()
df2.show(10)
answer = df2.where("InvoiceNo = '536367'").withColumn("TotalPrice", expr("floor(Quantity * UnitPrice)"))
display(answer)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   8

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice
536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,2010-12-01 08:34:00,1.69,13047.0,United Kingdom,54
536367,22745,POPPY'S PLAYHOUSE BEDROOM,6,2010-12-01 08:34:00,2.1,13047.0,United Kingdom,12
536367,22748,POPPY'S PLAYHOUSE KITCHEN,6,2010-12-01 08:34:00,2.1,13047.0,United Kingdom,12
536367,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,8,2010-12-01 08:34:00,3.75,13047.0,United Kingdom,30
536367,22310,IVORY KNITTED MUG COSY,6,2010-12-01 08:34:00,1.65,13047.0,United Kingdom,9
536367,84969,BOX OF 6 ASSORTED COLOUR TEASPOONS,6,2010-12-01 08:34:00,4.25,13047.0,United Kingdom,25
536367,22623,BOX OF VINTAGE JIGSAW BLOCKS,3,2010-12-01 08:34:00,4.95,13047.0,United Kingdom,14
536367,22622,BOX OF VINTAGE ALPHABET BLOCKS,2,2010-12-01 08:34:00,9.95,13047.0,United Kingdom,19
536367,21754,HOME BUILDING BLOCK WORD,3,2010-12-01 08:34:00,5.95,13047.0,United Kingdom,17
536367,21755,LOVE BUILDING BLOCK WORD,3,2010-12-01 08:34:00,5.95,13047.0,United Kingdom,17


## 4. 문자열 데이터 타입 다루기
### 4.1 첫 문자열만 대문자로 변경
* 공백으로 나뉘는 모든 단어의 첫 글자를 대문자로 변경, initcap

In [13]:
from pyspark.sql.functions import initcap
df.select(initcap(col("Description"))).show(2, False)

+----------------------------------+
|initcap(Description)              |
+----------------------------------+
|White Hanging Heart T-light Holder|
|White Metal Lantern               |
+----------------------------------+
only showing top 2 rows



### 4.2 대문자(upper), 소문자(lower)

In [14]:
from pyspark.sql.functions import lower, upper
df.selectExpr("Description", "lower(Description)", "upper(Description)").show(2)

+--------------------+--------------------+--------------------+
|         Description|  lower(Description)|  upper(Description)|
+--------------------+--------------------+--------------------+
|WHITE HANGING HEA...|white hanging hea...|WHITE HANGING HEA...|
| WHITE METAL LANTERN| white metal lantern| WHITE METAL LANTERN|
+--------------------+--------------------+--------------------+
only showing top 2 rows



### 4.3 문자열 주변의 공백을 제거, lpad/ltrim/rpad/rtrim/trim

In [15]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim
df.select(
    ltrim(lit("   HELLO   ")).alias("ltrim"),
    rtrim(lit("   HELLO   ")).alias("rtrim"),
    trim(lit("   HELLO   ")).alias("trim"),
    lpad(lit("HELLO"), 3, " ").alias("lp"),
    rpad(lit("HELLO"), 10, " ").alias("rp")
).show(2)

+--------+--------+-----+---+----------+
|   ltrim|   rtrim| trim| lp|        rp|
+--------+--------+-----+---+----------+
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
+--------+--------+-----+---+----------+
only showing top 2 rows



### <font color=blue>3. [중급]</font> "data/retail-data/by-day/2010-12-01.csv" 에 저장된 CSV 파일을 읽고
#### 1. 스키마를 출력하세요
#### 2. 데이터를 10건 출력하세요
#### 3. 송장번호(InvoiceNo) 가 '536365' 인 거래 내역의
#### 4. 제품코드(StockCode) 를 출력하되 총 8자리 문자로 출력하되 빈 앞자리는 0으로 채워주세요
#### 5. 0이 패딩된 제품코드(StockCode) 컬럼의 컬럼명은 StockCode 로 유지되어야 합니다
#### 5. 최종 출력되는 컬럼은 "InvoiceNo", "StockCode", "Description" 만 출력하세요

<details><summary>[실습3] 출력 결과 확인 </summary>

> 아래와 유사하게 방식으로 작성 되었다면 정답입니다

```python
df3 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df3.printSchema()
df3.show(10)
answer = df3.where("InvoiceNo = '536365'").select("InvoiceNo", lpad("StockCode", 8, "0").alias("StockCode"), "Description")
display(answer)
```

</details>


In [16]:
# 여기에 실습 코드를 작성하고 실행하세요 (Shift+Enter)
df3 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df3.printSchema()
df3.show(10)
answer = df3.where("InvoiceNo = '536365'").select("InvoiceNo", lpad("StockCode", 8, "0").alias("StockCode"), "Description")
display(answer)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   8

InvoiceNo,StockCode,Description
536365,0085123A,WHITE HANGING HEART T-LIGHT HOLDER
536365,00071053,WHITE METAL LANTERN
536365,0084406B,CREAM CUPID HEARTS COAT HANGER
536365,0084029G,KNITTED UNION FLAG HOT WATER BOTTLE
536365,0084029E,RED WOOLLY HOTTIE WHITE HEART.
536365,00022752,SET 7 BABUSHKA NESTING BOXES
536365,00021730,GLASS STAR FROSTED T-LIGHT HOLDER


## 5. 정규 표현식
### 5.1 단어 치환, regexp_extract

In [17]:
from pyspark.sql.functions import regexp_replace
regex_string = "BLACK|WHITE|RED|GRENN|BLUE"
df.select(regexp_replace(col("Description"), regex_string, "COLOR").alias("color_clean"), col("Description")).show(2, truncate=False)

+----------------------------------+----------------------------------+
|color_clean                       |Description                       |
+----------------------------------+----------------------------------+
|COLOR HANGING HEART T-LIGHT HOLDER|WHITE HANGING HEART T-LIGHT HOLDER|
|COLOR METAL LANTERN               |WHITE METAL LANTERN               |
+----------------------------------+----------------------------------+
only showing top 2 rows



## 6. 날짜와 타임스팸프 데이터 타입 다루기
> 시간대 설정이 필요하다면 스파크 SQL 설정의 spark.conf.sessionLocalTimeZone 속성으로 가능 <br>
> TimestampType 클래스는 초 단위 정밀도만 지원 - 초 단위 이상 정밀도 요구 시 long 데이터 타입으로 데이터를 변환해 처리하는 우회 정책이 필요 <br>

### 6.1 오늘 날짜 구하기

In [18]:
from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(10) \
    .withColumn("today", current_date()) \
    .withColumn("now", current_timestamp())

dateDF.createOrReplaceTempView("dataTable")
dateDF.printSchema()

dateDF.show(3, False)

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)

+---+----------+--------------------------+
|id |today     |now                       |
+---+----------+--------------------------+
|0  |2021-08-01|2021-08-01 21:52:57.125036|
|1  |2021-08-01|2021-08-01 21:52:57.125036|
|2  |2021-08-01|2021-08-01 21:52:57.125036|
+---+----------+--------------------------+
only showing top 3 rows



### 6.2 날짜를 더하거나 빼기

In [19]:
from pyspark.sql.functions import date_sub, date_add
dateDF.select(
    date_sub(col("today"), 5),
    date_add(col("today"), 5)
).show(1)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2021-07-27|        2021-08-06|
+------------------+------------------+
only showing top 1 row



### 6.3 문자열을 날짜로 변환

In [20]:
from pyspark.sql.functions import to_date, lit

spark.range(5) \
    .withColumn("date", lit("2017-01-01")) \
    .select(to_date(col("date"))) \
    .show(1)

+---------------+
|to_date(`date`)|
+---------------+
|     2017-01-01|
+---------------+
only showing top 1 row



In [21]:
""" 파싱오류로 날짜가 null로 반환되는 사례 """
dateDF.select(to_date(lit("2016-20-12")), to_date(lit("2017-12-11"))).show(1) # 월과 일의 순서가 바뀜

+---------------------+---------------------+
|to_date('2016-20-12')|to_date('2017-12-11')|
+---------------------+---------------------+
|                 null|           2017-12-11|
+---------------------+---------------------+
only showing top 1 row



### <font color=red>4. [고급]</font> "data/retail-data/by-day/2010-12-01.csv" 에 저장된 CSV 파일을 읽고
#### 1. 스키마를 출력하세요
#### 2. 데이터를 10건 출력하세요
#### 3. 적재일자(LoadDate) 컬럼을 넣되 포맷은 'yyyy-MM-dd' 으로 추가해 주시고 현재 일자를 넣으시면 됩니다
#### 4. 송장일자(InvoiceDate) 와 오늘 시간과의 차이를 나타내는 컬럼(InvoiceDiff)을 표현식(`LoadDate - to_date(InvoiceDate)`)넣어주세요 (힌트: withColumn("컬럼명", "표현식"))
#### 5. 변경된 스키마를 출력하세요

<details><summary>[실습4] 출력 결과 확인 </summary>

> 아래와 유사하게 방식으로 작성 되었다면 정답입니다

```python
df4 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df4.printSchema()
df4.show(10)
answer = df4.withColumn("LoadDate", current_date()).withColumn("InvoiceDiff", expr("LoadDate - to_date(InvoiceDate)"))
display(answer)
answer.printSchema()
```

</details>


In [22]:
# 여기에 실습 코드를 작성하고 실행하세요 (Shift+Enter)
df4 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
)
df4.printSchema()
df4.show(10)
answer = df4.withColumn("LoadDate", current_date()).withColumn("InvoiceDiff", expr("LoadDate - to_date(InvoiceDate)"))
display(answer)
answer.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   8

+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+----------+-----------------+
|InvoiceNo|StockCode|                        Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|  LoadDate|      InvoiceDiff|
+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+----------+-----------------+
|   536365|   85123A| WHITE HANGING HEART T-LIGHT HOLDER|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|2021-08-01|10 years 8 months|
|   536365|    71053|                WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|2021-08-01|10 years 8 months|
|   536365|   84406B|     CREAM CUPID HEARTS COAT HANGER|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|2021-08-01|10 years 8 months|
|   536365|   84029G|KNITTED UNION FLAG HOT WATER BOTTLE|       6|2010-12-01 08:26:00|  

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- LoadDate: date (nullable = false)
 |-- InvoiceDiff: interval (nullable = true)



## 7. 널 값 다루기
+ null 값을 사용하는 것 보다 명시적으로 사용하는 것이 항상 좋음
+ null 값을 허용하지 않는 컬럼을 선언해도 강제성은 없음
+ nullable 속성은 스파크 SQL 옵티마이저가 해당 컬럼을 제어하는 동작을 단순하게 돕는 역할
+ null 값을 다루는 방법은 두 가지 
    + 명시적으로 null을 제거
    + 전역 또느 컬럼 단위로 null 값을 특정 값으로 채움

### 7-1. 컬럼 값에 따른 널 처리 함수 (ifnull, nullIf, nvl, nvl2)
+ SQL 함수이며 DataFrame의 select 표현식으로 사용 가능
    + ifnull(null, 'return_value') # 두 번째 값을, 아니라면 첫 번째 값을 반환 
    + nullif('value', 'value')     # 두 값이 같으면 null
    + nvl(null, 'return_value')    # 두 번째 값을, 아니라면 첫 번째 값을 반환
    + nvl2('not_null', 'return_value', 'else_value') # 두 번째 값을, 아니라면 세번째 값을 반환

In [23]:
spark.sql("""
SELECT
    ifnull(null, 'return_value'),
    nullif('value', 'value'),
    nvl(null, 'return_value'),
    nvl2('not null', 'return_value', 'else_value')
""").show()

+----------------------------+------------------------+-------------------------+----------------------------------------------+
|ifnull(NULL, 'return_value')|nullif('value', 'value')|nvl(NULL, 'return_value')|nvl2('not null', 'return_value', 'else_value')|
+----------------------------+------------------------+-------------------------+----------------------------------------------+
|                return_value|                    null|             return_value|                                  return_value|
+----------------------------+------------------------+-------------------------+----------------------------------------------+



### 7-2 컬럼의 널 값에 따른 로우 제거 (na.drop)

In [24]:
df.na.drop()
df.na.drop("any").show(1) # 로우 컬럼값 중 하나라도 null이면 제거
df.na.drop("all").show(1) # 로우 컬럼값 모두 null이면 제거

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 1 row

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
+---

In [25]:
# 배열 형태의 컬럼을 인수로 전달하여 지정한 컬럼만 제거합니다
df.na.drop("all", subset=("StockCode", "InvoiceNo")).show(1)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 1 row



### 7.3 컬럼의 널 값에 따른 값을 채움 (na.fill)

In [26]:
""" null을 포함한 DataFrame 행성 """
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, DoubleType

myManualSchema = StructType([
    StructField("string_null", StringType(), True),
    StructField("string2_null", StringType(), True),
    StructField("number_null", DoubleType(), True)
])

myRows = []
myRows.append(Row("Hello", None, float(5))) # string 컬럼에 null 포함
myRows.append(Row(None, "World", None))     # number 컬럼에 null 포함

myDf = spark.createDataFrame(myRows, myManualSchema)
myDf.show()

myDf.na.fill( {"number_null": 5.0, "string_null": "not_null"} ).show()

+-----------+------------+-----------+
|string_null|string2_null|number_null|
+-----------+------------+-----------+
|      Hello|        null|        5.0|
|       null|       World|       null|
+-----------+------------+-----------+

+-----------+------------+-----------+
|string_null|string2_null|number_null|
+-----------+------------+-----------+
|      Hello|        null|        5.0|
|   not_null|       World|        5.0|
+-----------+------------+-----------+



### <font color=green>5. [기본]</font> "data/retail-data/by-day/2010-12-01.csv" 에 저장된 CSV 파일을 읽고
#### 1. 스키마를 출력하세요
#### 2. 데이터를 10건 출력하세요
#### 3. 고객구분자(CustomerID)와 설명(Description) 컬럼이 널값인 데이터프레임을 추출하여 출력하세요
#### 4. 고객구분자(CustomerID)가 null 인 경우는 0.0 으로 치환하고
#### 5. 설명(Description)가 null 인 경우는 "NOT MENTIONED" 값으로 저장될 수 있도록 만들어주세요
#### 6. 최종 스키마와 데이터를 출력해 주세요

<details><summary>[실습5] 출력 결과 확인 </summary>

> 아래와 유사하게 방식으로 작성 되었다면 정답입니다

```python
df5 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
).where(expr("Description is null or CustomerID is null"))
df5.printSchema()
df5.show(10)
desc_custid_fill = {"Description":"NOT MENTIONED", "CustomerID":0.0}
answer = df5.na.fill(desc_custid_fill)
answer.printSchema()
display(answer)
```

</details>


In [27]:
# 여기에 실습 코드를 작성하고 실행하세요 (Shift+Enter)
df5 = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("data/retail-data/by-day/2010-12-01.csv")
).where(expr("Description is null or CustomerID is null"))
df5.printSchema()
df5.show(10)
desc_custid_fill = {"Description":"NOT MENTIONED", "CustomerID":0.0}
answer = df5.na.fill(desc_custid_fill)
answer.printSchema()
display(answer)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536414|    22139|                null|      56|2010-12-01 11:52:00|      0.0|      null|United Kingdom|
|   536544|    21773|DECORATIVE ROSE B...|       1|2010-12-01 14:32:00|     2.51|      null|United Kingdom|
|   536544|    21774|DECORATIVE CATS B...|       2|2010-12-01 14:32:00|     2.51|      null|United Kingdom|
|   536544|    

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536414,22139,NOT MENTIONED,56,2010-12-01 11:52:00,0.0,0.0,United Kingdom
536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,2010-12-01 14:32:00,2.51,0.0,United Kingdom
536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,2010-12-01 14:32:00,2.51,0.0,United Kingdom
536544,21786,POLKADOT RAIN HAT,4,2010-12-01 14:32:00,0.85,0.0,United Kingdom
536544,21787,RAIN PONCHO RETROSPOT,2,2010-12-01 14:32:00,1.66,0.0,United Kingdom
536544,21790,VINTAGE SNAP CARDS,9,2010-12-01 14:32:00,1.66,0.0,United Kingdom
536544,21791,VINTAGE HEADS AND TAILS CARD GAME,2,2010-12-01 14:32:00,2.51,0.0,United Kingdom
536544,21801,CHRISTMAS TREE DECORATION WITH BELL,10,2010-12-01 14:32:00,0.43,0.0,United Kingdom
536544,21802,CHRISTMAS TREE HEART DECORATION,9,2010-12-01 14:32:00,0.43,0.0,United Kingdom
536544,21803,CHRISTMAS TREE STAR DECORATION,11,2010-12-01 14:32:00,0.43,0.0,United Kingdom


## 참고자료

#### 1. [Spark Programming Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
#### 2. [PySpark SQL Modules Documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html)
#### 3. <a href="https://spark.apache.org/docs/3.0.1/api/sql/" target="_blank">PySpark 3.0.1 Builtin Functions</a>
#### 4. [PySpark Search](https://spark.apache.org/docs/latest/api/python/search.html)
#### 5. [Pyspark Functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?#module-pyspark.sql.functions)