In [1]:
spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.7:4040
SparkContext available as 'sc' (version = 3.1.2, master = local[*], app id = local-1640612567173)
SparkSession available as 'spark'


res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@8752822


이 장에서는 다음과 같은 내용을 설명함

- spark-submit 명령으로 운영용 애플리케이션 실행
- Dataset: 타입 안정성을 제공하는 구조적 API
- 구조적 스트리밍
- 머신러닝과 고급 분석
- RDD: 스파크의 저수준 API
- SparkR
- 서드파티 패키지 에코시스템

# 3.2 Dataset: 타입 안정성을 제공하는 구조적 API
Dataset은 자바와 스칼라의 정적 데이터 타입에 맞는 코드, 즉 정적 타입 코드(statically typed code)를 지원하기 위해 고안된 스파크의 구조적 API임<br/>
Dataset은 타입 안정성을 지원하며, 동적 타입 언어인 파이썬과 R에서는 사용할 수 없음<br/>

DataFrame은 다양한 데이터 타입의 테이블형 데이터를 보관할 수 있는 Row 타입의 객체로 구성된 분산 컬렉션임<br/>
Dataset API는 DataFrame의 레코드를 사용자가 자바나 스칼라로 정의한 클래스에 할당하고 자바의 ArrayList 또는 스칼라의 Seq 객체 등의 고정 타입형 컬렉션으로 다룰 수 있는 기능을 제공함<br/>
Dataset API는 타입 안정성을 지원하므로 초기화에 사용한 클래스 대신 다른 클래스를 사용해 접근할 수 없음<br/>
따라서 Dataset API는 다수의 소프트웨어 엔지니어가 잘 정의된 인터페이스로 상호작용하는 대규모 애플리케이션을 개발하는 데 특히 유용함<br/>

다음은 타입 안정성 함수와 DataFrame을 사용해 비즈니스 로직을 신속하게 작성하는 방법을 보여주는 간단한 예제임

In [25]:
import spark.implicits._
case class Flight(DEST_COUNTRY_NAME: String,
                  ORIGIN_COUNTRY_NAME: String,
                  count: BigInt)
val flightsDF = spark.read
  .parquet("Downloads/Spark-The-Definitive-Guide/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]

import spark.implicits._
defined class Flight
flightsDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
flights: org.apache.spark.sql.Dataset[Flight] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


마지막으로 소개할 Dataset의 장점은 collect 메서드나 take 메서드를 호출하면 DataFrame을 구성하는 Row 타입의 객체가 아닌 Dataset에 매개변수로 지정한 타입의 객체를 반환한다는 것임<br/>
따라서 코드 변경 없이 타입 안정성을 보장할 수 있으며 로컬이나 분산 클러스터 환경에서 데이터를 안전하게 다룰 수 있음<br/>

# 3.3 구조적 스트리밍

구조적 스트리밍은 스파크 2.2 버전에서 안정화(production-ready)된 스트림 처리용 고수준 API임<br/>
구조적 스트리밍을 사용하면 구조적 API로 개발된 배치 모드의 연산을 스트리밍 방식으로 실행할 수 있으며, 지연 시간을 줄이고 증분 처리할 수 있음<br/>

예제에서는 retail 데이터셋을 사용하며, 이 데이터셋에는 특정 날짜와 시간 정보가 있음<br/>
예제 데이터셋 중 하루치 데이터를 나타내는 by-day 디렉토리의 파일을 사용함<br/>
지금 사용하는 데이터는 retail 데이터이므로 소매점에서 생성된 데이터가 구조적 스트리밍 job이 읽을 수 있는 저장소로 전송되고 있다고 가정함<br/>


In [27]:
val staticDataFrame = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("Downloads/Spark-The-Definitive-Guide/data/retail-data/by-day/*.csv")

staticDataFrame.createOrReplaceTempView("retail_data")
val staticSchema = staticDataFrame.schema

staticDataFrame: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 6 more fields]
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(InvoiceNo,StringType,true), StructField(StockCode,StringType,true), StructField(Description,StringType,true), StructField(Quantity,IntegerType,true), StructField(InvoiceDate,StringType,true), StructField(UnitPrice,DoubleType,true), StructField(CustomerID,DoubleType,true), StructField(Country,StringType,true))


정적 데이터셋의 데이터를 분석해 DataFrame을 생성했음<br/>
정적 데이터셋의 schema도 함께 생성했음<br/>
스트림 처리 과정에서 schema를 추론하는 방법은 5부(20~23장)에서 자세히 알아보겠음<br/>


총 구매비용 컬럼을 추가하고 고객이 가장 많이 소비한 날을 찾아볼 것임<br/>
window 함수는 집계 시 시계열 컬럼을 기준으로 각 날짜에 대한 전체 데이터를 가지는 window를 구성함<br/>
window는 간격을 통해 처리 요건을 명시할 수 있기 때문에 날짜와 timestamp 처리에 유용함<br/>

In [29]:
import org.apache.spark.sql.functions.{window, column, desc, col}
staticDataFrame
  .selectExpr(
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate")
  .groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day"))
  .sum("total_cost")
  .show(30)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   16057.0|{2011-12-05 09:00...|             -37.6|
|   14126.0|{2011-11-29 09:00...| 643.6300000000001|
|   13500.0|{2011-11-16 09:00...| 497.9700000000001|
|   17160.0|{2011-11-08 09:00...| 516.8499999999999|
|   15608.0|{2011-11-11 09:00...|             122.4|
|   15253.0|{2011-11-23 09:00...|             277.6|
|   15124.0|{2011-11-17 09:00...|             93.44|
|   12539.0|{2011-11-17 09:00...|           1050.66|
|   13658.0|{2011-11-30 09:00...| 542.4000000000001|
|   17396.0|{2011-10-31 09:00...|             495.0|
|   13576.0|{2011-11-10 09:00...| 543.3600000000001|
|   15111.0|{2011-11-10 09:00...|329.67999999999995|
|   17419.0|{2011-10-06 09:00...|465.54999999999995|
|   15749.0|{2011-04-18 09:00...|-1462.500000000001|
|   15769.0|{2011-04-18 09:00...|122.03999999999999|
|   18219.0|{2011-04-18 09:00...|            2

import org.apache.spark.sql.functions.{window, column, desc, col}


shuffle partition 수는 shuffle 이후에 생성될 partition 수를 의미함<br/>
기본값은 200이지만 로컬 모드에서는 그렇게 많은 executor가 필요하지 않기 때문에 이 값을 5로 줄이겠음<br/>

In [30]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

지금까지 동작 방식을 알아보았고, 이제 스트리밍 코드를 살펴보겠음<br/>
코드는 거의 바뀌지 않고, read 메서드 대신 readStream 메서드를 사용하는 게 가장 큰 차이점<br/>
그리고 maxFilesPerTrigger 옵션을 추가로 지정함<br/>
이 옵션을 사용해 한 번에 읽을 파일 수를 설정할 수 있음<br/>

In [31]:
val streamingDataFrame = spark.readStream
    .schema(staticSchema)
    .option("maxFilesPerTrigger", 1)
    .format("csv")
    .option("header", "true")
    .load("Downloads/Spark-The-Definitive-Guide/data/retail-data/by-day/*.csv")

streamingDataFrame: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 6 more fields]


In [32]:
streamingDataFrame.isStreaming // Dataframe이 스트리밍 유형인지 확인

res12: Boolean = true


기존 DataFrame 처리와 동일한 비즈니스 로직을 적용해보겠음<br/>
다음 코드는 총 판매 금액을 계산함<br/>

In [34]:
val purchaseByCustomerPerHour = streamingDataFrame
    .selectExpr(
        "CustomerId",
        "UnitPrice * Quantity as total_cost",
        "InvoiceDate")
    .groupBy(
        col("CustomerId"), window(col("InvoiceDate"), "1 day"))
    .sum("total_cost")

purchaseByCustomerPerHour: org.apache.spark.sql.DataFrame = [CustomerId: double, window: struct<start: timestamp, end: timestamp> ... 1 more field]


이 작업 역시 lazy evaluation이므로 데이터 플로우를 실행하기 위해 스트리밍 action을 호출해야 함<br/>
스트리밍 action은 어딘가에 데이터를 채워넣어야 하므로 count 메서드와 같은 일반적인 정적 action과는 조금 다른 특성을 가짐<br/>
여기서 사용할 스트리밍 action은 **트리거**가 실행된 다음 데이터를 갱신하게 될 인메모리 테이블에 데이터를 저장함<br/>
이번 예제의 경우 파일마다 트리거를 실행함<br/>
스파크는 이전 집계값보다 더 큰 값이 발생한 경우에만 인메모리 테이블을 갱신하므로 언제나 가장 큰 값을 얻을 수 있음<br/>

In [35]:
purchaseByCustomerPerHour.writeStream
    .format("memory") // memory = 인메모리 테이블에 저장
    .queryName("customer_purchases") // 인메모리에 저장될 테이블명
    .outputMode("complete") // complete = 모든 카운트 수행 결과를 테이블에 저장
    .start()

res13: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@65f5e745


스트림이 시작되면 쿼리 실행 결과가 어떠한 형태로 인메모리 테이블에 기록되는지 확인할 수 있음

In [36]:
spark.sql("""
    SELECT *
    FROM customer_purchases
    ORDER BY 'sum(total_cost)' DESC
    """)
    .show(20)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15237.0|{2011-12-08 09:00...|              83.6|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   16811.0|{2011-12-05 09:00...|             232.3|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14506.0|{2011-11-22 09:00...|496.91999999999996|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   13408.0|{2011-11-09 09:00...| 550.4399999999999|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   15844.0|{2011-10-24 09:00...|            1

In [41]:
purchaseByCustomerPerHour.writeStream
    .format("console") // console = 콘솔에 결과 출력
    .queryName("customer_purchases_2")
    .outputMode("complete")
    .start()

res19: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@2586e207


-------------------------------------------
Batch: 0
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   12921.0|{2010-12-01 09:00...|             322.4|
|   16583.0|{2010-12-01 09:00...|233.45000000000002|
|   17897.0|{2010-12-01 09:00...|            140.39|
|   12748.0|{2010-12-01 09:00...|              4.95|
|   15350.0|{2010-12-01 09:00...|            115.65|
|   17809.0|{2010-12-01 09:00...|              34.8|
|   13747.0|{2010-12-01 09:00...|              79.6|
|   16250.0|{2010-12-01 09:00...|            226.14|
|   15983.0|{2010-12-01 09:00...|            440.89|
|   17511.0|{2010-12-01 09:00...|           1825.74|
|   14001.0|{2010-12-01 09:00...|            301.24|
|   17460.0|{2010-12-01 09:00...|              19.9|
|   18074.0|{2010-12-01 09:00...|             489.6|
|   12868.0|{2010-12-01 09:00...|             203.3|
| 


-------------------------------------------
Batch: 6
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   13329.0|{2010-12-08 09:00...|             304.2|
|   16250.0|{2010-12-01 09:00...|            226.14|
|   17460.0|{2010-12-01 09:00...|              19.9|
|   13491.0|{2010-12-02 09:00...|              98.9|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   14800.0|{2010-12-05 09:00...| 555.8399999999999|
|   15235.0|{2010-12-05 09:00...| 85.55000000000001|
|   15078.0|{2010-12-06 09:00...| 475.1499999999999|
|   18041.0|{2010-12-02 09:00...| 428.9399999999999|
|   12471.0|{2010-12-02 09:00...|             -17.0|
|   12433.0|{2010-12-08 09:00...|1867.9800000000002|
|   17949.0|{2010-12-03 09:00...|            1314.0|
|


-------------------------------------------
Batch: 12
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   17576.0|{2010-12-13 09:00...| 177.35000000000002|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17220.0|{2010-12-10 09:00...| 317.50000000000006|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   14800.0|{2010-12-05 09:00...|  555.8399999999999|
|   14256.0|{2010-12-10 09:00...|  523.8599999999999|
|   12434.0|{2010-12-14 09:00...|-27.749999999999996|
|   18041.0|{2010-12-02 09:00...|  428.9399999999999|
|   17551.0|{2010-12-15 09:00...|             306.84|
|   16565.0|{2010-12-10 09:00...|              173.7|
|   17949.0|{2010-12-03 09:00...|    


-------------------------------------------
Batch: 18
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   17576.0|{2010-12-13 09:00...| 177.35000000000002|
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17220.0|{2010-12-10 09:00...| 317.50000000000006|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   13329.0|{2010-12-20 09:00...|-35.400000000000006|
|   14800.0|{2010-12-05 09:00...|  555.8399999999999|
|   14256.0|{2010-12-10 09:00...|  523.8599999999999|
|   12434.0|{2010-12-14 09:00...|-27.749999999999996|
|   18041.0|{2010-12-02 09:00...|  42


-------------------------------------------
Batch: 24
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   17576.0|{2010-12-13 09:00...| 177.35000000000002|
|   17368.0|{2011-01-06 09:00...|  563.1500000000001|
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17220.0|{2010-12-10 09:00...| 317.50000000000006|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   13329.0|{2010-12-20 09:00...|-35.400000000000006|
|   14800.0|{2010-12-05 09:00...|  555.8399999999999|
|   14256.0|{2010-12-10 09:00...|  523.8599999999999|
|   12434.0|{2010-12-14 09:00...|-27.


-------------------------------------------
Batch: 30
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   14800.0|{2010-12-05 09:00...|  555.8399999999999|
|   14256.0|{2010-12-10 09:00...|  523.8599999999999|
|   12434.0|{2010-12-14 09:00...|-27.749999999999996|
|   13715.0|{2011-01-05 09:00...|  445.2200000000002|
|   16607.0|{2010-12-15 09:00...|             404.82|
|   13081.0|{2011-01-14 09:00...|-13.200000000000001|
|   15799.0|{2011-01-09 09:00...|    


-------------------------------------------
Batch: 36
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17504.0|{2011-01-21 09:00...| 441.15000000000003|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   14800.0|{2010-12-05 09:00...|  555.8399999999999|
|   14256.0|{2010-12-10 09:00...|  523.8599999999999|
|   12727.0|{2011-01-23 09:00...|              514.5|
|   17602.0|{2011-01-19 09:00...|             767.05|
|   12434.0|{2010-12-14 09:00...|-27.749999999999996|
|   13715.0|{2011-01-05 09:00...|  44


-------------------------------------------
Batch: 42
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17504.0|{2011-01-21 09:00...| 441.15000000000003|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   14334.0|{2011-01-24 09:00...| 352.41999999999996|
|   14800.0|{2010-12-05 09:00...|  555.8399999999999|
|   14256.0|{2010-12-10 09:00...|  523.8599999999999|
|   12727.0|{2011-01-23 09:00...|              514.5|
|   17602.0|{2011-01-19 09:00...|             767.05|
|   12434.0|{2010-12-14 09:00...|-27.


-------------------------------------------
Batch: 48
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   14627.0|{2011-02-01 09:00...|-21.849999999999998|
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   12373.0|{2011-02-01 09:00...|              364.6|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17504.0|{2011-01-21 09:00...| 441.15000000000003|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   14334.0|{2011-01-24 09:00...| 352.41999999999996|
|   14800.0|{2010-12-05 09:00...|  555.8399999999999|
|   14606.0|{2011-02-01 09:00...| 157


-------------------------------------------
Batch: 54
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   14627.0|{2011-02-01 09:00...|-21.849999999999998|
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   12373.0|{2011-02-01 09:00...|              364.6|
|   16842.0|{2011-02-10 09:00...|  520.5699999999999|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17504.0|{2011-01-21 09:00...| 441.15000000000003|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   14334.0|{2011-01-24 09:00...| 352.41999999999996|
|   12913.0|{2011-02-11 09:00...|    

-------------------------------------------
Batch: 60
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   14627.0|{2011-02-01 09:00...|-21.849999999999998|
|   14825.0|{2011-02-15 09:00...| 241.34000000000006|
|   15208.0|{2010-12-21 09:00...|               65.4|
|   15039.0|{2010-12-14 09:00...|  706.2500000000002|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   16250.0|{2010-12-01 09:00...|             226.14|
|   12373.0|{2011-02-01 09:00...|              364.6|
|   16842.0|{2011-02-10 09:00...|  520.5699999999999|
|   14594.0|{2010-12-01 09:00...| 254.99999999999997|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   17504.0|{2011-01-21 09:00...| 441.15000000000003|
|   14865.0|{2010-12-02 09:00...|               37.2|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   14334.0|{2011-01-24 09:00...| 352.


-------------------------------------------
Batch: 66
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|
|   12913.0|{2011-02-11 09:00...|             313.8|
|   17243.0|{2011-02-27 09:00...| 373.6499999999999|
|   18188.0|{2011-02-22 09:00...|             426.6|
|   17175.0|{2011-02-16 09:00...|            519.08|



-------------------------------------------
Batch: 72
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|
|   12913.0|{2011-02-11 09:00...|             313.8|
|   17243.0|{2011-02-27 09:00...| 373.6499999999999|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|



-------------------------------------------
Batch: 78
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|
|   12913.0|{2011-02-11 09:00...|             313.8|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|



-------------------------------------------
Batch: 84
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   15694.0|{2011-03-16 09:00...|            584.76|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|
|   12913.0|{2011-02-11 09:00...|             313.8|



-------------------------------------------
Batch: 90
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   15694.0|{2011-03-16 09:00...|            584.76|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|
|   12913.0|{2011-02-11 09:00...|             313.8|



-------------------------------------------
Batch: 96
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   15694.0|{2011-03-16 09:00...|            584.76|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|


-------------------------------------------
Batch: 102
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   15694.0|{2011-03-16 09:00...|            584.76|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|



-------------------------------------------
Batch: 108
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   15694.0|{2011-03-16 09:00...|            584.76|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|


-------------------------------------------
Batch: 114
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   15694.0|{2011-03-16 09:00...|            584.76|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|


-------------------------------------------
Batch: 120
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   14911.0|{2011-03-11 09:00...|               0.0|
|   15208.0|{2010-12-21 09:00...|              65.4|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   15694.0|{2011-03-16 09:00...|            584.76|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16842.0|{2011-02-10 09:00...| 520.5699999999999|
|   14594.0|{2010-12-01 09:00...|254.99999999999997|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   17504.0|{2011-01-21 09:00...|441.15000000000003|
|   14865.0|{2010-12-02 09:00...|              37.2|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   14334.0|{2011-01-24 09:00...|352.41999999999996|


-------------------------------------------
Batch: 126
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15290.0|{2011-02-22 09:00...|              -1.65|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   12921.0|{2011-03-30 09:00...| -87.30000000000001|
|   17652.0|{2011-03-03 09:00...|              222.3|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   17961.0|{2011-03-11 09:00...| 3.0999999999999996|
|   17841.0|{2011-02-28 09:00...| 227.95000000000005|
|   18188.0|{2011-02-22 09:00...|              426.6|
|   15068.0|{2011-03-28 09:00...| 239.95999999999995|
|   17175.0|{2011-02-16 09:00...|             519.08|
|   16837.0|{2011-04-20 09:00...|              102.0|
|   13184.0|{2011-02-22 09:00...| 212.51999999999998|
|   12748.0|{2011-05-10 09:00...| 21


-------------------------------------------
Batch: 132
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15290.0|{2011-02-22 09:00...|              -1.65|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   12921.0|{2011-03-30 09:00...| -87.30000000000001|
|   17652.0|{2011-03-03 09:00...|              222.3|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   17961.0|{2011-03-11 09:00...| 3.0999999999999996|
|   17841.0|{2011-02-28 09:00...| 227.95000000000005|
|   18188.0|{2011-02-22 09:00...|              426.6|
|   15068.0|{2011-03-28 09:00...| 239.95999999999995|
|   17175.0|{2011-02-16 09:00...|             519.08|
|   16837.0|{2011-04-20 09:00...|              102.0|
|   13184.0|{2011-02-22 09:00...| 212.51999999999998|
|   12748.0|{2011-05-10 09:00...| 21

-------------------------------------------
Batch: 138
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15290.0|{2011-02-22 09:00...|              -1.65|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   12921.0|{2011-03-30 09:00...| -87.30000000000001|
|   17652.0|{2011-03-03 09:00...|              222.3|
|   14292.0|{2011-05-25 09:00...|              -20.8|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   17961.0|{2011-03-11 09:00...| 3.0999999999999996|
|   17841.0|{2011-02-28 09:00...| 227.95000000000005|
|   18188.0|{2011-02-22 09:00...|              426.6|
|   15068.0|{2011-03-28 09:00...| 239.95999999999995|
|   17175.0|{2011-02-16 09:00...|             519.08|
|   16837.0|{2011-04-20 09:00...|              102.0|
|   13184.0|{2011-02-22 09:00...| 212


-------------------------------------------
Batch: 144
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15290.0|{2011-02-22 09:00...|              -1.65|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   12921.0|{2011-03-30 09:00...| -87.30000000000001|
|   17652.0|{2011-03-03 09:00...|              222.3|
|   14292.0|{2011-05-25 09:00...|              -20.8|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   17961.0|{2011-03-11 09:00...| 3.0999999999999996|
|   17841.0|{2011-02-28 09:00...| 227.95000000000005|
|   18188.0|{2011-02-22 09:00...|              426.6|
|   15068.0|{2011-03-28 09:00...| 239.95999999999995|
|   17175.0|{2011-02-16 09:00...|             519.08|
|   16837.0|{2011-04-20 09:00...|              102.0|
|   13184.0|{2011-02-22 09:00...| 21

-------------------------------------------
Batch: 150
-------------------------------------------
+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   15290.0|{2011-02-22 09:00...|              -1.65|
|   14911.0|{2011-01-31 09:00...|             797.77|
|   12921.0|{2011-03-30 09:00...| -87.30000000000001|
|   17652.0|{2011-03-03 09:00...|              222.3|
|   14292.0|{2011-05-25 09:00...|              -20.8|
|   15899.0|{2010-12-06 09:00...|              56.25|
|   18223.0|{2010-12-16 09:00...|  501.6899999999999|
|   17961.0|{2011-03-11 09:00...| 3.0999999999999996|
|   17841.0|{2011-02-28 09:00...| 227.95000000000005|
|   18188.0|{2011-02-22 09:00...|              426.6|
|   15068.0|{2011-03-28 09:00...| 239.95999999999995|
|   17175.0|{2011-02-16 09:00...|             519.08|
|   16837.0|{2011-04-20 09:00...|              102.0|
|   13184.0|{2011-02-22 09:00...| 212

-------------------------------------------
Batch: 156
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   18188.0|{2011-02-22 09:00...|             426.6|
|   15068.0|{2011-03-28 09:00...|239.95999999999995|
|   17175.0|{2011-02-16 09:00...|            519.08|


-------------------------------------------
Batch: 162
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   18188.0|{2011-02-22 09:00...|             426.6|
|   15068.0|{2011-03-28 09:00...|239.95999999999995|



-------------------------------------------
Batch: 168
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   18188.0|{2011-02-22 09:00...|             426.6|

-------------------------------------------
Batch: 174
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   18188.0|{2011-02-22 09:00...|             426.6|



-------------------------------------------
Batch: 180
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|

-------------------------------------------
Batch: 186
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|


-------------------------------------------
Batch: 192
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|


-------------------------------------------
Batch: 198
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|



-------------------------------------------
Batch: 204
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|

-------------------------------------------
Batch: 210
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|


-------------------------------------------
Batch: 216
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|


-------------------------------------------
Batch: 222
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   14825.0|{2011-06-23 09:00...|181.45000000000002|


-------------------------------------------
Batch: 228
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   18093.0|{2011-09-07 09:00...|             89.64|


-------------------------------------------
Batch: 234
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   15290.0|{2011-02-22 09:00...|             -1.65|
|   15036.0|{2011-06-13 09:00...|53.150000000000006|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12464.0|{2011-07-13 09:00...|45.599999999999994|
|   14224.0|{2011-07-17 09:00...|            368.18|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   14292.0|{2011-05-25 09:00...|             -20.8|
|   14243.0|{2011-07-22 09:00...|            214.62|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   15719.0|{2011-06-14 09:00...|342.14999999999975|
|   18093.0|{2011-09-07 09:00...|             89.64|


-------------------------------------------
Batch: 240
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|
|   17175.0|{2011-02-16 09:00...|            519.08|
|   15622.0|{2011-09-21 09:00...|             -2.08|


-------------------------------------------
Batch: 246
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|
|   17175.0|{2011-02-16 09:00...|            519.08|
|   15622.0|{2011-09-21 09:00...|             -2.08|



-------------------------------------------
Batch: 252
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|
|   17175.0|{2011-02-16 09:00...|            519.08|
|   15622.0|{2011-09-21 09:00...|             -2.08|

-------------------------------------------
Batch: 258
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|
|   17175.0|{2011-02-16 09:00...|            519.08|
|   15622.0|{2011-09-21 09:00...|             -2.08|


-------------------------------------------
Batch: 264
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   15844.0|{2011-10-24 09:00...|            130.74|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|
|   17175.0|{2011-02-16 09:00...|            519.08|


-------------------------------------------
Batch: 270
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   15844.0|{2011-10-24 09:00...|            130.74|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|
|   17175.0|{2011-02-16 09:00...|            519.08|


-------------------------------------------
Batch: 276
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   15844.0|{2011-10-24 09:00...|            130.74|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|
|   17175.0|{2011-02-16 09:00...|            519.08|


-------------------------------------------
Batch: 282
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   13408.0|{2011-11-09 09:00...| 550.4399999999999|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   15844.0|{2011-10-24 09:00...|            130.74|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|



-------------------------------------------
Batch: 288
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   13408.0|{2011-11-09 09:00...| 550.4399999999999|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   15844.0|{2011-10-24 09:00...|            130.74|
|   13058.0|{2011-07-11 09:00...|47.550000000000004|


-------------------------------------------
Batch: 294
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14506.0|{2011-11-22 09:00...|496.91999999999996|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   13408.0|{2011-11-09 09:00...| 550.4399999999999|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|
|   15844.0|{2011-10-24 09:00...|            130.74|


-------------------------------------------
Batch: 300
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   18075.0|{2011-06-28 09:00...|            282.23|
|   14911.0|{2011-01-31 09:00...|            797.77|
|   16811.0|{2011-12-05 09:00...|             232.3|
|   12921.0|{2011-03-30 09:00...|-87.30000000000001|
|   17652.0|{2011-03-03 09:00...|             222.3|
|   14506.0|{2011-11-22 09:00...|496.91999999999996|
|   16609.0|{2011-07-27 09:00...|375.12999999999994|
|   12647.0|{2011-09-11 09:00...|356.10999999999996|
|   13408.0|{2011-11-09 09:00...| 550.4399999999999|
|   18093.0|{2011-09-07 09:00...|             89.64|
|   15899.0|{2010-12-06 09:00...|             56.25|
|   18223.0|{2010-12-16 09:00...| 501.6899999999999|
|   17961.0|{2011-03-11 09:00...|3.0999999999999996|
|   17841.0|{2011-02-28 09:00...|227.95000000000005|

스파크가 데이터를 처리하는 시점이 아닌 이벤트 시간에 따라 윈도우를 구성하는 방식에 주목할 필요가 있음<br/>
이 방식을 사용하면 기존 스파크 스트리밍의 간점을 구조적 스트리밍으로 보완할 수 있음<br/>
구조적 스트리밍은 5부(20~23장)에서 자세히 알아보겠음<br/>

# 3.4 머신러닝과 고급 분석
MLlib을 사용하면 대용량 데이터를 대상으로 전처리, 멍잉(munging; data wrangling이라고도 하며, 원본 데이터를 다른 형태로 변환하거나 매핑하는 과정을 의미함), 모델 학습 및 추론을 할 수 있음<br/>
또한 구조적 스트리밍에서 예측하고자 할 때도 MLlib에서 학습시킨 다양한 예측 모델을 사용할 수 있음<br/>
머신러닝 API를 설명하기 위해 k-means 알고리즘을 이용해 클러스터링을 수행해보겠음<br/>

스파크는 데이터 전처리에 사용하는 다양한 메서드를 제공함<br/>
다음 예제는 원본 데이터를 올바른 포맷으로 만드는 transformation을 정의하고, 실제로 모델을 학습한 다음, 예측을 수행함<br/>

In [42]:
staticDataFrame.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



MLlib의 머신러닝 알고리즘을 사용하기 위해서는 수치형 데이터가 필요함<br/>
예제의 데이터는 timestamp, 정수, 문자열 등 다양한 데이터 타입으로 이루어져 있으므로 수치형으로 변환해야 함<br/>
다음은 몇 가지 DataFrame 트랜스포메이션을 사용해 날짜 데이터를 다루는 예제임<br/>

In [43]:
import org.apache.spark.sql.functions.date_format

val preppedDataFrame = staticDataFrame
    .na.fill(0)
    .withColumn("day_of_week", date_format($"InvoiceDate", "EEEE"))
    .coalesce(5)

import org.apache.spark.sql.functions.date_format
preppedDataFrame: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [InvoiceNo: string, StockCode: string ... 7 more fields]


데이터를 학습 데이터셋, 테스트 데이터셋으로 분리해야 함<br/>
예제에서는 특정 구매가 이루어진 날짜 기준으로 직접 분리함<br/>
또한 MLlib의 트랜스포메이션 API(TrainValidationSplit이나 CrossValidator)를 사용해 학습 데이터셋과 테스트 데이터셋을 생성할 수도 있음<br/>
이 방식은 6부(고급 분석과 머신러닝; 24~31장)에서 자세히 알아보겠음<br/>

In [44]:
val trainDataFrame = preppedDataFrame
    .where("InvoiceDate < '2011-07-01'")
val testDataFrame = preppedDataFrame
    .where("InvoiceDate >= '2011-07-01'")

trainDataFrame: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [InvoiceNo: string, StockCode: string ... 7 more fields]
testDataFrame: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [InvoiceNo: string, StockCode: string ... 7 more fields]


데이터가 준비되었으니 action을 호출해 데이터를 분리하겠음<br/>
예제의 데이터는 시계열 데이터셋으로서 임의 날짜를 기준으로 데이터를 분리함<br/>
위 예제의 코드는 데이터셋을 대략 절반으로 나눔 <br/>

In [46]:
trainDataFrame.count()

res22: Long = 245903


In [47]:
testDataFrame.count()

res23: Long = 296006


DataFrame의 트랜스포메이션은 2부(구조적 API: DataFrame, SQL, Dataset; 4~11장)에서 자세히 알아보겠음<br/>
스파크 MLlib은 일반적인 transformation을 자동화하는 다양한 transformation을 제공함<br/>
그중 하나가 바로 StringIndexer임<br/>

In [48]:
import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
    .setInputCol("day_of_week")
    .setOutputCol("day_of_week_index")

import org.apache.spark.ml.feature.StringIndexer
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_56cce65680ba


위에서는 요일(day of week)를 수치형으로 반환했음<br/>
ex) 토요일을 6으로, 월요일을 1으로<br/>
그러나 이런 번호 지정 체계로 수치 표현을 하는 경우 암묵적으로 토요일이 월요일보다 더 큼을 의미하게 됨 -> 잘못된 방식<br/>
이 문제점을 보완하기 위해서는 OneHotEncoder를 사용해 각 값을 자체 컬럼으로 인코딩해야 함<br/>
이렇게 하면 특정 요일이 해당 요일인지 아닌지 Boolean 타입으로 나타낼 수 있음<br/>

In [49]:
import org.apache.spark.ml.feature.OneHotEncoder

val encoder = new OneHotEncoder()
    .setInputCol("day_of_week_index")
    .setOutputCol("day_of_week_encoded")

import org.apache.spark.ml.feature.OneHotEncoder
encoder: org.apache.spark.ml.feature.OneHotEncoder = oneHotEncoder_9315c8ddb5c5


위 예제의 결과는 벡터 타입을 구성할 컬럼 중 하나로 사용됨 <br/>
스파크의 모든 머신러닝 알고리즘은 수치형 벡터 타입을 입력으로 사용함<br/>

In [50]:
import org.apache.spark.ml.feature.VectorAssembler

val vectorAssembler = new VectorAssembler()
    .setInputCols(Array("UnitPrice", "Quantity", "day_of_week_encoded"))
    .setOutputCol("features")

import org.apache.spark.ml.feature.VectorAssembler
vectorAssembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_359aae1dc1cc, handleInvalid=error, numInputCols=3


위 예제는 세 가지 핵심 특징인 가격(UnitPrice), 수량(Quantity), 특정 날짜의 요일(day_of_week_encoded)을 가지고 있음<br/>
다음은 나중에 입력값으로 들어올 데이터가 같은 프로세스를 거쳐 반환되도록 파이프라인을 설정하는 예제임<br/>

In [51]:
import org.apache.spark.ml.Pipeline

val transformationPipeline = new Pipeline()
    .setStages(Array(indexer, encoder, vectorAssembler))

import org.apache.spark.ml.Pipeline
transformationPipeline: org.apache.spark.ml.Pipeline = pipeline_150b2a060aee


학습 준비 과정은 두 단계로 이루어짐<br/>
우선 transformer를 데이터셋에 fit시켜야 함<br/>
그리고 6부(고급 분석과 머신러닝)에서 자세히 알아보겠지만 기본적으로 StringIndexer는 인덱싱할 고윳값의 수를 알아야 함<br/>
고윳값의 수를 알 수 있다면 인코딩을 매우 쉽게 할 수 있지만, 만약 알 수 없다면 컬럼에 있는 모든 고윳값을 조사하고 인덱싱해야 함<br/>

In [52]:
val fittedPipeline = transformationPipeline.fit(trainDataFrame)

fittedPipeline: org.apache.spark.ml.PipelineModel = pipeline_150b2a060aee


학습 데이터셋에 변환자를 fit시키고 나면 학습을 위한 fitted pipeline이 준비됨<br/>
이것을 사용해서 일관되고 반복적인 방식으로 모든 데이터를 변환할 수 있음<br/>

In [53]:
val transformedTraining = fittedPipeline.transform(trainDataFrame)

transformedTraining: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 10 more fields]


이제 모델 학습에 사용할 파이프라인이 마련되었음<br/>
하지만 데이터 캐싱을 설명하기 위해 파이프라인 구성 과정에서 데이터 캐싱 과정을 제외시켰음<br/>
캐싱은 4부(운영용 애플리케이션; 15~19장)에서 자세히 알아보겠음<br/>
동일한 transformation을 계속 반복할 수 없으므로 그 대신 모델에 일부 hyperparameter 튜닝값을 적용함<br/>
캐싱을 사용하면 중간 변환된 데이터셋의 복사본을 메모리에 저장하므로 전체 파이프라인을 재실행하는 것보다 훨씬 빠르게 반복적으로 데이터셋에 접근 가능<br/>
얼마나 큰 차이를 보이는지 궁금하다면 예제에서 다음 코드(transformedTraining.cache())를 제거하고 데이터 캐싱 없이 모델 학습을 진행해볼 것<br/>
그 다음 제거했던 코드를 다시 추가하고 처리해보면 속도 면에서 분명한 차이를 느낄 수 있을 것임<br/>


아래 코드는 책의 코드와 다름

- *computeCost*라는 attribute가 Spark 3.0.0부터 deprecate되어서 *ClusteringEvaluator*를 사용했음
- 책의 코드에서는 수행시간을 확인하는 부분이 없었는데 이 부분을 추가함

In [63]:
// 캐시 적용 x

import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator

val startTimeWithOutCache = System.nanoTime

val kmeansWithOutCache = new KMeans()
    .setK(20)
    .setSeed(1L)

// 모델 학습
val kmModelWithOutCache = kmeansWithOutCache.fit(transformedTraining)

// Make predictions for training data
val predictionsForTrainingWithOutCache = kmModelWithOutCache.transform(transformedTraining)

// Evaluate clustering
val evaluatorWithOutCache = new ClusteringEvaluator()

val scoreForTrainingWithOutCache = evaluatorWithOutCache.evaluate(predictionsForTrainingWithOutCache)
println(s"Score for training data = $scoreForTrainingWithOutCache")

val transformedTestWithOutCache = fittedPipeline.transform(testDataFrame)

// Make predictions for test data
val predictionsForTestWithOutCache = kmModelWithOutCache.transform(transformedTestWithOutCache)

val scoreForTestWithOutCache = evaluatorWithOutCache.evaluate(predictionsForTestWithOutCache)
println(s"Score for test data = $scoreForTestWithOutCache")

val durationWithOutCache = (System.nanoTime - startTimeWithOutCache) / 1e9d
println(s"dutation time without cache = $durationWithOutCache")

Score for training data = 0.6842576726028763
Score for test data = 0.5427938390491535
dutation time without cache = 30.250212307


import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
startTimeWithOutCache: Long = 100666407606868
kmeansWithOutCache: org.apache.spark.ml.clustering.KMeans = kmeans_dc6eeef29af8
kmModelWithOutCache: org.apache.spark.ml.clustering.KMeansModel = KMeansModel: uid=kmeans_dc6eeef29af8, k=20, distanceMeasure=euclidean, numFeatures=7
predictionsForTrainingWithOutCache: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 11 more fields]
evaluatorWithOutCache: org.apache.spark.ml.evaluation.ClusteringEvaluator = ClusteringEvaluator: uid=cluEval_805f50f6989c, metricName=silhouette, distanceMeasure=squaredEuclidean
scoreForTrainingWithOutCache: Double = 0.6842576726028763
transformedTestWithOutCache: org.apache.spark.sql.Dat...


In [64]:
// 캐시 적용 o

import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator

// 캐시 적용 부분
transformedTraining.cache()

val startTimeWithCache = System.nanoTime

val kmeansWithCache = new KMeans()
    .setK(20)
    .setSeed(1L)

// 모델 학습
val kmModelWithCache = kmeansWithCache.fit(transformedTraining)

// Make predictions for training data
val predictionsForTrainingWithCache = kmModelWithCache.transform(transformedTraining)

// Evaluate clustering
val evaluatorWithCache = new ClusteringEvaluator()

val scoreForTrainingWithCache = evaluatorWithCache.evaluate(predictionsForTrainingWithCache)
println(s"Score for training data = $scoreForTrainingWithCache")

val transformedTestWithCache = fittedPipeline.transform(testDataFrame)

// Make predictions for test data
val predictionsForTestWithCache = kmModelWithCache.transform(transformedTestWithCache)

val scoreForTestWithCache = evaluatorWithCache.evaluate(predictionsForTestWithCache)
println(s"Score for test data = $scoreForTestWithCache")

val durationWithCache = (System.nanoTime - startTimeWithCache) / 1e9d
println(s"dutation time with cache = $durationWithCache")

Score for training data = 0.6842576726028763
Score for test data = 0.5427938390491535
dutation time with cache = 29.491759284


import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
startTimeWithCache: Long = 100704841383950
kmeansWithCache: org.apache.spark.ml.clustering.KMeans = kmeans_b65c5b418fd5
kmModelWithCache: org.apache.spark.ml.clustering.KMeansModel = KMeansModel: uid=kmeans_b65c5b418fd5, k=20, distanceMeasure=euclidean, numFeatures=7
predictionsForTrainingWithCache: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 11 more fields]
evaluatorWithCache: org.apache.spark.ml.evaluation.ClusteringEvaluator = ClusteringEvaluator: uid=cluEval_524b7474908e, metricName=silhouette, distanceMeasure=squaredEuclidean
scoreForTrainingWithCache: Double = 0.6842576726028763
transformedTestWithCache: org.apache.spark.sql.DataFrame = [InvoiceNo: ...


*책의 설명과는 다르게 데이터 캐싱을 한 경우와 안 한 경우 수행 시간 차이가 별로 나지 않았음* (이유가 아직 작 모르겠음)

책에 의하면 예제의 모델은 아직 개선할 부분이 많음 <br/>
앞으로 더 많은 전처리 과정을 추가하고, 하이퍼파라미터값을 튜닝하면 더 좋은 모델을 만들 수 있을 것임<br/>
모델을 개선하는 방법은 6부(고급 분석과 머신러닝)에서 자세히 알아보겠음<br/>

# 3.5 저수준 API
스파크는 RDD를 통해 자바와 파이선 객체를 다루는 데 필요한 다양한 기본 기능(저수준 API)를 제공함<br/>
그리고 스파크의 거의 모든 기능은 RDD를 기반으로 만들어졌음<br/>
DataFrame 연산도 RDD를 기반으로 만들어졌으며 편리하고 효율적인 분산 처리를 위해 저수준 명령으로 컴파일됨<br/>
원시 데이터를 읽거나 다루는 용도로 RDD를 사용할 수 있지만 대부분은 구조적 API를 사용하는 것이 좋음<br/>
하지만 RDD를 이용해 파티션과 같은 물리적 실행 특성을 결정할 수 있으므로 DataFrame보다 더 세밀한 제어를 할 수 있음<br/>

또한 드라이버 시스템의 메모리에 저장된 원시 데이터를 병렬 처리(parallelize)하는 데 RDD를 사용할 수 있음<br/>
다음은 간단한 숫자를 이용해 병렬화해 RDD를 생성하는 예제임<br/>
그런 다음 다른 DataFrame과 함께 사용할 수 있도록 DataFrame으로 변환함<br/>

In [66]:
spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF()

res30: org.apache.spark.sql.DataFrame = [value: int]


RDD는 스칼라뿐만 아니라 파이썬에서도 사용할 수 있지만 두 언어의 RDD가 동일하진 않음<br/>
언어와 관계없이 동일한 실행 특성을 제공하는 DataFrame API와는 다르게 RDD는 세부 구현 방식에서 차이를 보임<br/>
4부(운영용 애플리케이션)에서 RDD와 저수준 API를 자세히 알아볼 것임<br/>
낮은 버전의 스파크 코드를 계속 사용해야 하는 상황이 아니라면 RDD를 사용해 스파크 코드를 작성할 필요는 없음<br/>
최신 버전의 스파크에서는 기본적으로 RDD를 사용하지 않지만, 비정형 데이터나 정제되지 않은 원시 데이터를 처리해야 한다면 RDD를 사용해야 함<br/>

# 3.8 정리
이 장에서는 스파크를 비즈니스와 기술적 문제 해결에 적용할 수 있는 다양한 방법을 알아보았음<br/>