### Chapter 2

In interactive mode, Spark session is created implicitly. When you start it through a standalone application, you need to first create the SparkSession Object

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [None]:
spark.version

The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like
Spark’s standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to
these cluster managers, which will grant resources to our application so that we can complete our
work

You control your Spark Application through a driver process called the SparkSession. The SparkSession instance is the way Spark executes
user-defined manipulations across the cluster. There is a one-to-one correspondence between a
SparkSession and a Spark Application. In Scala and Python, the variable is available as spark
when you start the console. 

Spark Applications consist of a driver process and a set of executor processes. The driver process
runs your main() function, sits on a node in the cluster, and is responsible for three things:
maintaining information about the Spark Application; responding to a user’s program or input;
and analyzing, distributing, and scheduling work across the executors (discussed momentarily).
The driver process is absolutely essential—it’s the heart of a Spark Application and maintains all
relevant information during the lifetime of the application. The executors are responsible for actually carrying out the work that the driver assigns them.
This means that each executor is responsible for only two things: executing code assigned to it
by the driver, and reporting the state of the computation on that executor back to the driver node

The driver and executors are simply
processes, which means that they can live on the same machine or different machines. Spark employs a cluster manager that keeps track of the resources available. The driver process is responsible for executing the driver program’s commands across
the executors to complete a given task. The executors, for the most part, will always be running Spark code. However, the driver can be
“driven” from a number of different languages through Spark’s language APIs. 

Each language API maintains the same core concepts that we described earlier. There is a
SparkSession object available to the user, which is the entrance point to running Spark code.
When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you
write Python and R code that Spark translates into code that it then can run on the executor
JVMs

Executor - driver & Data - Name nodes are software pieces

When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you
write Python and R code that Spark translates into code that it then can run on the executor
JVMs.

Let’s now perform the simple task of creating a range of numbers. This range of numbers is just
like a named column in a spreadsheet

In [None]:
myRange = spark.range(1000).toDF("number")

You just ran your first Spark code! We created a DataFrame with one column containing 1,000
rows with values from 0 to 999. This range of numbers represents a distributed collection. When
run on a cluster, each part of this range of numbers exists on a different executor. This is a Spark
DataFrame.

A DF in Spark is a table which can spread on several thousand computers. The schema of a dataframe is a list which defines columns and types. You can also easily convert Pandas DF into Spark DF, etc.

A DF is then split into multiple partitions, to hand over to each executor for parallel processing. Each partition contains multiple rows and sits on one physical machine. If you have one partition, then you can only have parallelism of 1 even if you have thousands of executors. If you have one executor and thousand of partitions then still you have parallelism of one. In Dataframes you usually do not manipulate partitions manually but it is usually done through low-level APIs

Core Data structures are immutable in Spark, like tuples in Python. You instruct Spark on how to modify these through transformations, which are only acted upon if there is a action alonside

In [None]:
divisBy2 = myRange.where("number % 2 = 0")

No output of the above command, because there is no action specified. Transformations are of two type (i) Narrow & (ii) Wide dependencies. Narrow is when each partition contribute to only one output partition and wide when each partition contribute to many partitions. The later is also referred to as shuffle as partitions change their positions in cluster. For former, Spark implements a process called pipelineing where if multiple filters are applied on a DF then they are applied in memory cache.

Lazy evaluation means waiting until the last momen to execute a graph of computation instructions i.e. a plan of transformations. THis allows optimizing the entire workflow from end to end. Predicate push down is when Spark push down a filter defined at the end of workflow to minimize data read outtime. 

In [None]:
divisBy2.count()

Transformations allow us to build up our logical plan and actions trigger the computaiton. Three types of actions include (i) view data in console (ii) actions to collect data in the console (iii) Actions to write output data sources. 

You can view progress of a job in Spark UI available at localhost:4040

# An end-end Spark workflow

Show them the repository on Github. https://github.com/databricks/Spark-The-Definitive-Guide

In follow up time apply the same operations with other data files

we want Spark to take a best guess at what
the schema of our DataFrame should be. We also want to specify that the first row is the header
in the file, so we’ll specify that as an option, too.

In [None]:
flightData2015 = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv("./../data/flight-data/csv/2015-summary.csv")

Spark peaks at only a few rows and try to infer schema but it does not have a specified number of rows. Reading data is a lazy operation. You can view outcomes of your command if you apply an action

In [None]:
flightData2015.take(5)

In [None]:
flightData2015.sort("count")

In [None]:
flightData2015.sort("count").explain()

Note here that sort is a wide transformation, so the number of partitions defined are by default 200

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "1")

Having too many partitions means too many processes & shufflig whereas too less partitions means some cores working hard and others idle

In [None]:
flightData2015.sort("count").explain()

In [None]:
flightData2015.sort("count").take(1)

Logical plan also keep memory of steps taken from input to output and at any point it can recreate the same results as far as the operations stay constant

In [None]:
dataFrameWay = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .count()

See here that you need to convert to a temp view. In SQL, a view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table

In [None]:
flightData2015.createOrReplaceTempView("flight_data_2015")

In [None]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")

Each of these DataFrames (in Scala and Python) have a set of columns with an unspecified
number of rows.

In [None]:
sqlWay.explain()

In [None]:
dataFrameWay.explain()

Various available functionality & manipulations in Spark. Its just a row count for US - US flights

In [None]:
spark.sql("SELECT max(count) from flight_data_2015").take(1)

In [None]:
from pyspark.sql.functions import max
flightData2015.select(max("count")).take(1)

In [None]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

In [None]:
from pyspark.sql.functions import desc

In [None]:
maxsql = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\

maxsql.show()

Then applying the same operation in Dataframe

In [None]:
flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\
  .explain()

This true execution plan (the one visible in explain) will differ from that shown in Figure 2-10
because of optimizations in the physical execution; however, the llustration is as good of a
starting point as any. This execution plan is a directed acyclic graph (DAG) of transformations,
each resulting in a new immutable DataFrame, on which we call an action to generate a result

The first step is to read in the data. We defined the DataFrame previously but, as a reminder,
Spark does not actually read it in until an action is called on that DataFrame or one derived from
the original DataFrame.
The second step is our grouping; technically when we call groupBy, we end up with a
RelationalGroupedDataset, which is a fancy name for a DataFrame that has a grouping
specified but needs the user to specify an aggregation before it can be queried further. We
basically specified that we’re going to be grouping by a key (or set of keys) and that now we’re
going to perform an aggregation over each one of those keys.
Therefore, the third step is to specify the aggregation. Let’s use the sum aggregation method.
This takes as input a column expression or, simply, a column name. The result of the sum
method call is a new DataFrame. You’ll see that it has a new schema but that it does know the
type of each column. It’s important to reinforce (again!) that no computation has been
performed. This is simply another transformation that we’ve expressed, and Spark is simply able
to trace our type information through it.
The fourth step is a simple renaming. We use the withColumnRenamed method that takes two
arguments, the original column name and the new column name. Of course, this doesn’t perform
computation: this is just another transformation!
The fifth step sorts the data such that if we were to take results off of the top of the DataFrame,
they would have the largest values in the destination_total column.
You likely noticed that we had to import a function to do this, the desc function. You might also
have noticed that desc does not return a string but a Column. In general, many DataFrame
methods will accept strings (as column names) or Column types or expressions. Columns and
expressions are actually the exact same thing.
Penultimately, we’ll specify a limit. This just specifies that we only want to return the first five
values in our final DataFrame instead of all the data.
The last step is our action! Now we actually begin the process of collecting the results of our
DataFrame, and Spark will give us back a list or array in the language that we’re executing. To
reinforce all of this, let’s look at the explain plan for the previous query

## Chapter 3

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [7]:
staticDataFrame = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .option("multiLine", "true")\
  .load("../data/retail-data/by-day/*.csv")

staticDataFrame.createOrReplaceTempView("retail_data")
staticSchema = staticDataFrame.schema

In [9]:
from pyspark.sql.functions import window, column, desc, col

staticDataFrame\
  .selectExpr(
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate")\
  .groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
  .sum("total_cost")\
  .sort(desc("sum(total_cost)"))\
  .show(5)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   17450.0|[2011-09-20 02:00...|          71601.44|
|      null|[2011-11-14 01:00...|          55316.08|
|      null|[2011-11-07 01:00...|          42939.17|
|      null|[2011-03-29 02:00...| 33521.39999999998|
|      null|[2011-12-08 01:00...|31975.590000000007|
+----------+--------------------+------------------+
only showing top 5 rows



In [None]:
streamingDataFrame = spark.readStream\
    .schema(staticSchema)\
    .option("maxFilesPerTrigger", 1)\
    .format("csv")\
    .option("header", "true")\
    .load("/data/retail-data/by-day/*.csv")

In [None]:
purchaseByCustomerPerHour = streamingDataFrame\
  .selectExpr(
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate")\
  .groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
  .sum("total_cost")

In [None]:
purchaseByCustomerPerHour.writeStream\
    .format("memory")\
    .queryName("customer_purchases")\
    .outputMode("complete")\
    .start()

In [None]:
spark.sql("""
  SELECT *
  FROM customer_purchases
  ORDER BY `sum(total_cost)` DESC
  """)\
  .show(5)

In [None]:
from pyspark.sql.functions import date_format, col
preppedDataFrame = staticDataFrame\
  .na.fill(0)\
  .withColumn("day_of_week", date_format(col("InvoiceDate"), "EEEE"))\
  .coalesce(5)

In [None]:
trainDataFrame = preppedDataFrame\
  .where("InvoiceDate < '2011-07-01'")
testDataFrame = preppedDataFrame\
  .where("InvoiceDate >= '2011-07-01'")

In [None]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer()\
  .setInputCol("day_of_week")\
  .setOutputCol("day_of_week_index")

In [None]:
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder()\
  .setInputCol("day_of_week_index")\
  .setOutputCol("day_of_week_encoded")

In [None]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler()\
  .setInputCols(["UnitPrice", "Quantity", "day_of_week_encoded"])\
  .setOutputCol("features")

In [None]:
from pyspark.ml import Pipeline

transformationPipeline = Pipeline()\
  .setStages([indexer, encoder, vectorAssembler])

In [None]:
fittedPipeline = transformationPipeline.fit(trainDataFrame)

In [None]:
transformedTraining = fittedPipeline.transform(trainDataFrame)

In [None]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans()\
  .setK(20)\
  .setSeed(1)

In [None]:
kmModel = kmeans.fit(transformedTraining)

In [None]:
transformedTest = fittedPipeline.transform(testDataFrame)

In [None]:
from pyspark.sql import Row

spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()

## Chapter 4

In [None]:
df = spark.range(500).toDF("number")
df.select(df["number"] + 10)

In [None]:
spark.range(2).collect()

In [2]:
from pyspark.sql.types import *
b = ByteType()

## Chapter 5

Definitionally, a DataFrame consists of a series of records (like rows in a table), that are of type
Row, and a number of columns (like columns in a spreadsheet) that represent a computation
expression that can be performed on each individual record in the Dataset. Schemas define the
name as well as the type of data in each column. Partitioning of the DataFrame defines the
layout of the DataFrame or Dataset’s physical distribution across the cluster. The partitioning
scheme defines how that is allocated.

In [84]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

### 1. Define DF schema on read vs manual enforcing of schema

In [85]:
df = spark.read.format("json").load("./../data/flight-data/json/2015-summary.json")

In [13]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [33]:
df = spark.read.format("json").load("./../data/flight-data/json/2015-summary.json")

In [34]:
df.schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

In [98]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", LongType(), False, metadata={"hello":"world"})
])
df = spark.read.format("json").schema(myManualSchema)\
  .load("./../data/flight-data/json/2015-summary.json")

In [99]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



StructType: Represents values with the structure described by a sequence of StructFields
StructField: A field in Struct Type. metadatadict: a dict from string to simple type that can be toInternald to JSON automatically

### 2. Columns & expressions

In [19]:
from pyspark.sql.functions import col, column
col("someColumnName")
column("someColumnName")

Column<b'someColumnName'>

In [40]:
a = col("new_col")
a

Column<b'new_col'>

In [42]:
df["DEST_COUNTRY_NAME"]

Column<b'DEST_COUNTRY_NAME'>

In [None]:
df.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

Expressions are operations for selection, manipulation adn removal of columns from dataframes

columns are logical constructions that simply represent a value computed on a perrecord basis by means of an expression

An expression is
a set of transformations on one or more values in a record in a DataFrame. Think of it like a
function that takes as input one or more column names, resolves them, and then potentially
applies more expressions to create a single value for each record in the dataset. Importantly, this
“single value” can actually be a complex type like a Map or Array.

In [104]:
(((col("someCol") + 5) * 200) - 6) < col("otherCol")

Column<b'((((someCol + 5) * 200) - 6) < otherCol)'>

In [105]:
from pyspark.sql.functions import expr
expr("(((someCol + 5) * 200) - 6) < otherCol")

Column<b'((((someCol + 5) * 200) - 6) < otherCol)'>

The key take away here is the underlying structure of operations in represented similarly, irrespective of whether we use SQL or DataFrames

### 3. Rows

Calling first row of a dataframe

In [49]:
df.first()

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)

Create rows

The key take away here is if you create a Row manually, you must specify the values in the same order as the schema of the DataFrame to which they might be appended 

In [106]:
from pyspark.sql import Row
myRow = Row("Hello", None, 1, False)

Access elements of a row

In [107]:
myRow[0]


'Hello'

In [108]:
myRow[2]

1

### 4. Data frames

One way to create dataframe is by reading a data file e.g. CSV

In [55]:
df = spark.read.format("json").load("./../data/flight-data/json/2015-summary.json")

Create temporary view

In [56]:
df.createOrReplaceTempView("dfTable")

Second way to create DF. Collecting multiple rows into a dataframe

In [109]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
  StructField("some", StringType(), True),
  StructField("col", StringType(), True),
  StructField("names", LongType(), False)
])
myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow , myRow], myManualSchema)
myDf.show()

+-----+----+-----+
| some| col|names|
+-----+----+-----+
|Hello|null|    1|
|Hello|null|    1|
+-----+----+-----+



You can insert multiple rows in this array here.

### 5. Select vs Expr for indexing dataframes in Spark

In [60]:
df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [61]:
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



Notice below, that all expr, col, column seem to do the same job

In [110]:
from pyspark.sql.functions import expr, col, column

df.select(
    expr("DEST_COUNTRY_NAME"),
    col("DEST_COUNTRY_NAME"),
    column("DEST_COUNTRY_NAME"))\
  .show(2)

+-----------------+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
+-----------------+-----------------+-----------------+
|    United States|    United States|    United States|
|    United States|    United States|    United States|
+-----------------+-----------------+-----------------+
only showing top 2 rows



The following code is expected to show compile error, due to mixing of columns and strings but it did not for some reason

In [114]:
df.select(col("DEST_COUNTRY_NAME"), "DEST_COUNTRY_NAME")

DataFrame[DEST_COUNTRY_NAME: string, DEST_COUNTRY_NAME: string]

expr is the most flexible reference that we can use. It can refer to a plain
column or a string manipulation of a column

Selecting column and changing its name simultanously

In [73]:
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



Selecting column, changing its name and reverting the name

In [77]:
df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME"))\
  .show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



selectExpr = Selct + expr. A short hand

In [80]:
df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)

+-------------+-----------------+
|newColumnName|DEST_COUNTRY_NAME|
+-------------+-----------------+
|United States|    United States|
|United States|    United States|
+-------------+-----------------+
only showing top 2 rows



This combination provides simple way to build up complex expression to create new dataframe. This can be valid for any non-aggregating SQL statement.

In [82]:
df.selectExpr(
  "*", # all original columns
  "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry")\
  .show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



Try the following in SQL using temp view

In [None]:
SELECT *, (DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry
FROM dfTable
LIMIT 2

In [83]:
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)

+-----------+---------------------------------+
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+



Try the following SQL for the same task using tempview

--------- Stopping here------

In [None]:
SELECT avg(count), count(distinct(DEST_COUNTRY_NAME)) FROM dfTable LIMIT 2

In [None]:
from pyspark.sql.functions import lit
df.select(expr("*"), lit(1).alias("One")).show(2)

In [None]:
df.withColumn("numberOne", lit(1)).show(2)

In [None]:
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))\
  .show(2)

In [None]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

In [None]:
dfWithLongColName = df.withColumn(
    "This Long Column-Name",
    expr("ORIGIN_COUNTRY_NAME"))

In [None]:
dfWithLongColName.selectExpr(
    "`This Long Column-Name`",
    "`This Long Column-Name` as `new col`")\
  .show(2)

In [None]:
dfWithLongColName.select(expr("`This Long Column-Name`")).columns

In [None]:
df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia")\
  .show(2)

In [None]:
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

In [None]:
df.select("ORIGIN_COUNTRY_NAME").distinct().count()

In [None]:
seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()

In [None]:
dataFrames = df.randomSplit([0.25, 0.75], seed)
dataFrames[0].count() > dataFrames[1].count() # False

In [None]:
from pyspark.sql import Row
schema = df.schema
newRows = [
  Row("New Country", "Other Country", 5),
  Row("New Country 2", "Other Country 3", 1)
]
parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)

In [None]:
df.union(newDF)\
  .where("count = 1")\
  .where(col("ORIGIN_COUNTRY_NAME") != "United States")\
  .show()

In [None]:
df.sort("count").show(5)
df.orderBy("count", "DEST_COUNTRY_NAME").show(5)
df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)

In [None]:
from pyspark.sql.functions import desc, asc
df.orderBy(expr("count desc")).show(2)
df.orderBy(col("count").desc(), col("DEST_COUNTRY_NAME").asc()).show(2)

In [None]:
spark.read.format("json").load("./../data/flight-data/json/*-summary.json")\
  .sortWithinPartitions("count")

In [None]:
df.limit(5).show()

In [None]:
df.orderBy(expr("count desc")).limit(6).show()

In [None]:
df.rdd.getNumPartitions() # 1

In [None]:
df.repartition(5)

In [None]:
df.repartition(col("DEST_COUNTRY_NAME"))

In [None]:
df.repartition(5, col("DEST_COUNTRY_NAME"))

In [None]:
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)

In [None]:
collectDF = df.limit(10)
collectDF.take(5) # take works with an Integer count
collectDF.show() # this prints it out nicely
collectDF.show(5, False)
collectDF.collect()

## Stopping at chapter 5