One column containing 1,000rows with values from 0 to 999. This range of numbers represents a distributed collection. When run on a cluster, each part of this range of numbers exists on a different executor.

In [0]:
myRange = spark.range(1000).toDF("number")


Returns no output - specified only an abstract transformation - Spark will not act on transformations until we call an action.

In [0]:
divisBy2 = myRange.where("number % 2 = 0")


To trigger the computation,we run an action. An action instructs Spark to compute a result from a series of transformations.

In [0]:
divisBy2.count()

Schema inference, which means that we want Spark to take a best guess at whatthe schema of our DataFrame should be. Number of rows is unspecified is because reading data is a transformation, and is therefore a lazy operation. Spark peeked at only a couple of rows of data to try to guess what types each column should be.

In [0]:
flightData2015 = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv("/FileStore/tables/2015_summary.csv")


# "/FileStore/tables/2015_summary.csv"

Stores table for a particular Spark session.

In [0]:
flightData2015.createOrReplaceTempView("flight_data_2015")


We can see that Spark is building up a plan for how it will execute this across the cluster by looking at the explain plan. Explain can be called on any DataFrame object to see the DataFrame’s lineage (how Spark will execute query). The sort of our data is actually a wide transformation because rows will need to be compared with one another.

In [0]:
flightData2015.sort("count").explain()

By default, when we perform a shuffle, Sparkoutputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the shuffle:

In [0]:
spark.conf.set("spark.sql.shuffle.partitions","5")
flightData2015.sort("count").take(2)

In [0]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")

dataFrameWay = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .count()

sqlWay.explain()
dataFrameWay.explain()


In [0]:
from pyspark.sql.functions import max

flightData2015.select(max("count")).take(1)


In [0]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()


DataFrame syntax that is semantically similar but slightly different inimplementation and ordering:

In [0]:
from pyspark.sql.functions import desc

flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\
  .show()


Aggregation happens in two phases, in the partial_sum calls. This is because summing a list of numbers is commutative, and Spark can perform the sum, partition by partition.

In [0]:
flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\
  .explain()
