# Chapter 5. Basic Structured Operations
- DataFrame: A distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Python.
- Row: The individual records in a DataFrame.
- Columns: Represents the computation expressions applicable to each record.
- Schema: Provides a blueprint for the structure of the DataFrame, detailing column names and data types.
- Partitioning: Crucial for performance, as it defines the data distribution across the Spark cluster nodes, optimizing resource utilization and parallel processing.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
  .appName("PySpark Basic Structured Operations")\
  .getOrCreate()

24/08/10 09:54:45 WARN Utils: Your hostname, Khanhs-MAC.local resolves to a loopback address: 127.0.0.1; using 192.168.0.102 instead (on interface en0)
24/08/10 09:54:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/10 09:54:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/10 09:54:48 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 56533)
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 318, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 349, in process_request
    self.finish_request(request, client_address)
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 362, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socketserver.py", line 761, in __init__
    self.handle()
  File "/Users/khanhnn/Developer/DE/spark/practice_spark/venv/lib

In [2]:
df = spark.read.format("json").load("../data/flight-data/json/2015-summary.json")

df.printSchema()

[Stage 0:>                                                          (0 + 1) / 1]

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



                                                                                

Schemas tie everything together, so they’re worth belaboring.

### Schemas 
- Schema Definition: Defines column names and data types of a DataFrame.
- Schema-on-Read: 
  - Automatically defines the schema when reading data.
  - Suitable for ad hoc analysis.
  - Works well with plain-text formats like CSV or JSON.
  - Can be slow and may cause precision issues.
- Explicit Schema Definition:
  - Manually define the schema before reading data.
  - Recommended for production ETL processes.
  - Ensures accurate data type settings.
  - Avoid precision issues.
- Ad Hoc Analysis:
  - Schema-on-read is generally sufficient for quick, exploratory analysis.
- Precision Issues:
  - Automated schema inference may not always be accurate.
- ETL processes:
  - Explicit schema definition is beneficial for performance and accuracy in structured workflow.

In [3]:
spark.read.format("json").load("../data/flight-data/json/2015-summary.json").schema

StructType([StructField('DEST_COUNTRY_NAME', StringType(), True), StructField('ORIGIN_COUNTRY_NAME', StringType(), True), StructField('count', LongType(), True)])

- Schema Composition:
  - A schema is a StructType mae up of multiple StructFields.
  - StructFields have:
    - Name.
    - Type.
    - Boolean flag for null/missing values.
    - Optional metadata for additional information.
- Metadata:
  - Used to store information about columns.
  - Utilized by Spark's machine learning library.
- Other StructType (Spark's Complex types):
  - Schemas can include other StructTypes.
- Runtime Type Checking:
  - Spark throws an error if runtime data types do not match the schema.


In [4]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
  StructField("DEST_COUNTY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTY_NAME", StringType(), True),
  StructField("count", LongType(), False, metadata={"hello": "world"})
])

df = spark.read.format("json").schema(myManualSchema)\
  .load("../data/flight-data/json/2015-summary.json")

print(df)

DataFrame[DEST_COUNTY_NAME: string, ORIGIN_COUNTY_NAME: string, count: bigint]


## Columns and Expressions
- Column Similarity:
  - Columns in Spark are akin to columns in a spreadsheet, R, data frame, or pandas DataFrame.
- Operations on Columns:
  - You can select, manipulate, and remove columns from DataFrames.
  - These operations are represented as expressions.
- Logical Constructions:
  - Columns are logical constructions representing a value computed per record using an expression
- Dependency on DataFrame:
  - A column's real value is tied to a row.
  - To have a row, you need a DataFrame.
  - Columns cannot be manipulated outside the context of a DataFrame.
- Spark Transformations:
  - Using Spark transformations within a DataFrame to modify column contents.

### Columns
- Two simplest ways: col and column functions

In [5]:
from pyspark.sql.functions import col, column

col("someColumnName")
column("someColumnName")

Column<'someColumnName'>

- Column Resolution:
  - Columns may not exist in DataFrames util resolved against the catalog during the analyzer phase.
- Scala Syntatic Sugar:
  - $ and ' are shorthand ways to refer to columns.
  - $ designates a string as a special string for expressions.
  - ' is a symbol for referring to an identifier.
  - Both provide no performance improvement but offer shorthand notation.
- Explicit Column References:
  - Use the col method on a specific DataFrame for explicit references.
  - Useful for joining DataFrames with shared column names.
  - This avoids the need for Spark to resolve the column during the analyzer phase.

In [6]:
df.select(col("count"))

DataFrame[count: bigint]

### Expressions
- An expression is a set of transformations on one or more value in a record within a DataFrame.
- Functions as a computation on column names to produce a single value per record.
- The single value can be a complex type like a Map or Array.
- Creating Expressions:
  - Simplest case: expr("someCol") is equivalent to col("someCol").
  - expr function can parse transformations and column references from a string.
- Columns as expressions:
  - Columns are a subset of expressions functionality.
  - Use col() for transformations on a column reference.
  - Example: expr("someCol - 5") is equivalent to col("someCol") - 5.
- Logical Plan:
  - Columns and transformations compile to the same logical plan as parsed expressions.
  - The logical plan is a directed acyclic graph.

In [7]:
from pyspark.sql.functions import expr

expr("(((someCol + 5) * 200) - 6) < otherCol")

Column<'((((someCol + 5) * 200) - 6) < otherCol)'>

- SQL Equivalence:
  - Expressions is expr can also be valid SQL code.
  - SQL expressions and DataFrame code compile to the same logical tree, providing identical performance characteristics.
- Accessing DataFrame's Columns:
  - Use printSchema to see DataFrame's schema.
  - Use the columns property to programmatically access all columns of a DataFrame.

In [8]:
df = spark.read.format("json").load("../data/flight-data/json/2015-summary.json")
print(df.columns)

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']


## Records and Rows
- Records and Rows in Spark:
  - Each row in a DataFrame is single record, represents as an object of type Row.
  - Spark manipulates Row objects using column expressions to produce usable values.
  - Row objects internally represent arrays of bytes.
- Byte Array Interface:
  - The byte array interface of Row objects is not exposed to uses.
  - Users interact with column expressions to manipulate Row objects.
- Returning Rows:
  - Commands that return individual rows to the driver will return one or more Row types when working with DataFrames.
- Terminology Note:
  - "row" and "record" are used interchangeably.
  - A capitalized Row refers specifically to the Row object.

In [9]:
df.first()

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)

### Creating Rows
- Creating Rows:
  - Row are manually instantiated using the Row object with values for each column.
  - Only DataFrame have schemas; Rows do not.
  - When creating a Row manually, values must be specified in the same order as the DataFrame schema they will be appended to.

In [10]:
from pyspark.sql import Row
myRow = Row("Hello", None, 1, False)

- Accessing Data in Rows:

In [11]:
myRow[0]
myRow[2]

1

## DataFrame Transformations
- Core Parts of a DataFrame:
  - Brief definition of core parts completed.
- Manipulating DataFrames:
  - Fundamental objective when working with DataFrames.
- Core Operations:
 - Add rows or columns.
 - Remove rows or columns.
 - Transform a row into column or vice versa.
 - Change the order of rows based on values in columns.
- Transformations can be simplified into operations that take one column, change it row by row, and then return the results.

### Creating DataFrames
- DataFrame can be created from raw data sources.
- Register DataFrames as temporary views for SQL queries and transformations.

In [12]:
df = spark.read.format("json")\
  .load("../data/flight-data/json/2015-summary.json")
df.createOrReplaceGlobalTempView("dfTable")

- Creating DataFrames on the fly:
  - Convert a set of rows into a DataFrame.
  - Define schema using StructType and StructField.

In [13]:
from pyspark.sql import Row

myManualSchema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", LongType(), False)
])

myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()

                                                                                

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|            Hello|               NULL|    1|
+-----------------+-------------------+-----+



- Key Methods for DataFrames:
  - *select* method for working with columns or expressions
  - *selectExpr* method for working wit expressions in strings.
  - Use functions from org.apache.spark.sql.functions for transformations not available as column methods.
- Transformation Tools:
  - With *select, selectExpr*, and *functions* from org.apache.spark.sql.functions, you can handle most DataFrame transformations challenges.


### select and selectExpr
- *select* and *selectExpr*:
  - Equivalent to SQL queries on a DataFrame.
- USING *select* and *selectExpr*:
  - Manipulate columns in DataFrames.
  - Simplest way is to use the *select* method and pass column name as strings.

In [14]:
# in Scala
# df.select("DEST_COUNTRY_NAME").show(2)
# in Python
df.select("DEST_COUNTRY_NAME").show(2)
# in SQL
# SELECT DEST_COUNTRY_NAME FROM dfTable LIMIT 2

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [15]:
# in Python
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)
# in SQL
#SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME FROM dfTable LIMIT 2

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



- Referring to Columns:
  - Columns can be referred to in multiple ways and used interchangeably.
- Methods to Refer to Columns:

In [16]:
from pyspark.sql.functions import expr, col, column

df.select(
  expr("DEST_COUNTRY_NAME"),
  col("DEST_COUNTRY_NAME"),
  column("DEST_COUNTRY_NAME")
).show(2)

+-----------------+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
+-----------------+-----------------+-----------------+
|    United States|    United States|    United States|
|    United States|    United States|    United States|
+-----------------+-----------------+-----------------+
only showing top 2 rows



- Common Error:
 - Mixing Column objects and strings in the same select statement will result in a compiler error:

In [17]:
df.select(col("DEST_COUNTRY_NAME"), "DEST_COUNTRY_NAME")

DataFrame[DEST_COUNTRY_NAME: string, DEST_COUNTRY_NAME: string]

- Using expr for Flexibility:
  - *expr* is the most flexible method to reference columns, as it can refer to a plain column or string manipulation of a column

In [18]:
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

# -- in SQL
# SELECT DEST_COUNTRY_NAME as destination FROM dfTable LIMIT 2

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



In [19]:
df.select(expr("DEST_COUNTRY_NAME AS destination"))\
  .alias("DEST_COUNTRY_NAME").show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



Spark has a shorthand for doing select and expr

In [20]:
df.selectExpr("DEST_COUNTRY_NAME AS newColumnName", "DEST_COUNTRY_NAME").show(2)

+-------------+-----------------+
|newColumnName|DEST_COUNTRY_NAME|
+-------------+-----------------+
|United States|    United States|
|United States|    United States|
+-------------+-----------------+
only showing top 2 rows



- Power of selectExpr:
  - Allows building complex expressions to create new DataFrames.
  - Can include any valid non-expression SQL statement.

In [21]:
df.selectExpr(
  "*",
  "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry")\
    .show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



With select expression, we can also specify aggregations over the entire DatFrame by taking advantage of the functions that we have.

In [22]:
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)

+-----------+---------------------------------+
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+



### Converting to Spark Types (Literals)
- Converting to Spark Types:
  - Literals are used to pass explicit values into Spark that are just values, not new columns.
  - Useful for constant values or for comparisons later.
- Using Literals:
  - Translate a literal value from a programming language to one that Spark understands.
  - Literals are expressions and can be used similarly.

In [23]:
from pyspark.sql.functions import lit

df.select(expr("*"), lit(1).alias("One")).show(2)

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|One|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows



### Adding Columns
- Use the *withColumn* method to add a new column to a DataFrame.

In [24]:
df.withColumn("numberOne", lit(1)).show(2)

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



- Adding Columns with Expressions:
  - Adding a column with a Boolean expression to check if the origin country is the same as the destination country.

In [25]:
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))\
    .show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



- *withColumn* Function:
  - Takes two arguments: the column name and the expression to create the value for each row.
  - Can also be used to rename a column.

In [26]:
df.withColumn("Destination", expr("DEST_COUNTRY_NAME")).columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count', 'Destination']

### Renaming Columns
- Renaming Columns:
  - Use the *withColumnRenamed* method to rename a column.
  - Renames the column with the given new name.

In [27]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "destination").columns

['destination', 'ORIGIN_COUNTRY_NAME', 'count']

### Reversed Characters and Keywords
- Use backticks (`) to handle reserved characters like spaces or dashes in column names.

Don't use escape characters.

In [28]:
dfWithLongColName = df.withColumn(
      "This Long Column-Name",
      expr("ORIGIN_COUNTRY_NAME")).show(2)

+-----------------+-------------------+-----+---------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|This Long Column-Name|
+-----------------+-------------------+-----+---------------------+
|    United States|            Romania|   15|              Romania|
|    United States|            Croatia|    1|              Croatia|
+-----------------+-------------------+-----+---------------------+
only showing top 2 rows



Using Backticks for Reversed Characters.

In [29]:
dfWithLongColName.selectExpr(
      "`This Long Column-Name`",
      "`This Long Column-Name` as `new col`")\
    .show(2)
dfWithLongColName.createOrReplaceTempView("dfTableLong")

# -- in SQL
#   SELECT `This Long Column-Name`, `This Long Column-Name` as `new col`
#   FROM dfTableLong LIMIT 2

AttributeError: 'NoneType' object has no attribute 'selectExpr'

24/08/10 10:58:56 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 904865 ms exceeds timeout 120000 ms
24/08/10 10:58:56 WARN SparkContext: Killing executors is not supported by current scheduler.
24/08/10 10:59:01 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at o

### Case Sensitivity
By default Spark is case insensitive; however, you can make Spark case sensitive by setting the configuration:

In [None]:

set spark.sql.caseSensitive true

### Removing Columns
- We can do this by using *select*.
- It also a dedicated method called *drop*.

In [None]:
df.drop("ORIGIN_COUNTRY_NAME").columns

['DEST_COUNTRY_NAME', 'count']

In [None]:
# We can drop multiple columns by passing in multiple columns as arguments
dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")

AttributeError: 'NoneType' object has no attribute 'drop'

### Changing a Column’s Type (cast)
- We can do this by casting the column from one type to another.

In [None]:
df.withColumn("count2", col("count").cast("long"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint, count2: bigint]

### Filtering Rows
- Create an expression that evaluates to true or false to filter rows.
- Filter out rows the expression evaluates to false.
- Methods for Filtering:
  - *where*: Familiar to SQL.
  - *filter*: Also valid and performs the same operation.
  - Both methods are accepts the same arguments types and work with DataFrame.
**Note**: When using the Dataset API from Scala or Java, filter accepts an arbitrary function to apply to each record in the Dataset.

- Equivalent filters: The result of using *filter* and *where* are the same.

In [None]:
df.filter(col("count") < 2).show(2)
df.where("count < 2").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



- Multiple filters:
  - You. can chain multiple filters sequentially.
  - Spark performs all filtering operations at the same time regardless of the filter ordering.

In [None]:
# in Python
df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia").show(2)
# -- in SQL
#   SELECT * FROM dfTable WHERE count < 2 AND ORIGIN_COUNTRY_NAME != "Croatia"
#   LIMIT 2

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



### Getting Unique Rows
- Extract the unique or distinct in a DataFrame using the *distinct* method.
- Deduplicate rows based on one or more columns.

In [None]:
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

256

In [None]:
df.select("ORIGIN_COUNTRY_NAME").distinct().count()

125

### Random Samples
- Use the *sample* on a DataFrame to extract random records.
- Specify the fraction of rows to extract and whether to sample with or without replacement.

In [None]:
seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()

138

### Concatenating and Appending Rows (Union)
- DataFrame are Immutable:
  - Cannot append directly to DataFrame as that would change them.
  - To append, use the union method to concatenate the original DataFrame.
- Union Requirement:
  - Both DataFrames must have the same schema and number of columns.
  - The union operation will fail if schemas or column counts do not match.
- **Warning:**
  - Unions are performed based on column locations, not schema.
  - Columns may not automatically line up as expected if their positions differ.

In [None]:
schema = df.schema
newRows = [
    Row("New Country", "Other Country", 5),
    Row("New Country 2", "Other Country 3", 1)
]
parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)


In [None]:
df.union(newDF)\
    .where("count = 1")\
    .where(col("ORIGIN_COUNTRY_NAME") != "United States")\
    .show()

                                                                                

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
|    United States|          Gibraltar|    1|
|    United States|             Cyprus|    1|
|    United States|            Estonia|    1|
|    United States|          Lithuania|    1|
|    United States|           Bulgaria|    1|
|    United States|            Georgia|    1|
|    United States|            Bahrain|    1|
|    United States|   Papua New Guinea|    1|
|    United States|         Montenegro|    1|
|    United States|            Namibia|    1|
|    New Country 2|    Other Country 3|    1|
+-----------------+-------------------+-----+



### Sorting Rows 
- Sort values in a DataFrame to place the largest or smallest values at the top.
- Two equivalent operations: sort and orderBy.
- Both methods accept column expressions, strings, and multiple columns.
- Default sort order is ascending.

In [None]:
df.sort("count").show(5)
df.orderBy("count", "DEST_COUNTRY_NAME").show(5)
df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
+--------------------+-------------------+-----+
only showing top 5 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--

In [None]:
from pyspark.sql.functions import desc, asc

df.orderBy(expr("count desc")).show(2)
df.orderBy(expr("count").desc(), col("DEST_COUNTRY_NAME").asc()).show(2)


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Moldova|      United States|    1|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
+-----------------+-------------------+------+
only showing top 2 rows



In [None]:
spark.read.format("json").load("../data/flight-data/json/*-summary.json")\
  .sortWithinPartitions("count")

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

### Limit

In [None]:
df.limit(5).show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+



In [None]:
df.orderBy(expr("count desc")).limit(6).show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
|             Moldova|      United States|    1|
+--------------------+-------------------+-----+



### Repartition and Coalesce
- Repartition and Coalesce:
	- Important for optimizing data layout across the cluster.
	- Controls the physical layout of data including the partitioning scheme and the number of partitions.
- Repartition:
	- Incurs a full shuffle of data, regardless of necessity.
	- Recommended when the future number of partitions is greater than the current number or when partitioning by specific columns.

In [None]:
df.rdd.getNumPartitions()
df.repartition(5).show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|           Greece|      United States|   30|
|    United States|            Bermuda|  193|
|    United States|           Portugal|  134|
|    United States|Trinidad and Tobago|  217|
|          Romania|      United States|   14|
+-----------------+-------------------+-----+
only showing top 5 rows



Repartition by column

In [None]:
df.repartition(col("DEST_COUNTRY_NAME")).show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|         Anguilla|      United States|   41|
|           Russia|      United States|  176|
|         Paraguay|      United States|   60|
|          Senegal|      United States|   40|
|           Sweden|      United States|  118|
+-----------------+-------------------+-----+
only showing top 5 rows



In [None]:
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2).show(5)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|             Moldova|      United States|    1|
|             Bolivia|      United States|   30|
|             Algeria|      United States|    4|
|Turks and Caicos ...|      United States|  230|
|            Pakistan|      United States|   12|
+--------------------+-------------------+-----+
only showing top 5 rows



### Collecting Rows to the Driver
- Collecting Rows to the Driver:
  - Spark maintains the state of the cluster in the driver.
  - Collect data to the driver for local manipulation.
- Methods for Collecting Data:
  - collect: Retrieves all data from the entire DataFrame.
  - take(N): Selects the first N rows.
  - show(N): Prints out a specified number of rows.

In [None]:
collectDF = df.limit(10)
collectDF.take(5) # take works with an Integer count
collectDF.show() # this prints it out nicely
collectDF.show(5, False)
collectDF.collect()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India         

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Grenada', count=62),
 Row(DEST_COUNTRY_NAME='Costa Rica', ORIGIN_COUNTRY_NAME='United States', count=588),
 Row(DEST_COUNTRY_NAME='Senegal', ORIGIN_COUNTRY_NAME='United States', count=40),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

In [None]:
collectDF.toLocalIterator()

<generator object _local_iterator_from_socket.<locals>.PyLocalIterable.__iter__ at 0x1196f9150>

- Warning on Collecting Data to the Driver:
  - Collecting data to the driver can be very expensive.
  - Large datasets may crash the driver when using collect.
  - Using toLocalIterator with large partitions can crash the driver node and lose the application state.
  - This approach is also expensive as it operates on a one-by-one basis, instead of running computations in parallel.