<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Structured API (Ch 4: Spark Structured API and Basic Operations)</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

## Chapter 4 of Spark Definitive Guide

## Overview
- Structured APIs are a tool for manipulating all sorts of data, from unstructured log files to semi-structured CSV files and highly structured Parquet files.
- These APIs refer to three core types of distributed collection APIs:
 - Datasets
 - DataFrames
 - SQL tables and views
- The majority of the Structured APIs apply to both batch and streaming computation. 
- It should be simple to migrate from batch to streaming (or vice versa) with little to no effort. 



## DataFrames and Datasets
- Spark has two notions of structured collections: DataFrames and Datasets. 
- Although they have (nuanced) differences, both are (distributed) table-like collections with well-defined rows and columns. 
- DataFrames and Datasets represent immutable, lazily evaluated plans that specify what operations to apply to data residing at a location to generate some output. 
- When we perform an action on a DataFrame, we instruct Spark to perform the actual transformations and return the result. 


## DataFrames and Datasets
- DataSets only for Java and Scala
- DataFrames are DataSets of type of Row
- The “Row” type is Spark’s internal representation of its optimized in-memory format for computation. 
- This format makes for highly specialized and efficient computation.
- To Spark (in Python or R), there is no such thing as a Dataset: everything is a DataFrame and therefore we always operate on that optimized format.


## Data Types
- Columns represent a simple type like an integer or string.
- A row is nothing more than a record of data. 
- Each record in a DataFrame must be of type Row.


In [1]:
spark.version

'3.1.3'

In [2]:
spark.range(2).collect()

                                                                                

[Row(id=0), Row(id=1)]

- Spark has a large number of internal type representations: 

In [3]:
from pyspark.sql.types import *
b = ByteType()
b

ByteType

In [4]:
from pyspark.sql.types import *

## Structured API Execution

- Understanding how code is actually executed across a cluster will help us understand (and potentially debug) the process of writing and executing code on clusters:
- An overview of the steps:
 - 1. Write DataFrame/Dataset/SQL Code.
 - 2. If valid code, Spark converts this to a Logical Plan.
 - 3. Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way.
 - 4. Spark then executes this Physical Plan (RDD manipulations) on the cluster.


## API References
- Spark is a growing project, and any book is a snapshot in time. Find latest API:
http://spark.apache.org/docs/latest/api/python/
- org.apache.spark.sql.functions contains a variety of functions for a range of different data types. 
- The majority of these functions are ones that you will find in SQL and analytics systems.


### Create a DataFrame

In [6]:
df = spark.read.format("json").load("gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.json")

                                                                                

In [7]:
df

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

In [8]:
df.schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

Show the schema

In [9]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



The example that follows shows how to create and enforce a
specific schema on a DataFrame.

In [10]:
# COMMAND ----------

from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", LongType(), False, metadata={"hello":"world"})
])

In [12]:
df = spark.read.format("json").schema(myManualSchema)\
  .load("gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.json")

In [13]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



### Columns
There are a lot of different ways to construct and refer to columns but the two simplest ways are by
using the col or column functions. To use either of these functions, you pass in a column name:

In [14]:
from pyspark.sql.functions import col, column

In [15]:
col("somecol")

Column<'somecol'>

In [16]:
column("someColumn")

Column<'someColumn'>

### Columns as expressions
Columns provide a subset of expression functionality. If you use col() and want to perform
transformations on that column, you must perform those on that column reference. When using an
expression, the expr function can actually parse transformations and column references from a string
and can subsequently be passed into further transformations. Let’s look at some examples.

In [17]:
from pyspark.sql.functions import expr

In [18]:
expr("(((someCol + 5) * 200) - 6) < otherCol")

Column<'((((someCol + 5) * 200) - 6) < otherCol)'>

### Records and Rows
In Spark, each row in a DataFrame is a single record. Spark represents this record as an object of
type Row. Spark manipulates Row objects using column expressions in order to produce usable values.
Row objects internally represent arrays of bytes. The byte array interface is never shown to users
because we only use column expressions to manipulate them.

In [19]:
from pyspark.sql import Row

In [20]:
myRow = Row("Hello", None, 1, False)

In [21]:
myRow[1]

Accessing data in rows is equally as easy: you just specify the position that you would like.

In [22]:
# COMMAND ----------

myRow[0]

'Hello'

In [23]:
myRow[2]

1

### Creating DataFrames
We can create DataFrames from raw data sources. We will also register this as a temporary view so that we can query it
with SQL and show off basic transformations in SQL.

In [24]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

In [25]:
mySchema = StructType([
    StructField("Destination", StringType(), True),
    StructField("Origin", StringType(), True),
    StructField("count", LongType(), False)
]
)

In [26]:
# COMMAND ----------

df = spark.read.format("csv").schema(mySchema).load("gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.csv")

In [27]:
df.printSchema()

root
 |-- Destination: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- count: long (nullable = true)



In [28]:
df.show()

+--------------------+-------------------+-----+
|         Destination|             Origin|count|
+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United State

In [29]:
df.createOrReplaceTempView("dfTable")

In [30]:
df.show(5)

+-----------------+-------------------+-----+
|      Destination|             Origin|count|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
+-----------------+-------------------+-----+
only showing top 5 rows



We can also create DataFrames on the fly by taking a set of rows and converting them to a DataFrame.

In [31]:
# COMMAND ----------

from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
  StructField("some", StringType(), True),
  StructField("col", StringType(), True),
  StructField("names", LongType(), False)
])
myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()

[Stage 4:>                                                          (0 + 1) / 1]

+-----+----+-----+
| some| col|names|
+-----+----+-----+
|Hello|null|    1|
+-----+----+-----+



                                                                                

Use the select method and pass in the column names as
strings with which you would like to work:

In [32]:
df.count()

257

You can select multiple columns by using the same style of query, just add more column name strings
to your select method call:

In [33]:
# COMMAND ----------

df.select("Destination", "Origin").show(2)

+-----------------+-------------------+
|      Destination|             Origin|
+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
|    United States|            Romania|
+-----------------+-------------------+
only showing top 2 rows



You can refer to columns in a number of different ways;
all you need to keep in mind is that you can use them interchangeably:

In [34]:
# COMMAND ----------

from pyspark.sql.functions import expr, col, column
df.select(
    expr("Destination"),
    col("Destination"),
    column("Destination"))\
  .show(2)

+-----------------+-----------------+-----------------+
|      Destination|      Destination|      Destination|
+-----------------+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
|    United States|    United States|    United States|
+-----------------+-----------------+-----------------+
only showing top 2 rows



As we’ve seen thus far, expr is the most flexible reference that we can use. It can refer to a plain
column or a string manipulation of a column. To illustrate, let’s change the column name, and then
change it back by using the AS keyword and then the alias method on the column:

In [35]:
# COMMAND ----------

df.select(expr("Destination as DEST_COUNTRY_NAME")).show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|DEST_COUNTRY_NAME|
|    United States|
+-----------------+
only showing top 2 rows



This changes the column name to “destination.” You can further manipulate the result of your
expression as another expression:

In [36]:
# COMMAND ----------

df.select(expr("Destination as DEST_COUNTRY_NAME").alias("destination"))\
  .show(2)

+-----------------+
|      destination|
+-----------------+
|DEST_COUNTRY_NAME|
|    United States|
+-----------------+
only showing top 2 rows



In [37]:
df.select("Destination", col("Destination")).show(2)

+-----------------+-----------------+
|      Destination|      Destination|
+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
|    United States|    United States|
+-----------------+-----------------+
only showing top 2 rows



Because select followed by a series of expr is such a common pattern, Spark has a shorthand for
doing this efficiently: selectExpr. This is probably the most convenient interface for everyday use:

In [38]:
# COMMAND ----------

df.selectExpr("Destination as newColumnName", "Destination").show(2)

+-----------------+-----------------+
|    newColumnName|      Destination|
+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
|    United States|    United States|
+-----------------+-----------------+
only showing top 2 rows



This opens up the true power of Spark. We can treat selectExpr as a simple way to build up
complex expressions that create new DataFrames. In fact, we can add any valid non-aggregating SQL
statement, and as long as the columns resolve, it will be valid! Here’s a simple example that adds a
new column withinCountry to our DataFrame that specifies whether the destination and origin are
the same:

In [39]:
df.selectExpr("*", "(Destination=Origin) as withinCountry").show(2)

+-----------------+-------------------+-----+-------------+
|      Destination|             Origin|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|        false|
|    United States|            Romania|   15|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



With select expression, we can also specify aggregations over the entire DataFrame by taking
advantage of the functions that we have. These look just like what we have been showing so far:

In [51]:
from pyspark.sql.functions import avg

In [41]:
df.selectExpr("avg(count)").show(4)

+-----------+
| avg(count)|
+-----------+
|1770.765625|
+-----------+



In [42]:
df.selectExpr("count(distinct(Destination))").show(4)

+---------------------------+
|count(DISTINCT Destination)|
+---------------------------+
|                        133|
+---------------------------+



In [43]:
df.select("Destination").distinct().show(2)

+-----------+
|Destination|
+-----------+
|   Anguilla|
|     Russia|
+-----------+
only showing top 2 rows



In [46]:
df.createOrReplaceTempView("sqlTable")

In [47]:
spark.sql("select Destination as dest from sqlTable where count>1000").show()

+------------------+
|              dest|
+------------------+
|            Mexico|
|     United States|
|     United States|
|           Germany|
|            Canada|
|Dominican Republic|
|             Japan|
|     United States|
|     United States|
|    United Kingdom|
|     United States|
|     United States|
|     United States|
|       South Korea|
+------------------+



### Converting to Spark Types (Literals)
Sometimes, we need to pass explicit values into Spark that are just a value (rather than a new
column). This might be a constant value or something we’ll need to compare to later on. The way we
do this is through literals. This is basically a translation from a given programming language’s literal
value to one that Spark understands. Literals are expressions and you can use them in the same way:

In [48]:
# COMMAND ----------

from pyspark.sql.functions import lit

In [49]:
df.select("Destination", lit(1)).show(5)

+-----------------+---+
|      Destination|  1|
+-----------------+---+
|DEST_COUNTRY_NAME|  1|
|    United States|  1|
|    United States|  1|
|    United States|  1|
|            Egypt|  1|
+-----------------+---+
only showing top 5 rows



In [50]:
df.select("*", lit(1).alias("One")).show(3)

+-----------------+-------------------+-----+---+
|      Destination|             Origin|count|One|
+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|  1|
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 3 rows



### Adding Columns
There’s also a more formal way of adding a new column to a DataFrame, and that’s by using the
withColumn method on our DataFrame. For example, let’s add a column that just adds the number
one as a column:

In [51]:
# COMMAND ----------

df.withColumn("numberOne", lit(1)).show(2)

+-----------------+-------------------+-----+---------+
|      Destination|             Origin|count|numberOne|
+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|        1|
|    United States|            Romania|   15|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



Let’s do something a bit more interesting and make it an actual expression. In the next example, we’ll
set a Boolean flag for when the origin country is the same as the destination country:

In [53]:
# COMMAND ----------

df.withColumn("withinCountry", expr("Origin == Destination"))\
  .show(2)

+-----------------+-------------------+-----+-------------+
|      Destination|             Origin|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|        false|
|    United States|            Romania|   15|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



Notice that the withColumn function takes two arguments: the column name and the expression that
will create the value for that given row in the DataFrame. Interestingly, we can also rename a column
this way. Although we can rename a column in the manner that we just described, another alternative is to use
the withColumnRenamed method. This will rename the column with the name of the string in the first
argument to the string in the second argument:

### Removing Columns
Now that we’ve created this column, let’s take a look at how we can remove columns from
DataFrames. You likely already noticed that we can do this by using select. However, there is also a
dedicated method called drop:
df.drop("ORIGIN_COUNTRY_NAME").columns
We can drop multiple columns by passing in multiple columns as arguments:

In [55]:
dfWithLongColName = df

In [56]:
dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").columns

['Destination', 'Origin', 'count']

In [57]:
df.columns

['Destination', 'Origin', 'count']

### Filtering Rows
To filter rows, we create an expression that evaluates to true or false. You then filter out the rows
with an expression that is equal to false. The most common way to do this with DataFrames is to
create either an expression as a String or build an expression by using a set of column manipulations.
There are two methods to perform this operation: you can use where or filter and they both will
perform the same operation and accept the same argument types when used with DataFrames. We will
stick to where because of its familiarity to SQL; however, filter is valid as well.

In [59]:
# COMMAND ----------

df.where(col("count") < 2).where(col("Origin") != "Croatia")\
  .show(2)

+-------------+-------------+-----+
|  Destination|       Origin|count|
+-------------+-------------+-----+
|United States|    Singapore|    1|
|      Moldova|United States|    1|
+-------------+-------------+-----+
only showing top 2 rows



### Getting Unique Rows
A very common use case is to extract the unique or distinct values in a DataFrame. These values can
be in one or more columns. The way we do this is by using the distinct method on a DataFrame,
which allows us to deduplicate any rows that are in that DataFrame. For instance, let’s get the unique
origins in our dataset. This, of course, is a transformation that will return a new DataFrame with only
unique rows:

In [60]:
# COMMAND ----------

df.select("Origin", "Destination").distinct().show()

+----------------+--------------------+
|          Origin|         Destination|
+----------------+--------------------+
|         Romania|       United States|
|         Croatia|       United States|
|         Ireland|       United States|
|   United States|               Egypt|
|           India|       United States|
|       Singapore|       United States|
|         Grenada|       United States|
|   United States|          Costa Rica|
|   United States|             Senegal|
|   United States|             Moldova|
|    Sint Maarten|       United States|
|Marshall Islands|       United States|
|   United States|              Guyana|
|   United States|               Malta|
|   United States|            Anguilla|
|   United States|             Bolivia|
|        Paraguay|       United States|
|   United States|             Algeria|
|   United States|Turks and Caicos ...|
|       Gibraltar|       United States|
+----------------+--------------------+
only showing top 20 rows



### Random Splits
Random splits can be helpful when you need to break up your DataFrame into a random “splits” of
the original DataFrame. This is often used with machine learning algorithms to create training,
validation, and test sets.

In [61]:
# COMMAND ----------

seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()

139

In [62]:
# COMMAND ----------

dataFrames = df.randomSplit([0.25, 0.75], seed)

In [63]:
dataFrames[1].show(2)

+-------------------+-------------+-----+
|        Destination|       Origin|count|
+-------------------+-------------+-----+
|           Anguilla|United States|   41|
|Antigua and Barbuda|United States|  126|
+-------------------+-------------+-----+
only showing top 2 rows



In [60]:
dataFrames[0].count() > dataFrames[1].count() # False

False

### Concatenating and Appending Rows (Union)
As you learned in the previous section, DataFrames are immutable. This means users cannot append
to DataFrames because that would be changing it. To append to a DataFrame, you must union the
original DataFrame along with the new DataFrame. This just concatenates the two DataFramess. To
union two DataFrames, you must be sure that they have the same schema and number of columns;
otherwise, the union will fail.

In [64]:
# COMMAND ----------

from pyspark.sql import Row
schema = df.schema
newRows = [
  Row("New Country", "Other Country", 5),
  Row("New Country 2", "Other Country 3", 1)
]
parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)

In [65]:
newDF.show()

+-------------+---------------+-----+
|  Destination|         Origin|count|
+-------------+---------------+-----+
|  New Country|  Other Country|    5|
|New Country 2|Other Country 3|    1|
+-------------+---------------+-----+



In [66]:
# COMMAND ----------

df.union(newDF)\
  .where("count = 1")\
  .where(col("Origin") != "United States")\
  .show()

+-------------+----------------+-----+
|  Destination|          Origin|count|
+-------------+----------------+-----+
|United States|         Croatia|    1|
|United States|       Singapore|    1|
|United States|       Gibraltar|    1|
|United States|          Cyprus|    1|
|United States|         Estonia|    1|
|United States|       Lithuania|    1|
|United States|        Bulgaria|    1|
|United States|         Georgia|    1|
|United States|         Bahrain|    1|
|United States|Papua New Guinea|    1|
|United States|      Montenegro|    1|
|United States|         Namibia|    1|
|New Country 2| Other Country 3|    1|
+-------------+----------------+-----+



### Sorting Rows
When we sort the values in a DataFrame, we always want to sort with either the largest or smallest
values at the top of a DataFrame. There are two equivalent operations to do this sort and orderBy
that work the exact same way. They accept both column expressions and strings as well as multiple
columns. The default is to sort in ascending order:

In [67]:
# COMMAND ----------

df.sort("count").show(5)

+--------------------+-------------------+-----+
|         Destination|             Origin|count|
+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|       United States|          Singapore|    1|
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
+--------------------+-------------------+-----+
only showing top 5 rows



In [68]:
df.orderBy("count", "Destination").show(5)

+-----------------+-------------------+-----+
|      Destination|             Origin|count|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



To more explicitly specify sort direction, you need to use the asc and desc functions if operating on a
column. These allow you to specify the order in which a given column should be sorted:

In [69]:
# COMMAND ----------

from pyspark.sql.functions import desc, asc
df.orderBy(expr("count desc")).show(2)

+-----------------+-------------------+-----+
|      Destination|             Origin|count|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|            Malta|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [70]:
df.orderBy(col("count").desc(), col("Destination").asc()).show(2)

+-------------+-------------+------+
|  Destination|       Origin| count|
+-------------+-------------+------+
|United States|United States|370002|
|United States|       Canada|  8483|
+-------------+-------------+------+
only showing top 2 rows



For optimization purposes, it’s sometimes advisable to sort within each partition before another set of
transformations. You can use the sortWithinPartitions method to do this:

In [71]:
# COMMAND ----------

spark.read.format("json").load("gs://info323-ya45-spring2023/notebooks/jupyter/*-summary.json")\
  .sortWithinPartitions("count")

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

### Limit
Oftentimes, you might want to restrict what you extract from a DataFrame; for example, you might
want just the top ten of some DataFrame. You can do this by using the limit method:

In [72]:
# COMMAND ----------

df.limit(5).show()

+-----------------+-------------------+-----+
|      Destination|             Origin|count|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
+-----------------+-------------------+-----+



In [73]:
# COMMAND ----------

df.orderBy(expr("count desc")).limit(6).show()

+--------------------+-------------------+-----+
|         Destination|             Origin|count|
+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|       United States|          Singapore|    1|
|Saint Vincent and...|      United States|    1|
|               Malta|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
+--------------------+-------------------+-----+



## Spark SQL

In [74]:
df.createOrReplaceTempView('df_table')

In [75]:
spark.sql("select * from df_table").show(3)

+-----------------+-------------------+-----+
|      Destination|             Origin|count|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 3 rows



## To Pandas

In [76]:
df.toPandas() #Why and when?

Unnamed: 0,Destination,Origin,count
0,DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,
1,United States,Romania,15.0
2,United States,Croatia,1.0
3,United States,Ireland,344.0
4,Egypt,United States,15.0
...,...,...,...
252,United States,Saint Kitts and Nevis,145.0
253,Uruguay,United States,43.0
254,United States,Haiti,225.0
255,"Bonaire, Sint Eustatius, and Saba",United States,58.0


In [78]:
df_pd = df.toPandas()

In [79]:
df_sp = spark.createDataFrame(df_pd)

In [80]:
df_sp.show()

+--------------------+-------------------+-----+
|         Destination|             Origin|count|
+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|  NaN|
|       United States|            Romania| 15.0|
|       United States|            Croatia|  1.0|
|       United States|            Ireland|344.0|
|               Egypt|      United States| 15.0|
|       United States|              India| 62.0|
|       United States|          Singapore|  1.0|
|       United States|            Grenada| 62.0|
|          Costa Rica|      United States|588.0|
|             Senegal|      United States| 40.0|
|             Moldova|      United States|  1.0|
|       United States|       Sint Maarten|325.0|
|       United States|   Marshall Islands| 39.0|
|              Guyana|      United States| 64.0|
|               Malta|      United States|  1.0|
|            Anguilla|      United States| 41.0|
|             Bolivia|      United States| 30.0|
|       United State

### Repartition and Coalesce
Another important optimization opportunity is to partition the data according to some frequently
filtered columns, which control the physical layout of data across the cluster including the partitioning
scheme and the number of partitions.
Repartition will incur a full shuffle of the data, regardless of whether one is necessary. This means
that you should typically only repartition when the future number of partitions is greater than your
current number of partitions or when you are looking to partition by a set of columns:

In [81]:
# COMMAND ----------

df.rdd.getNumPartitions() # 1

1

In [82]:
# COMMAND ----------

df_5 =df.repartition(5)

In [83]:
df_5.rdd.getNumPartitions()

5

If you know that you’re going to be filtering by a certain column often, it can be worth repartitioning
based on that column:

In [84]:
# COMMAND ----------

df.repartition(col("Destination"))

DataFrame[Destination: string, Origin: string, count: bigint]

You can optionally specify the number of partitions you would like, too:

In [85]:
# COMMAND ----------

df.repartition(5, col("Destination"))

DataFrame[Destination: string, Origin: string, count: bigint]

Coalesce, on the other hand, will not incur a full shuffle and will try to combine partitions. This
operation will shuffle your data into five partitions based on the destination country name, and then
coalesce them (without a full shuffle):

In [86]:
# COMMAND ----------

df.repartition(5, col("Destination")).coalesce(2)

DataFrame[Destination: string, Origin: string, count: bigint]

### Collecting Rows to the Driver
As discussed in previous chapters, Spark maintains the state of the cluster in the driver. There are
times when you’ll want to collect some of your data to the driver in order to manipulate it on your
local machine.
Thus far, we did not explicitly define this operation. However, we used several different methods for
doing so that are effectively all the same. collect gets all data from the entire DataFrame, take
selects the first N rows, and show prints out a number of rows nicely.

In [87]:
# COMMAND ----------

collectDF = df.limit(10)
collectDF.take(5) # take works with an Integer count
collectDF.show() # this prints it out nicely
collectDF.show(5, False)
collectDF.collect()

+-----------------+-------------------+-----+
|      Destination|             Origin|count|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| null|
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
+-----------------+-------------------+-----+

+-----------------+-------------------+-----+
|Destination      |Origin             |count|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|null |
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States 

[Row(Destination='DEST_COUNTRY_NAME', Origin='ORIGIN_COUNTRY_NAME', count=None),
 Row(Destination='United States', Origin='Romania', count=15),
 Row(Destination='United States', Origin='Croatia', count=1),
 Row(Destination='United States', Origin='Ireland', count=344),
 Row(Destination='Egypt', Origin='United States', count=15),
 Row(Destination='United States', Origin='India', count=62),
 Row(Destination='United States', Origin='Singapore', count=1),
 Row(Destination='United States', Origin='Grenada', count=62),
 Row(Destination='Costa Rica', Origin='United States', count=588),
 Row(Destination='Senegal', Origin='United States', count=40)]

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 33824)
Traceback (most recent call last):
  File "/opt/conda/miniconda3/lib/python3.8/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/conda/miniconda3/lib/python3.8/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/opt/conda/miniconda3/lib/python3.8/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/conda/miniconda3/lib/python3.8/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/lib/spark/python/pyspark/accumulators.py", line 262, in handle
    poll(accum_updates)
  File "/usr/lib/spark/python/pyspark/accumulators.py", line 235, in poll
    if func():
  File "/usr/lib/spark/python/pyspark/accumulators.py", line 239, in accum_updates
    num_updates = read_int(self.rfile