In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Ch 3. Basic Structured Operations

In [2]:
df = spark.read.format("json")\
    .load("./2015-summary.json")

In [3]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



## Schemas

In [5]:
spark.read.format("json")\
    .load("./2015-summary.json").schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

In [34]:
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
    StructField("DEST_COUNTRY_NAME", StringType(), True),
    StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
    StructField("count", LongType(), False),
])

In [35]:
df = spark.read.format("json")\
    .schema(myManualSchema)\
    .load("./2015-summary.json")

In [36]:
from pyspark.sql.functions import col, column

In [37]:
col("someColumnName")
column("someColumnName")

Column<b'someColumnName'>

## Explicit Column References

If you need to refer to a specific DataFrame's column, you can use the col method on the specific DataFrame. This can be useful when you are performing a join and need to refer to a specific column in one DataFrmae that ay share a name with another column in the joined DataFrame. We will see this in the joins chapter. As an added benefit, Spark does not need to resolve this column itself because we did that for Spark.

## Expressions

Now we metioned that columns are expressions, so what is an expression? An expression is a set of transformations on one or more values in a record in a DataFrame. Think of it like a function that takes as input one ore more column names, resolves them and then potentially applies more expresions to create a single value for each record in the dataset. Importantly, this "single value" can actually be a complex type like a Map type or Array type.

## Columns as Expressions

Columns provide a subset of expression functionality. If you use col() and wish to perform transformations on that column, you must perform those on that coliumn reference. When using an expression, the expr function can actually parse transformations and column references from a string and can subsequently be passed into further transformations. Let's look at some examples.

In [12]:
(((col("someCol") + 5) * 200) - 6) < col("otherCol")

Column<b'((((someCol + 5) * 200) - 6) < otherCol)'>

In [13]:
from pyspark.sql.functions import expr
expr("(((someCol +5) * 200) - 6) < otherCol")

Column<b'((((someCol + 5) * 200) - 6) < otherCol)'>

This is an extremely important point to reinforce. Notice how the previous expression is actually valid SQL code as well, just like you might put in a SELECT statement? That's becuase this SQL expression and the previous DataFrame ode compile to the same underlying logical tree prior to execution. This means you can write your expressions as DataFrame code or as SQL expression and get the exact same benefits. You likely saw all of this in the first chapters of the book and we covered this more extensively in the Overview of the Structured APIs chapter.

## Records and Rows

In Spark, a record or row makes up a "row" in a DataFrame. A logical record or row is an object of type Row. Row objects are the objects that column expressions operate on to produce some usable value. Row objects represent physical byte arrays. The byte array interface is never shown to users because we only use column expressions to manipulate them.

In [14]:
df.first()

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COURTRY_NAME=None, count=15)

## Creating Rows

You can create rows by manually instantiating a Row object with the values that below in each column. It's important to note that only DataFrames have schema. rows themselves do not have schemas. This means if you create a Row manually, you must specify the values in the same order as the schema of the DataFrame they may be appended to. We will see this when we discuss creating DataFrames.

In [21]:
from pyspark.sql import Row
myRow = Row("Hello", None, 1, False)

In [22]:
myRow[0]

'Hello'

In [23]:
myRow[2]

1

## DataFrame Transformations

Now that we briefly defined the core parts of a DataFrmae, we will move onto manipulating DataFrames. When working with individual DataFrames there are some fundamental objectives. These break down into several core operations.

## Creating DataFrames

As we saw previously, we can create DataFrame from raw data sources. This is covered extensively in the Data Srouces chapter howeber we will use them now to create an example DataFrame. For illustration purposes later in this chapter, we wiil also register this as a temporary view so that we can query it with SQL.

In [25]:
df.createOrReplaceTempView("dfTable")

In [27]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
    StructField("some", StringType(), True),
    StructField("col", StringType(), True),
    StructField("names", LongType(), False),
])

myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()

+-----+----+-----+
| some| col|names|
+-----+----+-----+
|Hello|null|    1|
+-----+----+-----+



In [28]:
df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [32]:
df.describe()

DataFrame[summary: string, DEST_COUNTRY_NAME: string, ORIGIN_COURTRY_NAME: string, count: string]

In [38]:
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)



+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



In [39]:
from pyspark.sql.functions import expr, col, column

df.select(
    expr("DEST_COUNTRY_NAME"),
    col("DEST_COUNTRY_NAME"),
    column("DEST_COUNTRY_NAME"))\
    .show(2)

+-----------------+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
+-----------------+-----------------+-----------------+
|    United States|    United States|    United States|
|    United States|    United States|    United States|
+-----------------+-----------------+-----------------+
only showing top 2 rows



In [40]:
df.select(expr("DEST_COUNTRY_NAME AS destination"))

DataFrame[destination: string]

In [41]:
df.select(expr("DEST_COUNTRY_NAME as destination")).alias("DEST_COUNTRY_NAME")

DataFrame[destination: string]

In [42]:
df.selectExpr(
    "DEST_COUNTRY_NAME  as newColumnName",
    "DEST_COUNTRY_NAME").show(2)

+-------------+-----------------+
|newColumnName|DEST_COUNTRY_NAME|
+-------------+-----------------+
|United States|    United States|
|United States|    United States|
+-------------+-----------------+
only showing top 2 rows



In [43]:
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))")

DataFrame[avg(count): double, count(DISTINCT DEST_COUNTRY_NAME): bigint]

## Converting to Spark Types

In [44]:
from pyspark.sql.functions import lit

df.select(
    expr("*"),
    lit(1).alias("One")
).show(2)

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|One|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows



## Adding Columns

In [45]:
df.withColumn("numberOne", lit(1)).show(2)

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



In [None]:
df.withColumn(
    "withincountry",
    expr("ORIGIN_COU"))