<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Structured API: Workding with Different Types of Data including Nulls</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

## Let’s read in the DataFrame that we’ll be using for this analysis:

In [1]:
df = spark.read.format("json")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("gs://info323-ya45-spring2023/flights/tzcorr/all_flights-00010-of-00026")
df.printSchema()
df.createOrReplaceTempView("dfTable")

root
 |-- ARR_AIRPORT_LAT: double (nullable = true)
 |-- ARR_AIRPORT_LON: double (nullable = true)
 |-- ARR_AIRPORT_TZOFFSET: double (nullable = true)
 |-- ARR_DELAY: double (nullable = true)
 |-- ARR_TIME: string (nullable = true)
 |-- CANCELLED: boolean (nullable = true)
 |-- CRS_ARR_TIME: string (nullable = true)
 |-- CRS_DEP_TIME: string (nullable = true)
 |-- DEP_AIRPORT_LAT: double (nullable = true)
 |-- DEP_AIRPORT_LON: double (nullable = true)
 |-- DEP_AIRPORT_TZOFFSET: double (nullable = true)
 |-- DEP_DELAY: double (nullable = true)
 |-- DEP_TIME: string (nullable = true)
 |-- DEST: string (nullable = true)
 |-- DEST_AIRPORT_SEQ_ID: string (nullable = true)
 |-- DISTANCE: string (nullable = true)
 |-- DIVERTED: boolean (nullable = true)
 |-- FL_DATE: string (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- ORIGIN_AIRPORT_SEQ_ID: string (nullable = true)
 |-- TAXI_IN: double (nullable = true)
 |-- TAXI_OUT: double (nullable = true)
 |-- UNIQUE_CARRIER: string (nulla

In [3]:
df.show(3)

+---------------+---------------+--------------------+---------+-------------------+---------+-------------------+-------------------+---------------+---------------+--------------------+---------+-------------------+----+-------------------+--------+--------+----------+------+---------------------+-------+--------+--------------+-------------------+-------------------+
|ARR_AIRPORT_LAT|ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|ARR_DELAY|           ARR_TIME|CANCELLED|       CRS_ARR_TIME|       CRS_DEP_TIME|DEP_AIRPORT_LAT|DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET|DEP_DELAY|           DEP_TIME|DEST|DEST_AIRPORT_SEQ_ID|DISTANCE|DIVERTED|   FL_DATE|ORIGIN|ORIGIN_AIRPORT_SEQ_ID|TAXI_IN|TAXI_OUT|UNIQUE_CARRIER|         WHEELS_OFF|          WHEELS_ON|
+---------------+---------------+--------------------+---------+-------------------+---------+-------------------+-------------------+---------------+---------------+--------------------+---------+-------------------+----+-------------------+--------+---

### Converting to Spark Types
One thing you’ll see us do throughout this chapter is convert native types to Spark types. We do this
by using the first function that we introduce here, the lit function. This function converts a type in
another language to its correspnding Spark representation. Here’s how we can convert a couple of
different kinds of Scala and Python values to their respective Spark types:

In [4]:
# COMMAND ----------

from pyspark.sql.functions import lit
df.select(lit(5), lit("five"), lit(5.0)).show(2)

+---+----+---+
|  5|five|5.0|
+---+----+---+
|  5|five|5.0|
|  5|five|5.0|
+---+----+---+
only showing top 2 rows



### Working with Booleans
Booleans are essential when it comes to data analysis because they are the foundation for all filtering.
Boolean statements consist of four elements: and, or, true, and false. We use these simple structures
to build logical statements that evaluate to either true or false. These statements are often used as
conditional requirements for when a row of data must either pass the test (evaluate to true) or else it
will be filtered out.
Let’s use our retail dataset to explore working with Booleans. We can specify equality as well as
less-than or greater-than:

In [5]:
from pyspark.sql.functions import col

In [31]:
df.where(col("ORIGIN") == 'PHL').count()

367

In [33]:
df.where(col("ORIGIN") == 'PHL').select("DEST", "ORIGIN", "UNIQUE_CARRIER").show(7, False)

+----+------+--------------+
|DEST|ORIGIN|UNIQUE_CARRIER|
+----+------+--------------+
|SLC |PHL   |US            |
|SLC |PHL   |US            |
|SLC |PHL   |DL            |
|SLC |PHL   |DL            |
|SLC |PHL   |AA            |
|SLC |PHL   |AA            |
|SLC |PHL   |AA            |
+----+------+--------------+
only showing top 7 rows



In [10]:
df.count()

28233

Although you can specify your statements explicitly by using and if
you like, they’re often easier to understand and to read if you specify them serially. or statements
need to be specified in the same statement

In [12]:
from pyspark.sql.functions import instr

In [44]:
depDelayFilter = col("DEP_DELAY") > 100

In [45]:
destFilter = instr(df.DEST, "SMF") >= 1

In [46]:
df.where(df.ORIGIN.isin("PHL")).where(destFilter | depDelayFilter).show()

+---------------+---------------+--------------------+---------+-------------------+---------+-------------------+-------------------+---------------+---------------+--------------------+---------+-------------------+----+-------------------+--------+--------+----------+------+---------------------+-------+--------+--------------+-------------------+-------------------+
|ARR_AIRPORT_LAT|ARR_AIRPORT_LON|ARR_AIRPORT_TZOFFSET|ARR_DELAY|           ARR_TIME|CANCELLED|       CRS_ARR_TIME|       CRS_DEP_TIME|DEP_AIRPORT_LAT|DEP_AIRPORT_LON|DEP_AIRPORT_TZOFFSET|DEP_DELAY|           DEP_TIME|DEST|DEST_AIRPORT_SEQ_ID|DISTANCE|DIVERTED|   FL_DATE|ORIGIN|ORIGIN_AIRPORT_SEQ_ID|TAXI_IN|TAXI_OUT|UNIQUE_CARRIER|         WHEELS_OFF|          WHEELS_ON|
+---------------+---------------+--------------------+---------+-------------------+---------+-------------------+-------------------+---------------+---------------+--------------------+---------+-------------------+----+-------------------+--------+---

Boolean expressions are not just reserved to filters. To filter a DataFrame, you can also just specify a
Boolean column:

In [47]:
originFilter = col("ORIGIN") == "PHL"

In [50]:
df.withColumn("FromPhil", originFilter & (depDelayFilter | destFilter)).where("FromPhil").select("ARR_DELAY").show()

+---------+
|ARR_DELAY|
+---------+
|    237.0|
|    315.0|
|    124.0|
|    105.0|
|    101.0|
|    152.0|
|     91.0|
|    118.0|
|    266.0|
|    136.0|
|    -19.0|
|    -28.0|
|    -11.0|
|    -12.0|
|     -8.0|
|    -12.0|
|    -13.0|
|     -9.0|
|    -10.0|
|     -7.0|
+---------+
only showing top 20 rows



If you’re coming from a SQL background, all of these statements should seem quite familiar. Indeed,
all of them can be expressed as a where clause. In fact, it’s often easier to just express filters as SQL
statements than using the programmatic DataFrame interface and Spark SQL allows us to do this
without paying any performance penalty.

## Working with Numbers

When working with big data, the second most common task you will do after filtering things is
counting things. For the most part, we simply need to express our computation, and that should be
valid assuming that we’re working with numerical data types.

In [51]:
df.printSchema()

root
 |-- ARR_AIRPORT_LAT: double (nullable = true)
 |-- ARR_AIRPORT_LON: double (nullable = true)
 |-- ARR_AIRPORT_TZOFFSET: double (nullable = true)
 |-- ARR_DELAY: double (nullable = true)
 |-- ARR_TIME: string (nullable = true)
 |-- CANCELLED: boolean (nullable = true)
 |-- CRS_ARR_TIME: string (nullable = true)
 |-- CRS_DEP_TIME: string (nullable = true)
 |-- DEP_AIRPORT_LAT: double (nullable = true)
 |-- DEP_AIRPORT_LON: double (nullable = true)
 |-- DEP_AIRPORT_TZOFFSET: double (nullable = true)
 |-- DEP_DELAY: double (nullable = true)
 |-- DEP_TIME: string (nullable = true)
 |-- DEST: string (nullable = true)
 |-- DEST_AIRPORT_SEQ_ID: string (nullable = true)
 |-- DISTANCE: string (nullable = true)
 |-- DIVERTED: boolean (nullable = true)
 |-- FL_DATE: string (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- ORIGIN_AIRPORT_SEQ_ID: string (nullable = true)
 |-- TAXI_IN: double (nullable = true)
 |-- TAXI_OUT: double (nullable = true)
 |-- UNIQUE_CARRIER: string (nulla

In [52]:
# COMMAND ----------

from pyspark.sql.functions import expr, pow
fabricatedQuantity = pow(col("DEP_DELAY") * col("ARR_DELAY"), 2) + 5
df.select(expr("UNIQUE_CARRIER"), fabricatedQuantity.alias("realDelay")).show(2)

+--------------+---------+
|UNIQUE_CARRIER|realDelay|
+--------------+---------+
|            OO|    905.0|
|            DL|      9.0|
+--------------+---------+
only showing top 2 rows



Notice that we were able to multiply our columns together because they were both numerical.
Naturally we can add and subtract as necessary, as well. In fact, we can do all of this as a SQL
expression, as well:

In [53]:
# COMMAND ----------

df.selectExpr(
  "UNIQUE_CARRIER",
  "(POWER((DEP_DELAY * ARR_DELAY), 2.0) + 5) as realDelay").show(2)

+--------------+---------+
|UNIQUE_CARRIER|realDelay|
+--------------+---------+
|            OO|    905.0|
|            DL|      9.0|
+--------------+---------+
only showing top 2 rows



Another common numerical task is rounding. If you’d like to just round to a whole number, oftentimes
you can cast the value to an integer and that will work just fine. However, Spark also has more
detailed functions for performing this explicitly and to a certain level of precision. In the following
example, we round to one decimal place:

In [54]:
# COMMAND ----------

from pyspark.sql.functions import lit, round, bround

df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



Another numerical task is to compute the correlation of two columns. For example, we can see the
Pearson correlation coefficient for two columns to see if cheaper things are typically bought in
greater quantities. We can do this through a function as well as through the DataFrame statistic
methods:

In [55]:
from pyspark.sql.functions import corr

In [58]:
df.stat.corr("DEP_DELAY", "ARR_DELAY")

0.9476274949356805

In [59]:
df.select(corr("DEP_DELAY", "ARR_DELAY")).take(1)[0][0]

0.9523863105799023

Another common task is to compute summary statistics for a column or set of columns. We can use the
describe method to achieve exactly this. This will take all numeric columns and calculate the count,
mean, standard deviation, min, and max. You should use this primarily for viewing in the console
because the schema might change in the future:

In [60]:
df.toPandas().describe()

Unnamed: 0,ARR_AIRPORT_LAT,ARR_AIRPORT_LON,ARR_AIRPORT_TZOFFSET,ARR_DELAY,DEP_AIRPORT_LAT,DEP_AIRPORT_LON,DEP_AIRPORT_TZOFFSET,DEP_DELAY,TAXI_IN,TAXI_OUT
count,28233.0,28233.0,28233.0,28012.0,28233.0,28233.0,28233.0,28068.0,28051.0,28057.0
mean,39.753249,-116.732381,-24408.798215,3.67596,37.102959,-112.732,-24565.380937,8.379364,5.382018,15.27298
std,1.046346,4.806332,3128.808671,31.943579,5.131649,13.1466,4152.777185,30.83905,2.649147,7.721619
min,38.695556,-121.590833,-28800.0,-56.0,20.898611,-157.920278,-36000.0,-24.0,1.0,1.0
25%,38.695556,-121.590833,-25200.0,-11.0,33.434167,-121.929167,-25200.0,-5.0,4.0,10.0
50%,40.788333,-111.977778,-25200.0,-4.0,34.200556,-117.601111,-25200.0,-1.0,5.0,14.0
75%,40.788333,-111.977778,-21600.0,6.0,39.861667,-112.011667,-25200.0,8.0,6.0,18.0
max,40.788333,-111.977778,0.0,674.0,47.45,-71.006389,0.0,694.0,55.0,151.0


In [62]:
df.describe().select("DEP_DELAY", "DISTANCE", "ARR_DELAY").show(10, False)

+-----------------+------------------+------------------+
|DEP_DELAY        |DISTANCE          |ARR_DELAY         |
+-----------------+------------------+------------------+
|28068            |28233             |28012             |
|8.379364400741057|749.5296992880671 |3.6759603027274026|
|30.8390501393679 |481.00236948166906|31.943579377647556|
|-24.0            |1087.00           |-56.0             |
|694.0            |909.00            |674.0             |
+-----------------+------------------+------------------+



If you need these exact numbers, you can also perform this as an aggregation yourself by importing the
functions and applying them to the columns that you need:

In [6]:
# COMMAND ----------

from pyspark.sql.functions import count, mean, stddev_pop, min, max

There are a number of statistical functions available in the StatFunctions Package (accessible using
stat as we see in the code block below). These are DataFrame methods that you can use to calculate
a variety of different things. For instance, you can calculate either exact or approximate quantiles of
your data using the approxQuantile method:

In [64]:
# COMMAND ----------

colName = "ARR_DELAY"
quantileProbs = [0.5]
relError = 0.05
df.stat.approxQuantile("ARR_DELAY", quantileProbs, relError) 

[-4.0]

In [65]:
df.select(mean("DEP_DELAY")).show()

+-----------------+
|   avg(DEP_DELAY)|
+-----------------+
|8.379364400741057|
+-----------------+



You also can use this to see a cross-tabulation or frequent item pairs (be careful, this output will be
large:

In [66]:
# COMMAND ----------

df.stat.crosstab("DEST", "ORIGIN").show()

+-----------+---+---+----+---+---+----+---+---+---+---+---+----+----+---+---+---+---+---+----+----+----+---+----+----+---+----+----+---+---+---+---+
|DEST_ORIGIN|ATL|BOS| BUR|CLT|DAL| DEN|DFW|HNL|IAD|IAH|JFK| LAS| LAX|LGB|MDW|MSP|OAK|OGG| ONT| ORD| PDX|PHL| PHX| SAN|SAT| SEA| SFO|SJC|SMF|SNA|STL|
+-----------+---+---+----+---+---+----+---+---+---+---+---+----+----+---+---+---+---+---+----+----+----+---+----+----+---+----+----+---+---+---+---+
|        SMF|286| 22|1136|113|141|1307|679|173|143|352|156|1382|3039|355|236|286|  0|188|1127| 279| 900| 37|1627|   0|  0|   0|   0|  0|  0|  0|  0|
|        SLC|  0|  0|   0|  0|  0|   0|  0|  0|  0|  0|  0|   0|   0|  0|  0|  0|218|  0|   0|1457|1268|330|2958|1201|291|1953|1800|916|791|822|264|
+-----------+---+---+----+---+---+----+---+---+---+---+---+----+----+---+---+---+---+---+----+----+----+---+----+----+---+----+----+---+---+---+---+



In [9]:
df.stat.freqItems(["DEST", "ORIGIN"]).show(2, False)

+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|DEST_freqItems|ORIGIN_freqItems                                                                                                                                           |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|[SMF, SLC]    |[BUR, LGB, PHL, JFK, OAK, SMF, DFW, CLT, SJC, ORD, IAH, OGG, SNA, DEN, ONT, PDX, HNL, STL, IAD, MSP, LAS, DAL, ATL, SFO, SEA, SAT, MDW, PHX, SAN, BOS, LAX]|
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+



## Working with Strings
String manipulation shows up in nearly every data flow, and it’s worth explaining what you can do
with strings. You might be manipulating log files performing regular expression extraction or
substitution, or checking for simple string existence, or making all strings uppercase or lowercase.

Let’s begin with the last task because it’s the most straightforward. The initcap function will
capitalize every word in a given string when that word is separated from another by a space.

In [3]:
df.printSchema()

root
 |-- ARR_AIRPORT_LAT: double (nullable = true)
 |-- ARR_AIRPORT_LON: double (nullable = true)
 |-- ARR_AIRPORT_TZOFFSET: double (nullable = true)
 |-- ARR_DELAY: double (nullable = true)
 |-- ARR_TIME: string (nullable = true)
 |-- CANCELLED: boolean (nullable = true)
 |-- CRS_ARR_TIME: string (nullable = true)
 |-- CRS_DEP_TIME: string (nullable = true)
 |-- DEP_AIRPORT_LAT: double (nullable = true)
 |-- DEP_AIRPORT_LON: double (nullable = true)
 |-- DEP_AIRPORT_TZOFFSET: double (nullable = true)
 |-- DEP_DELAY: double (nullable = true)
 |-- DEP_TIME: string (nullable = true)
 |-- DEST: string (nullable = true)
 |-- DEST_AIRPORT_SEQ_ID: string (nullable = true)
 |-- DISTANCE: string (nullable = true)
 |-- DIVERTED: boolean (nullable = true)
 |-- FL_DATE: string (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- ORIGIN_AIRPORT_SEQ_ID: string (nullable = true)
 |-- TAXI_IN: double (nullable = true)
 |-- TAXI_OUT: double (nullable = true)
 |-- UNIQUE_CARRIER: string (nulla

In [5]:
# COMMAND ----------

from pyspark.sql.functions import initcap, col
df.select(initcap(col("DEST"))).show()

+-------------+
|initcap(DEST)|
+-------------+
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
|          Slc|
+-------------+
only showing top 20 rows



As just mentioned, you can cast strings in uppercase and lowercase, as well:

In [10]:
# COMMAND ----------

from pyspark.sql.functions import lower, upper
df.select(col("ORIGIN"),
    lower(col("ORIGIN")),
    upper(lower(col("ORIGIN")))).show(2)

+------+-------------+--------------------+
|ORIGIN|lower(ORIGIN)|upper(lower(ORIGIN))|
+------+-------------+--------------------+
|   OAK|          oak|                 OAK|
|   OAK|          oak|                 OAK|
+------+-------------+--------------------+
only showing top 2 rows



Another trivial task is adding or removing spaces around a string. You can do this by using lpad,
ltrim, rpad and rtrim, trim:

In [11]:
# COMMAND ----------

from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim
df.select(
    ltrim(lit("    HELLO    ")).alias("ltrim"),
    rtrim(lit("    HELLO    ")).alias("rtrim"),
    trim(lit("    HELLO    ")).alias("trim"),
    lpad(lit("HELLO"), 3, " ").alias("lp"),
    rpad(lit("HELLO"), 10, " ").alias("rp")).show(2)

+---------+---------+-----+---+----------+
|    ltrim|    rtrim| trim| lp|        rp|
+---------+---------+-----+---+----------+
|HELLO    |    HELLO|HELLO|HEL|HELLO     |
|HELLO    |    HELLO|HELLO|HEL|HELLO     |
+---------+---------+-----+---+----------+
only showing top 2 rows



Note that if lpad or rpad takes a number less than the length of the string, it will always remove
values from the right side of the string.

## Regular Expressions
Probably one of the most frequently performed tasks is searching for the existence of one string in
another or replacing all mentions of a string with another value. 

Spark takes advantage of the complete power of Java regular expressions. There are two key functions in Spark that you’ll need in order to
perform regular expression tasks: regexp_extract and regexp_replace. These functions extract
values and replace values, respectively.

Let’s explore how to use the regexp_replace function to replace substitute color names in our
description column:

In [14]:
# COMMAND ----------

from pyspark.sql.functions import regexp_replace
regex_string = "PHL|LAX|OAK"
df.select(
  regexp_replace(col("ORIGIN"), regex_string, "PHIL").alias("Origin"),
  col("DEST")).show()

+------+----+
|Origin|DEST|
+------+----+
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
|  PHIL| SLC|
+------+----+
only showing top 20 rows



Another task might be to replace given characters with other characters. Building this as a regular
expression could be tedious, so Spark also provides the translate function to replace these values.
This is done at the character level and will replace all instances of a character with the indexed
character in the replacement string:

Sometimes, rather than extracting values, we simply want to check for their existence. We can do this
with the contains method on each column. This will return a Boolean declaring whether the value
you specify is in the column’s string. In Python and SQL, we can use the instr function:

In [16]:
# COMMAND ----------

from pyspark.sql.functions import instr
containsPH = instr(col("ORIGIN"), "PH") >= 1
containsLA = instr(col("ORIGIN"), "LA") >= 1
df.withColumn("fromPHLA", containsPH | containsLA)\
  .where("fromPHLA")\
  .select("ORIGIN").show(3, False)

+------+
|ORIGIN|
+------+
|PHL   |
|PHL   |
|PHL   |
+------+
only showing top 3 rows



## Working with Dates and Timestamps

Spark will make a best effort to correctly identify
column types, including dates and timestamps when we enable inferSchema. 

Although Spark will do read dates or times on a best-effort basis. However, sometimes there will be
no getting around working with strangely formatted dates and times. The key to understanding the
transformations that you are going to need to apply is to ensure that you know exactly what type and
format you have at each given step of the way. Another common “gotcha” is that Spark’s
TimestampType class supports only second-level precision, which means that if you’re going to be
working with milliseconds or microseconds, you’ll need to work around this problem by potentially
operating on them as longs. Any more precision when coercing to a TimestampType will be
removed.

Spark can be a bit particular about what format you have at any given point in time. It’s important to
be explicit when parsing or converting to ensure that there are no issues in doing so. At the end of the
day, Spark is working with Java dates and timestamps and therefore conforms to those standards.

Let’s begin with the basics and get the current date and the current timestamps:

In [17]:
# COMMAND ----------

from pyspark.sql.functions import current_date, current_timestamp
dateDF = spark.range(10)\
  .withColumn("today", current_date())\
  .withColumn("now", current_timestamp())
dateDF.createOrReplaceTempView("dateTable")

In [18]:
dateDF.show(4)

+---+----------+--------------------+
| id|     today|                 now|
+---+----------+--------------------+
|  0|2023-05-14|2023-05-14 19:43:...|
|  1|2023-05-14|2023-05-14 19:43:...|
|  2|2023-05-14|2023-05-14 19:43:...|
|  3|2023-05-14|2023-05-14 19:43:...|
+---+----------+--------------------+
only showing top 4 rows



Now that we have a simple DataFrame to work with, let’s add and subtract five days from today.
These functions take a column and then the number of days to either add or subtract as the arguments:

In [19]:
# COMMAND ----------

from pyspark.sql.functions import date_add, date_sub
dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(1)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2023-05-09|        2023-05-19|
+------------------+------------------+
only showing top 1 row



Another common task is to take a look at the difference between two dates. We can do this with the
datediff function that will return the number of days in between two dates. Most often we just care
about the days, and because the number of days varies from month to month, there also exists a
function, months_between, that gives you the number of months between two dates:

In [21]:
# COMMAND ----------

from pyspark.sql.functions import datediff, months_between, to_date
dateDF.withColumn("week_ago", date_sub(col("today"), 7))\
  .select(datediff(col("week_ago"), col("today"))).show(1)

+-------------------------+
|datediff(week_ago, today)|
+-------------------------+
|                       -7|
+-------------------------+
only showing top 1 row



In [22]:
dateDF.select(
    to_date(lit("2023-01-01")).alias("start"),
    to_date(lit("2023-05-10")).alias("end"))\
  .select(months_between(col("start"), col("end"))).show(1)

+--------------------------------+
|months_between(start, end, true)|
+--------------------------------+
|                     -4.29032258|
+--------------------------------+
only showing top 1 row



Notice that we introduced a new function: the to_date function. The to_date function allows you to
convert a string to a date, optionally with a specified format. We specify our format in the Java
SimpleDateFormat which will be important to reference if you use this function:

In [23]:
# COMMAND ----------

from pyspark.sql.functions import to_date, lit
spark.range(5).withColumn("date", lit("2023-01-01"))\
  .select(to_date(col("date"))).show(1)

+---------------+
|to_date(`date`)|
+---------------+
|     2023-01-01|
+---------------+
only showing top 1 row



# Working with Nulls

## drop
The simplest function is drop, which removes rows that contain nulls. The default is to drop any row
in which any value is null:

In [2]:
df.toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28233 entries, 0 to 28232
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ARR_AIRPORT_LAT        28233 non-null  float64
 1   ARR_AIRPORT_LON        28233 non-null  float64
 2   ARR_AIRPORT_TZOFFSET   28233 non-null  float64
 3   ARR_DELAY              28012 non-null  float64
 4   ARR_TIME               28233 non-null  object 
 5   CANCELLED              28233 non-null  bool   
 6   CRS_ARR_TIME           28233 non-null  object 
 7   CRS_DEP_TIME           28233 non-null  object 
 8   DEP_AIRPORT_LAT        28233 non-null  float64
 9   DEP_AIRPORT_LON        28233 non-null  float64
 10  DEP_AIRPORT_TZOFFSET   28233 non-null  float64
 11  DEP_DELAY              28068 non-null  float64
 12  DEP_TIME               28233 non-null  object 
 13  DEST                   28233 non-null  object 
 14  DEST_AIRPORT_SEQ_ID    28233 non-null  object 
 15  DI

In [3]:
# COMMAND ----------

df.na.drop("all", subset=["DEP_DELAY", "ARR_DELAY"])

DataFrame[ARR_AIRPORT_LAT: double, ARR_AIRPORT_LON: double, ARR_AIRPORT_TZOFFSET: double, ARR_DELAY: double, ARR_TIME: string, CANCELLED: boolean, CRS_ARR_TIME: string, CRS_DEP_TIME: string, DEP_AIRPORT_LAT: double, DEP_AIRPORT_LON: double, DEP_AIRPORT_TZOFFSET: double, DEP_DELAY: double, DEP_TIME: string, DEST: string, DEST_AIRPORT_SEQ_ID: string, DISTANCE: string, DIVERTED: boolean, FL_DATE: string, ORIGIN: string, ORIGIN_AIRPORT_SEQ_ID: string, TAXI_IN: double, TAXI_OUT: double, UNIQUE_CARRIER: string, WHEELS_OFF: string, WHEELS_ON: string]

## fill
Using the fill function, you can fill one or more columns with a set of values. This can be done by
specifying a map—that is a particular value and a set of columns.For example, to fill all null values in columns of type String, you might specify the following:
`df.na.fill("All Null values become this string")`
We could do the same for columns of type Integer by using `df.na.fill(5:Integer)`, or for Doubles
`df.na.fill(5:Double)`. To specify columns, we just pass in an array of column names like we did
in the previous example:

In [4]:
# COMMAND ----------

df.na.fill(0, subset=["DEP_DELAY", "ARR_DELAY"])

DataFrame[ARR_AIRPORT_LAT: double, ARR_AIRPORT_LON: double, ARR_AIRPORT_TZOFFSET: double, ARR_DELAY: double, ARR_TIME: string, CANCELLED: boolean, CRS_ARR_TIME: string, CRS_DEP_TIME: string, DEP_AIRPORT_LAT: double, DEP_AIRPORT_LON: double, DEP_AIRPORT_TZOFFSET: double, DEP_DELAY: double, DEP_TIME: string, DEST: string, DEST_AIRPORT_SEQ_ID: string, DISTANCE: string, DIVERTED: boolean, FL_DATE: string, ORIGIN: string, ORIGIN_AIRPORT_SEQ_ID: string, TAXI_IN: double, TAXI_OUT: double, UNIQUE_CARRIER: string, WHEELS_OFF: string, WHEELS_ON: string]

We can also do this with with a Map, where the key is the column name and the value is the
value we would like to use to fill null values:

In [5]:
# COMMAND ----------

fill_cols_vals = {"DEP_DELAY": 5, "ARR_DELAY" : "No Value"}
df.na.fill(fill_cols_vals)

DataFrame[ARR_AIRPORT_LAT: double, ARR_AIRPORT_LON: double, ARR_AIRPORT_TZOFFSET: double, ARR_DELAY: double, ARR_TIME: string, CANCELLED: boolean, CRS_ARR_TIME: string, CRS_DEP_TIME: string, DEP_AIRPORT_LAT: double, DEP_AIRPORT_LON: double, DEP_AIRPORT_TZOFFSET: double, DEP_DELAY: double, DEP_TIME: string, DEST: string, DEST_AIRPORT_SEQ_ID: string, DISTANCE: string, DIVERTED: boolean, FL_DATE: string, ORIGIN: string, ORIGIN_AIRPORT_SEQ_ID: string, TAXI_IN: double, TAXI_OUT: double, UNIQUE_CARRIER: string, WHEELS_OFF: string, WHEELS_ON: string]

## replace
In addition to replacing null values like we did with drop and fill, there are more flexible options
that you can use with more than just null values. Probably the most common use case is to replace all
values in a certain column according to their current value. The only requirement is that this value be
the same type as the original value:

In [6]:
# COMMAND ----------

df.na.replace(["PHL"], ["UNKNOWN"], "DEST")

DataFrame[ARR_AIRPORT_LAT: double, ARR_AIRPORT_LON: double, ARR_AIRPORT_TZOFFSET: double, ARR_DELAY: double, ARR_TIME: string, CANCELLED: boolean, CRS_ARR_TIME: string, CRS_DEP_TIME: string, DEP_AIRPORT_LAT: double, DEP_AIRPORT_LON: double, DEP_AIRPORT_TZOFFSET: double, DEP_DELAY: double, DEP_TIME: string, DEST: string, DEST_AIRPORT_SEQ_ID: string, DISTANCE: string, DIVERTED: boolean, FL_DATE: string, ORIGIN: string, ORIGIN_AIRPORT_SEQ_ID: string, TAXI_IN: double, TAXI_OUT: double, UNIQUE_CARRIER: string, WHEELS_OFF: string, WHEELS_ON: string]