# Joining Data Sets

Let us understand how to join multiple Data Sets using Spark based APIs.

## Prepare Datasets for Joins
Let us prepare Dataset to join and get the details related to airports (origin and destination).

* Make sure airport-codes is in HDFS.

In [1]:
%%sh
hdfs dfs -ls /public/airlines_all/airport-codes

Found 1 items
-rw-r--r--   2 hdfs supergroup      11411 2021-01-28 10:48 /public/airlines_all/airport-codes/airport-codes-na.txt


## Starting Spark Context

Let us start spark context for this Notebook so that we can execute the code provided.

In [2]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Joining Data Sets'). \
    master('yarn'). \
    getOrCreate()

In [3]:
spark.conf.set('spark.sql.shuffle.partitions', '2')

* Analyze the Dataset to confirm if there is header and also how the data is structured.

In [4]:
spark.read. \
    text("/public/airlines_all/airport-codes"). \
    show(truncate=False)

+-------------------------+
|value                    |
+-------------------------+
|City	State	Country	IATA  |
|Abbotsford	BC	Canada	YXX |
|Aberdeen	SD	USA	ABR      |
|Abilene	TX	USA	ABI       |
|Akron	OH	USA	CAK         |
|Alamosa	CO	USA	ALS       |
|Albany	GA	USA	ABY        |
|Albany	NY	USA	ALB        |
|Albuquerque	NM	USA	ABQ   |
|Alexandria	LA	USA	AEX    |
|Allentown	PA	USA	ABE     |
|Alliance	NE	USA	AIA      |
|Alpena	MI	USA	APN        |
|Altoona	PA	USA	AOO       |
|Amarillo	TX	USA	AMA      |
|Anahim Lake	BC	Canada	YAA|
|Anchorage	AK	USA	ANC     |
|Appleton	WI	USA	ATW      |
|Arviat	NWT	Canada	YEK    |
|Asheville	NC	USA	AVL     |
+-------------------------+
only showing top 20 rows



 * Data is tab separated.
 * There is header for the data set.
 * Dataset have 4 fields - **Country, State, City, IATA**
    
    
Create DataFrame airport_codes applying appropriate Schema.


In [5]:
airport_codes_path = "/public/airlines_all/airport-codes"

In [6]:
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       )

* Preview and Understand the data.

In [7]:
airport_codes.show()

+-----------+-----+-------+----+
|       City|State|Country|IATA|
+-----------+-----+-------+----+
| Abbotsford|   BC| Canada| YXX|
|   Aberdeen|   SD|    USA| ABR|
|    Abilene|   TX|    USA| ABI|
|      Akron|   OH|    USA| CAK|
|    Alamosa|   CO|    USA| ALS|
|     Albany|   GA|    USA| ABY|
|     Albany|   NY|    USA| ALB|
|Albuquerque|   NM|    USA| ABQ|
| Alexandria|   LA|    USA| AEX|
|  Allentown|   PA|    USA| ABE|
|   Alliance|   NE|    USA| AIA|
|     Alpena|   MI|    USA| APN|
|    Altoona|   PA|    USA| AOO|
|   Amarillo|   TX|    USA| AMA|
|Anahim Lake|   BC| Canada| YAA|
|  Anchorage|   AK|    USA| ANC|
|   Appleton|   WI|    USA| ATW|
|     Arviat|  NWT| Canada| YEK|
|  Asheville|   NC|    USA| AVL|
|      Aspen|   CO|    USA| ASE|
+-----------+-----+-------+----+
only showing top 20 rows



* Get schema of **airport_codes**.

In [8]:
airport_codes.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- IATA: string (nullable = true)



* Preview the data
 * Get the count of records

In [9]:
airport_codes.count()

526

   * Get the count of unique records and see if it is the same as total count.

In [10]:
airport_codes. \
    select("IATA"). \
    distinct(). \
    count()

524

 * If they are not equal, analyze the data and identify IATA codes which are repeated more than once.

In [11]:
from pyspark.sql.functions import lit, count

In [12]:
duplicate_iata_count = airport_codes. \
    groupBy("IATA"). \
    agg(count(lit(1)).alias("iata_count")). \
    filter("iata_count > 1")

In [13]:
duplicate_iata_count.show()

+----+----------+
|IATA|iata_count|
+----+----------+
| Big|         3|
+----+----------+



 * Filter out the duplicates using the most appropriate one and discard others.

In [14]:
airport_codes. \
    filter("IATA = 'Big'"). \
    show()

+-----------+------+-------+----+
|       City| State|Country|IATA|
+-----------+------+-------+----+
|       Hilo|    HI|    USA| Big|
|Kailua-Kona|Hawaii|    USA| Big|
|    Kamuela|Hawaii|    USA| Big|
+-----------+------+-------+----+



In [15]:
airport_codes. \
    filter("!(State = 'Hawaii' AND IATA = 'Big')"). \
    show()

+-----------+-----+-------+----+
|       City|State|Country|IATA|
+-----------+-----+-------+----+
| Abbotsford|   BC| Canada| YXX|
|   Aberdeen|   SD|    USA| ABR|
|    Abilene|   TX|    USA| ABI|
|      Akron|   OH|    USA| CAK|
|    Alamosa|   CO|    USA| ALS|
|     Albany|   GA|    USA| ABY|
|     Albany|   NY|    USA| ALB|
|Albuquerque|   NM|    USA| ABQ|
| Alexandria|   LA|    USA| AEX|
|  Allentown|   PA|    USA| ABE|
|   Alliance|   NE|    USA| AIA|
|     Alpena|   MI|    USA| APN|
|    Altoona|   PA|    USA| AOO|
|   Amarillo|   TX|    USA| AMA|
|Anahim Lake|   BC| Canada| YAA|
|  Anchorage|   AK|    USA| ANC|
|   Appleton|   WI|    USA| ATW|
|     Arviat|  NWT| Canada| YEK|
|  Asheville|   NC|    USA| AVL|
|      Aspen|   CO|    USA| ASE|
+-----------+-----+-------+----+
only showing top 20 rows



 * Get number of airports (IATA Codes) for each state in the US. Sort the data in descending order by count.

In [16]:
from pyspark.sql.functions import col, lit, count

In [17]:
airport_codes_path = "/public/airlines_all/airport-codes"

In [18]:
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       ). \
    filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country='USA'")

In [19]:
airport_count_per_state = airport_codes. \
    groupBy("Country", "State"). \
    agg(count(lit(1)).alias("IATACount")). \
    orderBy(col("IATACount").desc())

In [20]:
airport_count_per_state.show()

+-------+-----+---------+
|Country|State|IATACount|
+-------+-----+---------+
|    USA|   CA|       29|
|    USA|   TX|       26|
|    USA|   AK|       25|
|    USA|   NY|       18|
|    USA|   FL|       18|
|    USA|   MI|       18|
|    USA|   MT|       14|
|    USA|   PA|       13|
|    USA|   CO|       12|
|    USA|   IL|       12|
|    USA|   WY|       10|
|    USA|   NC|       10|
|    USA|   WI|        9|
|    USA|   NM|        9|
|    USA|   HI|        9|
|    USA|   KS|        9|
|    USA|   GA|        9|
|    USA|   WA|        9|
|    USA|   NE|        9|
|    USA|   MO|        8|
+-------+-----+---------+
only showing top 20 rows



## Joining Data Frames

Let us understand how to join Data Frames by using some problem statements. Use 2008 January data.
* Get number of flights departed from each of the US airport.
* Get number of flights departed from each of the state.
* Get the list of airports in the US from which flights are not departed.
* Check if there are any origins in airlines data which do not have record in airport-codes.
* Get the total number of flights from the airports that do not contain entries in airport-codes.
* Get the total number of flights per airport that do not contain entries in airport-codes.

## Overview of Joins

Let us get an overview of joining Data Frames.
* Our data cannot be stored in one table. It will be stored in multiple tables and the tables might be related.
  * When it comes to transactional systems, we typically define tables based on Normalization Principles.
  * When it comes to data warehousing applications, we typically define tables using Dimensional Modeling.
  * Either of the approach data is scattered into multiple tables and relationships are defined.
  * Typically tables are related with one to one, one to many, many to many relationships.
* When we have 2 Data Sets that are related based on a common key we typically perform join.
* There are different types of joins.
  * INNER JOIN
  * OUTER JOIN (LEFT or RIGHT)
  * FULL OUTER JOIN (a LEFT OUTER JOIN b UNION a RIGHT OUTER JOIN b)
 

## Solutions - Problem 1

Get number of flights departed from each of the US airport.

In [21]:
from pyspark.sql.functions import col, lit, count

In [22]:
airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [23]:
airlines = spark. \
    read. \
    parquet(airlines_path)

In [24]:
airlines.show()

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|2008|    1|    

In [25]:
airlines.select("Year", "Month", "DayOfMonth", "Origin", "Dest", "CRSDepTime").show()

+----+-----+----------+------+----+----------+
|Year|Month|DayOfMonth|Origin|Dest|CRSDepTime|
+----+-----+----------+------+----+----------+
|2008|    1|        16|   BGR| CVG|      1735|
|2008|    1|        17|   SYR| CVG|      1701|
|2008|    1|        17|   SAV| BOS|      1225|
|2008|    1|        17|   CVG| GRR|      1530|
|2008|    1|        17|   STL| CVG|      1205|
|2008|    1|        18|   STL| JFK|      1150|
|2008|    1|        18|   MCI| CVG|      1009|
|2008|    1|        19|   TUL| CVG|       835|
|2008|    1|        20|   JFK| PHL|      1935|
|2008|    1|        20|   RDU| CVG|       830|
|2008|    1|        21|   CVG| DTW|      1640|
|2008|    1|        21|   MSY| LGA|      1204|
|2008|    1|        21|   JFK| PHL|      1935|
|2008|    1|        21|   DCA| JFK|      1830|
|2008|    1|        21|   HSV| DCA|       700|
|2008|    1|        22|   ORD| CVG|      1910|
|2008|    1|        22|   CVG| JFK|      1320|
|2008|    1|        23|   LGA| SAV|       908|
|2008|    1| 

In [26]:
airlines.count()

605659

In [27]:
airport_codes_path = "/public/airlines_all/airport-codes"

In [28]:
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       ). \
    filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country='USA'")

In [29]:
airport_codes.count()

443

In [30]:
airlines. \
    join(airport_codes, col("IATA") == col("Origin")). \
    select("Year", "Month", "DayOfMonth", airport_codes["*"], "CRSDepTime"). \
    show()

+----+-----+----------+-------------+-----+-------+----+----------+
|Year|Month|DayOfMonth|         City|State|Country|IATA|CRSDepTime|
+----+-----+----------+-------------+-----+-------+----+----------+
|2008|    1|        16|       Bangor|   ME|    USA| BGR|      1735|
|2008|    1|        17|     Syracuse|   NY|    USA| SYR|      1701|
|2008|    1|        17|     Savannah|   GA|    USA| SAV|      1225|
|2008|    1|        17|   Cincinnati|   OH|    USA| CVG|      1530|
|2008|    1|        17|    St. Louis|   MO|    USA| STL|      1205|
|2008|    1|        18|    St. Louis|   MO|    USA| STL|      1150|
|2008|    1|        18|  Kansas City|   MO|    USA| MCI|      1009|
|2008|    1|        19|        Tulsa|   OK|    USA| TUL|       835|
|2008|    1|        20|     New York|   NY|    USA| JFK|      1935|
|2008|    1|        20|      Raleigh|   NC|    USA| RDU|       830|
|2008|    1|        21|   Cincinnati|   OH|    USA| CVG|      1640|
|2008|    1|        21|  New Orleans|   LA|    U

In [31]:
airlines. \
    join(airport_codes, airport_codes.IATA == airlines["Origin"]). \
    select("Year", "Month", "DayOfMonth", airport_codes["*"], "CRSDepTime"). \
    show()

+----+-----+----------+-------------+-----+-------+----+----------+
|Year|Month|DayOfMonth|         City|State|Country|IATA|CRSDepTime|
+----+-----+----------+-------------+-----+-------+----+----------+
|2008|    1|        16|       Bangor|   ME|    USA| BGR|      1735|
|2008|    1|        17|     Syracuse|   NY|    USA| SYR|      1701|
|2008|    1|        17|     Savannah|   GA|    USA| SAV|      1225|
|2008|    1|        17|   Cincinnati|   OH|    USA| CVG|      1530|
|2008|    1|        17|    St. Louis|   MO|    USA| STL|      1205|
|2008|    1|        18|    St. Louis|   MO|    USA| STL|      1150|
|2008|    1|        18|  Kansas City|   MO|    USA| MCI|      1009|
|2008|    1|        19|        Tulsa|   OK|    USA| TUL|       835|
|2008|    1|        20|     New York|   NY|    USA| JFK|      1935|
|2008|    1|        20|      Raleigh|   NC|    USA| RDU|       830|
|2008|    1|        21|   Cincinnati|   OH|    USA| CVG|      1640|
|2008|    1|        21|  New Orleans|   LA|    U

In [32]:
airlines.join?

[0;31mSignature:[0m [0mairlines[0m[0;34m.[0m[0mjoin[0m[0;34m([0m[0mother[0m[0;34m,[0m [0mon[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mhow[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Joins with another :class:`DataFrame`, using the given join expression.

:param other: Right side of the join
:param on: a string for the join column name, a list of column names,
    a join expression (Column), or a list of Columns.
    If `on` is a string or a list of strings indicating the name of the join column(s),
    the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default ``inner``. Must be one of: ``inner``, ``cross``, ``outer``,
    ``full``, ``full_outer``, ``left``, ``left_outer``, ``right``, ``right_outer``,
    ``left_semi``, and ``left_anti``.

The following performs a full outer join between ``df1`` and ``df2``.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height)

In [33]:
flight_count_per_airport = airlines. \
    join(airport_codes, airport_codes.IATA == airlines.Origin). \
    groupBy("Origin"). \
    agg(count(lit(1)).alias("FlightCount")). \
    orderBy(col("FlightCount").desc())

In [34]:
flight_count_per_airport.show()

+------+-----------+
|Origin|FlightCount|
+------+-----------+
|   ATL|      33897|
|   ORD|      29936|
|   DFW|      23861|
|   DEN|      19477|
|   LAX|      18945|
|   PHX|      17695|
|   IAH|      15531|
|   LAS|      15292|
|   DTW|      14357|
|   EWR|      12467|
|   SLC|      12401|
|   MSP|      11800|
|   SFO|      11573|
|   MCO|      11070|
|   CLT|      10752|
|   LGA|      10300|
|   JFK|      10023|
|   BOS|       9717|
|   BWI|       8883|
|   CVG|       8659|
+------+-----------+
only showing top 20 rows



## Solutions - Problem 2

Get number of flights departed from each of the state.

In [35]:
from pyspark.sql.functions import col, lit, count

In [36]:
airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [37]:
airlines = spark. \
    read. \
    parquet(airlines_path)

In [38]:
airport_codes_path = "/public/airlines_all/airport-codes"

In [39]:
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       ). \
    filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country='USA'")

In [40]:
flight_count_per_state = airlines. \
    join(airport_codes, airport_codes.IATA == airlines.Origin). \
    groupBy("State"). \
    agg(count(lit(1)).alias("FlightCount")). \
    orderBy(col("FlightCount").desc())

In [41]:
flight_count_per_state.show()

+-----+-----------+
|State|FlightCount|
+-----+-----------+
|   CA|      72853|
|   TX|      63930|
|   FL|      41042|
|   IL|      39812|
|   GA|      35527|
|   NY|      28414|
|   CO|      23288|
|   AZ|      20768|
|   OH|      19209|
|   NC|      17942|
|   MI|      17824|
|   NV|      17763|
| null|      14090|
|   TN|      13549|
|   PA|      13491|
|   UT|      12709|
|   NJ|      12498|
|   MN|      12357|
|   MO|      11808|
|   WA|      10210|
+-----+-----------+
only showing top 20 rows



## Solutions - Problem 3

Get the list of airports in the US from which flights are not departed.

In [42]:
airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [43]:
airlines = spark. \
    read. \
    parquet(airlines_path)

In [44]:
airport_codes_path = "/public/airlines_all/airport-codes"

In [45]:
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       ). \
    filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country='USA'")

In [46]:
airport_codes.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- IATA: string (nullable = true)



In [47]:
airports_not_used = airport_codes. \
    join(airlines, airport_codes.IATA == airlines.Origin, "left"). \
    select(airport_codes["*"], "Year", "Month", 
           "DayOfMonth", "Origin", "CRSDepTime"). \
    show()

+--------+-----+-------+----+----+-----+----------+------+----------+
|    City|State|Country|IATA|Year|Month|DayOfMonth|Origin|CRSDepTime|
+--------+-----+-------+----+----+-----+----------+------+----------+
|Aberdeen|   SD|    USA| ABR|null| null|      null|  null|      null|
| Abilene|   TX|    USA| ABI|2008|    1|         5|   ABI|       540|
| Abilene|   TX|    USA| ABI|2008|    1|        17|   ABI|      1423|
| Abilene|   TX|    USA| ABI|2008|    1|        21|   ABI|       640|
| Abilene|   TX|    USA| ABI|2008|    1|         8|   ABI|       842|
| Abilene|   TX|    USA| ABI|2008|    1|        15|   ABI|      1423|
| Abilene|   TX|    USA| ABI|2008|    1|        13|   ABI|       540|
| Abilene|   TX|    USA| ABI|2008|    1|        18|   ABI|       842|
| Abilene|   TX|    USA| ABI|2008|    1|        25|   ABI|       640|
| Abilene|   TX|    USA| ABI|2008|    1|         2|   ABI|       640|
| Abilene|   TX|    USA| ABI|2008|    1|        30|   ABI|       640|
| Abilene|   TX|    

In [48]:
airports_not_used = airport_codes. \
    join(airlines, airport_codes.IATA == airlines.Origin, "left"). \
    filter(airlines.Origin.isNull()). \
    select('City', 'State', 'Country', 'IATA')

In [49]:
airports_not_used = airlines. \
    join(airport_codes, airport_codes.IATA == airlines.Origin, "right"). \
    filter("Origin IS NULL"). \
    select('City', 'State', 'Country', 'IATA')

In [50]:
airports_not_used.count()

173

In [51]:
airport_codes.show()

+-------------+-----+-------+----+
|         City|State|Country|IATA|
+-------------+-----+-------+----+
|     Aberdeen|   SD|    USA| ABR|
|      Abilene|   TX|    USA| ABI|
|        Akron|   OH|    USA| CAK|
|      Alamosa|   CO|    USA| ALS|
|       Albany|   GA|    USA| ABY|
|       Albany|   NY|    USA| ALB|
|  Albuquerque|   NM|    USA| ABQ|
|   Alexandria|   LA|    USA| AEX|
|    Allentown|   PA|    USA| ABE|
|     Alliance|   NE|    USA| AIA|
|       Alpena|   MI|    USA| APN|
|      Altoona|   PA|    USA| AOO|
|     Amarillo|   TX|    USA| AMA|
|    Anchorage|   AK|    USA| ANC|
|     Appleton|   WI|    USA| ATW|
|    Asheville|   NC|    USA| AVL|
|        Aspen|   CO|    USA| ASE|
|       Athens|   GA|    USA| AHN|
|      Atlanta|   GA|    USA| ATL|
|Atlantic City|   NJ|    USA| ACY|
+-------------+-----+-------+----+
only showing top 20 rows



## Solutions - Problem 4

Check if there are any origins in airlines data which do not have record in airport-codes.

In [52]:
airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [53]:
airlines = spark. \
    read. \
    parquet(airlines_path)

In [54]:
airport_codes_path = "/public/airlines_all/airport-codes"

In [55]:
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       ). \
    filter("!(State = 'Hawaii' AND IATA = 'Big')")

In [56]:
airlines. \
    join(airport_codes, airlines.Origin == airport_codes.IATA, "left"). \
    filter("IATA IS NULL"). \
    select("Origin"). \
    distinct(). \
    show()

+------+
|Origin|
+------+
|   HDN|
|   SJU|
|   ITO|
|   STT|
|   CEC|
|   CDC|
|   PSG|
|   ADK|
|   KOA|
|   OTZ|
|   BQN|
|   STX|
|   PMD|
|   PSE|
|   SCC|
|   SLE|
+------+



In [57]:
airlines. \
    join(airport_codes, airlines.Origin == airport_codes.IATA, "left"). \
    filter("IATA IS NULL"). \
    select("Origin"). \
    distinct(). \
    count()

16

## Solutions - Problem 5

Get the total number of flights from the airports that do not contain entries in airport-codes.

In [58]:
airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [59]:
airlines = spark. \
    read. \
    parquet(airlines_path)

In [None]:
airport_codes_path = "/public/airlines_all/airport-codes"

In [None]:
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       ). \
    filter("!(State = 'Hawaii' AND IATA = 'Big')")

In [None]:
airlines. \
    join(airport_codes, airlines.Origin == airport_codes.IATA, "left"). \
    filter("IATA IS NULL"). \
    count()

## Solutions - Problem 6

Get the total number of flights per airport that do not contain entries in airport-codes.

In [None]:
airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [None]:
airlines = spark. \
    read. \
    parquet(airlines_path)

In [None]:
airport_codes_path = "/public/airlines_all/airport-codes"
airport_codes = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       ). \
    filter("!(State = 'Hawaii' AND IATA = 'Big')")

In [None]:
airlines. \
    join(airport_codes, airlines.Origin == airport_codes.IATA, "left"). \
    filter("IATA IS NULL"). \
    groupBy("Origin"). \
    count(). \
    show()