# NYC Flights data

Reference : https://github.com/rich-iannone/so-many-pyspark-examples/blob/main/spark-dataframes.ipynb

In this notebook, we will extracts some historical flight data for flights out of NYC between 1990 and 2000. The data is taken from [here](http://stat-computing.org/dataexpo/2009/the-data.html). 

**Variable descriptions**

 Name	Description
 
 1.	`Year` 1987-2008
 2.	`Month` 1-12
 3. `DayofMonth` 1-31
 4.	`DayOfWeek` 1 (Monday) - 7 (Sunday)
 5.	`DepTime` actual departure time (local, hhmm)
 6.	`CRSDepTime` scheduled departure time (local, hhmm)
 7.	`ArrTime` actual arrival time (local, hhmm)
 8.	`CRSArrTime` scheduled arrival time (local, hhmm)
 9.	`UniqueCarrier` unique carrier code
 10. `FlightNum` flight number
 11. `TailNu` plane tail number
 12. `ActualElapsedTime` in minutes
 13. `CRSElapsedTime` in minutes
 14. `AirTime` in minutes
 15. `ArrDelay` arrival delay, in minutes
 16. `DepDelay` departure delay, in minutes
 17. `Origin` origin IATA airport code
 18. `Dest` destination IATA airport code
 19. `Distance` in miles
 20. `TaxiIn` taxi in time, in minutes
 21. `TaxiOut` taxi out time in minutes
 22. `Cancelled` was the flight cancelled?
 23. `CancellationCode` reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
 24. `Diverted` 1 = yes, 0 = no
 25. `CarrierDelay` in minutes
 26. `WeatherDelay` in minutes
 27. `NASDelay` in minutes
 28. `SecurityDelay` in minutes
 29. `LateAircraftDelay` in minutes

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .config("spark.cores.max", "4") \
        .appName("NYCFlights") \
        .master("spark://b2-120-gra11:7077") \
        .getOrCreate()


nycflights = spark.read.parquet("hdfs://localhost:54310/data/nycflights.parquet")
nycflights.show()

22/12/06 10:31:03 WARN Utils: Your hostname, b2-120-gra11 resolves to a loopback address: 127.0.1.1; using 141.94.168.194 instead (on interface ens3)
22/12/06 10:31:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/06 10:31:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


[Stage 0:>                                                          (0 + 0) / 1]

22/12/06 10:31:31 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources


                                                                                

+-------------------+---------+-------+----------+-------+----------+-------------+---------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+--------+-------------------+
|               Date|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|Diverted|__null_dask_index__|
+-------------------+---------+-------+----------+-------+----------+-------------+---------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+--------+-------------------+
|1999-01-01 00:00:00|        5| 1526.0|      1515| 1838.0|      1849|           CO|     1923|            312.0|         334.0|  289.0|   -11.0|    11.0|   EWR| PHX|  2133.0|   7.0|   16.0|    false|       0|                  0|
|1999-01-02 00:00:00|        6| 1727.0|      1540| 2056.0|      1914|           CO|     

Let's take a look to the dataframe scheme

In [2]:
nycflights.printSchema()

root
 |-- Date: timestamp (nullable = true)
 |-- DayOfWeek: long (nullable = true)
 |-- DepTime: double (nullable = true)
 |-- CRSDepTime: long (nullable = true)
 |-- ArrTime: double (nullable = true)
 |-- CRSArrTime: long (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: long (nullable = true)
 |-- ActualElapsedTime: double (nullable = true)
 |-- CRSElapsedTime: double (nullable = true)
 |-- AirTime: double (nullable = true)
 |-- ArrDelay: double (nullable = true)
 |-- DepDelay: double (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: double (nullable = true)
 |-- TaxiIn: double (nullable = true)
 |-- TaxiOut: double (nullable = true)
 |-- Cancelled: boolean (nullable = true)
 |-- Diverted: long (nullable = true)
 |-- __null_dask_index__: long (nullable = true)



Let's group and aggregate `groupBy()` will group one or more DF columns and prep them for aggregration functions

In [2]:
(nycflights
 .groupby('Origin') # creates 'GroupedData'
 .count() # creates a new column with aggregate `count` values
 .show())

[Stage 4:>                                                          (0 + 1) / 1]

+------+-------+
|Origin|  count|
+------+-------+
|   LGA|1003420|
|   EWR|1174131|
|   JFK| 434341|
+------+-------+



                                                                                

Use the `agg()` function to perform multiple aggregations

In [3]:
(nycflights
 .groupby('Origin')
 .agg({'DepDelay': 'avg', 'ArrDelay': 'avg'}) # note the new column names
 .show())

[Stage 7:>                                                          (0 + 1) / 1]

+------+------------------+-----------------+
|Origin|     avg(DepDelay)|    avg(ArrDelay)|
+------+------------------+-----------------+
|   LGA| 7.431141565915709|6.047335822109474|
|   EWR|10.295468607250333|9.565089035954916|
|   JFK|10.351298909519874|8.353392637878372|
+------+------------------+-----------------+



                                                                                

You can't perform multiple aggregrations on the same column (only the last is performed)

In [4]:
(nycflights
 .groupby('DayOfWeek')
 .agg({'DepDelay': 'min', 'DepDelay': 'max'})
 .show())

[Stage 10:>                                                         (0 + 1) / 1]

+---------+-------------+
|DayOfWeek|max(DepDelay)|
+---------+-------------+
|        7|       1429.0|
|        6|       1434.0|
|        5|       1435.0|
|        1|       1435.0|
|        3|       1435.0|
|        2|       1434.0|
|        4|       1435.0|
+---------+-------------+



                                                                                

Use `groupBy()` with a few columns, then aggregate

In [5]:
(
  nycflights
  .groupby(['DayOfWeek', 'Origin', 'Dest']) # group by these unique combinations
  .count()                              # perform a 'count' aggregation on the groups
  .orderBy(['DayOfWeek', 'count'],
           ascending = [1, 0])          # order by `DayOfWeek` ascending, `count` descending
  .show(40)
) 
     

[Stage 13:>                                                         (0 + 1) / 1]

+---------+------+----+-----+
|DayOfWeek|Origin|Dest|count|
+---------+------+----+-----+
|        1|   LGA| ORD|17121|
|        1|   EWR| ORD|15946|
|        1|   EWR| BOS|12157|
|        1|   JFK| LAX|10383|
|        1|   EWR| ATL| 9462|
|        1|   LGA| BOS| 8755|
|        1|   LGA| DFW| 8687|
|        1|   LGA| DCA| 8165|
|        1|   LGA| ATL| 7568|
|        1|   EWR| DTW| 7209|
|        1|   JFK| SFO| 6869|
|        1|   EWR| MCO| 6564|
|        1|   LGA| MIA| 6381|
|        1|   EWR| DFW| 6364|
|        1|   EWR| DCA| 6203|
|        1|   EWR| LAX| 5959|
|        1|   LGA| MCO| 5503|
|        1|   EWR| PIT| 5213|
|        1|   EWR| DEN| 5004|
|        1|   JFK| SJU| 4988|
|        1|   EWR| MIA| 4941|
|        1|   LGA| DTW| 4927|
|        1|   LGA| PIT| 4894|
|        1|   LGA| PBI| 4888|
|        1|   LGA| CLT| 4776|
|        1|   JFK| MIA| 4749|
|        1|   LGA| FLL| 4495|
|        1|   EWR| IAH| 4399|
|        1|   EWR| CLT| 4377|
|        1|   LGA| CLE| 4308|
|        1

                                                                                

Use `groupBy()` + `pivot()` + an aggregation function to make a pivot table!
Get a table of flights by month for each carrier

In [6]:
(
  nycflights
  .groupBy('DayOfWeek') # group the data for aggregation by `month` number
  .pivot('UniqueCarrier') # provide columns of data by `carrier` abbreviation
  .count()          # create aggregations as a count of rows
  .show()
)
     

[Stage 27:>                                                         (0 + 1) / 1]

+---------+-----+------+-----+----+----+------+-----+------+-----+-----+-----+
|DayOfWeek|   AA|    CO|   DL|  EA|  HP|ML (1)|   NW|PA (1)|   TW|   UA|   US|
+---------+-----+------+-----+----+----+------+-----+------+-----+-----+-----+
|        7|51950| 94901|52191|2375|6240|   306|17846|  5856|29381|33861|59781|
|        6|46854| 85451|51611|2346|6429|   281|16335|  5891|28709|32328|44344|
|        5|54980|106813|54520|2346|6411|   363|19695|  6263|29862|36895|67396|
|        1|55358|106849|54824|2431|6413|   365|19783|  6254|30106|37294|68334|
|        3|55268|106974|54936|2337|6396|   376|19806|  6259|30076|37299|68315|
|        2|55472|107185|55061|2346|6411|   368|19893|  6206|30128|37311|68531|
|        4|55049|106488|54626|2361|6417|   366|19709|  6173|29811|37135|67981|
+---------+-----+------+-----+----+----+------+-----+------+-----+-----+-----+



                                                                                

## Column Operations


`Column` instances can be created by:

(1) Selecting a column from a DataFrame
- `df.colName`
- `df["colName"]`
- `df.select(df.colName)`
- `df.withColumn(df.colName)`

(2) Creating one from an expression
- `df.colName + 1`
- `1 / df.colName`

Once you have a `Column` instance, you can apply a wide range of functions. Some of the functions covered here are:
- `format_number()`: apply formatting to a number, rounded to `d` decimal places, and return the result as a string
- `when()` & `otherwise()`: `when()` evaluates a list of conditions and returns one of multiple possible result expressions; if `otherwise()` is not invoked, `None` is returned for unmatched conditions
- `concat_ws()`: concatenates multiple input string columns together into a single string column, using the given separator
- `to_utc_timestamp()`: assumes the given timestamp is in given timezone and converts to UTC
- `year()`: extracts the year of a given date as integer
- `month()`: extracts the month of a given date as integer
- `dayofmonth()`: extracts the day of the month of a given date as integer
- `hour()`: extract the hour of a given date as integer
- `minute()`: extract the minute of a given date as integer

Perform 2 different aggregations, rename those new columns, then do some rounding of the aggregrate values


In [8]:
from pyspark.sql.functions import *

(
  nycflights
  .groupby('DayOfWeek')
  .agg({'DepDelay': 'avg', 'ArrDelay': 'avg'})
  .withColumnRenamed('avg(DepDelay)', 'mean_arr_delay')
  .withColumnRenamed('avg(ArrDelay)', 'mean_dep_delay')
  .withColumn('mean_arr_delay', format_number('mean_arr_delay', 1))
  .withColumn('mean_dep_delay', format_number('mean_dep_delay', 1))
  .show()
)

[Stage 28:=====>                                                  (1 + 10) / 11]

+---------+--------------+--------------+
|DayOfWeek|mean_arr_delay|mean_dep_delay|
+---------+--------------+--------------+
|        7|           9.0|           4.8|
|        6|           7.8|           3.1|
|        5|          11.5|          11.8|
|        1|           8.1|           6.3|
|        3|           9.1|           9.8|
|        2|           8.1|           7.7|
|        4|          10.5|          11.5|
+---------+--------------+--------------+



                                                                                

Add a new column (`far_or_near`) with a string based on a comparison
on a numeric column; uses: `withColumn()`, `when()`, and `otherwise()`

In [8]:
from pyspark.sql.types import *  # Necessary for creating schemas
from pyspark.sql.functions import * # Importing PySpark functions

(
  nycflights
  .withColumn('far_or_near',
              when(nycflights.Distance > 1000, 'far') # the `if-then` statement
              .otherwise('near'))                     # the `else` statement
  .select(["Origin", "Dest", "far_or_near"])
  .distinct()
  .show()
)



+------+----+-----------+
|Origin|Dest|far_or_near|
+------+----+-----------+
|   LGA| MKE|       near|
|   LGA| BOS|       near|
|   LGA| CAE|       near|
|   JFK| FLL|        far|
|   LGA| DTW|       near|
|   JFK| IND|       near|
|   LGA| BNA|       near|
|   EWR| BNA|       near|
|   LGA| PHL|       near|
|   EWR| RSW|        far|
|   EWR| DCA|       near|
|   EWR| LAS|        far|
|   EWR| ORH|       near|
|   JFK| IAD|       near|
|   EWR| SEA|        far|
|   EWR| DAY|       near|
|   JFK| RDU|       near|
|   EWR| PIT|       near|
|   LGA| EWR|       near|
|   JFK| CVG|       near|
+------+----+-----------+
only showing top 20 rows



                                                                                

Perform a few numerical computations across columns

In [9]:
(
  nycflights
  .withColumn('dist_per_minute',
              nycflights.Distance / nycflights.AirTime) # create new column with division of values
  .withColumn('dist_per_minute',
              format_number('dist_per_minute', 2))       # round that new column's float value to 2 decimal places
  .select(["Origin", "Dest", "dist_per_minute"])
  .distinct()
  .show()
)



+------+----+---------------+
|Origin|Dest|dist_per_minute|
+------+----+---------------+
|   JFK| SFO|           6.66|
|   JFK| STL|           7.89|
|   LGA| TPA|           6.83|
|   LGA| MEM|           7.77|
|   LGA| MIA|           5.93|
|   LGA| MSP|           5.45|
|   LGA| MSY|           6.01|
|   EWR| PBI|           8.26|
|   EWR| LAX|           7.03|
|   LGA| BOS|           4.87|
|   LGA| DEN|           7.20|
|   EWR| PHX|           6.20|
|   EWR| GSP|           6.39|
|   EWR| BUF|           3.97|
|   EWR| CVG|           7.39|
|   LGA| STL|           7.45|
|   EWR| GSO|           4.80|
|   EWR| ATL|           4.16|
|   JFK| BQN|           7.40|
|   LGA| CLT|           9.54|
+------+----+---------------+
only showing top 20 rows



                                                                                

You can split the date if you need. Use the `year()`, `month()`, `dayofmonth()`,`hour()`, and `minute()` functions with `withColumn()`

In [10]:
(
  nycflights
  .withColumn('Year', year(nycflights.Date))
  .withColumn('Month', month(nycflights.Date))
  .withColumn('Day', dayofmonth(nycflights.Date))
  .select(["Day", "Month", "Year"])
  .distinct()
  .show()
)



+---+-----+----+
|Day|Month|Year|
+---+-----+----+
| 10|    9|1995|
| 29|   11|1995|
| 29|    9|1990|
|  1|   11|1990|
| 24|    6|1994|
| 17|   10|1994|
|  6|    8|1995|
|  5|    2|1990|
| 29|    9|1994|
| 18|   11|1994|
|  2|   10|1995|
| 26|    7|1994|
| 31|    8|1994|
|  5|    5|1990|
|  8|    7|1990|
| 16|    7|1990|
|  1|    8|1990|
|  2|   10|1994|
|  2|    1|1995|
| 22|    4|1995|
+---+-----+----+
only showing top 20 rows



                                                                                

There are more time-based functions:
- `date_sub()`: subtract an integer number of days from a *Date* or *Timestamp*
- `date_add()`: add an integer number of days from a *Date* or *Timestamp*
- `datediff()`: get the difference between two dates
- `add_months()`: add an integer number of months
- `months_between()`: get the number of months between two dates
- `next_day()`: returns the first date which is later than the value of the date column
- `last_day()`: returns the last day of the month which the given date belongs to
- `dayofmonth()`: extract the day of the month of a given date as integer
- `dayofyear()`: extract the day of the year of a given date as integer
- `weekofyear()`: extract the week number of a given date as integer
- `quarter()`: extract the quarter of a given date

Let's transform the timestamp in the first record of `nycflights` with each of these functions

In [11]:
(
  nycflights
   .limit(10)
   .select('Date')
   .withColumn('dayofyear', dayofyear(nycflights.Date))
   .withColumn('weekofyear', weekofyear(nycflights.Date))
   .show()
   )

+-------------------+---------+----------+
|               Date|dayofyear|weekofyear|
+-------------------+---------+----------+
|1992-01-07 00:00:00|        7|         2|
|1992-01-08 00:00:00|        8|         2|
|1992-01-09 00:00:00|        9|         2|
|1992-01-11 00:00:00|       11|         2|
|1992-01-12 00:00:00|       12|         2|
|1992-01-13 00:00:00|       13|         3|
|1992-01-14 00:00:00|       14|         3|
|1992-01-15 00:00:00|       15|         3|
|1992-01-16 00:00:00|       16|         3|
|1992-01-17 00:00:00|       17|         3|
+-------------------+---------+----------+



In [12]:
spark.stop()