# Prepare the weather data

In [0]:
import datetime
from pyspark.sql.types import *
from pyspark.sql.functions import unix_timestamp
import math
from pyspark.sql import functions as F

To begin, take a look at the `flight_weather_with_airport_code` data that was imported to get a sense of the data we will be working with.

In [0]:
%sql
select * from flight_weather_with_airport_code

Year,Month,Day,Time,TimeZone,SkyCondition,Visibility,WeatherType,DryBulbFarenheit,DryBulbCelsius,WetBulbFarenheit,WetBulbCelsius,DewPointFarenheit,DewPointCelsius,RelativeHumidity,WindSpeed,WindDirection,ValueForWindCharacter,StationPressure,PressureTendency,PressureChange,SeaLevelPressure,RecordType,HourlyPrecip,Altimeter,AirportCode,DISPLAY_AIRPORT_NAME,LATITUDE,LONGITUDE
2013,4,1,56,-4,FEW018 SCT044 BKN070,10.0,-RA,76,24.4,74,23.3,73,22.8,90,13,080,,30.06,,,30.06,AA,T,30.07,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,156,-4,FEW037 SCT070,10.0,,76,24.4,73,22.5,71,21.7,85,10,090,,30.05,6.0,17.0,30.05,AA,,30.06,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,256,-4,FEW037 SCT070,10.0,,76,24.4,73,22.5,71,21.7,85,9,100,,30.03,,,30.03,AA,,30.04,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,356,-4,FEW025 SCT070,10.0,,76,24.4,72,22.2,70,21.1,82,9,100,,30.02,,,30.03,AA,,30.03,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,456,-4,FEW025,10.0,,76,24.4,72,22.2,70,21.1,82,7,110,,30.03,5.0,4.0,30.04,AA,,30.04,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,556,-4,FEW025 SCT080,10.0,,76,24.4,71,21.8,69,20.6,79,7,100,,30.04,,,30.05,AA,,30.05,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,656,-4,FEW028 BKN080,10.0,,77,25.0,71,21.7,68,20.0,74,9,110,,30.07,,,30.07,AA,,30.08,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,756,-4,FEW028 BKN080,10.0,,79,26.1,72,22.4,69,20.6,72,13,100,,30.09,3.0,20.0,30.10,AA,,30.1,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,856,-4,FEW030 BKN080,10.0,,82,27.8,73,22.9,69,20.6,65,14,100,21.0,30.11,,,30.11,AA,,30.12,SJU,Luis Munoz Marin International,18.43944444,-66.00222222
2013,4,1,956,-4,SCT035 BKN090,10.0,,83,28.3,74,23.0,69,20.6,63,16,090,23.0,30.11,,,30.12,AA,,30.12,SJU,Luis Munoz Marin International,18.43944444,-66.00222222


Next, count the number of records so we know how many rows we are working with.

In [0]:
%sql
select count(*) from flight_weather_with_airport_code

count(1)
406516


Observe that this data set has 406,516 rows and 29 columns. For this model, we are going to focus on predicting delays using WindSpeed (in MPH), SeaLevelPressure (in inches of Hg), and HourlyPrecip (in inches). We will focus on preparing the data for those features.

Let's start out by taking a look at the **WindSpeed** column. You may scroll through the values in the table above, but reviewing just the distinct values will be faster.

In [0]:
%sql
select distinct WindSpeed from flight_weather_with_airport_code

WindSpeed
7
51
15
11
29
3
30
34
8
22


Try clicking on the **WindSpeed** column header to sort the list by ascending and then by descending order. Observe that the values are all numbers, with the exception of some having `null` values and a string value of `M` for Missing. We will need to ensure that we remove any missing values and convert WindSpeed to its proper type as a numeric feature.

Next, let's take a look at the **SeaLevelPressure** column in the same way, by listing its distinct values.

In [0]:
%sql
select distinct SeaLevelPressure from flight_weather_with_airport_code

SeaLevelPressure
29.68
29.45
30.43
29.58
30.59
30.13
30.66
29.39
30.17
29.61


Like you did before, click on the **SeaLevelPressure** column header to sort the values in ascending and then descending order. Observe that many of the features are of a numeric value (e.g., 29.96, 30.01, etc.), but some contain the string value of M for Missing. We will need to replace this value of "M" with a suitable numeric value so that we can convert this feature to be a numeric feature.

Finally, let's observe the **HourlyPrecip** feature by selecting its distinct values.

In [0]:
%sql
select distinct HourlyPrecip from flight_weather_with_airport_code

HourlyPrecip
0.55
0.07
0.75
0.59
1.53
0.32
0.03
0.11
1.23
0.60


Click on the column header to sort the list and ascending and then descending order. Observe that this column contains mostly numeric values, but also `null` values and values with `T` (for Trace amount of rain). We need to replace T with a suitable numeric value and convert this to a numeric feature.

## Clean up weather data

To preform our data cleanup, we will execute a Python script, in which we will perform the following tasks:

* WindSpeed: Replace missing values with 0.0, and “M” values with 0.005
* HourlyPrecip: Replace missing values with 0.0, and “T” values with 0.005
* SeaLevelPressure: Replace “M” values with 29.92 (the average pressure)
* Convert WindSpeed, HourlyPrecip, and SeaLevelPressure to numeric columns
* Round “Time” column down to the nearest hour, and add value to a new column named “Hour”
* Eliminate unneeded columns from the dataset

Let's begin by creating a new DataFrame from the table. While we're at it, we'll pare down the number of columns to just the ones we need (AirportCode, Month, Day, Time, WindSpeed, SeaLevelPressure, HourlyPrecip).

In [0]:
dfWeather = spark.sql("select AirportCode, cast(Month as int) Month, cast(Day as int) Day, cast(Time as int) Time, WindSpeed, SeaLevelPressure, HourlyPrecip from flight_weather_with_airport_code")

In [0]:
dfWeather.show()

+-----------+-----+---+----+---------+----------------+------------+
|AirportCode|Month|Day|Time|WindSpeed|SeaLevelPressure|HourlyPrecip|
+-----------+-----+---+----+---------+----------------+------------+
|        SJU|    4|  1|  56|       13|           30.06|           T|
|        SJU|    4|  1| 156|       10|           30.05|        null|
|        SJU|    4|  1| 256|        9|           30.03|        null|
|        SJU|    4|  1| 356|        9|           30.03|        null|
|        SJU|    4|  1| 456|        7|           30.04|        null|
|        SJU|    4|  1| 556|        7|           30.05|        null|
|        SJU|    4|  1| 656|        9|           30.07|        null|
|        SJU|    4|  1| 756|       13|           30.10|        null|
|        SJU|    4|  1| 856|       14|           30.11|        null|
|        SJU|    4|  1| 956|       16|           30.12|        null|
|        SJU|    4|  1|1056|       17|           30.12|        null|
|        SJU|    4|  1|1156|      

Review the schema of the dfWeather DataFrame

In [0]:
print(dfWeather.dtypes)

[('AirportCode', 'string'), ('Month', 'int'), ('Day', 'int'), ('Time', 'int'), ('WindSpeed', 'string'), ('SeaLevelPressure', 'string'), ('HourlyPrecip', 'string')]


In [0]:

# Round Time down to the next hour, since that is the hour for which we want to use flight data. Then, add the rounded Time to a new column named "Hour", and append that column to the dfWeather DataFrame.
df = dfWeather.withColumn('Hour', F.floor(dfWeather['Time']/100))

# Replace any missing HourlyPrecip and WindSpeed values with 0.0
df = df.fillna('0.0', subset=['HourlyPrecip', 'WindSpeed'])

# Replace any WindSpeed values of "M" with 0.005
df = df.replace('M', '0.005', 'WindSpeed')

# Replace any SeaLevelPressure values of "M" with 29.92 (the average pressure)
df = df.replace('M', '29.92', 'SeaLevelPressure')

# Replace any HourlyPrecip values of "T" (trace) with 0.005
df = df.replace('T', '0.005', 'HourlyPrecip')

# Be sure to convert WindSpeed, SeaLevelPressure, and HourlyPrecip columns to float
# Define a new DataFrame that includes just the columns being used by the model, including the new Hour feature
dfWeather_Clean = df.select('AirportCode', 'Month', 'Day', 'Hour', df['WindSpeed'].cast('float'), df['SeaLevelPressure'].cast('float'), df['HourlyPrecip'].cast('float'))


Now let's take a look at the new `dfWeather_Clean` DataFrame.

In [0]:
display(dfWeather_Clean)

AirportCode,Month,Day,Hour,WindSpeed,SeaLevelPressure,HourlyPrecip
SJU,4,1,0,13.0,30.06,0.005
SJU,4,1,1,10.0,30.05,0.0
SJU,4,1,2,9.0,30.03,0.0
SJU,4,1,3,9.0,30.03,0.0
SJU,4,1,4,7.0,30.04,0.0
SJU,4,1,5,7.0,30.05,0.0
SJU,4,1,6,9.0,30.07,0.0
SJU,4,1,7,13.0,30.1,0.0
SJU,4,1,8,14.0,30.11,0.0
SJU,4,1,9,16.0,30.12,0.0


Observe that the new DataFrame only has 7 columns. Also, the WindSpeed, SeaLevelPressure, and HourlyPrecip fields are all numeric and contain no missing values. To ensure they are indeed numeric, we can take a look at the DataFrame's schema.

In [0]:
print(dfWeather_Clean.dtypes)

[('AirportCode', 'string'), ('Month', 'int'), ('Day', 'int'), ('Hour', 'bigint'), ('WindSpeed', 'float'), ('SeaLevelPressure', 'float'), ('HourlyPrecip', 'float')]


Now let's persist the cleaned weather data to a persistent global table.

In [0]:
dfWeather_Clean.write.mode("overwrite").save("/mnt/sparkcontainer/Silver/flight_weather_clean")

In [0]:
%sql
DROP TABLE IF EXISTS flight_weather_clean;

CREATE TABLE flight_weather_clean
USING DELTA LOCATION '/mnt/sparkcontainer/Silver/flight_weather_clean'

In [0]:
dfWeather_Clean.select("*").count()

Out[15]: 406516