## Case Study: Analyzing Flight Delays

Now that you've learned how to utilize Dask to read and process large data sets in parallel, you'll put these skills together to search for correlations between flight delays and reported weather events at selected airports. You'll read files in multiple directories containing flight statistics for selected airports from the Bureau of Transportation Statistics and merge them with daily weather data from wunderground.com into a single Dask DataFrame.

### Delaying reading & cleaning
To work with this subset of the monthly flight information data efficiently, you'll need to do a bit of cleaning. Specifically, you'll need to replace zeros in the 'WEATHER_DELAY' column with nan. This substitution will make counting delays much easier later. This operation requires you to build a delayed pipeline of pandas DataFrame manipulations. You will then convert the output to a Dask DataFrame in which each file will be one chunk.

Your first job is to write a function to read a single CSV file into a DataFrame. The DataFrame returned will use pandas TimeStamps in the 'FL_DATE' column, and will have 0s replaced with np.nans in the 'WEATHER_DELAY' column. You can use the flightdelays-2016-1.csv file to verify that the function works as intended.

In [6]:
# Define @delayed-function read_flights
from dask import delayed
import pandas as pd
import numpy as np

@delayed
def read_flights(filename):

    # Read in the DataFrame: df
    df = pd.read_csv(filename, parse_dates=['FL_DATE'])

    # Replace 0s in df['WEATHER_DELAY'] with np.nan
    df['WEATHER_DELAY'] = df['WEATHER_DELAY'].replace(0, np.nan)

    # Return df
    return df

### Reading all flight data
A list called filenames is provided for you at the start of this exercise; it contains the strings "flightdelays-2016-1.csv" through "flightdelays-2016-5.csv". In addition, the delayed function read_flights() defined in the last exercise is provided for you. Also, Numpy & Pandas have been imported for you.

Your task now is to iterate over the list filenames and to use the function read_flights to build a list of delayed objects. Finally, you'll concatenate them into a Dask DataFrame with dd.from_delayed() and print out the mean of the WEATHER_DELAY column.

In [7]:
# modules
import dask.dataframe as dd


# define filenames
filenames = ['flightdelays-2016-1.csv',
 'flightdelays-2016-2.csv',
 'flightdelays-2016-3.csv',
 'flightdelays-2016-4.csv',
 'flightdelays-2016-5.csv']

dataframes = []

# Loop over filenames with index filename
for filename in filenames:
    # Apply read_flights to filename; append to dataframes
    dataframes.append(read_flights(filename))

# Compute flight delays: flight_delays
flight_delays = dd.from_delayed(dataframes)

# Print average of 'WEATHER_DELAY' column of flight_delays
print(flight_delays['WEATHER_DELAY'].mean().compute())

51.29467680608365


### Deferring reading weather data
For this exercise, daily weather data is provided from 2016 for 5 US cities: Atlanta, Denver, Dallas-Fort Worth, Orlando, and Chicago. The weather data comes from Weather Underground and is found in separate CSV files labelled by airport code (e.g., ATL.csv). The list filenames contains the names of these 5 files. The ultimate goal is to correlate the flight delays with weather events from each day of 2016.

As with the flight-delays data, you'll need to clean the weather data as it is read in. Your job is to define a function that loads a DataFrame from a file, cleans the DataFrame's 'PrecipitationIn' column, and appends an 'Airport' column with the appropriate airport code for each record.

In [8]:
# Define @delayed
@delayed
def read_weather (filename):
    # Read in filename: df
    df = pd.read_csv(filename, parse_dates=['Date'])

    # Clean 'PrecipitationIn'
    df['PrecipitationIn'] = pd.to_numeric(df['PrecipitationIn'], errors = 'coerce')

    # Create the 'Airport' column
    df['Airport'] = filename.split('.')[0]

    # Return df
    return df

### Building a weather DataFrame
Your job now is to construct a Dask DataFrame using the function from the previous exercise. To do this, you will iterate over the list filenames provided and build up a list of delayed DataFrames. You'll then concatenate those delayed DataFrames into a Dask DataFrame with dd.from_delayed() as you did with the flight information. Finally, you'll print the row with largest 'Max TemperatureF' value.

The list filenames contains the names of the CSV files of weather data labelled by airport code for Atlanta, Denver, Dallas-Fort Worth, Orlando, and Chicago. The read_weather function from the previous exercise is also provided for you and dask.dataframe is imported as dd. Additionally, an empty list called weather_dfs has been created for you.

In [9]:
# define filenames
filenames = ['ATL.csv', 'DEN.csv', 'DFW.csv', 'MCO.csv', 'ORD.csv']

weather_dfs = []

# Loop over filenames with filename
for filename in filenames:
    # Invoke read_weather on filename; append result to weather_dfs
    weather_dfs.append(read_weather(filename))

# Call dd.from_delayed() with weather_dfs: weather
weather = dd.from_delayed(weather_dfs)

# Print result of weather.nlargest(1, 'Max TemperatureF')
print(weather.nlargest(1,'Max TemperatureF').compute())

          Date  Max TemperatureF  Mean TemperatureF  Min TemperatureF  \
224 2016-08-12               107                 93                79   

     Max Dew PointF  MeanDew PointF  Min DewpointF  Max Humidity  \
224              75              71             66            79   

     Mean Humidity  Min Humidity  ...  Mean VisibilityMiles  \
224             53            27  ...                     8   

     Min VisibilityMiles  Max Wind SpeedMPH  Mean Wind SpeedMPH  \
224                    0                 41                  10   

     Max Gust SpeedMPH  PrecipitationIn  CloudCover             Events  \
224               54.0             0.82           5  Rain-Thunderstorm   

     WindDirDegrees  Airport  
224             214      DFW  

[1 rows x 24 columns]


### Which city gets the most snow?
The Dask DataFrame weather from the previous exercise is provided here.

Your task now is to aggregate the total snow fall for each airport (at least those airports that experienced snow). You'll use the method .str.contains() to create a boolean Series identifying snowy days. You'll need to chain with the method fillna(False) as well; this is to clean NaN values from the boolean Series so it can be used for selection within the .loc[] accessor. After filtering rows that correspond to snowy days from weather, you'll group the rows of the filtered DataFrame by airport code. This allows you to extract the precipitation column and compute aggregated sums grouped by airport.

In [10]:
# Make cleaned Boolean Series from weather['Events']: is_snowy
is_snowy = weather['Events'].str.contains('Snow').fillna(False)

# Create filtered DataFrame with weather.loc & is_snowy: got_snow
got_snow = weather.loc[is_snowy]

# Groupby 'Airport' column; select 'PrecipitationIn'; aggregate sum(): result
result = got_snow.groupby('Airport')['PrecipitationIn'].sum()

# Compute & print the value of result
print(result.compute())

Airport
ATL    1.94
DEN    5.59
ORD    3.91
Name: PrecipitationIn, dtype: float64
