In [5]:
import pandas as pd

# Exercise 6.0: Importing the Honolulu Flights Data Set

For some of the exercises in this chapter we will again be working with a data set containing information about all the arriving and departing flights in and out of the Honolulu aiport, HNL, on the Island of Oahu in December 2015. This data set was introduced in chapter 3: `DataFrame` Attributes and Arithmetic

Please run the following code cell which will parse the 'honolulu_flights.csv' file, and build the `HNL_flights_df DataFrame` before trying the exercises in this chapter related to the Honolulu flights data set.

Pleases recall that this data set contains the following columns:

| Column |Description|
|:----------|-----------|
| `YEAR` | The year of the flight  |
| `MONTH` |  The month of the flight |
| `DAY` |  The day of the flight |
| `DAY_OF_WEEK` |  The day of the week of the flight |
| `FLIGHT_NUMBER` |  The flight number of the flight |
| `ORIGIN_AIRPORT` |  The origin airport of the flight  |
| `DESTINATION_AIRPORT` |  The destination airport of the flight |
| `DEPARTURE_DELAY` |  The departure delay of the flight  |
| `DISTANCE` |  The distance of the flight in miles |
| `AIR_TIME` |  The flight time without taxiing in minutes |
| `ARRIVAL_DELAY` |  The arrival delay of the flight  |

In [6]:
HNL_flights_df = pd.read_csv('Data/honolulu_flights.csv')

# Exercise 6.1 Group Specific Processing

Which of the following options will correctly group data in `HNL_flights_df` by the entries in the `ORIGIN_AIRPORT` column and save the results in a `GroupBy` object, `HNL_flights_by_origin`?

A:
```python
HNL_flights_by_origin = HNL_flights_df.groupby('ORIGIN_AIRPORT')
```

B:
```python
HNL_flights_df.groupby('ORIGIN_AIRPORT', inplace=True)
```

C:
```python
HNL_flights_by_origin.groupby('ORIGIN_AIRPORT')
```

D:
```python
HNL_flights_by_origin = pd.groupby(HNL_flights_df['ORIGIN_AIRPORT'])
```

**Correct Answer**

A:
```python
HNL_flights_by_origin = HNL_flights_df.groupby('ORIGIN_AIRPORT')
```

**Explanation**

A: This line of code correctly uses the `groupby() DataFrame` method to group data entries of `HNL_flights_df` by their values in the `ORIGIN_AIRPORT` column. 

B: There is no inplace optional parameter for the `groupby()` method since no changes to the data are being made, rather a new `GroupBy` object is constructed by the method. This line results in a `TypeError` with the message: 'group() got an unexpected keyword argument 'inplace''

C: This line of code attempts to call the `groupby()` method with a variable, `HNL_flights_by_origin`, that is not yet defined. This line results in a `NameError` with the message: 'name 'HNL_flights_by_origin' is not defined'.

D: This line of code calls the `pandas groupby()` method which does exist as of `pandas` version '0.23.4', but is deprecated and will be removed in future versions of `pandas`. Furthermore, the method is not used properly since the argument passed is only the  `ORIGIN_AIRPORT` column of `HNL_flights_df` rather than the complete `DataFrame`, and the positional parameter `by` is missing.

---

Building on the first part of this exercise, which of the following lines of code will extract the subset, or group, of data in `HNL_flights_df` that all have a common `ORIGIN_AIRPORT` value of `LAX` and save the result into the `DataFrame: LAX_to_HNL_df`?


A:
```python
LAX_to_HNL_df = HNL_flights_by_origin.get_group(ORIGIN_AIRPORT = 'LAX')
```

B:
```python
LAX_to_HNL_df = HNL_flights_by_origin.loc[:,'LAX']
```

C:
```python
LAX_to_HNL_df = HNL_flights_by_origin['LAX']
```

D:
```python
LAX_to_HNL_df = HNL_flights_by_origin.get_group('LAX')
```


**Correct Answer**

D:
```python
LAX_to_HNL_df = HNL_flights_by_origin.get_group('LAX')
```

**Explanation**

A: This option misuses the `get_group() GroupBy` object method. The proper usage of this method is explained in chapter 4 in the 'Group Specific Processing' cell.

B: There is no `loc` attribute of a `GroupBy` object.

C: `HNL_flights_by_origin` is a `GroupBy` object and extracting data from the object is different than `DataFrames`.

D: This line of code correctly uses the `get_group() GroupBy` object method to extract the subset, or group, of data in `HNL_flights_df` that all have a common `ORIGIN_AIRPORT` value of `LAX` and saves the result into the `DataFrame: LAX_to_HNL_df`.


# Exercise 6.2: Aggregate

Using the code cell below, create a new `DataFrame` named `delay_by_origin` that is indexed by the unique origin airports in `HNL_flights_df` and contains the median departure and arrival delays for groups of flights with common origin airports. 

**One Possible Solution**

```python
delay_by_origin = HNL_flights_df.loc[:, ['ORIGIN_AIRPORT', 'DEPARTURE_DELAY', 'ARRIVAL_DELAY']].groupby('ORIGIN_AIRPORT').median()
```

In [8]:
delay_by_origin = HNL_flights_df.loc[:, ['ORIGIN_AIRPORT', 'DEPARTURE_DELAY', 'ARRIVAL_DELAY']].groupby('ORIGIN_AIRPORT').median()
delay_by_origin.head()

Unnamed: 0_level_0,DEPARTURE_DELAY,ARRIVAL_DELAY
ORIGIN_AIRPORT,Unnamed: 1_level_1,Unnamed: 2_level_1
ANC,-5.0,-17.0
ATL,-1.0,-20.0
BLI,-3.0,-10.5
DEN,12.0,-1.0
DFW,0.0,7.0


# Exercise 6.3: Transform

Using the code cell below, 
* create a new `DataFrame` named `distance_and_day_df` that is a subset of `HNL_flights_df` containing only the `DAY` and `DISTANCE` columns. 
* Group `distance_and_day_df` by the `DAY` column and save the resulting `GroupBy` object in the variable `distance_by_day`. 
* Transfrom the `DISTANCE` column for each flight by calculating the percentage of the total distance by day the flight took. Save the result in a new column of  `HNL_flights_df` named `DISTANCE_PCT`. Use the function pre-defined in the cell to perform the transformation.

**One Possible Solution**
```python
def percent_of_total(x):
  return (x   / x.sum() ) * 100

distance_and_day_df = HNL_flights_df.loc[:, ['DAY', 'DISTANCE']]
distance_by_day = distance_and_day_df.groupby('DAY')
HNL_flights_df['DISTANCE_PCT'] = distance_by_day.transform(percent_of_total)
```

In [12]:
def percent_of_total(x):
  return (x   / x.sum() ) * 100

distance_and_day_df = HNL_flights_df.loc[:, ['DAY', 'DISTANCE']]
distance_by_day = distance_and_day_df.groupby('DAY')
HNL_flights_df['DISTANCE_PCT'] = distance_by_day.transform(percent_of_total)
HNL_flights_df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,DEPARTURE_DELAY,AIR_TIME,DISTANCE,ARRIVAL_DELAY,DISTANCE_PCT
0,2015,12,1,2,1730,HNL,SFO,3.0,276.0,2398,-21.0,0.812884
1,2015,12,1,2,17,LAS,HNL,4.0,361.0,2762,-2.0,0.936274
2,2015,12,1,2,102,HNL,ITO,-4.0,36.0,216,-8.0,0.073221
3,2015,12,1,2,108,HNL,KOA,-6.0,28.0,163,-9.0,0.055254
4,2015,12,1,2,206,HNL,OGG,-4.0,22.0,100,-1.0,0.033898


# Exercise 6.4 Filter

Using the code cell below, 
* Group `HNL_flights_df` by the `DAY` column and save the resulting `GroupBy` object in the variable `hnl_flights_by_day`. 
* Filter the flights by determining if the `ARRIVAL_DELAY` of the day was net positive, i.e. if there was a positive total delay for a day keep the flights, otherwise filter them out. Save the resulting `DataFrame` into the variable `HNL_flights_delayed_days_df`. Use the function pre-defined in the cell to perform the transformation.

**One Possible Solution**
```python
def net_postive_arrival_delay(x):
  return  x.ARRIVAL_DELAY.sum() > 0

hnl_flights_by_day = HNL_flights_df.groupby('DAY')
HNL_flights_delayed_days_df = hnl_flights_by_day.filter(net_postive_arrival_delay)
```

In [13]:
def net_postive_arrival_delay(x):
  return  x.ARRIVAL_DELAY.sum() > 0

hnl_flights_by_day = HNL_flights_df.groupby('DAY')
HNL_flights_delayed_days_df = hnl_flights_by_day.filter(net_postive_arrival_delay)
HNL_flights_delayed_days_df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,DEPARTURE_DELAY,AIR_TIME,DISTANCE,ARRIVAL_DELAY,DISTANCE_PCT
0,2015,12,1,2,1730,HNL,SFO,3.0,276.0,2398,-21.0,0.812884
1,2015,12,1,2,17,LAS,HNL,4.0,361.0,2762,-2.0,0.936274
2,2015,12,1,2,102,HNL,ITO,-4.0,36.0,216,-8.0,0.073221
3,2015,12,1,2,108,HNL,KOA,-6.0,28.0,163,-9.0,0.055254
4,2015,12,1,2,206,HNL,OGG,-4.0,22.0,100,-1.0,0.033898
