# Week 9 Problem 3

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/assignments/README.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET

from nose.tools import ok_, assert_equal

# Problem 1. XML.

In this problem, we will use Pandas to find the top 20 airports by traffic and work with XML format to extract delay information of the top 20 airports.

We will use the `Dest` and `Origin` columns of the 2001 flight data `2001.csv`. Note that each airport is identified by [IATA codes](https://en.wikipedia.org/wiki/International_Air_Transport_Association_code).

In [2]:
dest_origin = pd.read_csv(
    '/home/data_scientist/data/2001.csv', # edit this path if necessary
    encoding='latin-1',
    usecols=('Dest', 'Origin')
)

print(dest_origin.head())

  Origin Dest
0    BWI  CLT
1    BWI  CLT
2    BWI  CLT
3    BWI  CLT
4    BWI  CLT


## 1. Function: get_total_flights()

Your first task is to add the number of departures and the number of arrivals in 2001 to find which 20 airports had the most number of flights.

Count the total number of departures from and arrivals to each airport.
  In other words, first count the number of times each airport appears in the `Dest` column
  to get
  
    Dest
    ABE      5262
    ABI      2567
    ABQ     36229
    ACT      2682
    ADQ       726
    
  (only the first 5 columns are shown).
  Then, count the number of times each airport apears in the `Origin` column to get
  
    Origin
    ABE        5262
    ABI        2567
    ABQ       36248
    ACT        2686
    ACY           1

  Finally, add them up get the total number:
  
    ABE    10524
    ABI     5134
    ABQ    72477
    ACT     5368
    ACY        1

**Hint 1**: I would use `groupby(...).size()` on `Dest` and `Origin` columns to get the number of departures and arrivals, respectively.

**Hint 2**: If you simply add up the dataframes with `df1 + df2` (where `df1` is the result of doing `groupby().size()` on the `Dest` column and `df2` is the result of doing `groupby().size()` on the `Origin` column), `df1 + df2` will have some columns with `NaN` values (try this!).

For example, note that there is no `ACY` airpot when we add up the `Dest` column, while there is 1 flight that originated from `ACY` when we add up the `Origin` column. In this case, the number of flights for `ACY` in `df1 + df2` will be `NaN`.

So, you need a way to handle these missing entries. I suggest that you use [pandas.DataFrame.add()](http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.add.html) so that you can apply the `fill_value` parameter to fill the missing values with 0.

**Hint 3**: It seems that [pandas.DataFrame.add()](http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.add.html) returns `float64` values by default. But we are only dealing with integer values here, so use [pandas.DataFrame.astype()](http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.astype.html) to change the data type to `np.int32`.

**Hint 4**: As always, this is just a suggestion. If you see an easier or better approach, use it.

In [3]:
def get_total_flights(df):
    '''
    Takes a dataframe that has two columns Dest and Origin
    and returns a dataframe that has a column named flights
    and is indexed by IATA codes.
    
    Parameters
    ----------
    df: pandas.DataFrame
    
    Returns
    -------
    pandas.DataFrame
    '''
    
    # YOUR CODE HERE
    # Get the number of departure
    df1 = df.groupby('Dest').size()
    # Get the number of arrival
    df2 = df.groupby('Origin').size()
    # Add them up and set the data type to be np.int32
    result = df1.add(df2,fill_value=0).astype('int32')
    return result

Visually inspect the resulting data frame.

In [4]:
flights = get_total_flights(dest_origin)
print(flights)

ABE     10524
ABI      5134
ABQ     72477
ACT      5368
ACY         1
ADQ      1452
AKN       568
ALB     32713
AMA     12267
ANC     42381
APF       725
ATL    503163
AUS     85809
AVL      3172
AVP      2893
AZO      5290
BDL     71983
BET      2306
BFL      3338
BGM       751
BGR      7417
BHM     37566
BIL      6249
BIS      2779
BMI      2869
BNA    112603
BOI     24152
BOS    266032
BPT      3481
BQN       518
        ...  
SHV     12011
SIT      2758
SJC    144653
SJT      4505
SJU     52957
SLC    152859
SMF     80394
SNA     86871
SPS      3985
SRQ      9044
STL    324477
STT      6723
STX      1817
SUX       546
SWF      2386
SYR     22281
TLH      2957
TOL      4483
TPA    137286
TRI      1095
TUL     45562
TUS     39101
TVC      5067
TXK      3475
TYR      6361
TYS     11131
VPS      3455
WRG      1452
XNA     11749
YAK      1450
dtype: int32


In [5]:
test1 = pd.DataFrame({
    'Dest': ['A', 'B', 'A', 'A', 'C'],
    'Origin': ['B', 'A', 'B', 'B', 'A']
    })

answer1 = pd.Series([5, 4, 1], index=['A', 'B', 'C'], dtype=np.int32)

test2 = pd.DataFrame({
    'Dest': ['A', 'B'],
    'Origin': ['C', 'D']
    })

answer2 = pd.Series([1, 1, 1, 1], index=['A', 'B', 'C', 'D'], dtype=np.int32)

ok_(get_total_flights(test1).equals(answer1))
ok_(get_total_flights(test2).equals(answer2))

To keep the problem simple, we will use only the top 20 airports.

In [6]:
top20 = flights.sort_values(ascending=False, inplace=False)[:20]
print(top20)

ORD    682636
DFW    624361
ATL    503163
LAX    450019
PHX    368631
STL    324477
DTW    297522
MSP    284955
LAS    272293
BOS    266032
DEN    265184
IAH    257193
CLT    256626
SFO    243473
EWR    241016
PHL    239390
LGA    232964
PIT    212738
SEA    205486
BWI    199674
dtype: int32


## 2. Function: is_delayed()

- Write a function named `is_delayed` that takes an XML code (str), and returns `None` if the airport is not delayed and a tuple of `(MinDelay, MaxDelay)` (both strings) if the airport is delayed.

In [Problem 5.2](https://github.com/UI-DataScience/info490-fa15/blob/master/Week5/assignment/requests.ipynb), we used the [FAA airport service](http://services.faa.gov/docs/services/airport/), which lets us get the airport status, including known delays and weather data. We requested the response be in a JSON format in Problem 5.2, but now that we have learned about XML formats, we will choose the reponse format to be in XML. 

From the XML response, use the [`xml.etree.ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml) module or the [`BeautilfulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) module to parse the XML and extract the delay information. The XML files we are going to handle are similar to the sample XML response at the [FAA airport service](http://services.faa.gov/docs/services/airport/) webpage.

In [7]:
def is_delayed(xml):
    '''
    Takes an IATA code and returns a Boolean.
    
    Parameter
    ---------
    xml (str): an XML code to parse.

    Returns
    -------
    None if if the 'Delay' tag in the XML is false, e.g. <Delay>false</Delay>
    If the 'Delay' tag is true, e.g. <Delay>true</Delay>, the function returns
    a tuple of two strings, 'MinDelay' and 'MaxDelay'.
    For example, when
    <Delay>true</Delay>
    <MinDelay>16 minutes</MinDelay>
    <MaxDelay>30 minutes</MaxDelay>
    the function returns ('16 minutes', '30 minutes').
    '''
    
    # YOUR CODE HERE
    # Get the root
    root = ET.fromstring(xml)
    # if it delayed
    if root.find('Delay').text=='true':
        # Get the status
        stat=root.find('Status')
        # Get two strings
        mindelay = stat.find('MinDelay').text
        maxdelay = stat.find('MaxDelay').text
        result = (mindelay, maxdelay)
    # if it not delayed, return None
    else:
        result = None
    return result

The following code cell makes XML requests to the FAA service. The FAA service is in real time, while the XML files on Github are static, so results may differ.

In [8]:
for airport in top20.index:
    url = 'http://services.faa.gov/airport/status/{}'.format(airport)
    payload = {
        'format': 'application/xml'
    }
    r = requests.get(url)
    delay = is_delayed(r.text)
    if delay is None:
        print('{} is not delayed.'.format(airport))
    else:
        print('{} is delayed by {} to {}.'.format(airport, delay[0], delay[1]))

ORD is not delayed.
DFW is not delayed.
ATL is not delayed.
LAX is not delayed.
PHX is not delayed.
STL is not delayed.
DTW is not delayed.
MSP is not delayed.
LAS is not delayed.
BOS is not delayed.
DEN is not delayed.
IAH is not delayed.
CLT is not delayed.
SFO is not delayed.
EWR is not delayed.
PHL is not delayed.
LGA is not delayed.
PIT is not delayed.
SEA is not delayed.
BWI is not delayed.


Your function should pass the tests in the following code cell without an error.

In [9]:
test1 = '''
<?xml version="1.0" encoding="UTF-8"?><AirportStatus> 
					<Delay>true</Delay>
				
					<IATA>ORD</IATA>
				
					<State>Illinois</State>
				
					<Name>Chicago OHare International</Name>
				
					<Weather><Visibility>10.00</Visibility>
				
					<Weather>Partly Cloudy</Weather>
				
					<Meta><Credit>NOAA&apos;s National Weather Service</Credit>
				
					<Updated>12:51 PM Local</Updated>
				
					<Url>http://weather.gov/</Url></Meta>
				
					<Temp>53.0 F (11.7 C)</Temp>
				
					<Wind>Northwest at 12.7mph</Wind></Weather>
				
					<ICAO>KORD</ICAO>
				
					<City>Chicago</City>
				
					<Status><Reason>VOL:Multi-taxi</Reason>
				
					<ClosureBegin></ClosureBegin>
				
					<EndTime></EndTime>
				
					<MinDelay>16 minutes</MinDelay>
				
					<AvgDelay></AvgDelay>
				
					<MaxDelay>30 minutes</MaxDelay>
				
					<ClosureEnd></ClosureEnd>
				
					<Trend>Increasing</Trend>
				
					<Type>Departure</Type></Status>
				</AirportStatus> 
'''.strip()

test2 = '''
<?xml version="1.0" encoding="UTF-8"?><AirportStatus> 
					<Delay>false</Delay>
				
					<IATA>SFO</IATA>
				
					<State>California</State>
				
					<Name>San Francisco International</Name>
				
					<Weather><Visibility>10.00</Visibility>
				
					<Weather>Partly Cloudy</Weather>
				
					<Meta><Credit>NOAA&apos;s National Weather Service</Credit>
				
					<Updated>12:56 PM Local</Updated>
				
					<Url>http://weather.gov/</Url></Meta>
				
					<Temp>68.0 F (20.0 C)</Temp>
				
					<Wind>North at 5.8mph</Wind></Weather>
				
					<ICAO>KSFO</ICAO>
				
					<City>San Francisco</City>
				
					<Status><Reason>No known delays for this airport.</Reason>
				
					<ClosureBegin></ClosureBegin>
				
					<EndTime></EndTime>
				
					<MinDelay></MinDelay>
				
					<AvgDelay></AvgDelay>
				
					<MaxDelay></MaxDelay>
				
					<ClosureEnd></ClosureEnd>
				
					<Trend></Trend>
				
					<Type></Type></Status>
				</AirportStatus> 
'''.strip()

assert_equal(is_delayed(test1), ('16 minutes', '30 minutes'))
assert_equal(is_delayed(test2), None)