# Week 14 Problem 2

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/assignments/README.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [None]:
import numpy as np
import pandas as pd
import sqlite3 as sl
import requests
from bs4 import BeautifulSoup

from nose.tools import assert_equal

from IPython.display import SVG, display_svg

In this assignment, we will visualize the total number of flights and current temperature of the top 20 airports using the 2001 flight data. This week's assignment is one long, continuous problem, but I split it up into two sections for easier grading. Before you start coding, read the entire notebook first to understand the big picture.

This is where we are going:

In [None]:
# run this cell if there's no image
r = requests.get("https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week14/assignment/top20.svg")
SVG(r.content)

The circles are the top 20 airports in the U.S. The size of each circle is proportional to the total number of arrivals and departures in 2001. The redder the color, the higher the temperature; the bluer the color, the lower the temperature. Thus, we will visualize three pieces of information in one single plot: the location of major airports, the size of the airports, and the current temperature.

# Problem 1.

Recall that in [Problem 9.3](https://github.com/UI-DataScience/info490-fa16/blob/master/Week9/assignment/Problem_3.ipynb) we wrote a function named `get_total_flights()` that adds the number of departures and the number of arrivals in 2001:

```python
dest_origin = pd.read_csv("2001.csv", encoding='latin-1', usecols=('Dest', 'Origin'))
flights = get_total_flights(dest_origin)
```

And we also found which 20 airports had the most number of flights:

```python
top20 = flights.sort(ascending=False, inplace=False)[:20]
```

Suppose that we have stored the result (the data frame `top20`) in an SQL database named [top20.db](https://github.com/UI-DataScience/info490-fa15/blob/master/Week14/assignment/top20.db?raw=true). You can download `top20.db` from the course repository on Github:

In [None]:
!wget -O top20.db https://github.com/UI-DataScience/info490-fa15/blob/master/Week14/assignment/top20.db?raw=true

### Function: read\_top20()

Your first task is to

- Write a functoin named `read_top20()` that takes the file name (`top20.db`) and returns a Pandas data frame with the contents of the `top20` table in the database.

In [None]:
def read_top20(db):
    """
    Takes the file name of an SQL database.
    Returns a Pandas data frame of all rows in the "top20" table of that database.
    
    Parameters
    ----------
    db (str): file name of SQL database.
    
    Returns
    -------
    A Pandas.DataFrame with all rows in "top20" table.
    """
    # YOUR CODE HERE
    return result

So, when we do

```python
top20 = read_top20("top20.db")
print(top20)
```

the output should be

```
   iata  flights
0   ORD   682636
1   DFW   624361
2   ATL   503163
3   LAX   450019
4   PHX   368631
5   STL   324477
6   DTW   297522
7   MSP   284955
8   LAS   272293
9   BOS   266032
10  DEN   265184
11  IAH   257193
12  CLT   256626
13  SFO   243473
14  EWR   241016
15  PHL   239390
16  LGA   232964
17  PIT   212738
18  SEA   205486
19  BWI   199674
```

In [None]:
top20 = read_top20("top20.db")
print(top20)

In [None]:
answer = {
    "iata": [
        'ORD', 'DFW', 'ATL', 'LAX', 'PHX',
        'STL', 'DTW', 'MSP', 'LAS', 'BOS',
        'DEN', 'IAH', 'CLT', 'SFO', 'EWR',
        'PHL', 'LGA', 'PIT', 'SEA', 'BWI'],
    "flights": [
        682636, 624361, 503163, 450019, 368631,
        324477, 297522, 284955, 272293, 266032,
        265184, 257193, 256626, 243473, 241016,
        239390, 232964, 212738, 205486, 199674]
}

answer_df = pd.DataFrame(answer)

np.testing.assert_array_equal(top20["iata"].values, answer_df["iata"].values)
np.testing.assert_array_equal(top20["flights"].values, answer_df["flights"].values)

Note that our airports are identified by IATA codes in the dataframe. So we need to match the IATA codes with the city names.

Note that this problem is similar to `get_airport()` in [Problem 5.3](https://github.com/UI-DataScience/info490-fa16/blob/master/Week5/assignment/Problem_3.ipynb) and `is_delayed()` in [Problem 9.3](https://github.com/UI-DataScience/info490-fa16/blob/master/Week9/assignment/Problem_3.ipynb), but you should use [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) in this problem.

# Problem 2.

### Function: get\_city()

- Write a function named `get_city` that takes an XML (str) and returns the city name (str). See the unit tests below.

In [None]:
def get_city(xml):
    """
    Takes an XML and returns the city name.
    
    Parameters
    ----------
    xml (str): An XML script.
    
    Returns
    -------
    A string.
    """
    # YOUR CODE HERE
    return result

In [None]:
test1 = '''
<?xml version="1.0" encoding="UTF-8"?><AirportStatus> 
					<Delay>true</Delay>
				
					<IATA>ORD</IATA>
				
					<State>Illinois</State>
				
					<Name>Chicago OHare International</Name>
				
					<Weather><Visibility>10.00</Visibility>
				
					<Weather>Partly Cloudy</Weather>
				
					<Meta><Credit>NOAA&apos;s National Weather Service</Credit>
				
					<Updated>12:51 PM Local</Updated>
				
					<Url>http://weather.gov/</Url></Meta>
				
					<Temp>53.0 F (11.7 C)</Temp>
				
					<Wind>Northwest at 12.7mph</Wind></Weather>
				
					<ICAO>KORD</ICAO>
				
					<City>Chicago</City>
				
					<Status><Reason>VOL:Multi-taxi</Reason>
				
					<ClosureBegin></ClosureBegin>
				
					<EndTime></EndTime>
				
					<MinDelay>16 minutes</MinDelay>
				
					<AvgDelay></AvgDelay>
				
					<MaxDelay>30 minutes</MaxDelay>
				
					<ClosureEnd></ClosureEnd>
				
					<Trend>Increasing</Trend>
				
					<Type>Departure</Type></Status>
				</AirportStatus> 
'''.strip()

test2 = '''
<?xml version="1.0" encoding="UTF-8"?><AirportStatus> 
					<Delay>false</Delay>
				
					<IATA>SFO</IATA>
				
					<State>California</State>
				
					<Name>San Francisco International</Name>
				
					<Weather><Visibility>10.00</Visibility>
				
					<Weather>Partly Cloudy</Weather>
				
					<Meta><Credit>NOAA&apos;s National Weather Service</Credit>
				
					<Updated>12:56 PM Local</Updated>
				
					<Url>http://weather.gov/</Url></Meta>
				
					<Temp>68.0 F (20.0 C)</Temp>
				
					<Wind>North at 5.8mph</Wind></Weather>
				
					<ICAO>KSFO</ICAO>
				
					<City>San Francisco</City>
				
					<Status><Reason>No known delays for this airport.</Reason>
				
					<ClosureBegin></ClosureBegin>
				
					<EndTime></EndTime>
				
					<MinDelay></MinDelay>
				
					<AvgDelay></AvgDelay>
				
					<MaxDelay></MaxDelay>
				
					<ClosureEnd></ClosureEnd>
				
					<Trend></Trend>
				
					<Type></Type></Status>
				</AirportStatus> 
'''.strip()

assert_equal(get_city(test1), "Chicago")
assert_equal(get_city(test2), "San Francisco")

# Problem 3.

### Function: get\_temp()

- Write a function named `get_temp` that takes an XML (str) and returns the current temperature (float).

In [None]:
def get_temp(xml):
    """
    Takes an XML and returns the temperature.
    
    Parameters
    ----------
    xml (str): An XML script.
    
    Returns
    -------
    A float.
    """
    # YOUR CODE HERE
    return result

In [None]:
assert_equal(get_temp(test1), 53.0)
assert_equal(get_temp(test2), 68.0)

Let's use `get_city()` and `get_temp()` to add the city names and temperatures to the data frame. Make sure that `get_city()` and `get_temp()` passed the unit tests before running the following code cell, because too many HTTP requests may lock you out for a while. You should get

```python
print(top20)
```
```
   iata  flights             city  temperature
0   ORD   682636          Chicago           53
1   DFW   624361  Dallas-Ft Worth           81
2   ATL   503163          Atlanta           75
3   LAX   450019      Los Angeles           71
4   PHX   368631          Phoenix           91
5   STL   324477         St Louis           63
6   DTW   297522          Detroit           53
7   MSP   284955      Minneapolis           44
8   LAS   272293        Las Vegas           76
9   BOS   266032           Boston           61
10  DEN   265184           Denver           61
11  IAH   257193          Houston           86
12  CLT   256626        Charlotte           74
13  SFO   243473    San Francisco           66
14  EWR   241016           Newark           63
15  PHL   239390     Philadelphia           65
16  LGA   232964         New York           61
17  PIT   212738       Pittsburgh           55
18  SEA   205486          Seattle           58
19  BWI   199674        Baltimore           63
```

In [None]:
for idx, row in top20.iterrows():
    url = "https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week14/assignment/xml/{}"
    r = requests.get(url.format(row["iata"]))
    top20.loc[idx, "city"] = get_city(r.text)
    top20.loc[idx, "temperature"] = get_temp(r.text)
    
print(top20)

For a final treat, I'll be providing a code that will allow you to create a plot like that found above.  Note that this section of the code will not be graded, but it might prove helpful for Problem_3.

In [None]:
resp = requests.get("https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week14/assignment/airports.svg")
usairports = resp.content
display_svg(usairports, raw=True)

In [None]:
resp = requests.get("https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week14/assignment/top20.svg")
SVG(resp.content)

In [None]:
# Parse the "usairports" file that we've requested above
soup = BeautifulSoup(usairports, "lxml")

# Search for all instances of "circle"
circles = soup.findAll('circle')

# Our color scheme from http://colorbrewer2.org
colors = ["#EFF3FF", "#C6DBEF", "#9ECAE1", "#6BAED6", "#3182BD", "#08519C"]

# Rewrite "r" variable and color code
for c in circles:
    # Rescale the radius so that it is proportional to number of flights in DataFrame "top20"
    c['r'] = float(top20[top20['city'] == c['id']]['flights'])/30000
    
    # Modify the colors
    if c['r']>20:
        c['fill'] = colors[5]
    elif c['r']>17:
        c['fill'] = colors[4]
    elif c['r']>14:
        c['fill'] = colors[3]
    elif c['r']>11:
        c['fill'] = colors[2]
    elif c['r']>8:
        c['fill'] = colors[1]
    else:
        c['fill'] = colors[0]

# Display the new svg graph
display_svg(soup.prettify(), raw=True)

In [None]:
!rm top20.db