# 05 - web scraping and data transformations

1. [The TSA posts passenger numbers](https://www.tsa.gov/coronavirus/passenger-throughput) in a table but there is no download or API option. We can use BeautifulSoup to parse this table.
1. Transform the TSA passenger data in two ways to create two different charts
1. Create two charts inside this notebook with [Matplotlib](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)

__Library reference__
- [BeautifulSoup]()
- [pandas]()
- [Matplot for pandas]()
- [Datetime format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)

1. Turn the TSA's html table into a dataframe
    1. Create a list of column names
    1. Create a 2d array of data
    1. Format the data into two columns: date and value
1. Transform the data in two different ways for new different charts
1. Create two charts

In [24]:
# !pipenv uninstall matplotlib

In [2]:
#### Import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta

# set display format for numbers
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## 1. Turn the TSA's html table into a dataframe

In [3]:
# get html from from page
tsa_r = requests.get('https://www.tsa.gov/coronavirus/passenger-throughput')

In [4]:
# create a beautifulsoup object
tsa_bs = BeautifulSoup(tsa_r.text)

#### table tag
![table selected](../answers/assets/table.png)

### a. Create a list column names

In [6]:
# turn thead into a column list
thead = tsa_bs.find('thead')

In [7]:
# then find all th elements (because there is only 1 row)
ths = thead.find_all('th')

In [10]:
# and loop through each th to extract the text for a list
tsa_col = []
for th in ths:
    tsa_col.append(th.text.strip())

In [11]:
# print the list
tsa_col

['Date',
 '2021 Traveler Throughput',
 '2020 Traveler Throughput',
 '2019 Traveler Throughput']

### b. Create a 2d array of data
![tbody example](../answers/assets/tbody.png)

In [12]:
# turn data into an array of arrays (2d array)
tbody = tsa_bs.find('tbody')

In [13]:
# turn tr tags into a list
trs = tbody.find_all('tr')

In [14]:
# create a list of td tags inside each tr list
tr_list = []
for tr in trs:
    tds = tr.find_all('td')
    td_list = []
    for td in tds:
        td_list.append(td.text.strip())
    tr_list.append(td_list)

In [15]:
# Check the length of the list and the first couple of items
len(tr_list), tr_list[0:2]

(365,
 [['7/6/2021', '1,889,911', '641,761', '2,506,859'],
  ['7/5/2021', '2,160,147', '755,555', '2,748,718']])

### c. Format the data into two columns: date and value

In [47]:
# create a function that will generate dates of preceding years
def format_date(d, column_year):
    date_f = datetime.strptime(d, '%m/%d/%Y')
    new_date = date_f - timedelta(weeks=column_year*52)
    return new_date

In [48]:
# this double loop can be combined with the loop above that generates tr_list
# but i want to separate text extraction from formatting
passengers_per_day = []
# for each tr
for tr in tr_list[0:4]:
    # we need to find dates for 2020 and 2019 and align them with the html table format
    # turn string into date object so we can perform datetime calculations on it
    date_2021 = datetime.strptime(tr[0], '%m/%d/%Y')
    # the date for 2020 will be 52 weeks before 
    date_2020 = date_2021 - timedelta(weeks=52)
    date_2020 = format_date(tr[0], 1)
    print(date_2020)
    # the date for 2019 will be 104 weeks before
    date_2019 = date_2021 - timedelta(weeks=104)
    
    # because the above is a repeatable process, how can move this to a function?
    date_list = [date_2021, date_2020, date_2019]
    
    # for each passenger column td_list[1:]
    for (index, passenger_column) in enumerate(tr[1:]):
        # Create a new dictionary to populate with formatted date
        # index being the column that corresponds to the order of dates in the date_list above
        daily_passengers = {
            'date': date_list[index],
            'value': passenger_column,
        }
#         print(daily_passengers)
        passengers_per_day.append(daily_passengers)
        # if value does exist, change it to an integer (or else there will be an error on missing values)
        
            # add each newly created dictionary to passengers_per_day list

2020-07-07 00:00:00
2020-07-06 00:00:00
2020-07-05 00:00:00
2020-07-04 00:00:00


In [32]:
passengers_per_day

1095

In [39]:
tsa_df = pd.DataFrame(passengers_per_day)

tsa_df = tsa_df.sort_values('date', ascending=True)

tsa_df = tsa_df.drop_duplicates(subset=['date'])

In [40]:
len(tsa_df), len(tsa_df['date'].unique())

(1093, 1093)

In [46]:
tsa_df[tsa_df['value'] == 0]

Unnamed: 0,date,value


In [23]:
# check if days of the week line up

In [13]:
# print(tr_list[0][1:])

In [14]:
# turn passengers_per_day into a DataFrame with "date" "value" columns

# sort dates from latest to earliest

# delete duplicates

## 2. Transform the data in two different ways for two different charts
[What's moving average and why are they used? - Dallas FED](https://www.dallasfed.org/research/basics/moving.aspx)

### a. Calculate 7-day moving average

In [16]:
# display the last 7 rows

In [17]:
# write a function that takes the current date and 6 previous dates and averages them
def moving_average(row):
    
    return row

[Read up on pandas' apply method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

In [20]:
# calculate 7-day moving average in a new column and start 7 days in (note: result_type apply)
# set the date as the index for matplot

### b. Group data by weeks

In [21]:
# create a function to get day of the first day of the week
def weekday_start(row):
    
    return row

In [24]:
# create a new column that IDs the start date of the week

In [22]:
# groupby week start turn the groupby object into a dataframe

## 3. Create two charts - one for 7-day moving average and one for week totals
Create a bar chart of the daily values for reference

In [25]:
# create a bar chart for daily values

### a. 7-day moving average

In [26]:
# plot a 7-day average line chart

### b. By weekly totals

In [28]:
# plot as weeks as a line chart