# Scraper.ipynb
In this notebook, we scrape passenger volume data from the TSA website using `requests`, a library that allows python to get things from the internet, and `BeautifulSoup`, a library for extracting information from webpages. 

### Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

### 2019
When web scraping, its helpful to get your code working for one page before running it over multiple pages. 

In [2]:
#Grab a webpage from the internet using requests
year = 2019
url = "https://www.tsa.gov/travel/passenger-volumes/" + str(year)
response = requests.get(url)

In [3]:
#Parse the webpages using BeautifulSoup
doc = BeautifulSoup(response.text, 'html.parser')

If you navigate to the webpage we're looking at in your browser. You'll see a data table. We want to get that data from the table and put it into a form we can analyze. To do that, we right-click on a cells in the table and choose 'inspect element'. That's what we do here. This pulls up a bunch of computery stuff in a part of your web browser you might not have ever used before. This is HTML - what the webpage looks like under the hood so the computer can understand it. 


If we look at the section that is highlighted, we see a section that says `class="views-field views-field-field-travel-number-date views-align-center"`. It turns out that the `views-field-field-travel-number-date` is a piece of HTML markup that is unique to the dates in the table. We can use this uniqueness in our code to easily grab these dates

In [4]:
#Grab the 
dates = [date.text.strip() for date in doc.select('.views-field-field-travel-number-date')[1:]]

Doing the same inspect element process for the values, we see that we can use a similar tag `views-field views-field-field-travel-number` to get the values.

In [5]:
numbers = [int(number.text.strip().replace(',', '')) for number in doc.select('.views-field-field-travel-number')[1:]]

Now all that's left to do is get this information into a form we can export it. We use the library `pandas` for writing data to csv files and pandas takes data in lists `[]` of dictionaries `{}`.

In [6]:
[{"date": date, "number": number} for (date, number) in zip(dates, numbers)][0:5]

[{'date': '1/1/2019', 'number': 2201765},
 {'date': '1/2/2019', 'number': 2424225},
 {'date': '1/3/2019', 'number': 2279384},
 {'date': '1/4/2019', 'number': 2230078},
 {'date': '1/5/2019', 'number': 2049460}]

### All the data
Great! It looks like it worked for one page, now we just have to apply that for multiple pages. If we click around on the TSA website, we see that all the URL's (the things you type into the top of your browser) for all data pages for previous years are the same except for the year at the end ("https://www.tsa.gov/travel/passenger-volumes/2019", "https://www.tsa.gov/travel/passenger-volumes/2020", etc.)

We could use the code we wrote above and just change the year by hand everytime, but we're busy people and coding allows us to do this automatically with a concept called a for loop. For loops in python are easy. `for year in range (2019,2024):` sets the variable `year` to the value `2019`, runs the indented code after the `:`, then repeats, chaning the value of `year` to `2020`, then `2021`, and so on and so forth. 

The indented code is exactly the same code we wrote before, except for `time.sleep(1)` which tells python to wait for one second between runs. This is considerd polite when web scraping and can help prevent the website from blocking your computer.

In [7]:
archive = []

In [8]:
for year in range(2019,2024):
    url = "https://www.tsa.gov/travel/passenger-volumes/" + str(year)
    response = requests.get(url)
    doc = BeautifulSoup(response.text, 'html.parser')
    dates = [date.text.strip() for date in doc.select('.views-field-field-travel-number-date')[1:]]
    numbers = [int(number.text.strip().replace(',', '')) for number in doc.select('.views-field-field-travel-number')[1:]]
    archive.extend([{"date": date, "number": number} for (date, number) in zip(dates, numbers)])
    time.sleep(1)

In [9]:
len(archive)

1826

### Get this year's data
For the current year's data, the code is very simliar to before but the URL doesn't fit our previous pattern, so we do this step seperately.

In [10]:
url = "https://www.tsa.gov/travel/passenger-volumes"
response = requests.get(url)
doc = BeautifulSoup(response.text, 'html.parser')
dates = [date.text.strip() for date in doc.select('.views-field-field-travel-number-date')[1:]]
numbers = [int(number.text.strip().replace(',', '')) for number in doc.select('.views-field-field-travel-number')[1:]]

In [11]:
archive.extend([{"date": date, "number": number} for (date, number) in zip(dates, numbers)])

## The End
And that's it! Congratulations. You have now scraped your first webpage. All that's left to do is feed the variable `archive` which we used to store the list of dictionaries into pandas and write it to a csv file `tsa_volumes.csv`.

We will pick back up in `exploration.qmd` where we will use the R programming language to analyze this data and prototype a graphic.

In [12]:
df = pd.DataFrame(archive)

In [13]:
df.to_csv('tsa_volumes.csv', index=False)