<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Make-necessary-imports" data-toc-modified-id="Make-necessary-imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Make necessary imports</a></span></li><li><span><a href="#Scrape-www.massport.com-for-PDF-links" data-toc-modified-id="Scrape-www.massport.com-for-PDF-links-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Scrape <code>www.massport.com</code> for PDF links</a></span></li><li><span><a href="#Define-function-to-isolate-month-and-year" data-toc-modified-id="Define-function-to-isolate-month-and-year-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Define function to isolate <code>month</code> and <code>year</code></a></span></li><li><span><a href="#Capture-relevant-information-and-append-to-new-dataframe" data-toc-modified-id="Capture-relevant-information-and-append-to-new-dataframe-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Capture relevant information and append to new dataframe</a></span></li></ul></div>

## Make necessary imports

In [12]:
import pandas as pd
import numpy as np

import requests
import io
from bs4 import BeautifulSoup
from PyPDF2 import PdfFileReader
from pdfreader import SimplePDFViewer
import urllib
import tabula

## Scrape `www.massport.com` for PDF links

Embedded in the URL below are links to all the PDFs that we'd like to scrape for data. This step will isolate all the URLs so that each PDF can be accessed individually

In [20]:
# Define base URL, create BeautifulSoup instance
url = requests.get('https://www.massport.com/logan-airport/about-logan/airport-statistics/')
soup = BeautifulSoup(url.content, 'html')

# Create empty list to house all PDF links
pdfs = []

# Loop through all 'a' instances of 'noopener', scrape PDF link and append to empty list above
for a in soup.find_all('a', rel = 'noopener'):
    href = a['href']
    if href[-3:] == 'pdf':
        pdfs.append(href)

## Define function to isolate `month` and `year`

Scraping the PDF for its content will essentially dump a long, messy string into the notebook. The funciton below works by looking for a unique word in the string, in this case `Summary`, after which the `month` and `year` data is always referenced (this was determined after doing some testing and examining each PDF's string content).

Isolating the `month` and `year` will make for valuable data, as we need a reference point in time for which the number of `domestic travelers` and `international travelers`.

In [14]:
def find_month_and_year(string):
    
    # Eliminate spaces, double spaces, and hyphens in the string
    string = string.replace(' ', '')
    string = string.replace('  ', '')
    string = string.replace('-', '')
    
    # Isolate portion of string adjacent to the word "summary" + create blank string
    x = string.find('Summary') + len('Summary')
    y = ''
    
    # Loop through portion of string where 'Summary' begins
    for i in range(x, x + 30):
        
        # Isolate date + year
        character = string[i]
        if character.isalnum() == True:
            y += character
        else:
            break
    
    return y

## Capture relevant information and append to new dataframe

We want to capture the `month + year`, `domestic passengers`, and `international passengers`. The dataframe will be simple with just 3 columns.

In [15]:
# Create empty dataframe
flights = pd.DataFrame(columns = ['Term', 'Domestic Passengers', 'International Passengers'])
pdf_base_url = 'https://www.massport.com'

# Create 3 empty lists
terms = []
domestic = []
international = []

# Loop through PDF lins
for individual_url in pdfs:
    pdf_url = pdf_base_url + individual_url
    
    # Use requests library to extract string text from PDF
    response = requests.get(pdf_url)
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        # numpage for the number page
        numpage=0
        page = pdf.getPage(numpage)
        page_content = page.extractText()     
        page_content
        
        # Call 'find_month_and_year' function
        term = find_month_and_year(page_content)
        terms.append(term)
    
    # Try to read the data in the PDF using Tabula library
    try:
        table = tabula.read_pdf(pdf_url, multiple_tables=False, encoding = 'latin1', pages ='all')
        df = table[0]
    
    # If that doesn't work, use the same library with manual PDF measurements (see below)
    except:
        table = tabula.read_pdf(pdf_url, multiple_tables=True, guess = False,
                                area = (130.99, 30.64, 540.39, 533.67), stream = True, encoding = 'latin1', pages ='all')
    
    # Isolate dataframe and relevant data, append to appropriate list
    df = table[0]
    df.rename(columns = {list(df)[0]: 'Flights'}, inplace = True)
    df.rename(columns = {list(df)[1]: 'Current Month'}, inplace = True)

    
    domestic_index = df[df['Flights'] == 'Total Domestic Passengers'].index[0]
    international_index = df[df['Flights'] == 'Total International Passengers'].index[0]
    
    domestic_passengers = df[df['Flights'] == 'Total Domestic Passengers']['Current Month'][domestic_index]
    domestic_passengers = domestic_passengers.replace(',', '')
    domestic_passengers = int(domestic_passengers)
    
    international_passengers = df[df['Flights'] == 'Total International Passengers']['Current Month'][international_index]
    international_passengers = international_passengers.replace(',', '')
    international_passengers = int(international_passengers)
    
    domestic.append(domestic_passengers)
    international.append(international_passengers)


In [None]:
# Define a new dataframe
flights = pd.DataFrame(columns = ['Term', 'Domestic Passengers', 'International Passengers'])

# Add data to the predefined columns in the dataframe
flights['Term'] = terms[:len(terms) - 1]
flights['Domestic Passengers'] = domestic
flights['International Passengers'] = international

# Check work
flights

In [None]:
# Export to CSV
flights.to_csv('../datasets/logan_travel_data.csv')