## Beautiful Soup Documentation
    https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    https://www.dataquest.io/blog/web-scraping-tutorial-python/


### Useful Functions
    find_all()  : looks through a tag's descendants and returns a list of all descendants matching the filter
    find()      : returns the first descendant matching the filter
    find_next() : returns the first 'sibling' matching the filter, that appears immediately after the current tag

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

## Code Structure for Scrape.py
    (1) Define the starting point (the base webpage)
    (2) Use requests.get(<URL of the starting base webpage>) : to get a HTML document
    (3) Use BeautifulSoup(response.text, 'html parser') : to extract data out of the HTML document
    (4) Use page.find_all('b')[-1] to get a list of all bold tags
    (5) (For) Loop over the list of bold tags to follow each 'bold tage' (URL) and retrieve the data on it
            - Use for b in bold_tags: b.find('a')
            - Each iteration call a function 'get_data' to follow link and get data (should return a dictionary)
                  - Check what a.attrs gives you
                  - For each 'a' you will have to follow Steps (2) & (3) : get HTML document, use BS4 to extract data from it
                  - headers = ['Date', 'Operator', 'Flight origin', 'Destination', 'Fatalities']
                  - Insert the 'Name' of the accident in the dictionary to return under key = 'Name'
                  - Iterating over the list above, call another function 'get_accident_data' 
                        - Use th = a_page.find('th', text=header) to find the table which contains the information
                        - Use th.find_next('td') to get the data stored in the table
            - append the result of 'get_data' function into a larger list (this will be a list of dictionaries)


    (6) Finally, once the Loop over bold_tags ends, convert your list of dictionaries into a DataFrame
    (7) Drop duplicates : Use df.drop_duplicates(inplace=True)
    (8) Write the DataFrame onto a CSV file (without the Index) : Use df.to_csv('accidents.csv', index=False)

In [7]:
## Step 1
base = 'https://en.wikipedia.org'
path = '/wiki/List_of_accidents_and_incidents_involving_commercial_aircraft'

In [12]:
## Step 2
response = requests.get(base + path)
#base + path
#response.text

In [32]:
## Step 3
page = BeautifulSoup(response.text, 'html.parser')

In [33]:
## Step 4
bold_tags = page.find_all('b')[:-1]

## Commands and Outputs below are rough work from the TA Session
### Just for reference, use it carefully :)

In [34]:
for b in bold_tags:
    a = b.find('a')
    get_accidents_data(a)

In [62]:
'''
if bold_tags[0].find('a') == None:
    pass
else:
    <code>
'''
#request_2 = requests.get(base + a.attrs['href'])
a.attrs
b.find('a', href='/wiki/Southwest_Airlines_Flight_1380')

{'href': '/wiki/Southwest_Airlines_Flight_1380',
 'title': 'Southwest Airlines Flight 1380'}

In [40]:
page_2 = BeautifulSoup(request_2.text, 'html.parser')

In [48]:
page_2.find('th', text='Destination').find_next('td').text

'Dallas Love Field,\nDallas, Texas'

In [47]:
base + a.attrs['href']

'https://en.wikipedia.org/wiki/Southwest_Airlines_Flight_1380'

In [49]:
headers = ['Date', 'Operator', 'Flight origin', 'Destination', 'Fatalities']

In [51]:
d = {}
for header in headers:
    get_accidents_data(a, header)
    txt = page_2.find('th', text=header).find_next('td').text
    print(header, ':', txt)
    d[header] = txt

Date : April 17, 2018
Operator : Southwest Airlines
Flight origin : LaGuardia Airport,
New York City, New York
Destination : Dallas Love Field,
Dallas, Texas
Fatalities : 1


In [56]:
d

{'Date': 'April 17, 2018',
 'Destination': 'Dallas Love Field,\nDallas, Texas',
 'Fatalities': '1',
 'Flight origin': 'LaGuardia Airport,\nNew York City, New York',
 'Name': 'Southwest Airlines Flight 1380',
 'Operator': 'Southwest Airlines'}

In [55]:
d['Name'] = a.attrs['title']

In [None]:
def my_func(string):
    <code>
    return n

df['Fatalities_Count'] = df['Fatalities'].str.apply(my_func, axis = 1)