# Scrapping with Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. BeautifulSoup gives us the ability to parse the HTML document tree. Its similar to parsing the element tree in the XML document.

Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup 4 is faster, has more features, and works with third-party parsers
like lxml and html5lib.

In [1]:
from bs4 import BeautifulSoup
import requests
from pprint import pprint

## This [webpage](https://www.transtats.bts.gov/Data_Elements.aspx?Data=2) is used for web-scrapping with the help of beautifulsoup. Choose the following in the above page :

__Airport :__ Bostan MA Logan International

__ Carrier :__ Virgin America

## Data Wrangling Procedure
* Build list of carrier values.
* Build list of airport values.
* Make HTTP requests to download all data.
* Then parse the data files.

In [2]:
 # Convert the html content into a beautiful soup object
 # 'lxml' used to choose html parser (and to avoid warning)   
soup = BeautifulSoup(open("virgin_and_logan_airport.html"),'lxml')   
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [3]:
soup.title

<title>
	Data Elements
</title>

In [4]:
soup.title.name

'title'

In [5]:
soup.title.string

'\n\tData Elements\n'

In [6]:
soup.title.parent.name

'head'

In [7]:
soup.p

<p>BUREAU OF TRANSPORTATION STATISTICS</p>

In [8]:
soup.option

<option selected="selected" value="All">All U.S. and Foreign Carriers</option>

## For beautiful soup it is find_all() but for xml it is findall()



## __soup.find_all() :__ returns a list with all parent/children element with a known tag.

find_all() will find all of its descendants instead of just the first one.

_Check what if had another html tag option in the file???

In [9]:
option_list = soup.find_all('option') 
print(type(option_list))

<class 'bs4.element.ResultSet'>


## __soup.find() :__ Finds the first element/child with a particular tag along with its children


The only difference is that find_all() returns a list containing the single result, and find() just returns the result.

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:

In [10]:
carrier_list = soup.find(id = "CarrierList")
print(type(carrier_list))
carrier_list

<class 'bs4.element.Tag'>


<select class="slcBox" id="CarrierList" name="CarrierList" style="width:450px;">
<option selected="selected" value="All">All U.S. and Foreign Carriers</option>
<option value="AllUS">All U.S. Carriers</option>
<option value="AllForeign">All Foreign Carriers</option>
<option value="AS">Alaska Airlines </option>
<option value="G4">Allegiant Air</option>
<option value="AA">American Airlines </option>
<option value="5Y">Atlas Air </option>
<option value="DL">Delta Air Lines </option>
<option value="MQ">Envoy Air</option>
<option value="EV">ExpressJet Airlines </option>
<option value="F9">Frontier Airlines </option>
<option value="HA">Hawaiian Airlines </option>
<option value="B6">JetBlue Airways</option>
<option value="OO">SkyWest Airlines </option>
<option value="WN">Southwest Airlines </option>
<option value="NK">Spirit Air Lines</option>
<option value="UA">United Air Lines </option>
<option value="VX">Virgin America</option>
</select>

In [11]:
check = soup.find(value="VX")
check

<option value="VX">Virgin America</option>

##  Extracting all the URLs found within a page’s  known tags:

## element.get('tag') - to access the given elements attributes

In [12]:
#  extracting all the URLs found within all the '<a>' tags:
page_links = []
for link in soup.find_all('a'):
    page_links.append(link.get('href'))
    
print("Total number of links present in this webpage : ",len(page_links))
page_links[0:10]

Total number of links present in this webpage :  136


['http://www.transportation.gov',
 'https://www.bts.gov/',
 'http://transportation.libanswers.com/form.php?queue_id=1810',
 'http://www.rita.dot.gov/bts/publications/',
 'https://www.bts.dot.gov/explore-topics-and-geography',
 'https://www.bts.dot.gov//topics/airlines-and-airports',
 'https://www.bts.dot.gov//topics/energy-and-environment',
 'https://www.bts.dot.gov//topics/freight-transportation',
 'https://www.bts.dot.gov//topics/infrastructure',
 'https://www.bts.dot.gov//topics/passenger-travel']

## Extracting all the text from a page:

## __soup.get_text() : __ returns all the text content in the given webpage

In [28]:
#print(soup.get_text())

# # The soup object contains all of the HTML in the original document

__soup.preetify()__ : This method prints the HTML file in a more organized and comprehensive way.

Beautiful Soup is essentially a set of wrapper functions that make it simple to select common HTML elements.

In [14]:
print(soup.prettify()[0:1000])  # Here we just took a slice of the soup object

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   Data Elements
  </title>
  <link href="styles/global.css" rel="stylesheet" type="text/css"/>
  <link href="styles/rita_main.css" rel="stylesheet" type="text/css"/>
  <link href="https://fonts.googleapis.com/css?family=Open+Sans" rel="stylesheet" type="text/css"/>
  <link href="https://www.bts.dot.gov/sites/bts.dot.gov/themes/bts_standalone/bts_standalone.css" rel="stylesheet"/>
  <link href="https://www.bts.dot.gov/sites/bts.dot.gov/themes/bts_standalone/bts_standalone_pn.css" rel="stylesheet"/>
  <script src="//ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js" type="text/javascript">
  </script>
  <script src="https://www.bts.dot.gov/sites/bts.dot.gov/themes/bts_standalone/bts_standalone.js">
  </script>
  <script language="javascript" type="text/javascript">
   function window_Carri

In [15]:
carrier_list = []
carrier_planes = soup.find(id = "CarrierList")   
for carriers in carrier_planes.find_all('option'):
    carrier_list.append(carriers['value'])

print("List of all the carriers : \n",carrier_list) 

List of all the carriers : 
 ['All', 'AllUS', 'AllForeign', 'AS', 'G4', 'AA', '5Y', 'DL', 'MQ', 'EV', 'F9', 'HA', 'B6', 'OO', 'WN', 'NK', 'UA', 'VX']


In [16]:
airport_list = []
airports = soup.find(id = 'AirportList')   # airports will have an html element as its value
type(airports)

bs4.element.Tag

In [17]:
airport_list_name = []
airports_name = soup.find(attrs = {'name': 'AirportList'})   
type(airports_name)

bs4.element.Tag

In [18]:
for airport_names in airports.find_all('option'):
    airport_list.append(airport_names['value'])
    
print("Total number of airports : \n",len(airport_list))
print("Some airports : \n",airport_list[-15:]) 

Total number of airports : 
 1209
Some airports : 
 ['WSM', 'OLF', 'ORH', 'WRL', 'WRG', 'YKM', 'YAK', 'XWC', 'WYB', 'YNG', 'A63', 'NYL', 'YUM', 'KZB', 'AK8']


In order to make a web scrapping it is important to understand how the site expects requests. So 1st step we have to findout which url we have to access and which http method to use. 

Here, __http method is post__ and for __url to access is : Data_elements.aspx?Data=2__ (i.e, to url Data_elements.aspx  we are passing the parameter Data=2). And that is exactly the url we have accessed.

<img src = "web2.PNG">

In order to mine, this site for request we need to learn how to programmatically construct request to pull each page of data we need. And each time we will be passing a carrier value and an airport value.

So best way is to check how browser makes request to site.

From the network tab of inspect we find that the httprequest has 8 parameters (including CarrierList and AirportList) in the __FORMDATA__. These 8 parameters are : 
* EVENTTARGET
* EVENTARGUMENT
* VIEWSTATE
* CarrierList
* AirportList
* VIEWSTATEGENERATOR
* EVENTVALIDATION
* Submit

These form elements which are needed to make the request are not part of the user interface. On checking they are hidden/present in the _div_ element. 

Now to make an httprequest so that this data is included.

In [19]:
r = requests.get("https://www.transtats.bts.gov/Data_Elements.aspx?Data=2")
soup = BeautifulSoup(r.text,"lxml")
div_hidden = soup.find(id = "__EVENTVALIDATION")

In [20]:
eventvalidation = div_hidden['value']
type(eventvalidation)

str

In [21]:
div_hidden_view = soup.find(id = '__VIEWSTATE')
# viewstate = div_hidden_view.find_all(['value'])  # wrong did only when we are using loop
viewstate = div_hidden_view['value']
type(viewstate)

str

These form parameters   __VIEWSTATE__, __EVENTTARGET__ , __EVENTARGUMENT__ , __EVENTVALIDATION__ and __submit__, etc are used to validate each requests which are coming. 

In [22]:
r = requests.post("http://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
                    data={'AirportList': "BOS",  # filling the value for Boston airport
                          'CarrierList': "VX",   # filling the value for Virgin carrier
                          'Submit': 'Submit',
                          "__EVENTTARGET": "",
                          "__EVENTARGUMENT": "",
                          "__EVENTVALIDATION": eventvalidation,
                          "__VIEWSTATE": viewstate
                         })

In [23]:
# pprint(r.text)    # Returns the request object
print(type(r.text))

<class 'str'>


In [24]:
f = open("virgin_to_boston request2.html",'w')
f.write(r.text)
# Unfortuantely, the HTML code on that page being scraped from has been 
# updated hence the scraping mechanism now fails. But conceptually what is being done here is correct.

433606

Instead of the data that we want we are getting syntax error. 
To solve these types of error some practices are : 
* Look at how a browser makes requests.
* Emulate in code.
* If stuff blows up, look at your http traffic.
* Return to 1st step until it works.

Since this video was recorded, some changes to the code from line 13-20 are necessary in order to obtain the same functionality. First, the .post() function's first argument should be set to the secure server. Secondly, the "data" parameter should be a tuple of tuples in a specific order:

Now, to check what the browser is doing differently there is a cookie above __FORMDATA__ with some session data so we can maintain a session state in the code. So we use a session object to get our both get and post. So session request will be maintained and past along when we maintain this request.  

In [25]:
s = requests.Session()
r = s.get("https://www.transtats.bts.gov/Data_Elements.aspx?Data=2")
soup = BeautifulSoup(r.text,"lxml")
div_hidden = soup.find(id = "__EVENTVALIDATION")
eventvalidation = div_hidden['value']
div_hidden_view = soup.find(id = '__VIEWSTATE')
viewstate = div_hidden_view['value']

r = s.post("https://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
           data = (
                   ("__EVENTTARGET", ""),
                   ("__EVENTARGUMENT", ""),
                   ("__VIEWSTATE", viewstate),
                   ("__EVENTVALIDATION", eventvalidation),
                   ("CarrierList", "VX"),
                   ("AirportList", "BOS"),
                   ("Submit", "Submit")
                  ))

In [26]:
f = open("virgin_to_boston request2.html",'w')
f.write(r.text)

344619

So we get the right html page generated with the required data for virgin airlines and Logan airport.