The cell below uses the built in URL Library to import the file from the web. `raw_html` holds the text of that file.

In the following cell, printing raw_html gives us all the text that was in the file.

In [31]:
import requests
my_url = "https://www.panynj.gov/airports/en/statistics-general-info.html"
raw_html = requests.get(my_url).content

In [32]:
print(type(raw_html))

<class 'bytes'>


In [33]:
print(raw_html)

b'\n<!DOCTYPE HTML>\n<html lang="en">\n    <head>\r\n    <meta charset="UTF-8"/>\r\n    <title>Port Authority of New York and New Jersey Airport Traffic Statistics </title>\r\n    \r\n    <meta name="description" content="Learn more About the Monthly Summaries of Airport Activities for JFK, Newark, LaGuardia, Stewart and Teterboro Airports"/>\r\n    <meta name="template" content="page-template"/>\r\n    \n\r\n    \r\n\r\n<meta name="viewport" content="width=device-width, initial-scale=1"/>\r\n\r\n\r\n\r\n<meta property="cq:pagemodel_root_url" content="/content/airports/en.model.json"/>\r\n\r\n    \n    \n<link rel="stylesheet" href="/etc.clientlibs/portauthority/clientlibs/portauthority-react.min.49aacb59651cc293d73a2fc06d500726.css" type="text/css">\n\n\n\r\n\r\n\r\n<script>\r\n\t;(function(w, d, s, l, i) {\r\n\t\tw[l] = w[l] || []\r\n\t\tw[l].push({\r\n\t\t\t\'gtm.start\': new Date().getTime(),\r\n\t\t\tevent: \'gtm.js\'\r\n\t\t})\r\n\t\tvar f = d.getElementsByTagName(s)[0],\r\n\t\t\

In [34]:
from bs4 import BeautifulSoup

In [35]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(type(soup_doc))

<class 'bs4.BeautifulSoup'>


In [36]:
print(soup_doc.prettify())
#print(soup_doc)

<!DOCTYPE HTML>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Port Authority of New York and New Jersey Airport Traffic Statistics
  </title>
  <meta content="Learn more About the Monthly Summaries of Airport Activities for JFK, Newark, LaGuardia, Stewart and Teterboro Airports" name="description"/>
  <meta content="page-template" name="template"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="/content/airports/en.model.json" property="cq:pagemodel_root_url"/>
  <link href="/etc.clientlibs/portauthority/clientlibs/portauthority-react.min.49aacb59651cc293d73a2fc06d500726.css" rel="stylesheet" type="text/css"/>
  <script>
   ;(function(w, d, s, l, i) {
		w[l] = w[l] || []
		w[l].push({
			'gtm.start': new Date().getTime(),
			event: 'gtm.js'
		})
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s),
			dl = l != 'dataLayer' ? '&l=' + l : ''
		j.async = true
		j.src =
			'https://www.googletagmanager.com/

When you are first getting comfortable with beautiful soup, it is wise to use the .find() notation. find() searches for the first instance of a tag, and returns the contents of the tag as well as its tags. (Remember .string strips away the tags)

In [43]:
links = soup_doc.select("li.title a")

Most often when we are scraping HTML simply finding one tag will not do the trick. We need to navigate hierarchically down the tree of nested tags. We'll begin by searching the first list that is contained in the first `<div>` tag. Since we are looking for the first occurrence, we can use .find()

In [38]:
soup_doc.find('body')

<body class="page basicpage">
<noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-5DZH2R" style="display:none;visibility:hidden" title="googletagmanager" width="0"></iframe></noscript>
<div id="page"></div>
<script src="/etc.clientlibs/portauthority/clientlibs/portauthority-react.min.js"></script>
<script>
  function _reciteLoaded() {
    var html = document.querySelectorAll("html");
    for (var i = 0; i < html.length; i++) {
      if (html[i]) {
        html[i].setAttribute("lang", "en-us");
      }
    }
    var navBar = document.querySelectorAll(".NavGlobal-desktop-links li");
    for (var i = 0; i < navBar.length; i++) {
      if (navBar[i]) {
        navBar[i].setAttribute("style", "padding-top:25px !important");
      }
    }
      setTimeout(function () {
          var textPdfs =document.querySelectorAll(".Text a ");
          for (var i = 0 ; i < textPdfs.length; i++) {
              if (textPdfs[i] && textPdfs[i].href && textPdfs[i].href.indexOf(

Below, we navigate the tree: starting at the outer `<div>` tag, and then we use find_all() to get a list of every `<p>` tag nested inside. After that we loop through the list of `<p>` tags, and pull out the text that is inside the `<b>` tag, using `string` to get us just the name of the movies with no tags around it.

In [41]:
first_div = soup_doc.find('body')
#first_div Is a variable that contains all HTML in the first div 
page = first_div.find_all('div')
#.find_all() gives us a list
#so to search elements inside that list 
#we now need to loop through it
#for movies in all_paragraphs:
    #print(movies.find('b').string)

**More on find_all()**

If there are more than one of the same tags, find_all() gives us a list. We can use list notation to get a specific element in the list. The first cell below gives us a list of every single `<p>` tag in the document. In the following cell, we get a list of all of the `<p>` tags inside the first `<div>`. Try changing the index number `[0]` for each of these lists to see what you get.

In [20]:
soup_doc.find_all('p')[2]
# uncomment the line below to see the full list
#soup_doc.find_all('p')

<p><b> The Dark Knight</b> <span class="year">2008</span></p>

In [12]:
soup_doc.find('div').find_all('li')
#soup_doc.find('div').find_all('p')

[]

Here is a one-line search down the tree. Note that I specify the index number of the `<p>` tag I want to search further. Again, if I wanted to search all of the `<p>` tags, I would need to use a loop.

In [22]:
soup_doc.find('div').find_all('p')[2].find('b').string
#soup_doc.div.find_all('p')[2].b.string ##Same thing but shorter

' Avengers: Age Of Ultron'

These are two examples of searching the first list and pulling it out the name and date of the third movie in the list [2] -- try changing that index number to get movies in the list.

In [23]:
that_movie = soup_doc.find('div').find_all('p')[2]
movie_name = that_movie.find('b')
movie_year = that_movie.find('span')
print(movie_name, movie_year)

<b> Avengers: Age Of Ultron</b> <span class="year">2015</span>


In [24]:
that_movie = soup_doc.find('div').find_all('p')[2]
movie_name = that_movie.b
movie_year = that_movie.find('span')
print(movie_name.string, "||", movie_year.string)

 Avengers: Age Of Ultron || 2015


Now we get the next list (the second `<div>` or list element [1]) and then pull out all of the names and dates by using a loop.

In [25]:
next_list = soup_doc.find_all('div')[1]
#print(next_list)
next_movies = next_list.find_all('p')
print(next_movies)

[<p><b> The Incredible Hulk</b> <span class="year">2008</span> <a href="https://www.boxofficemojo.com/release/rl2791015937/">more info</a></p>, <p><b> Wanted</b> <span class="year">2008</span></p>, <p id="favorite"><b> Superman</b> <span class="year">1978</span></p>, <p> The Wolverine<span class="year">2013</span></p>, <p><b> Hulk</b> <span class="year">2003</span></p>]


In [26]:
for movie in next_movies:
    movie_name = movie.b
    movie_year = movie.span
    print(movie_name.string, "||", movie_year.string)

 The Incredible Hulk || 2008
 Wanted || 2008
 Superman || 1978


AttributeError: 'NoneType' object has no attribute 'string'

Oh no! What just happened? I mean, yay, an error--something to learn from! What might have gone wrong? 'NoneType'  means that something wasn't successfully found. Well it worked all the way to Superman, but it broke on The Wolverine. Why? Think about it, and then look at the following way to deal with this problem.

In [None]:
for movie in next_movies:
    movie_name = movie.b
    if movie_name is None: 
        movie_name = 'Problem##!!!' #It's movie.span.previous from below
    else:
        movie_name = movie_name.string
    movie_year = movie.find('span')
    print(movie_name, "||", movie_year.string)

Now the loop isn't breaking!  But what was the problem, and how do we fix it? Think about it. Or try doing the next thing and come back to this.

In [None]:
# edit that loop above to get the missing movie name in there.
for movie in next_movies:
    movie_name = movie.b
    if movie_name is None: 
        movie_name = movie.span.previous
    else:
        movie_name = movie_name.string
    movie_year = movie.find('span')
    print(movie_name, "||", movie_year.string)

**get_text() vs string=True vs stripped_string**

Sometimes HTML tags are more trouble than they are worth. Especially when they are inconsistent, they can cause problems and get in the way. There are few ways to grab out the text. It becomes less structured--we have less control over its format--which may cause  different problems. 

Here are three different ways to grab text out of the tags, each with its own fun complications.

In [27]:
#strings
for movie in next_movies:
    movie_stuff = movie.get_text()
    #movie_stuff = movie.text
    #movie_stuff = movie.get_text('|',strip=True)
#append!!!
    print(movie_stuff)

 The Incredible Hulk 2008 more info
 Wanted 2008
 Superman 1978
 The Wolverine2013
 Hulk 2003


In [28]:
#list
for movie in next_movies:
    movie_stuff = movie.find_all(string=True)
    print(movie_stuff)

[' The Incredible Hulk', ' ', '2008', ' ', 'more info']
[' Wanted', ' ', '2008']
[' Superman', ' ', '1978']
[' The Wolverine', '2013']
[' Hulk', ' ', '2003']


In [29]:
#generator
for movie in next_movies:
    for string in movie.stripped_strings:
        print(string)

The Incredible Hulk
2008
more info
Wanted
2008
Superman
1978
The Wolverine
2013
Hulk
2003


**next_element and previous_element / find_next and find_previous**

Sometimes things are less nicely defined and less easy to find, and you just want to look for next-door neighbors. These can be very handy, moving forward and backwards along the parsed elements.

To get the third list, we could have gotten `<div>` [2], but because it has a unique `<ul>` parent tag--we go straight for that.

Now try this: get 'Ghost in the Shell' out of that list. Hint: next/previous will help.

In [None]:
#third_list = soup_doc.find_all('div')[2].find('li')
third_list = soup_doc.ul
print(third_list)



In [None]:
#ghost = ??
third_list.find_all('li')[3].span.previous

**Parents, children, and siblings**

So far we have navigated the DOM tree from parent to child-- div > p > b 

And we've seen that we can move side-to-side generally. But the NEXT elements ignore hierarchy, which can be useful.

Sometimes you want to go the opposite direction, find a unique identifier inside a container and then get everything in the container. For example, as we saw the third list has a unique `<ul>` tag. If he wanted to get everything that is inside the same parent container (the `<div>`) we could do something like this:

In [None]:
soup_doc.find('ul').parent

You can also go sideways, meaning finding siblings--elements that are in the same container at the same level of the hierarchy. As you've seen, the fourth list is not in its own div. To get the fourth list we could use the unique `<h2>` to get all of the siblings that come after it.

In [None]:
last_head = soup_doc.find('h2')
last_list = last_head.find_next_siblings()
last_list


We can even specify what kind siblings you want to find--notice how that's final "that's all" showed up. We can search for all of the next `<p>` tags. (There is also a previous_siblings function goes backwards.)

In [None]:
last_head = soup_doc.find('h2')
last_list = last_head.find_next_siblings('p')
last_list

**Searching id and class**

Most websites these days use id and class attributes to style (and run code) on their webpages. These can be some of the most helpful attributes to search for to find certain types/groups of information.

In [31]:
#Finds all classes named "year"
all_years = soup_doc.find_all(class_="year")
print(all_years).string
#Try printing out just the years without any tags around them

[<span class="year">2010</span>, <span class="year">2008</span>, <span class="year">2015</span>, <span class="year">2012</span>, <span class="year">2013</span>, <span class="year">2008</span>, <span class="year">2008</span>, <span class="year">1978</span>, <span class="year">2013</span>, <span class="year">2003</span>, <span class="year">1991</span>, <span class="year">1994</span>, <span class="year">1993</span>, <span class="year">2017</span>, <span class="year">2004</span>]


AttributeError: 'NoneType' object has no attribute 'string'

In [None]:
#Finds any tag that has an id attribute in it
fav = soup_doc.find_all(id=True)
print(fav)

In [None]:
fav = soup_doc.find_all(id='favorite')
print(fav)

You can also search for any kind of attribute beyond id and class, and you can specify what kind of tag you want to look for that attribute in. This is very helpful for zoning in on specific parts of the webpage.

In [None]:
fav = soup_doc.find_all('p', attrs={'id': 'favorite'})
print(fav)

**Pulling out attributes**

Not only can you search by attributes but you can pull out the information hidden inside a tag. The most common information you will want to get is a link. Like in this tag: 
`<a href="https://www.boxofficemojo.com/release/rl2791015937/">more info</a>`

URLs are found in `<a>` tags inside the `href` attribute.

In [None]:
first_link = soup_doc.find('a')
get_url = first_link['href']
#Note that this works just like a key in a dictionary
print(get_url)

**Using that link!**

Here I am taking a real link to box office mojo and scraping a table with some basic information about the movie--and turning that info into a dictionary!

In [None]:
raw_html2 = requests.get(get_url).content
soup_doc2 = BeautifulSoup(raw_html2, "html.parser")
print(soup_doc2.prettify())

**Reading Structurally**

When you are reading HTML source for the purposes of scraping, you are looking for the structure of the DOM tree and for unique identifiers within the parent/child nodes that will let you zone in on specific information.

I want to pull out all of the box office information about Avengers. There is a whole lot on that page, and I just want the stuff in that table. After careful reading of the source, I noticed that the two main boundaries inside that table have unique class names, so I search for a tag with those classes.

In [None]:
left_info = soup_doc2.find(class_="a-section a-spacing-none mojo-performance-summary-table")

print(left_info)

In [None]:
left_entry = left_info.find_all('div')
#each_entry[0].get_text(strip=True)
for entry in left_entry:
    print(entry.get_text(" ",strip=True).split(" "))
    #print(entry.get_text(strip=True))

In [None]:
right_info = soup_doc2.find(class_="a-section a-spacing-none mojo-summary-values mojo-hidden-from-mobile")
each_field = right_info.find_all('div')
each_field
for field in each_field:
    fields = field.find_all('span')
    print(fields[0].string)
    print(fields[1].get_text("||",strip=True))
    print("-----------")

Now I'm going to put all of those findings together, and make a **dictionary** out of the information posted on that page.

In [None]:
avengers_dict = {}
for entry in left_entry:
    data_string = entry.get_text(" ",strip=True)
    data_list = data_string.split(" ")
    category = data_list[0].strip().lower()
    value = data_list[-1].strip()
    avengers_dict[category] = value
for field in each_field:
    fields = field.find_all('span')
    category = fields[0].string.strip().lower().replace(' ','_')
    value = fields[1].get_text("||", strip=True)
    avengers_dict[category] = value
avengers_dict

In [20]:
avengers_dict['genres']

NameError: name 'avengers_dict' is not defined