## SCRAPING HTML
HTML ("Hypertext Mark-up Language") is the language of the Internet, more specifically the World Wide Web. It is a very simple language that uses containers (like <HTML></HTML>) to tell a browser what up to display and how to display it.

A note on browsers: from now on you should be using Chrome. If you do not have chrome installed on your computer, do that now before you go any further. Chrome's developers tools are by far the best and most reliable.

This is how the Internet works (in a simplified way): you go to a page by typing a URL into the browser. The URL is and HTTP request for a file on a server. The file that arrives at your browser is an HTML file--the browser reads the HTML and displays what is supposed to be displayed, and also runs some scripts in the background. Most often, the page you see on your browser is an HTML page. (There are many exceptions, like direct PDF files, as well as data accessed via APIs.)

HTML is the raw text source code of what you see in a browser. In Chrome you can view the raw HTML by either going to the menu bar and choosing--View: Developer: View Source -- or right-clicking (control-clicking) the mouse on a page and choosing View Source. Like so:

<img src="http://floatingmedia.com/columbia/viewsource.png">

Don't panic! While HTML can be very disorienting at first look, there are more targeted and helpful ways to investigate it. The best one is through Chrome's "inspect" function. Right-click (or control-click) on the part the page that interests you, and select "Inspect" or "Inspect Element"--and you get the much more friendly developers tools way of navigating through the DOM tree:

<img src="http://floatingmedia.com/columbia/inspect.png">

Did I say DOM tree? Yes, the DOM [document object model](https://www.w3schools.com/js/js_htmldom.asp) is a term for the hierarchical structure of HTML elements on a page. It is a tree, because each of the elements on a page is nested within groups of HTML tags. 

<img src="http://floatingmedia.com/columbia/treeStructure.png">

Here are the most common tags, and often the most helpful tags to use when navigating through an HTML page.

`<h1>`, `<h2>`, `<h3>` headers
`<p>` paragraph
`<b>`, `<i>`, `<strong>` styles, like bold, italics...
`<table><tr><td>` table elements including rows and cells
`<a href="url">` links
`<div>`, `<span>` larger Element containers, these often have an id="name" and/or class="name" attached to them.
`<ol>`,`<ul>`,`<li>` ordered and unordered lists

For example: `<p>This would be a paragraph</p>`
`<p>This would be a <b>paragraph</b></p>` Same thing but the word paragraph is bold

Sometimes important information is hidden inside these tags:


`<span class="year">`2010`</span>`

or 

`<a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">`more info`</a>`

In this case the "class" tag is likely adding styling information (see CSS), whereas the "href" tag holds a hyperlinked url. [For a more complete list of HTML tags click here](https://www.w3schools.com/tags/ref_byfunc.asp)


**Why does this matter to us?**

Note that each tag begins with `<tagname>` and ends with `</tagname>`. So these HTML tags are structuring the text. The reason for the structure just to tell the browser how everything should look, but we can also use the structure of HTML to programmatically traverse the data on a webpage and scrape out the information we need. This is what scraping is. HTML is not a reliable data structure, but it often it is consistent from page to page on a particular website. If you can learn how to navigate the dom tree, you can turn information on a messy webpage into reliable and searchable data.



## My notes::

#### on DOM: An object in javascript = a dictionary in python. everything is nested inside the keys.

the tag is the key. 
When you're in Chrome: 
Go to : View>>Developer>>View Source

Note on the following: 
If you're going to start a new python page for scraping, you should have the following code first:

You can get the ORIGINAL FILE here: https://github.com/jthirkield/LedeProgram/blob/master/beautiful_soupCOMMENTS.ipynb

**I've altered the original, so some of his notes may be out of order or not make sense. You can download his to change it by right-clicking on Raw and saving it, then opening it as a Python notebook.**

## This brings us to Beautiful Soup
Beautiful Soup is a Python library that parses HTML, allowing us to navigate through the elements of a webpage using the HTML tags embedded in it. [Here is the link to the documentation,](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) there are examples and Extensions Beyond what is demonstrated below.

Now it's time to install Beautiful soup. Go to your terminal/shell/bash and type:

`pip3 install bs4`

(*we already did this last class so you should be set*)


We will begin by navigating a very simple HTML page I have posted on my website. [Please follow this link](http://floatingmedia.com/columbia/topfivelists.html) and try inspecting the HTML using Chrome. (p.s. The information on this page comes from [http://www.boxofficemojo.com/genres/chart/?id=comicbookadaptation.htm]

The cell below uses the built in URL Library to import the file from the web. `raw_html` holds the text of that file.

Printing raw_html gives us all the text that was in the file.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
#import the libraries you need

After importing Beautiful soup we create a new variable called soup_doc. In that variable we transform the text that we downloaded from the URL (the text in raw_html) into a beautiful soup object that we can investigate with built in functions (like .find())

In [None]:
raw_html = urlopen("http://floatingmedia.com/columbia/topfivelists.html").read()
#this is the first real action we're going to do. we're going to get our data.
soup_doc = BeautifulSoup(raw_html, "html.parser")
#here we're parsing it with html parser and saving it in the variable soup_doc
print(type(soup_doc))

In [None]:
print(type(raw_html))
#essentially this is telling us it's a really complicated library

In [None]:
print(raw_html)
#the raw HTML file. remember we named our file raw_html, it's sort of like a dictionary in python

In [None]:
print(soup_doc.prettify())
#or: print(soup_doc)
#this just makes it look nice

Here is a very simple example of beautiful soup's built in functions. .title is shorthand for .find('title'). It finds the first title tag in the document. If we append .string we get the text inside the tag.

In [None]:
soup_doc.title

In [None]:
soup_doc.title.string

When you are first getting comfortable with beautiful soup, it is wise to use the .find() notation. find() searches for the first instance of a tag, and returns the contents of the tag as well as its tags. (Remember .string strips away the tags)

In [None]:
soup_doc.find('p')

Most often when we are scraping HTML simply finding one tag will not do the trick. We need to navigate hierarchically down the tree of nested tags. We'll begin by searching the first list that is contained in the first `<div>` tag. Since we are looking for the first occurrence, we can use .find()

In [None]:
soup_doc.find('div')
#soup_doc.div  # would also work

In [None]:
soup_doc.find_all('div')[0]
#

Below, we navigate the tree: starting at the outer `<div>` tag, and then we use find_all() to get every `<p>` tag nested inside. After that we loop through the `<p>` tags, and pull out the text that is inside the `<b>` tag, using `string` get us just the name of the movies with no tags around it.

In [None]:
first_div = soup_doc.find('div')
#first_div Is a variable that contains all HTML in the first div 
all_paragraphs = first_div.find_all('p')
#.find_all() gives us a list
#so to search elements inside that list 
#we now need to loop through it
for movies in all_paragraphs:
    print(movies.find('b').string)

**More on find_all()**

If there are more than one of the same tags, find_all() gives us a list. We can use list notation to get a specific element in the list. The first cell below gives us a list of every single `<p>` tag in the document. In the following cell, we get a list of all of the `<p>` tags inside the first `<div>`. Try changing the index number `[0]` for each of these lists to see what you get.

In [None]:
soup_doc.find_all('p')[0]
# uncomment the line below to see the full list
#soup_doc.find_all('p')

In [None]:
soup_doc.find('div').find_all('p')[0]
#soup_doc.find('div').find_all('p')

Here is a one line search down the tree. Note that I specify the index number of the `<p>` tag I want to search further. Again, if I wanted to search all of the `<p>` tags, I would need to use a loop.

In [None]:
soup_doc.find('div').find_all('p')[2].find('b').string
#soup_doc.div.find_all('p')[2].b.string ##Same thing but shorter
#think about what this will get you before you return it. Why does it get you here? ((Avengers Age of Ultron.))

In [None]:
soup_doc.b
#same as writing soup_doc.find('b')
#this is just finding the FIRST ONE!
# docname.b >> returns everything in bold tags in the doc

In [None]:
soup_doc.find_all('b') 
# >> will return everything in a bold tag

These are two examples of searching the first list and pulling it out the name and date of the third movie in the list [2] -- try changing that index number to get movies in the list.

In [None]:
that_movie = soup_doc.find('div').find_all('p')[2]
#break this down and think about it. this is: find everything in paragraph tag [2], which is actually the third one
#save this in in a variable that_movie
movie_name = that_movie.find('b')
#within that_movie, find everything in 'b' (bold) tags. we know the names are in bold because we looked at the view source
#and noticed it.
movie_year = that_movie.find('span')
#find all the all the things in that_movie that are in a 'span' tag, which we know are years from looking at it
print(movie_name, movie_year)

In [None]:
that_movie = soup_doc.find('div').find_all('p')[2]
#here we're saying: look through all the div tags. find p [2] tags (which are actually the third p!)
#save this in the variable that_movie
movie_name = that_movie.b
#here we're looking for everything in bold tags in the variable that_movie
movie_year = that_movie.find('span')
print(movie_name.string, "||", movie_year.string)

What if you're missing bold (b) tags?

Now we get the next list (the second `<div>` or list element [1]) and then pull out all of the names and dates by using a loop.

In [None]:
next_list = soup_doc.find_all('div')[1]
print(next_list)
#here we're working on an array before the array has even been printed
next_movies = next_list.find_all('p')
print(next_movies)

Here's another way to do this, breaking it down into steps.
**note the commas hidden between tags! < p>,< /p> these mean it's a list!**

In [None]:
next_list = soup_doc.find_all('div')
next_list[1]

In [None]:
next_movies = next_list[1].find_all('p')
print(next_movies)

In [None]:
for movie in next_movies:
    movie_name = movie.b
    movie_year = movie.find('span')
    print(movie_name.string, "||", movie_year.string)

In [None]:
for movie in next_movies:
    if movie.b is None:
        continue #this is telling us to skip things not in bold tags
        movie_name = movie.b
        movie_year = movie.find('span')
        print(movie_name.string,"|",movie_year.string)

In [None]:
for movie in next_movies:
    if movie.b is None:
        movie_name = movie.find('span').previous_sibling
    else: movie_name = movie.b
    print(movie_name.string,"|",movie_year.string)

**FIND THE UNIQUE TAGS TO HELP YOU GET WHERE YOU NEED**

To get the third list, we could have gotten `<div>` [2], but because it has a unique `<ul>` parent tag--we go straight for that.

Try this: get 'Ghost in the Shell' out of that list.

In [None]:
#third_list = soup_doc.find_all('div')[2].find('li')
third_list = soup_doc.find('ul')
print(third_list)

In [None]:
#in class practice: try and get <li> Ghost In The Shell <span class="year">2017</span> out.
#it's the fourth one, so [3]. remember 
third_list = soup_doc.find('ul')
ghost = third_list.find_all('li')[3]
print(ghost)


Here's a worse way to do this. Note that with find_all you're getting a list, so you have to specify the element again within that list (third_list[0]). This is confusing. That's why we don't do it like this. But if you can understand what's happening here it's helpful.

In [None]:
third_list = soup_doc.find_all('ul')
ghost = third_list[0].find_all('li')[3]
print(ghost)

In [None]:
#He went quickly through this. I'm not totally sure what's happening, but I think it would be helpful to go back 
#through it and try and understand
third_list = soup_doc.find('ul')
ghost = third_list.find_all('li')[3]
ghost_name = ghost.find('span').previous_sibling
print(ghost_name)

**Parents, children, and siblings**

So far we have navigated the DOM tree from parent to child-- div > p > b

Sometimes you want to go the opposite direction, find a unique identifier inside a container and then get everything in the container. For example, as we saw the third list has a unique `<ul>` tag. If he wanted to get everything that is inside the same parent container (the `<div>`) we could do something like this:

Remember that siblings are on the **same indent line** as one another. for example: look at the third div tag. look at ul. h1 is the previous sibling of ul. all the ones below it are the next siblings. div is the parent**

In [None]:
soup_doc.find('ul').parent

You can also go sideways, meaning finding siblings--elements that are in the same container at the same level of the hierarchy. As you've seen, the fourth list is not in its own div. To get the fourth list we could use the unique `<h2>` to get all of the siblings that come after it.

In [None]:
last_head = soup_doc.find('h2')
last_list = last_head.find_next_siblings()
last_list


We can even specify what kind siblings you want to find--notice how that's final "that's all" showed up. We can search for all of the next `<p>` tags. (There is also a previous_siblings function goes backwards.)

In [None]:
last_head = soup_doc.find('h2')
last_list = last_head.find_next_siblings('p')
#this is asking us to find next siblings of h2 that have 'p' tags. note that it leaves off the h3 tag from above.
last_list

In [None]:
for movie_name in last_list:
    print(movie_name.string)

## Searching id and class

Most websites these days use id and class attributes to style (and run code) on their webpages. These can be some of the most helpful attributes to search for to find certain types/groups of information.

**Classes** can be repeated, and are almost always used for styles. E.g. you might have five div tags with the same class. That means in those five boxes on the website those five boxes have the same font, background color, etc.

**IDs** are supposed to be unique; one thing has one ID. Mostly they're used to tell Javascript this is where you want script to happen. If you see a box with things changing, there's an ID in that box that is putting things in the box and changing it.

In [73]:
#Finds all classes named "year"
all_years = soup_doc.find_all(class_="year")
print(all_years)
#Try printing out just the years without any tags around them

[<span class="year">2010</span>, <span class="year">2008</span>, <span class="year">2015</span>, <span class="year">2012</span>, <span class="year">2013</span>, <span class="year">2008</span>, <span class="year">2008</span>, <span class="year">1978</span>, <span class="year">2013</span>, <span class="year">2003</span>, <span class="year">1991</span>, <span class="year">1994</span>, <span class="year">1993</span>, <span class="year">2017</span>, <span class="year">2004</span>]


In [74]:
#Finds any tag that has an id attribute in it
fav = soup_doc.find_all(id=True)
print(fav)

[<p id="favorite"><b> Superman</b> <span class="year">1978</span></p>]


In [None]:
fav = soup_doc.find_all(id='favorite')
print(fav)

You can also search for any kind of **attribute beyond id and class**, and you can specify what kind of tag you want to look for that attribute in. This is very helpful for zoning in on specific parts of the webpage.

In [75]:
fav = soup_doc.find_all('p', attrs={'id': 'favorite'})
print(fav)
#find me the paragraph tag with the ID tag favorite (this is in case there are some div tags with 'favorite' out there 
#as well, which would be bad web building but it happens)

[<p id="favorite"><b> Superman</b> <span class="year">1978</span></p>]


## **Pulling out attributes**

Not only can you search by attributes but you can pull out the information hidden inside a tag. **The most common information you will want to get is a link.** Like in this tag: 
`<a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">more info</a>`

**URLs are found in `<a>` tags inside the `href` attribute.**

In [84]:
first_link = soup_doc.find('a')
print(first_link)
#first, we find the 'a' tag.
#why do this instead of just finding the link? because if we want to use this for multiple links with a function (next week!)
#it'll be easier, rather than finding each link you want manually and then rewriting all the code.
#e.g. if there's one page with lots of links you want to scrape, you don't want to have to enter all the links manually
#each time. this way you can have it loop over a bunch of links and scrape them all.

<a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">more info</a>


In [79]:
get_url = first_link['href']
#note: this is basically dictionary notation. turn those attributes into a dictionary
print(get_url)

http://www.boxofficemojo.com/movies/?id=avengers11.htm


In [82]:
first_link = soup_doc.find('a')
get_url = first_link['href']
print(get_url)
#altogether now

http://www.boxofficemojo.com/movies/?id=avengers11.htm


**Using that link!**

Here I am taking a real link to box office mojo and scraping a table with some basic information about the movie--and turning that info into a dictionary! I will cover this in class on Thursday.

## my notes:

### TABLES 
are basically rows and columns. Open http://www.boxofficemojo.com/movies/?id=avengers11.htm. Inspect the page. Look at the table tag and what the attributes are inside it (e.g. background color, border, cellspacing, cellpadding, bgcolor, width are all attributes). Which attributes are unique? (another note: #dcdcdc is a hexidecimal color code.)

In [None]:
raw_html2 = urlopen(get_url).read()
#NOTE this NEW VARIABLE called raw_html2
soup_doc2 = BeautifulSoup(raw_html2, "html.parser")
#this is a NEW VARIABLE for soup doc, parsing our new url properly
print(soup_doc2.prettify())
#make it fancy like and look at it so you know there's nothing wrong.

In this next section, we're using the UNIQUE ATTRIBUTES of the table to find what we want. There are a couple of ways to do this, because this table seems to have several unique ids. cellpadding, width, etc. Let's try it with a couple of them.

In [86]:
my_table = soup_doc2.find("table", attrs={"bgcolor": "#dcdcdc"})
print(my_table)

<table bgcolor="#dcdcdc" border="0" cellpadding="4" cellspacing="1" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$623,357,910</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=buenavista.htm">Buena Vista</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&amp;release=theatrical&amp;date=2012-05-04&amp;p=.htm">May 4, 2012</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Action / Adventure</b></td><td valign="top">Runtime: <b>2 hrs. 22 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>PG-13</b></td><td valign="top">Production Budget: <b>$220 million</b></td></tr></table>


In [87]:
my_table = soup_doc2.find("table", attrs={"width": "95%"})
print(my_table)
#this is the SAME TABLE. we're just getting at it a different way, via the attribute width rather than bg color. it doesn't
#matter how you do it.
#save it in a new variable.

<table bgcolor="#dcdcdc" border="0" cellpadding="4" cellspacing="1" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$623,357,910</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=buenavista.htm">Buena Vista</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&amp;release=theatrical&amp;date=2012-05-04&amp;p=.htm">May 4, 2012</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Action / Adventure</b></td><td valign="top">Runtime: <b>2 hrs. 22 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>PG-13</b></td><td valign="top">Production Budget: <b>$220 million</b></td></tr></table>


**tr** gives you the row, **td** has all the information (according to thirkield, who knows lots of things)

In [88]:
each_entry = my_table.find_all('td')
each_entry
#find all the tds in this new table that you found. this has all seven elements of the list.

[<td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$623,357,910</b></font></td>,
 <td valign="top">Distributor: <b><a href="/studio/chart/?studio=buenavista.htm">Buena Vista</a></b></td>,
 <td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&amp;release=theatrical&amp;date=2012-05-04&amp;p=.htm">May 4, 2012</a></nobr></b></td>,
 <td valign="top">Genre: <b>Action / Adventure</b></td>,
 <td valign="top">Runtime: <b>2 hrs. 22 min.</b></td>,
 <td valign="top">MPAA Rating: <b>PG-13</b></td>,
 <td valign="top">Production Budget: <b>$220 million</b></td>]

Examine what you just got above. It looks like everything that matters is in bold 'b' tags.

In [89]:
print(each_entry[0])

<td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$623,357,910</b></font></td>


In [91]:
for entry in each_entry:
    the_data = entry.find('b')
    #having observed that important things are in bold, we're going to look for 'b' tags in our new variable each_entry
    the_category = the_data.previous_sibling
    #pulling data for siblings because it's next door to it? not totally sure why we do this.
    print(the_data.string)
    print(the_category)

$623,357,910
Domestic Total Gross: 
Buena Vista
Distributor: 
May 4, 2012
Release Date: 
Action / Adventure
Genre: 
2 hrs. 22 min.
Runtime: 
PG-13
MPAA Rating: 
$220 million
Production Budget: 


Hmm. This looks like a backwards **dictionary**. Let's make it into a forward dictionary and save it as a structured piece of data.

In [92]:
avengers_dict = {}
for entry in each_entry:
    the_data = entry.find('b')
    the_category = the_data.previous_sibling
    #the category is the key, the data is the value. for each one of these we're making a category based on what we 
    #scraped, and the value is the data string.
    data_string = the_data.string
    the_category = the_category[:-2].replace(' ','')
    #this is just cleaning it up, chopping off the last two characters and removing the spaces. comment it out and see 
    #what it looks like without it.
    avengers_dict[the_category] = data_string
avengers_dict


{'Distributor': 'Buena Vista',
 'DomesticTotalGross': '$623,357,910',
 'Genre': 'Action / Adventure',
 'MPAARating': 'PG-13',
 'ProductionBudget': '$220 million',
 'ReleaseDate': 'May 4, 2012',
 'Runtime': '2 hrs. 22 min.'}

In [95]:
avengers_dict['Genre']
#now we can treat it as a dictionary and pull things out like normal.

'Action / Adventure'