## This brings us to Beautiful Soup
Beautiful Soup is a Python library that parses HTML, allowing us to navigate through the elements of a webpage using the HTML tags embedded in it. [Here is the link to the documentation,](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) there are examples and Extensions Beyond what is demonstrated below.

Now it's time to install Beautiful soup. Go to your terminal/shell/bash and type:

`pip3 install bs4`

We will begin by navigating a very simple HTML page I have posted on my website. [Please follow this link](http://floatingmedia.com/columbia/topfivelists.html) and try inspecting the HTML using Chrome. (p.s. The information on this page comes from [http://www.boxofficemojo.com/genres/chart/?id=comicbookadaptation.htm]

The cell below uses the built in URL Library to import the file from the web. `raw_html` holds the text of that file.

In the following cell, printing raw_html gives us all the text that was in the file.

In [None]:
from urllib.request import urlopen
raw_html = urlopen("http://floatingmedia.com/columbia/topfivelists.html").read()

In [None]:
print(type(raw_html))

In [None]:
print(raw_html)

Now we import Beautiful soup and create a new variable called soup_doc. In that variable we transform the text that we downloaded from the URL (the text in raw_html) into a beautiful soup object that we can investigate with built in functions (like .find())

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(type(soup_doc))

In [None]:
print(soup_doc.prettify())
#print(soup_doc)

Here is a very simple example of beautiful soup's built in functions. .title is shorthand for .find('title'). It finds the first title tag in the document. If we append .string we get the text inside the tag.

In [None]:
soup_doc.title

In [None]:
soup_doc.title.string

When you are first getting comfortable with beautiful soup, it is wise to use the .find() notation. find() searches for the first instance of a tag, and returns the contents of the tag as well as its tags. (Remember .string strips away the tags)

In [None]:
soup_doc.find('p')

Most often when we are scraping HTML simply finding one tag will not do the trick. We need to navigate hierarchically down the tree of nested tags. We'll begin by searching the first list that is contained in the first `<div>` tag. Since we are looking for the first occurrence, we can use .find()

In [None]:
soup_doc.find('div')
#soup_doc.div  # would also work

Below, we navigate the tree: starting at the outer `<div>` tag, and then we use find_all() to get every `<p>` tag nested inside. After that we loop through the `<p>` tags, and pull out the text that is inside the `<b>` tag, using `string` get us just the name of the movies with no tags around it.

In [None]:
first_div = soup_doc.find('div')
#first_div Is a variable that contains all HTML in the first div 
all_paragraphs = first_div.find_all('p')
#.find_all() gives us a list
#so to search elements inside that list 
#we now need to loop through it
for movies in all_paragraphs:
    print(movies.find('b').string)

**More on find_all()**

If there are more than one of the same tags, find_all() gives us a list. We can use list notation to get a specific element in the list. The first cell below gives us a list of every single `<p>` tag in the document. In the following cell, we get a list of all of the `<p>` tags inside the first `<div>`. Try changing the index number `[0]` for each of these lists to see what you get.

In [None]:
soup_doc.find_all('p')[0]
# uncomment the line below to see the full list
#soup_doc.find_all('p')

In [None]:
soup_doc.find('div').find_all('p')[0]
#soup_doc.find('div').find_all('p')

Here is a one line search down the tree. Note that I specify the index number of the `<p>` tag I want to search further. Again, if I wanted to search all of the `<p>` tags, I would need to use a loop.

In [None]:
soup_doc.find('div').find_all('p')[2].find('b').string
#soup_doc.div.find_all('p')[2].b.string ##Same thing but shorter

These are two examples of searching the first list and pulling it out the name and date of the third movie in the list [2] -- try changing that index number to get movies in the list.

In [None]:
that_movie = soup_doc.find('div').find_all('p')[2]
movie_name = that_movie.find('b')
movie_year = that_movie.find('span')
print(movie_name, movie_year)

In [None]:
that_movie = soup_doc.find('div').find_all('p')[2]
movie_name = that_movie.b
movie_year = that_movie.find('span')
print(movie_name.string, "||", movie_year.string)

Now we get the next list (the second `<div>` or list element [1]) and then pull out all of the names and dates by using a loop.

In [None]:
next_list = soup_doc.find_all('div')[1]
print(next_list)
next_movies = next_list.find_all('p')
print(next_movies)

In [None]:
for movie in next_movies:
    movie_name = movie.b
    movie_year = movie.find('span')
    print(movie_name.string, "||", movie_year.string)

To get the third list, we could have gotten `<div>` [2], but because it has a unique `<ul>` parent tag--we go straight for that.

Try this: get 'Ghost in the Shell' out of that list.

In [None]:
#third_list = soup_doc.find_all('div')[2].find('li')
third_list = soup_doc.find_all('ul')
print(third_list)

**Parents, children, and siblings**

So far we have navigated the DOM tree from parent to child-- div > p > b

Sometimes you want to go the opposite direction, find a unique identifier inside a container and then get everything in the container. For example, as we saw the third list has a unique `<ul>` tag. If he wanted to get everything that is inside the same parent container (the `<div>`) we could do something like this:

In [None]:
soup_doc.find('ul').parent

You can also go sideways, meaning finding siblings--elements that are in the same container at the same level of the hierarchy. As you've seen, the fourth list is not in its own div. To get the fourth list we could use the unique `<h2>` to get all of the siblings that come after it.

In [None]:
last_head = soup_doc.find('h2')
last_list = last_head.find_next_siblings()
last_list


We can even specify what kind siblings you want to find--notice how that's final "that's all" showed up. We can search for all of the next `<p>` tags. (There is also a previous_siblings function goes backwards.)

In [None]:
last_head = soup_doc.find('h2')
last_list = last_head.find_next_siblings('p')
last_list

**Searching id and class**

Most websites these days use id and class attributes to style (and run code) on their webpages. These can be some of the most helpful attributes to search for to find certain types/groups of information.

In [None]:
#Finds all classes named "year"
all_years = soup_doc.find_all(class_="year")
print(all_years)
#Try printing out just the years without any tags around them

In [None]:
#Finds any tag that has an id attribute in it
fav = soup_doc.find_all(id=True)
print(fav)

In [None]:
fav = soup_doc.find_all(id='favorite')
print(fav)

You can also search for any kind of attribute beyond id and class, and you can specify what kind of tag you want to look for that attribute in. This is very helpful for zoning in on specific parts of the webpage.

In [None]:
fav = soup_doc.find_all('p', attrs={'id': 'favorite'})
print(fav)

**Pulling out attributes**

Not only can you search by attributes but you can pull out the information hidden inside a tag. The most common information you will want to get is a link. Like in this tag: 
`<a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">more info</a>`

URLs are found in `<a>` tags inside the `href` attribute.

In [None]:
first_link = soup_doc.find('a')
get_url = first_link['href']
print(get_url)

**Using that link!**

Here I am taking a real link to box office mojo and scraping a table with some basic information about the movie--and turning that info into a dictionary! I will cover this in class on Thursday.

In [None]:
raw_html2 = urlopen(get_url).read()
soup_doc2 = BeautifulSoup(raw_html2, "html.parser")
print(soup_doc2.prettify())

In [None]:
my_table = soup_doc2.find("table", attrs={"bgcolor": "#dcdcdc"})
print(my_table)

In [None]:
each_entry = my_table.find_all('td')
each_entry

In [None]:
print(each_entry[0])

In [None]:
for entry in each_entry:
    the_data = entry.find('b')
    the_category = the_data.previous_sibling
    print(the_data.string)
    print(the_category)

In [None]:
avengers_dict = {}
for entry in each_entry:
    the_data = entry.find('b')
    the_category = the_data.previous_sibling
    data_string = the_data.string
    the_category = the_category[:-2].replace(' ','')
    avengers_dict[the_category] = data_string
avengers_dict

In [None]:
avengers_dict['Genre']