# Python Web Scraping Tutorial using BeautifulSoup

## The components of a web page


When we visit a web page, our web browser makes a GET request to a web server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping.

### HTML


HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. 

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the `<html>` tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

~~~
<html>
</html>
~~~

<html>
</html>

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

~~~
<html>
    <head>
    </head>
    <body>
    </body>
</html>
~~~

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:
~~~
<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>

Tags have commonly used names that depend on their position in relation to other tags:

- **child** — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- **parent** — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- **sibiling** — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:

~~~
<html>
  <head>
  </head>
    <body>
      <p>
        Here's a paragraph of text!
        <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
          Here's a second paragraph of text!
          <a href="https://www.python.org">Python</a>        
        </p>
    </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
            Here's a second paragraph of text!
            <a href="https://www.python.org">Python</a>        </p>
    </body></html>

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

- *div* — indicates a division, or area, of the page.
- *b* — bolds any text inside.
- *i* — italicizes any text inside.
- *table* — creates a table.
- *form* — creates an input form.


For a full list of tags, look [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

There are two special properties give HTML elements names, and make them easier to interact with when we’re scraping, **class** and **id** properties. 

- One element can have multiple classes, and a class can be shared between elements. 
- Each element can only have one id, and an id can only be used once on a page. 
- Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

~~~
<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>

## Apply this knowledge

Let's go to a webpage and change some of the HTML to see what happens. 

- Open up a webpage in Chrome.
- Right click and 'inspect element.

- Change the text in a tag.
- Change the class of a tag.
- Find a tag with an ID and remove it.



## Webscraping with Python

### The requests library
The first thing we’ll need to do to scrape a web page is to grab the html of thepage. We can get this by using the Python `requests` library.

In [1]:
import requests
page = requests.get("https://www.the-numbers.com/bankability")
page

<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [2]:
page.status_code


200

We can print out the HTML content of the page using the content property:

In [3]:
page.content

b'<!DOCTYPE html>\r\n<html>\r\n<head>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>\r\n<script>\r\n  window.dataLayer = window.dataLayer || [];\r\n  function gtag(){dataLayer.push(arguments);}\r\n  gtag(\'js\', new Date());\r\n\r\n  gtag(\'config\', \'UA-1343128-1\');\r\n</script>\r\n<meta http-equiv="PICS-Label" content=\'(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))\'>\r\n<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->\r\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\r\n<meta name="format-detection" content="telephone=no">   <!-- for apple mobile --> \r\n<meta property="fb:admins" content="521546213" />\r\n\r\n\r\n<meta name="viewport" content="initi

### Parsing a page with BeautifulSoup

We can use the BeautifulSoup library to parse this document, and extract the text from the `<p>` tag. 

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html>
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-1343128-1');
</script>
<meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
<meta content="521546213" property="fb:admins">
<meta content="initial-scale=1" name="viewport"/>
<meta content="A monthly ranking of the most valuable

In [5]:
print(soup.prettify())


<!DOCTYPE html>
<html>
 <head>
  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-1343128-1');
  </script>
  <meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
  <!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="telephone=no" name="format-detection"/>
  <!-- for apple mobile -->
  <meta content="521546213" property="fb:admins">
   <meta content="initial-scale=1" name="viewport"/>
   <meta content="A mo

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the `children` property of `soup`.

In [6]:
list(soup.children)

['html', '\n', <html>
 <head>
 <!-- Global site tag (gtag.js) - Google Analytics -->
 <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
 <script>
   window.dataLayer = window.dataLayer || [];
   function gtag(){dataLayer.push(arguments);}
   gtag('js', new Date());
 
   gtag('config', 'UA-1343128-1');
 </script>
 <meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
 <!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
 <meta content="521546213" property="fb:admins">
 <meta content="initial-scale=1" name="viewport"/>
 <meta content="A monthly ranking of th

The above tells us that there are two tags at the top level of the page — the initial `<!DOCTYPE html>` tag, and the `<html>` tag. There is a newline character `(\n)` in the list as well. Let’s see what the type of each element in the list is:

In [7]:
print(len(list(soup.children)))
[type(item) for item in list(soup.children)]


32


[bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.NavigableString]

In [8]:
list(soup.children)[2]

<html>
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-1343128-1');
</script>
<meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
<meta content="521546213" property="fb:admins">
<meta content="initial-scale=1" name="viewport"/>
<meta content="A monthly ranking of the most valuable and influential 

The __`Tag` object__ allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various `BeautifulSoup` objects [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects).

We can now select the html tag and its children by taking the third item in the list:



In [9]:
html = list(soup.children)[2]
ctr = 0
for h in html:
    print(h, ctr, 'NEXT \n')
    ctr +=1



 0 NEXT 

<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-1343128-1');
</script>
<meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
<meta content="521546213" property="fb:admins">
<meta content="initial-scale=1" name="viewport"/>
<meta content="A monthly ranking of the most valuable and in

Each item in the list returned by the `children` property is also a `BeautifulSoup` object, so we can also call the `children` method on `html`.

Now, we can find the `children` inside the `html` tag:

In [12]:
print(len(list(html.children)))
list(html.children)


4


['\n', <head>
 <!-- Global site tag (gtag.js) - Google Analytics -->
 <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
 <script>
   window.dataLayer = window.dataLayer || [];
   function gtag(){dataLayer.push(arguments);}
   gtag('js', new Date());
 
   gtag('config', 'UA-1343128-1');
 </script>
 <meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
 <!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
 <meta content="521546213" property="fb:admins">
 <meta content="initial-scale=1" name="viewport"/>
 <meta content="A monthly ranking of the most valuable 

As you can see above, there are two tags here, `head`, and `body`. We want to extract the text inside the `p` tag, so we’ll dive into the body:

In [13]:
body = list(html.children)[3]
sub_body = list(body)[3]
sub_body

<div id="wrap">
<div id="header">
<a href="/"><img alt="The Numbers - Where Data and Movies Meet" border="0" height="67" src="/images/the-numbers-banner.png" width="524"/>®
<br/>    Where Data and the Movie Business Meet
</a>
</div>
<div id="header_right" style="background: #663366; width:200px; height: 96px;">
<div style="float:right;">
Follow us on
<a href="https://www.facebook.com/TheNumbers" target="_blank"><img height="32" src="https://www.the-numbers.com/images/icons/facebook.png" style="border:none;" title="Follow The Numbers on Facebook" width="32"/></a>
<a href="https://www.twitter.com/MovieNumbers" target="_blank" title="Follow The Numbers on Twitter"><img height="32" src="https://www.the-numbers.com/images/icons/twitter.png" style="border:none;" width="32"/></a>
</div>
</div>
<div id="nav">
<ul>
<li><a href="/news">News</a>
<ul>
<li><a href="/news">Latest News</a></li>
<li><a href="/movies/release-schedule">Release Schedule</a></li>
<li><a href="/on-this-day">On This Day</a>

We can now isolate the `p` tag:



In [14]:
p = list(body.children)[1]
p

<script>
  window.fbAsyncInit = function() {
    FB.init({
      appId            : '1585976611658631',
      autoLogAppEvents : true,
      xfbml            : true,
      version          : 'v2.9'
    });
    FB.AppEvents.logPageView();
  };

  (function(d, s, id){
     var js, fjs = d.getElementsByTagName(s)[0];
     if (d.getElementById(id)) {return;}
     js = d.createElement(s); js.id = id;
     js.src = "//connect.facebook.net/en_US/sdk.js";
     fjs.parentNode.insertBefore(js, fjs);
   }(document, 'script', 'facebook-jssdk'));
</script>

Once we’ve isolated the tag, we can use the `get_text` method to extract all of the text inside the tag:

In [15]:
p.get_text()


'\r\n  window.fbAsyncInit = function() {\r\n    FB.init({\r\n      appId            : \'1585976611658631\',\r\n      autoLogAppEvents : true,\r\n      xfbml            : true,\r\n      version          : \'v2.9\'\r\n    });\r\n    FB.AppEvents.logPageView();\r\n  };\r\n\r\n  (function(d, s, id){\r\n     var js, fjs = d.getElementsByTagName(s)[0];\r\n     if (d.getElementById(id)) {return;}\r\n     js = d.createElement(s); js.id = id;\r\n     js.src = "//connect.facebook.net/en_US/sdk.js";\r\n     fjs.parentNode.insertBefore(js, fjs);\r\n   }(document, \'script\', \'facebook-jssdk\'));\r\n'

### Finding all instances of a tag at once

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

Note that `find_all` returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [None]:
soup.find_all('p')[0].get_text()

If you instead only want to find the first instance of a tag, you can use the `find` method, which will return a single `BeautifulSoup` object:

In [55]:
print(soup.find('p'))
print(soup.find('p').get_text())

<p>Here is some simple content for this page.</p>
Here is some simple content for this page.


### Searching for tags by class and id


We introduced classes and ids earlier, but it probably wasn’t clear why they were useful. Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. Let's look at the following page:

~~~
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
        <p class="outer-text first-item" id="second">
            <b>
                First outer paragraph.
            </b>
        </p>
        <p class="outer-text">
            <b>
                Second outer paragraph.
            </b>
        </p>
    </body>
</html>
~~~

In [16]:
import requests
from bs4 import BeautifulSoup
import re

page = requests.get("https://www.the-numbers.com/bankability")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html>
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-1343128-1');
</script>
<meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
<meta content="521546213" property="fb:admins">
<meta content="initial-scale=1" name="viewport"/>
<meta content="A monthly ranking of the most valuable

Now, we can use the `find_all` method to search for items by class or by id. In the below example, we’ll search for any `p` tag that has the class `outer-text`:

In [47]:
# soup.find_all('p', class_='outer-text')


# ACTORS ONLY!!!!!!
counter = 1
for i in soup.findAll('td'):
    if i.find('b', text=re.compile(".*Best-known acting roles:.*")):
                counter +=1
        temptext = i.text

        print(counter)
        #for j in i.children:
         #   print(j.findAll(r'summary">(.*?)</a>', j), "111")
            #print(j.nextSibling.finditer(r'summary">(.*?)</a>', j), "222\n")
#     for j in i.children:
#         print(type(j), 'ONE')
#         print(j.nextSibling, 'TWO\n')
# for r in list(re.finditer(r'summary">(.*?)<', str(l))):

Best-known acting roles: 
Ethan Hunt (Mission: ImpossibleâFallout), Ethan Hunt (Mission: ImpossibleâRogue Nation), Ethan Hunt (Mission: ImpossibleâGhost Protocol), Ray Ferier (War of the Worlds), Famous Austin Powers (Austin Powers in Goldmember)
1
Best-known acting roles: 
Genie (Aladdin), Deadshot (Suicide Squad), Robert Neville (I am Legend), John Hancock (Hancock), Agent J (Men in Black 3)
2
Best-known acting roles: 
Tony Stark/Iron Man (Avengers: Endgame), Tony Stark/Iron Man (Avengers: Infinity War), Tony Stark/Iron Man (The Avengers), Tony Stark/Iron Man  (Avengers: Age of Ultron), Tony Stark/Iron Man (Captain America: Civil War)
3
Best-known acting roles: 
Scarlet Overkill (Minions), Ryan Stone (Gravity), Leigh Anne Tuohy (The Blind Side), Debbie Ocean (Oceanâs 8), Ashbum (The Heat)
4
Best-known acting roles: 
Herself (Contact)
5
Best-known acting roles: 
Earl Stone (The Mule), Walt Kowalski (Gran Torino), Frankie Dunn (Million Dollar Baby), Frank Corvin (Space Cowboys)

In the below example, we’ll look for any tag that has the class `outer-text`:



In [61]:
soup.find_all(class_="outer-text")


[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by `id`:


In [49]:
# print(soup.findAll(r'Best-known acting roles:</b>(.*?)<tr><td><b>Best-known technical roles',soup))
print(len(list(soup.find_all(id="col2mid"))))

# for l in list(soup.find_all(id="col2mid")):
#     for r in list(re.finditer(r'summary">(.*?)<', str(l))):
#         print(r)
    
    #for r in list(re.findall(r'Best-known acting roles:</b>(.*?)<tr><td><b>Best-known technical roles',str(l))):
    #    print(r)

50


In [56]:
import re
from collections import defaultdict

films_dict = defaultdict(list)
rank = 1
print('length = ', len(list(soup.find_all(id="col2mid")))) # temp
for l in list(soup.find_all(id="col2mid")):
    for r in list(re.findall(r'summary">(.*?)<', str(l))):
        if r.find('â') != -1:
            if r[r.find('â')+3] == 's':
                r = r[:r.find('â')]+"'"+r[r.find('â')+3:]
            else:
                r = r[:r.find('â')]+":"+r[r.find('â')+2:]
                print(r) # temp
        # print('X94! ', r.find('x94')) # r.replace('\\x94','')
        if r not in films_dict[str(rank)]: # check if duplicate, skip if it is
            films_dict[str(rank)].append(r)    
    rank +=1
# print(rank, films_dict) # WHY?!?!

for rank in films_dict:
    for film in films_dict[rank]:
        print(rank, film)

        # if r.find('')
        # r.replace('â',"'")
        # append each to the actor key
# dupes = (('k1', 1), ('k2', 2), ('k1', 3), ('k3', 4))
# res = defaultdict(list)
# for k,v in dupes:
#     res[k].append(v)

# print(res)


length =  50
Mission: Impossible:Fallout
Mission: Impossible:Rogue Nation
Mission: Impossible:Ghost Protocol
Mission: Impossible:Fallout
Mission: Impossible:Rogue Nation
Mission: Impossible:Ghost Protocol
1 Mission: Impossible:Fallout
1 Mission: Impossible:Rogue Nation
1 Mission: Impossible:Ghost Protocol
1 War of the Worlds
1 Austin Powers in Goldmember
1 Mission: Impossible 2
1 Mission: Impossible III
2 Aladdin
2 Suicide Squad
2 I am Legend
2 Hancock
2 Men in Black 3
2 Hitch
2 The Pursuit of Happyness
2 Annie
2 I, Robot
3 Avengers: Endgame
3 Avengers: Infinity War
3 The Avengers
3 Avengers: Age of Ultron
3 Captain America: Civil War
3 The Judge
3 A Guide to Recognizing Your Saints
4 Minions
4 Gravity
4 The Blind Side
4 Ocean's 8
4 The Heat
4 The Proposal
4 Miss Congeniality
4 Two Weeks Notice
4 Miss Congeniality 2: Armed and Fabulous
4 Hope Floats
5 Star Wars Ep. VII: The Force Awakens
5 Star Wars Ep. VIII: The Last Jedi
5 Rogue One: A Star Wars Story
5 Solo: A Star Wars Sto

### Using CSS Selectors


You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

- *p a* — finds all a tags inside of a p tag.
- *body p a* — finds all a tags inside of a p tag inside of a body tag.
- *html body* — finds all body tags inside of an html tag.
- *p.outer-text* — finds all p tags with a class of outer-text.
- *p#first* — finds all p tags with an id of first.
- *body p.outer-text* — finds any p tags with a class of outer-text inside of a body tag.

`BeautifulSoup` objects support searching a page via CSS selectors __using the `select` method__. We can use CSS selectors to find all the `p` tags in our page that are inside of a `div` like this:

In [23]:
# soup.select('div p[style*="text-align:center;"]')
soup.select("p[style] span {
  font-size:200%;   
}

SyntaxError: EOL while scanning string literal (<ipython-input-23-ac1241e405be>, line 2)

Note that the `select` method above returns a list of `BeautifulSoup` objects, just like `find` and `find_all`.

## Your Turn: Downloading weather data


We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this [page](https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.XJJFuhNKhTY).

We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a `div` tag with the id `seven-day-forecas`t:

We now know enough to download the page and start parsing it. In the below code, we:

1. Download the web page containing the forecast.
2. Create a `BeautifulSoup` class to parse the page.
3. Find the `div` with id `seven-day-forecast`, and assign to `seven_day`
4. Inside `seven_day`, find each individual forecast item.
5. Extract and print the first forecast item.

In [23]:
# 1
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")

In [24]:
#2
soup = BeautifulSoup(page.content, 'html.parser')

In [25]:
#3
seven_day = soup.find(id="seven-day-forecast")

In [29]:
#4
forecast_items = seven_day.find_all(class_="tombstone-container")

In [30]:
#5
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 72. Light west southwest wind becoming west 6 to 11 mph in the afternoon. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 72. Light west southwest wind becoming west 6 to 11 mph in the afternoon. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 72 °F
 </p>
</div>


### Extracting information from the page
As you can see, inside the forecast item `tonight` is all the information we want. There are `4` pieces of information we can extract:

- The name of the forecast item — in this case, **Tonight**.
- The description of the conditions — this is stored in the `title` property of `img`.
- A short description of the conditions — in this case, **Mostly Clear**.
- The temperature low — in this case, **49 degrees**.

We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

Now, we can extract the `title` attribute from the `img` tag. To do this, we just treat the `BeautifulSoup` object like a dictionary, and pass in the attribute we want as a key:

In [None]:
# your code here

### Extracting all the information from the page
Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once.

In the below code, we:

- Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
- Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [31]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday']

As you can see above, our technique gets us each of the period names, in order. Now you can apply the same technique to get the other 3 fields:

In [None]:
#your code here

