In [6]:
import requests
from bs4 import BeautifulSoup

In [8]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content
print(content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


In [13]:
# Initialize the parser, and pass in the content grabbed earlier
parser = BeautifulSoup(content, 'html.parser')

# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

#Get the p tag from the body
p = body.p

# Get the text inside the title tag
# Text is a property that gets the inside text of a tag
title = parser.head.title
title_text = title.text

In [35]:
type(title)
title.find_all()
parser.find_all()

#Get a list of all occurrences of the body tag in the element
body = parser.find_all('body') #return a list of Tag instances
print(body)
#Get the paragraph tag.
p = body[0].find_all('p')
print(p)
#Get the text
print(p[0].text)

# Get the text inside the title tag
title_text = parser.find_all('head')[0].find_all('title')[0].text

[<body>
<p>Here is some simple content for this page.</p>
</body>]
[<p>Here is some simple content for this page.</p>]
Here is some simple content for this page.


In [36]:
title_text

'A simple example page'

In [37]:
type(parser.find_all('head')[0])

bs4.element.Tag

### HTML uses the div tag to create a divider that splits the page into logical units, like a "box" that contains content.

In [38]:
# Get the page content and set up a new parser
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

#Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all('p', id = 'first')[0]
print(first_paragraph.text)


                First paragraph.
            


In [44]:
parser.find_all('p', id = 'first')[0].text

'\n                First paragraph.\n            '

In [53]:
a =parser.find_all('head')[0]
b = a.text
b

'\nA simple example page\n'

In [55]:
parser.find_all(id = 'first')[0].text

'\n                First paragraph.\n            '

In [57]:
second_paragraph_text = parser.find_all('p', id = 'second')[0].text
print(second_paragraph_text)



                Second paragraph.
            



In [58]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, "html.parser")

In [64]:
parser.find_all('p', class_ = 'inner-text')[0].text

'\n                First paragraph.\n            '

## Cascading Styple Sheets ---- CSS
* A language for adding styles to HTML pages.
* CSS uses selectors to add style to the elements and classes of element specified. Add background colors, text colors, borders, padding...
* p{
    color: red
   } * all text inside all paragraph
 * p.inner-text{
    color: red
   }   *Select class with the dot symbol(.). Text for any paragraphs that have class inner-text
 * p#first{
    color: red
   } * Select IDs with pound or hash symbol(#). Text for any paragraphs that have the ID first.
 * #first{
    color: red
   } * Element with the ID first(not just paragraphs)
 * .inner-text{
    color: red
   } * Any element with the class inner-text
   
### Can use BeautifulSoup's .select method to work with CSS selectors.

In [65]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

In [72]:
# Select all of the elements that have the first-time class
first_items = parser.find_all(class_ = 'first-item')
first_items

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [75]:
first_items = parser.select(".first-item")
print(first_items[0].text)
first_items


                First paragraph.
            


[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

In [77]:
first_outer_text = parser.select(".outer-text")[0].text
second_text = parser.select("#second")[0].text
second_text

'\n\n                First outer paragraph.\n            \n'

## We can nest CSS selectors similar to the way HTML nests tags. For example, we could use selectors to find all of the paragraphs inside the body tag.
* div p * any paragraph inside a div tag
* div .first-item * any item inside a div tag that has teh class first-item
* body div #first * Any item that's inside a div tag inside a body tag, but only if also has the ID first
* .first-item #first ---------------Any items with the ID first that are inside any items with the class first-item

In [78]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
parser = BeautifulSoup(response.content, 'html.parser')

In [87]:
# Find the number of turnovers the Seahawks committed
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
seahawks_turnovers_count

'1'

In [91]:
# Total plays for the New England Patriots
patriots_total_plays_count = parser.select("#total-plays")[0].select("td")[2].text
total_yards = parser.select("#total-yards")[0]
seahawks_total_yards_count = total_yards.select("td")[1].text
seahawks_total_yards_count

'396'

In [97]:
parser.select('body #total-yards')

[<tr id="total-yards">
 <td>Total yards</td>
 <td>396</td>
 <td>377</td>
 </tr>]

In [101]:
response = requests.get("https://guides.github.com/activities/hello-world/")
parser= BeautifulSoup(response.content, 'html.parser')

In [109]:
para = parser.select("body .content-body p")
para

[<p><a class="toc-item" id="intro" title="Intro"></a></p>,
 <p>The <strong>Hello World</strong> project is a time-honored tradition in computer programming. It is a simple exercise that gets you started when learning something new. Let’s get started with GitHub!</p>,
 <p><strong>You’ll learn how to:</strong></p>,
 <p><a class="toc-item" id="what" title="What is GitHub?"></a></p>,
 <p>GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.</p>,
 <p>This tutorial teaches you GitHub essentials like <em>repositories</em>, <em>branches</em>, <em>commits</em>, and <em>Pull Requests</em>. You’ll create your own Hello World repository and learn GitHub’s Pull Request workflow, a popular way to create and review code.</p>,
 <p>To complete this tutorial, you need a <a href="http://github.com">GitHub.com account</a> and Internet access. You don’t need to know how to code, use the command line, or install Git (the vers

In [114]:
[t.text for t in para if len(t.text) > 0]

['The Hello World project is a time-honored tradition in computer programming. It is a simple exercise that gets you started when learning something new. Let’s get started with GitHub!',
 'You’ll learn how to:',
 'GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.',
 'This tutorial teaches you GitHub essentials like repositories, branches, commits, and Pull Requests. You’ll create your own Hello World repository and learn GitHub’s Pull Request workflow, a popular way to create and review code.',
 'To complete this tutorial, you need a GitHub.com account and Internet access. You don’t need to know how to code, use the command line, or install Git (the version control software GitHub is built on).',
 'Tip: Open this guide in a separate browser window (or tab) so you can see it while you complete the steps in the tutorial.',
 'A repository is usually used to organize a single project. Repositories can co