In this notebook I'll look to demonstrate the use case of the pakcage Beautiful Soup which is used to scrape data from the internet

# Imports

In [15]:
import requests
from bs4 import BeautifulSoup

# Grab data using request package

If we get a response of 200 it means we have been successful

In [12]:
example_page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
# example_page = requests.get("https://fbref.com/en/comps/9/Premier-League-Stats")

If the page was downloaded successfully

In [13]:
example_page.status_code

200

Examine content of page

In [14]:
example_page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

# Parsing a page with BeautifulSoup

Create instance of BeautifulSoup class to read our url.

In [16]:
soup = BeautifulSoup(example_page.content, 'html.parser')

method to see structure of html clearer

In [18]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


List children of our web page.

In [19]:
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

Check type of each child of the page is.

We will see they're all Beautiful Soup objects:
* The first is a Doctype object, which contains information about the type of the document.
* The second is a NavigableString, which represents text found in the HTML document.
* The final item is a Tag object, which contains other nested tags. (most important object)

In [21]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

### html tag

Grab html tag

In [22]:
html = list(soup.children)[2] # grabbing tag object

Look at the children of the html tag

In [28]:
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

As we can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body:

In [29]:
body = list(html.children)[3]

Now, we can get the p tag by finding the children of the body tag:

In [31]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

We can now isolate the p tag:

In [32]:
p = list(body.children)[1]

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

In [68]:
p.get_text()

'Here is some simple content for this page.'

### Finding all instances of a tag at once

Use the find_all method, which will find all the instances of a tag on a page.

In [34]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [43]:
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [35]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [36]:
soup.find('p')

<p>Here is some simple content for this page.</p>

### Searching for tags by class and id

Create BeautifulSoup object

In [37]:
id_classes_page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")

soup = BeautifulSoup(id_classes_page.content, 'html.parser')

Now, we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class outer-text:

In [39]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In the below example, we’ll look for any tag that has the class outer-text:

In [38]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by id:

In [42]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### Using CSS Selectors

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

* p a — finds all a tags inside of a p tag.
* body p a — finds all a tags inside of a p tag inside of a body tag.
* html body — finds all body tags inside of an html tag.
* p.outer-text — finds all p tags with a class of outer-text.
* p#first — finds all p tags with an id of first.
* body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

Note that the select method above returns a list of BeautifulSoup objects, just like find and find_all.

In [44]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

# Scraping data example

## Extracting information from the page

Grab data object for first forecast item (Overnight)

* Download the web page containing the forecast.
* Create a BeautifulSoup class to parse the page.
* Find the div with id seven-day-forecast, and assign to seven_day
* Inside seven_day, find each individual forecast item.
* Extract and print the first forecast item.

In [45]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")

soup = BeautifulSoup(page.content, 'html.parser')

seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")

tonight = forecast_items[0] # html objects for next X days
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Overnight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Overnight: Mostly cloudy, with a low around 54. West wind around 10 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Overnight: Mostly cloudy, with a low around 54. West wind around 10 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Cloudy
 </p>
 <p class="temp temp-low">
  Low: 54 °F
 </p>
</div>


As we can see, inside the forecast item tonight is all the information we want. There are four pieces of information we can extract:

* The name of the forecast item — in this case, Tonight.
* The description of the conditions — this is stored in the title property of img.
* A short description of the conditions — in this case, Mostly Clear (out of date).
* The temperature low — in this case, 49 degrees (out of date).

We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

In [46]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Overnight
Mostly Cloudy
Low: 54 °F


Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [70]:
tonight.find("img")

<img alt="Overnight: Mostly cloudy, with a low around 54. West wind around 10 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Overnight: Mostly cloudy, with a low around 54. West wind around 10 mph. "/>

In [51]:
img = tonight.find("img")
desc = img['title']
print(desc)

Overnight: Mostly cloudy, with a low around 54. West wind around 10 mph. 


## Extracting all the information from the page

Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.

In the below code, we will:

* Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
* Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [53]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Overnight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight']

As we can see above, our technique gets us each of the period names, in order.

We can apply the same technique to get the other three fields:

In [56]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Mostly Cloudy', 'Mostly Sunny', 'Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny', 'Mostly Clear', 'Mostly Sunny', 'Partly Cloudy']
['Low: 54 °F', 'High: 66 °F', 'Low: 51 °F', 'High: 66 °F', 'Low: 50 °F', 'High: 66 °F', 'Low: 50 °F', 'High: 64 °F', 'Low: 51 °F']
['Overnight: Mostly cloudy, with a low around 54. West wind around 10 mph. ', 'Tuesday: Mostly sunny, with a high near 66. West wind 5 to 10 mph increasing to 11 to 16 mph in the afternoon. Winds could gust as high as 21 mph. ', 'Tuesday Night: Mostly clear, with a low around 51. West wind 7 to 16 mph, with gusts as high as 20 mph. ', 'Wednesday: Sunny, with a high near 66. Northwest wind 7 to 17 mph, with gusts as high as 21 mph. ', 'Wednesday Night: Mostly clear, with a low around 50. West wind 5 to 15 mph, with gusts as high as 20 mph. ', 'Thursday: Sunny, with a high near 66.', 'Thursday Night: Mostly clear, with a low around 50.', 'Friday: Mostly sunny, with a high near 64.', 'Friday Night: Partly cloudy, with a low aroun

# Combining our data into a Pandas Dataframe

In [59]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Overnight,Mostly Cloudy,Low: 54 °F,"Overnight: Mostly cloudy, with a low around 54..."
1,Tuesday,Mostly Sunny,High: 66 °F,"Tuesday: Mostly sunny, with a high near 66. We..."
2,TuesdayNight,Mostly Clear,Low: 51 °F,"Tuesday Night: Mostly clear, with a low around..."
3,Wednesday,Sunny,High: 66 °F,"Wednesday: Sunny, with a high near 66. Northwe..."
4,WednesdayNight,Mostly Clear,Low: 50 °F,"Wednesday Night: Mostly clear, with a low arou..."
5,Thursday,Sunny,High: 66 °F,"Thursday: Sunny, with a high near 66."
6,ThursdayNight,Mostly Clear,Low: 50 °F,"Thursday Night: Mostly clear, with a low aroun..."
7,Friday,Mostly Sunny,High: 64 °F,"Friday: Mostly sunny, with a high near 64."
8,FridayNight,Partly Cloudy,Low: 51 °F,"Friday Night: Partly cloudy, with a low around..."


We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract method to pull out the numeric temperature values:


In [63]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    54
1    66
2    51
3    66
4    50
5    66
6    50
7    64
8    51
Name: temp_num, dtype: object

We could then find the mean of all the high and low temperatures:

In [64]:
weather["temp_num"].mean()

57.55555555555556

We could also only select the rows that happen at night:

In [65]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool

In [66]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
0,Overnight,Mostly Cloudy,Low: 54 °F,"Overnight: Mostly cloudy, with a low around 54...",54,True
2,TuesdayNight,Mostly Clear,Low: 51 °F,"Tuesday Night: Mostly clear, with a low around...",51,True
4,WednesdayNight,Mostly Clear,Low: 50 °F,"Wednesday Night: Mostly clear, with a low arou...",50,True
6,ThursdayNight,Mostly Clear,Low: 50 °F,"Thursday Night: Mostly clear, with a low aroun...",50,True
8,FridayNight,Partly Cloudy,Low: 51 °F,"Friday Night: Partly cloudy, with a low around...",51,True
