In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

A status_code of *200* means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

In [7]:
page.status_code 

200

In [8]:
page.text

'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [9]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [11]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [14]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

Shows us that there are two tags at the top level of the page. The initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (\n) in the list.

In [17]:
for item in list(soup.children):
    print(type(item))

<class 'bs4.element.Doctype'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>


All of the items are BeautifulSoup objects. The first is a Doctype object contains information about the type of the document.

The second is a NavigableString represents text found in the HTML document.

The final item is a Tag object contains other nested tags. Allows us to navigate through an HTML document, and extract other tags and text. 

Select the html tag and its children by taking the third item in the list.

In [21]:
html = list(soup.children)[2]
list(html.children)a

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [23]:
body = list(html.children)[3]
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

Isolate the p tag.

In [25]:
p = list(body.children)[1]
p.get_text

'Here is some simple content for this page.'

### Finding all instances of a tag at once

In [26]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

**find_all** returns a list. Use list indexing to extract text.

In [35]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

### Searching for tags by class and id

In [37]:
page = requests.get('http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(page.content, 'html.parser')

In [38]:
soup.find_all('p', class_ = 'outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [39]:
soup.find_all(class_ = 'outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [40]:
soup.find_all(id = 'first')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### Using CSS Selectors

In [41]:
soup.select('div p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [49]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a low around 50. West wind 15 to 20 mph decreasing to 8 to 13 mph after midnight. Winds could gust as high as 25 mph.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. " class="forecast-icon" src="newimages/medium/nshra60.png" title="Tonight: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a low around 50. West wind 15 to 20 mph decreasing to 8 to 13 mph after midnight. Winds could gust as high as 25 mph.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. "/>
 </p>
 <p class="short-desc">
  Showers
  <br/>
  Likely
 </p>
 <p class="temp temp-low">
  Low: 50 °F
 </p>
</div>


### Extracting information from the page
There are 4 pieces of information we can extract:

- The name of the forecast item — in this case, Tonight.

- The description of the conditions — this is stored in the title property   of img.
- A short description of the conditions — in this case, Mostly Clear.
- The temperature low — in this case, 49 degrees.

In [52]:
period = tonight.find(class_ = 'period-name').get_text()
short_desc = tonight.find(class_ = 'short-desc').get_text()
temp = tonight.find(class_ = 'temp').get_text()

print(period)
print(short_desc)
print(temp)

Tonight
ShowersLikely
Low: 50 °F


Extract the title attribute from the img tag. Treat the BeautifulSoup object like a dictionary and pass in the attribute we want as a key.

In [55]:
img = tonight.find("img")
desc = img['title']

print(desc)

Tonight: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a low around 50. West wind 15 to 20 mph decreasing to 8 to 13 mph after midnight. Winds could gust as high as 25 mph.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. 


### Extracting all the information from the page
- Select all items with the class period-name inside an item with the class   tombstone-container in seven_day.
- Use a list comprehension to call the get_text method on each    BeautifulSoup object


In [57]:
period_tags = seven_day.select('.tombstone-container .period-name')
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight']

In [58]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['ShowersLikely', 'Heavy Rainand Windy', 'Heavy Rainthen ChanceShowers', 'ChanceShowers', 'ChanceShowers', 'ChanceShowers', 'Showers', 'ChanceShowers', 'Slight ChanceShowers']
['Low: 50 °F', 'High: 56 °F', 'Low: 51 °F', 'High: 58 °F', 'Low: 54 °F', 'High: 59 °F', 'Low: 53 °F', 'High: 58 °F', 'Low: 50 °F']
['Tonight: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a low around 50. West wind 15 to 20 mph decreasing to 8 to 13 mph after midnight. Winds could gust as high as 25 mph.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. ', 'Sunday: Showers, mainly after 10am. The rain could be heavy at times.  High near 56. Windy, with a south southeast wind 14 to 19 mph increasing to 25 to 30 mph in the afternoon. Winds could gust as high as 39 mph.  Chance of precipitation is 90%. New precipitation amounts between a half and three quarters of an inch possible. ', 'Sunday Night: Showers,

### Combining our data into a Pandas Dataframe



In [64]:
import pandas as pd
weather = pd.DataFrame({
    "period" : periods,
    "short_desc": short_descs,
    "temp":temps,
    "desc" : descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,ShowersLikely,Low: 50 °F,Tonight: Showers likely and possibly a thunder...
1,Sunday,Heavy Rainand Windy,High: 56 °F,"Sunday: Showers, mainly after 10am. The rain c..."
2,SundayNight,Heavy Rainthen ChanceShowers,Low: 51 °F,"Sunday Night: Showers, mainly before 10pm. The..."
3,Monday,ChanceShowers,High: 58 °F,Monday: A 40 percent chance of showers. Cloud...
4,MondayNight,ChanceShowers,Low: 54 °F,Monday Night: A 30 percent chance of showers. ...
5,Tuesday,ChanceShowers,High: 59 °F,Tuesday: A 50 percent chance of showers. Clou...
6,TuesdayNight,Showers,Low: 53 °F,Tuesday Night: Showers. Low around 53. Chance...
7,Wednesday,ChanceShowers,High: 58 °F,Wednesday: A chance of showers. Mostly cloudy...
8,WednesdayNight,Slight ChanceShowers,Low: 50 °F,Wednesday Night: A slight chance of showers. ...


Use `Series.str.extract` to pull out numeric temp values

In [69]:
temp_num = weather['temp'].str.extract("(?P<temp_num>\d+)", expand = False)
weather['temp_num']= temp_num.astype('int')
temp_num

0    50
1    56
2    51
3    58
4    54
5    59
6    53
7    58
8    50
Name: temp_num, dtype: object

In [70]:
weather['temp_num'].mean()

54.333333333333336

In [71]:
is_night = weather['temp'].str.contains("Low")
weather["is_night"] = is_night
is_night

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool