# Webscraping using Python and BeautifulSoup -- Tutorial reference notes (Justin M. Olds)

#### Tutorial: https://www.dataquest.io/blog/web-scraping-tutorial-python/
#### BS documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
---
## html tag notes:
* child — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
* parent — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
* sibiling — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they're both inside html. Both p tags are siblings, since they're both inside body.

**a** and **p** are extremely common html tags. Here are a few others:

**div** — indicates a division, or area, of the page.
**b** — bolds any text inside.
**i** — italicizes any text inside.
**table** — creates a table.
**form** — creates an input form.

---
## html properties notes: 

properties give HTML elements names, and make them easier to interact with when scraping.
* **classes** - One element can have multiple classes, and a class can be shared between elements.
* **id** - Each element can only have one id, and an id can only be used once on a page.

---

## The Requests library
### Use request package (get function) to save html pages

In [5]:
import IPython

from IPython.display import HTML
from IPython.display import display

import requests

page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

### Response 200 (200's) denotes success. 400's or 500's denote an error. 
---
### Response objects are saved as an attribute of the requests.get object

In [6]:
page.status_code

200

### The html is saved within the 'content' attribute

In [7]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

## Beautiful Soup
### Soupify (i.e., parse) the page object

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [9]:
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


### Here are some simple ways to navigate that data structure:

In [11]:
soup.title

<title>A simple example page</title>

In [12]:
soup.title.name

'title'

In [13]:
soup.title.string

'A simple example page'

In [14]:
soup.title.parent.name

'head'

In [15]:
soup.p

<p>Here is some simple content for this page.</p>

---
### Examine all elements at the top level of the page using the children property of soup.
##### Note: the children command returns a list generator so we need to call the list function on it.

In [16]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

#### The first line of the output tells us that there are two tags at the top level of the page
* !DOCTYPE html 
* html 
* There is a newline (\n) in the list as well

---
### Examine the type of element in the top-level-tags list

In [17]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

#### "bs4." denotes that each element in the list is a BeautifulSoup object
* Doctype - information about the type of document 
* NavigableString - represents text found in the html document
* Tag - the important one ;) often contains other nested tags. 
---
### Select the html tag and its children by taking the third item in the list
##### Note: index elements begin at 0, thus the third element is 2

In [18]:
html = list(soup.children)[2]

### Find the children nested within the html tag

In [19]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [20]:
[type(item) for item in list(html.children)]

[bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

### There are two tags here
* head 
* body 
---
### We can find the p tag by finding the children of the body tag

In [21]:
body = list(html.children)[3]

In [22]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [23]:
[type(item) for item in list(body.children)]

[bs4.element.NavigableString, bs4.element.Tag, bs4.element.NavigableString]

### Isolate the p tag

In [24]:
p = list(body.children)[1]

### Once isolated, extract all of the text using get_text method

In [25]:
p.get_text()

'Here is some simple content for this page.'

## Finding all instances of a tag

In [26]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

##### Note: find_all returns a list, so we have to loop through it or use list indexing to extract text. 
---
#### The find method will resturing a single BeautifulSoup object with the first instance of a tag

In [27]:
soup.find('p')

<p>Here is some simple content for this page.</p>

## Searching for tags by class and id
##### Note: These properties are used to uniformly apply certain styles of formatting for related parts of a webpage

In [28]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


#### search for any p tag that has the class outer-text

In [29]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

search for all elements with id first

In [30]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

## Searching a page via CSS selectors using the select methods
https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors
CSS selectors allow develops to specify html tags for style/formatting. Some examples... 
* **p a** - finds all a tags inside of a p tag
* **body p a** - find all a tags inside of a p tag inside of a body tag
* **html body** - finds al body tags inside of an html tag
* **p.outer-text** - finds all p tags with a class of outer-text
* **p#first** - finds all ptags with an id of first
* **body p.outer-text** - finds any p tags with a class of outer-text inside of a body tag
---
### Find all the p tags in our page that are inside of a div 


In [31]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

##### Note: the select method returns a list of BeautifulSoup objects just like find and find_all. 
---
## Downloading weather data 
National Weather Service: http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168 

### Target data to scrape: Extended Forecast

## Exploring page structure with Chrome DevTools
#### The elements panel within DevTools shows all html tags. In this example, highlighting the extended forcast reveals a div tag with the id seven-day-forecast. Further exploration of this page's elements reveals that each forcast item (e.g., Tonight, Thursday, Thursday Night) is contained in a div with the class tombstone-container
---
###Next...
* Download the web page containing the forecast
* Create a BeautifulSoup class to parse the page
* Find the div with id seven-day-forecaste, and assign to .seven_day
* Inside seven_day, find each individual forecaste item
* Extract and print

In [32]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')  
seven_day = soup.find(id='seven-day-forecast')
forecast_items = seven_day.find_all(class_='tombstone-container')
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  This
  <br/>
  Afternoon
 </p>
 <p>
  <img alt="This Afternoon: Sunny, with a high near 67. West southwest wind 9 to 16 mph, with gusts as high as 21 mph. " class="forecast-icon" src="newimages/medium/few.png" title="This Afternoon: Sunny, with a high near 67. West southwest wind 9 to 16 mph, with gusts as high as 21 mph. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 67 °F
 </p>
</div>


### Extract information from the page
* Name of the forecast item (tonight)
* Weather condition discription (title property of img)
* Short discription (Cloudy)
* Temperature low -- 54

In [33]:
period = tonight.find(class_='period-name').get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

ThisAfternoon
Sunny
High: 67 °F


In [34]:
img = tonight.find("img")
desc = img['title']

print(desc)

This Afternoon: Sunny, with a high near 67. West southwest wind 9 to 16 mph, with gusts as high as 21 mph. 


### Extract all the page info
* Select all items with the class period-name inside of an item with the class tombstone-container in seven_day
* Use a list comprehenion to call the get_text method on each BeautifulSoup object

In [35]:
period_tags = seven_day.select('.tombstone-container .period-name') 
periods = [pt.get_text() for pt in period_tags]
periods

['ThisAfternoon',
 'Tonight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday']

In [36]:
[type(item) for item in list(period_tags)]

[bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.Tag]

In [37]:
short_descs = [sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]
temps = [t.get_text() for t in seven_day.select('.tombstone-container .temp')]
descs = [d["title"] for d in seven_day.select('.tombstone-container img')]

print(short_descs)

['Sunny', 'Partly Cloudy', 'DecreasingClouds', 'IncreasingClouds', 'Mostly Sunny', 'Mostly Cloudy', 'Mostly Sunny', 'Partly Cloudy', 'Mostly Sunny']


In [38]:
print(temps)

['High: 67 °F', 'Low: 54 °F', 'High: 70 °F', 'Low: 55 °F', 'High: 68 °F', 'Low: 54 °F', 'High: 66 °F', 'Low: 54 °F', 'High: 66 °F']


In [39]:
print(descs)

['This Afternoon: Sunny, with a high near 67. West southwest wind 9 to 16 mph, with gusts as high as 21 mph. ', 'Tonight: Partly cloudy, with a low around 54. West southwest wind 13 to 18 mph decreasing to 7 to 12 mph after midnight. Winds could gust as high as 24 mph. ', 'Friday: Mostly cloudy, then gradually becoming sunny, with a high near 70. West southwest wind 7 to 12 mph increasing to 13 to 18 mph in the afternoon. Winds could gust as high as 25 mph. ', 'Friday Night: Increasing clouds, with a low around 55. West southwest wind 14 to 18 mph, with gusts as high as 24 mph. ', 'Saturday: Mostly sunny, with a high near 68. West southwest wind 9 to 15 mph, with gusts as high as 20 mph. ', 'Saturday Night: Mostly cloudy, with a low around 54.', 'Sunday: Mostly sunny, with a high near 66.', 'Sunday Night: Partly cloudy, with a low around 54.', 'Monday: Mostly sunny, with a high near 66.']


### Importing data into a Pandas Dataframe
* call the DataFrame class
* pass each list o fitems in as part of a dictionary
* Each dictionary key will become a column in the DataFrame

In [40]:
import pandas as pd
weather = pd.DataFrame({
    'period': periods, 
    'short_desc': short_descs,
    'temp': temps, 
    'desc': descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,ThisAfternoon,Sunny,High: 67 °F,"This Afternoon: Sunny, with a high near 67. We..."
1,Tonight,Partly Cloudy,Low: 54 °F,"Tonight: Partly cloudy, with a low around 54. ..."
2,Friday,DecreasingClouds,High: 70 °F,"Friday: Mostly cloudy, then gradually becoming..."
3,FridayNight,IncreasingClouds,Low: 55 °F,"Friday Night: Increasing clouds, with a low ar..."
4,Saturday,Mostly Sunny,High: 68 °F,"Saturday: Mostly sunny, with a high near 68. W..."
5,SaturdayNight,Mostly Cloudy,Low: 54 °F,"Saturday Night: Mostly cloudy, with a low arou..."
6,Sunday,Mostly Sunny,High: 66 °F,"Sunday: Mostly sunny, with a high near 66."
7,SundayNight,Partly Cloudy,Low: 54 °F,"Sunday Night: Partly cloudy, with a low around..."
8,Monday,Mostly Sunny,High: 66 °F,"Monday: Mostly sunny, with a high near 66."


### Extract the temperature values as numeric values for analyses

In [44]:
temp_nums = weather['temp'].str.extract('(?P<temp_num>\d+)', expand=False)
weather['temp_num'] = temp_nums.astype('int')
temp_nums

0    67
1    54
2    70
3    55
4    68
5    54
6    66
7    54
8    66
Name: temp_num, dtype: object

### Find the mean of all the high and low temperatures:

In [45]:
weather['temp_num'].mean()

61.55555555555556

### Select only rows at night (low temp values)

In [48]:
is_night = weather['temp'].str.contains('Low') 
weather['is_night'] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [49]:
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
0,ThisAfternoon,Sunny,High: 67 °F,"This Afternoon: Sunny, with a high near 67. We...",67,False
1,Tonight,Partly Cloudy,Low: 54 °F,"Tonight: Partly cloudy, with a low around 54. ...",54,True
2,Friday,DecreasingClouds,High: 70 °F,"Friday: Mostly cloudy, then gradually becoming...",70,False
3,FridayNight,IncreasingClouds,Low: 55 °F,"Friday Night: Increasing clouds, with a low ar...",55,True
4,Saturday,Mostly Sunny,High: 68 °F,"Saturday: Mostly sunny, with a high near 68. W...",68,False
5,SaturdayNight,Mostly Cloudy,Low: 54 °F,"Saturday Night: Mostly cloudy, with a low arou...",54,True
6,Sunday,Mostly Sunny,High: 66 °F,"Sunday: Mostly sunny, with a high near 66.",66,False
7,SundayNight,Partly Cloudy,Low: 54 °F,"Sunday Night: Partly cloudy, with a low around...",54,True
8,Monday,Mostly Sunny,High: 66 °F,"Monday: Mostly sunny, with a high near 66.",66,False


### Select only nighttime rows

In [50]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
1,Tonight,Partly Cloudy,Low: 54 °F,"Tonight: Partly cloudy, with a low around 54. ...",54,True
3,FridayNight,IncreasingClouds,Low: 55 °F,"Friday Night: Increasing clouds, with a low ar...",55,True
5,SaturdayNight,Mostly Cloudy,Low: 54 °F,"Saturday Night: Mostly cloudy, with a low arou...",54,True
7,SundayNight,Partly Cloudy,Low: 54 °F,"Sunday Night: Partly cloudy, with a low around...",54,True
