# Python BeautifulSoup Web Scraping Tutorial
Learn to scrape data from the web using the Python BeautifulSoup bs4 library.  
BeautifulSoup makes it easy to parse useful data out of an HTML page.  
First install the bs4 library on your system by running at the command line,   
*pip install beautifulsoup4* or *easy_install beautifulsoup4* (or bs4)  
See [BeautifulSoup official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the complete set of functions.

### Import requests so we can fetch the html content of the webpage
You can see our example page has about 28k characters.

In [61]:
import requests
r = requests.get('https://www.usclimatedata.com/climate/united-states/us')
print(len(r.text))

28556


### Import BeautifulSoup, and convert your HTML into a bs4 object
Now we can access specific HTML tags on the page using dot, just like a JSON object.

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)
print(soup.title)
print(soup.title.string)

<title>Climate United States - normals and averages</title>
Climate United States - normals and averages


### Drill into the bs4 object to access page contents
soup.p will give you the contents of the first paragraph tag on the page.  
soup.a gives you anchors / links on the page.  
Get contents of an attribute inside an HTML tag using square brackets and perentheses.  
Use .parent to get the parent object, and .next_sibling to get the next peer object.  
**Use your browser's *inspect element* feature to find the tag for the data you want.**

In [13]:
print(soup.p)
print(soup.p.text)
print(soup.a)
print(soup.a['title'])
print()
print(soup.p.parent)

<p class="breadcrumbs">You are here: <a href="/climate/united-states/us" title="Climate United States">United States</a></p>
You are here: United States
<a href="https://www.facebook.com/yourweatherservice" title="US Climate Data on Facebook">
<img alt="US Climate Data on Facebook" height="16" src="/images/icons/facebook.png" width="16"/>
</a>
US Climate Data on Facebook

<ul>
<li>
<a class="summary" href="#summary">Monthly</a>
</li>
<!-- Breadcrumbs -->
<p class="breadcrumbs">You are here: <a href="/climate/united-states/us" title="Climate United States">United States</a></p>
<!-- end Breadcrumbs -->
</ul>


### Prettify() is handy for formatted printing   
but note this works only on bs4 objects, not on strings, dicts or lists. For those you need to import pprint.

In [14]:
print(soup.p.parent.prettify())

<ul>
 <li>
  <a class="summary" href="#summary">
   Monthly
  </a>
 </li>
 <!-- Breadcrumbs -->
 <p class="breadcrumbs">
  You are here:
  <a href="/climate/united-states/us" title="Climate United States">
   United States
  </a>
 </p>
 <!-- end Breadcrumbs -->
</ul>



### We need all the state links on this page
First we find_all anchor tags, and print out the href attribute, which is the actual link url.   
But we see the result includes some links we don't want, so we need to filter those out.

In [7]:
for link in soup.find_all('a'):
    print(link.get('href'))

https://www.facebook.com/yourweatherservice
https://twitter.com/usclimatedata
http://www.usclimatedata.com
/climate/united-states/us
#summary
/climate/united-states/us
#
#
/climate/alabama/united-states/3170
/climate/kentucky/united-states/3187
/climate/north-dakota/united-states/3204
/climate/alaska/united-states/3171
/climate/louisiana/united-states/3188
/climate/ohio/united-states/3205
/climate/arizona/united-states/3172
/climate/maine/united-states/3189
/climate/oklahoma/united-states/3206
/climate/arkansas/united-states/3173
/climate/maryland/united-states/1872
/climate/oregon/united-states/3207
/climate/california/united-states/3174
/climate/massachusetts/united-states/3191
/climate/pennsylvania/united-states/3208
/climate/colorado/united-states/3175
/climate/michigan/united-states/3192
/climate/rhode-island/united-states/3209
/climate/connecticut/united-states/3176
/climate/minnesota/united-states/3193
/climate/south-carolina/united-states/3210
/climate/delaware/united-states/31

### Filter urls using string functions
We just add an *if* to check conditions, then add the good ones to a list.  
In the end we get 51 state links, including Washington DC.

In [15]:
base_url = 'https://www.usclimatedata.com'
state_links = []
for link in soup.find_all('a'):
    url = link.get('href')
    if url and '/climate/' in url and '/climate/united-states/us' not in url:
        state_links.append(url)
print(len(state_links))

51


### Test getting the data for one state
then print the title for that page.

In [18]:
r = requests.get(base_url + state_links[5])
soup = BeautifulSoup(r.text)
print(soup.title.string)

Climate Ohio - temperature, rainfall and average


### The data we need is in *tr* tags
But look, there are 58 tr tags on the page, and we only want 2 of them - the *Average high* rows.

In [37]:
rows = soup.find_all('tr')
print(len(rows))

58


### Filter rows, and add temp data to a list
We use a list comprehension to filter the rows.  
Then we have only 2 rows left.  
We iterate through those 2 rows, and add all the temps from data cells (td) into a list.

In [50]:
rows = [row for row in rows if 'Average high' in str(row)]
print(len(rows))

high_temps = []
for row in rows:
    tds = row.find_all('td')
    for i in range(0,6):
        high_temps.append(tds[i].text)
print(high_temps)

2
['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41']


### Get the name of the State
First attempt we just split the title string into a list, and grab the second word.  
But that doesn't work for 2-word states like New York and North Carolina.   
So instead we slice the string from first blank to the hyphen. 

In [56]:
state = soup.title.string.split()[1]
print(state)
s = soup.title.string
state = s[s.find(' '):s.find('-')].strip()
print(state)

Wyoming
Wyoming


### Add state name and temp list to the data dictionary
For a single state, this is what our scraped data looks like.  
In this example we only got monthly highs by state, but you could drill into cities, and could get lows and precipitation. 

In [51]:
data = {}
data[state] = high_temps
print(data)

{'Ohio': ['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41']}


### Put it all together and iterate 51 states
We loop through our 51-state list, and get high temp data for each state, and add it to the data dict.  
This combines all our work above into a single for loop.  
The result is a dict with 51 states and a list of monthly highs for each.

In [59]:
data = {}
for state_link in state_links:
    url = base_url + state_link
    r = requests.get(base_url + state_link)
    soup = BeautifulSoup(r.text)
    rows = soup.find_all('tr')
    rows = [row for row in rows if 'Average high' in str(row)]
    high_temps = []
    for row in rows:
        tds = row.find_all('td')
        for i in range(1,7):
            high_temps.append(tds[i].text)
    s = soup.title.string
    state = s[s.find(' '):s.find('-')].strip()
    data[state] = high_temps
print(data)

{'Alabama': ['57', '62', '70', '77', '84', '90', '92', '92', '87', '78', '69', '60'], 'Kentucky': ['40', '45', '55', '66', '75', '83', '87', '86', '79', '68', '55', '44'], 'North Dakota': ['23', '28', '40', '57', '68', '77', '85', '83', '72', '58', '40', '26'], 'Alaska': ['23', '27', '34', '44', '56', '63', '65', '64', '55', '40', '28', '25'], 'Louisiana': ['62', '65', '72', '78', '85', '89', '91', '91', '87', '80', '72', '64'], 'Ohio': ['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41'], 'Arizona': ['67', '71', '77', '85', '95', '104', '106', '104', '100', '89', '76', '66'], 'Maine': ['28', '32', '40', '53', '65', '74', '79', '78', '70', '57', '45', '33'], 'Oklahoma': ['50', '55', '63', '72', '80', '88', '94', '93', '85', '73', '62', '51'], 'Arkansas': ['51', '55', '64', '73', '81', '89', '92', '93', '86', '75', '63', '52'], 'Maryland': ['42', '46', '54', '65', '75', '85', '89', '87', '80', '68', '58', '46'], 'Oregon': ['48', '52', '56', '61', '68', '74', '82', '8

### Save to CSV file
Lastly, we might want to write all this data to a CSV file.  
Here's a quick easy way to do that.

In [60]:
import csv

with open('high_temps.csv','w') as f:
    w = csv.writer(f)
    w.writerows(data.items())