The first thing we'll need to do to scrape a web page is to download the page. he Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

In [36]:
import requests
page = requests.get("http://wavecast.com/stateofsurf/")
page

<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [37]:
page.status_code

200

A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:



In [38]:
page.content

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<!-- saved from url=(0100)http://www.wetsand.com/page.asp?locationid=5&resourceid=5109&ProdId=0&CatId=852&TabID=849&SubTabID=0 -->\n<HTML><HEAD><TITLE>Wetsand.com : State or our Surf</TITLE>\n<META http-equiv=Content-Type content="text/html; charset=windows-1252">\n<LINK href="gr_gw_article_files/styles.css" type=text/css rel=stylesheet>\n\n<!--\n<LINK href="/css/layout.css" type=text/css rel=stylesheet>\n//-->\n\n<script language="JavaScript">\n<!--\n\nfunction SymError()\n{\n  return true;\n}\n\nwindow.onerror = SymError;\n\nvar SymRealWinOpen = window.open;\n\nfunction SymWinOpen(url, name, attributes)\n{\n  return (new Object());\n}\n\nwindow.open = SymWinOpen;\n\n//-->\n</script>\n\n<SCRIPT LANGUAGE="JavaScript">\n<!--\nflybycounter = 99;\nvar randomnumber = Math.random() ; \n\nfunction newWinWithUrl(pageUrl, width, height) {\n    theWin=window.open(pageUrl, "", "toolbar=1,location=1,directories=0,status=1,menubar=1

As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. 

In [39]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object

In [40]:
print(soup.prettify())


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0100)http://www.wetsand.com/page.asp?locationid=5&resourceid=5109&ProdId=0&CatId=852&TabID=849&SubTabID=0 -->
<html>
 <head>
  <title>
   Wetsand.com : State or our Surf
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="gr_gw_article_files/styles.css" rel="stylesheet" type="text/css"/>
  <!--
<LINK href="/css/layout.css" type=text/css rel=stylesheet>
//-->
  <script language="JavaScript">
   <!--

function SymError()
{
  return true;
}

window.onerror = SymError;

var SymRealWinOpen = window.open;

function SymWinOpen(url, name, attributes)
{
  return (new Object());
}

window.open = SymWinOpen;

//-->
  </script>
  <script language="JavaScript">
   <!--
flybycounter = 99;
var randomnumber = Math.random() ; 

function newWinWithUrl(pageUrl, width, height) {
    theWin=window.open(pageUrl, "", "toolbar=1,location=1,directories=0,status=1,menubar=1,scrollbars=

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the list function on it:

In [42]:
list(soup.children)


['HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"',
 '\n',
 ' saved from url=(0100)http://www.wetsand.com/page.asp?locationid=5&resourceid=5109&ProdId=0&CatId=852&TabID=849&SubTabID=0 ',
 '\n',
 <html><head><title>Wetsand.com : State or our Surf</title>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <link href="gr_gw_article_files/styles.css" rel="stylesheet" type="text/css"/>
 <!--
 <LINK href="/css/layout.css" type=text/css rel=stylesheet>
 //-->
 <script language="JavaScript">
 <!--
 
 function SymError()
 {
   return true;
 }
 
 window.onerror = SymError;
 
 var SymRealWinOpen = window.open;
 
 function SymWinOpen(url, name, attributes)
 {
   return (new Object());
 }
 
 window.open = SymWinOpen;
 
 //-->
 </script>
 <script language="JavaScript">
 <!--
 flybycounter = 99;
 var randomnumber = Math.random() ; 
 
 function newWinWithUrl(pageUrl, width, height) {
     theWin=window.open(pageUrl, "", "toolbar=1,location=1,directories=0,status=1,menubar=1,scr

Let's see what the type of each element in the list is:



In [43]:
[type(item) for item in list(soup.children)]


[bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

As you can see, all of the items are BeautifulSoup objectsthe  Doctype object, which contains information about the type of the document. The  NavigableString, which represents text found in the HTML document. The Tag object, which contains other nested tags. The most important object type, and the one we'll deal with most often, is the Tag object.



The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects.



We can now select the html tag and its children by taking the third item in the list:

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

In [76]:
html = list(soup.children)[7]


In [77]:
list(html.children)


['\\n");\n   flyby_windowhandle.document.write (str);\n   flyby_windowhandle.document.write("\\n',
 <p>For more detailed information, check out chapter " + chapter);
    flyby_windowhandle.document.write(" in our ");
    flyby_windowhandle.document.write(" <a href='\"javascript:openURL()\"'>");
    flyby_windowhandle.document.write("WetSand WaveCast Guide to Surf Forecasting</a>");
    flyby_windowhandle.document.write("\n</p>]

We can now isolate the p tag easily use the find_all method, which will find all the instances of a tag on a page



In [88]:
soup.find_all('p')


[<p>For more detailed information, check out chapter " + chapter);
    flyby_windowhandle.document.write(" in our ");
    flyby_windowhandle.document.write(" <a href='\"javascript:openURL()\"'>");
    flyby_windowhandle.document.write("WetSand WaveCast Guide to Surf Forecasting</a>");
    flyby_windowhandle.document.write("\n</p>, <p>
 
 By: <a href="http://nathancool.com">Nathan Cool</a>, Chief Forecaster, WetSand.com
 <br style="word-wrap:break-word"/>May 22, 2007
 <!--
 <br style=word-wrap:break-word>See also:
 <br>&nbsp;&nbsp;&nbsp;* <a href="http://www.wavecast.com/greenroom/hurricane2006.shtml">Hurricanes 2006</a>
 <br>&nbsp;&nbsp;&nbsp;* <a href="http://wavecast.com/greenroom">Global Warming Percolates Long Range Surf Forecasts</a>
 //-->
 <p></p>
  
 <p></p>
 <!--
 <hr>
 <table cellspacing=3 cellpadding=3 border=0><tr>
 <td valign=top align=left><a href="http://www.greenhousetruth.com/purchase_jump.shtml?from=stateofsurf" target=_blank><img border=0 src="http://www.greenhousetr

Note that find_all returns a list, so we'll have to loop through, or use list indexing, it to extract text:

In [96]:
x = soup.find_all('p')[10].get_text()
x = re.sub(r'([\n]+)([A-Z])+', r"\2", x) # using regular expression to clean as .get_text did not quite do that job
print(x)

Taking into consideration the state of our planet's climate and weather, this change into summer has some interesting twists and turns. This report will first discuss the primary factors influencing weather and surf for the coming season with a brief synopsis. What follows are sections discussing these effects on California, the East Coast of the U.S., and popular travel destinations as well.



Done with this practice!