# COMS W1002 : Computing in Context
## Lecture 23: Beautiful Soup 

Beautiful Soup is a very useful way to extract information from web pages. Documentation can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

**Example 1**

In [1]:
#example of use

import bs4
import requests

#make a string to record where
loc='http://www.cs.columbia.edu/~cannon/exampleIndex.html'

#create the response object using requests.get
r=requests.get(loc)




In [2]:
r.text

'<html>\n<head>\n<title>Adam Cannon\'s Home Page</title>\n</head>\n\n<body bgcolor="gray">\n<h2 align= center>Adam Cannon</br>\n</br>\n<img align=left src = "pic.jpg" width="225" height="175"></h2>\n<font size=2>\n<b>Mailing Address:</b></br>\nDepartment of Computer Science </br>\nColumbia University</br>\n1214 Amsterdam Avenue, MC 0401</br>\nNew York, NY 10027</br>\n</br>\n<b>office:</b> 459 Computer Science Building</br>\n<b>tel:</b> (212) 939-7016</br>\n<b>fax:</b> (212) 666-0140</br>\n<font size=3>\n<b>e-mail: </b><a\nhref="mailto:cannon@cs.columbia.edu"><tt>cannon@cs.columbia.edu</tt></a></br></b>\n<font size=2>\n<hr>\n<ul>\n<p>\n\nI am a faculty member in the \n<a\nhref="http://www.cs.columbia.edu">Department of Computer Science</a> at <a\nhref="http://www.columbia.edu">Columbia University</a>. \n</br>\n\n</ul>\n<p>\n<h3> Education </h3>\n<ul>\n<li>Ph.D. Applied Mathematics, <a \nhref="http://www.jhu.edu">Johns Hopkins University</a>.</br>\n<li>M.A. Applied Mathematics, Johns Hop

In [3]:
#create the soup object
soup=bs4.BeautifulSoup(r.content,'html.parser')

# now lets check out some methods
print(soup.prettify())

<html>
 <head>
  <title>
   Adam Cannon's Home Page
  </title>
 </head>
 <body bgcolor="gray">
  <h2 align="center">
   Adam Cannon
   <img align="left" height="175" src="pic.jpg" width="225"/>
  </h2>
  <font size="2">
   <b>
    Mailing Address:
   </b>
   Department of Computer Science
   Columbia University
   1214 Amsterdam Avenue, MC 0401
   New York, NY 10027
   <b>
    office:
   </b>
   459 Computer Science Building
   <b>
    tel:
   </b>
   (212) 939-7016
   <b>
    fax:
   </b>
   (212) 666-0140
   <font size="3">
    <b>
     e-mail:
    </b>
    <a href="mailto:cannon@cs.columbia.edu">
     <tt>
      cannon@cs.columbia.edu
     </tt>
    </a>
    <font size="2">
     <hr/>
     <ul>
      <p>
       I am a faculty member in the
       <a href="http://www.cs.columbia.edu">
        Department of Computer Science
       </a>
       at
       <a href="http://www.columbia.edu">
        Columbia University
       </a>
       .
      </p>
     </ul>
     <p>
      <h3>
       E

In [4]:
#find and find_all
print(soup.find('title').get_text())

Adam Cannon's Home Page


In [5]:
for l in soup.find_all('a'):
    print(l.get('href'))
    

mailto:cannon@cs.columbia.edu
http://www.cs.columbia.edu
http://www.columbia.edu
http://www.jhu.edu
http://www.ucla.edu
http://www.cs.columbia.edu/~cannon/
http://www.cs.columbia.edu


**Example 2** 
Let's scrape the current status of the lifts at a local(ish) ski resort..

In [6]:
# same as before but a different url
loc2='https://whiteface.com/mountain/conditions/'
r2=requests.get(loc2)
soup2=bs4.BeautifulSoup(r2.content,'html.parser')

print(soup2.prettify())


<!DOCTYPE html>
<!--[if lt IE 7]><html lang="en-US" class="no-js lt-ie9 lt-ie8 lt-ie7"><![endif]-->
<!--[if (IE 7)&!(IEMobile)]><html lang="en-US" class="no-js lt-ie9 lt-ie8"><![endif]-->
<!--[if (IE 8)&!(IEMobile)]><html lang="en-US" class="no-js lt-ie9"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" itemscope="itemscope" itemtype="https://schema.org/WebPage" lang="en-US">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <link href="https://whiteface.com/wp-content/themes/plate-orda/favicon.png" rel="icon"/>
  <meta content="kwj19izoa4awymr9y245l2fw8ubbgw" name="facebook-domain-verification">
   <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
    <!-- This site is optimized with the Yoast SEO plugin v17.5 - https://yoast.com/wordpress/plugins/seo/ -->
    <title>
    

In [7]:
#inspect the page to find the names of the stuff you want
#I want to know the number of lifts open, the number of trails open, the new snow

lifts=soup2.find(class_='report-box lifts')
trails=soup2.find(class_='report-box trails')



In [8]:
print(lifts.get_text())


Lifts

2
of 11





In [9]:
print(trails.get_text())


Trails

7
of 90





What if I just want the numbers 2 and 7 and not all the rest? Let's have a look at what *lifts* and *trails* actually are?

In [10]:
type(trails)

bs4.element.Tag

So *trails* is what's called a Tag element. Let's look at it in the raw:

In [11]:
trails

<div class="report-box trails">
<div class="title">Trails</div>
<div class="main-detail">
<span class="primary">7</span>
<span class="secondary">of 90</span>
</div>
<div class="sub-detail"></div>
</div>

Notice there are many tags within the "report-box trails" tag. How do we get at these?  One way is to use the contents attribute of a bs4 Tag element. This gives us a list of the contents of the tag.

In [12]:
trails.contents

['\n',
 <div class="title">Trails</div>,
 '\n',
 <div class="main-detail">
 <span class="primary">7</span>
 <span class="secondary">of 90</span>
 </div>,
 '\n',
 <div class="sub-detail"></div>,
 '\n']

Notice that the fourth element of *trails.contents* is another tag and that tag also has children tags. To get at those we can use contents again and finally we can recover the text.

In [13]:
trails.contents[3].contents[1].get_text()

'7'

Now what if we want more detail? That is, suppose I want a list of all of the open lifts. Once again we need to return to the source code and find the class names we are interested in.

In [14]:
status=soup2.find_all(class_='lift-icon data-icon')

In [15]:
#notice the structure is a little more complex this time
type(status[0].contents[0])

bs4.element.Tag

In [16]:
for element in status:
    print(element.contents[0]['title'])

Cloudsplitter Gondola is open
Lookout Mountain Triple Chair is closed
Little Whiteface Lift is closed
Freeway Lift is closed
Mountain Run Lift is closed
Face Lift is open
Bear Lift is closed
Summit Quad is closed
Falcon Flyer Quad is closed
Cub Carpet is closed
Coyote Carpet is closed


**Scraping is involved**: Scraping data from webpages is a relatively ad-hoc procedure. You need to experiment with the page you're interested in and always keep in mind that pages change so what you write today will most likely not work in the near future.