In [1]:
import requests 
from bs4 import BeautifulSoup

This is a tutorial for how to use `beatiful soup` to scrape webpages. First, we use the `requests` library to grab the raw html from a url we specify. 

In [5]:
result = requests.get("https://www.google.com/")

It is wise to check that our request succeded. We can do so by checking the status_code attribute of the request object. 

In [6]:
print(result.status_code)

200


We can also print out other info such as the http headers. 

In [7]:
print(result.headers)

{'Date': 'Thu, 13 Aug 2020 16:35:35 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2020-08-13-16; expires=Sat, 12-Sep-2020 16:35:35 GMT; path=/; domain=.google.com; Secure, NID=204=rWuhyEKIlegFTimOAGKBt53dZL-lt2a0U4yGHgFCogkEJzPWU-a0uzXAMFoFutMSDYFxYBXbNntFsQ3BN1F7lGH-iNoaXPh2kOvQN6og2qFj0sTVmSvtFplcqzoURy0pmu6S6mv83gDWn8dBHJtr297J5PKc9KSkB0B-ywTtMCs; expires=Fri, 12-Feb-2021 16:35:35 GMT; path=/; domain=.google.com; HttpOnly', 'Alt-Svc': 'h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'Transfer-Encoding': 'chunked'}


Note that the `domain` is correctly set to `google.com`. 

Now let's store the `content` of the page in a variable so that we can make it easier to work with

In [11]:
src = result.content

**Using Beautifule Soup to interact with content**

In [12]:
soup = BeautifulSoup(src, 'lxml')

In [17]:
# examine all the links
links = soup.find_all('a')
for link in links: 
    print(link.text)

Images
Maps
Play
YouTube
News
Gmail
Drive
More »
Web History
Settings
Sign in
Advanced search
Take a Password Checkup
Advertising Programs
Business Solutions
About Google
Privacy
Terms


In [21]:
for link in links: 
    if "About" in link.text: 
        print(link, '\n') 
        print(link.attrs['href'])

<a href="/intl/en/about.html">About Google</a> 

/intl/en/about.html


# Let's try a more elaborate example

Task: obtain the links from the whitehouse briefings website. Extract all of the links on the page that point to the briefings and statements

In [22]:
url = "https://www.whitehouse.gov/briefings-statements/"
result = requests.get(url)

In [23]:
print(result.status_code)

200


In [24]:
src = result.content

In [25]:
soup = BeautifulSoup(src, 'lxml') 

In [26]:
urls = [] 
# grab all of the h2 tags
for h2_tag in soup.find_all("h2"):
    a_tag = h2_tag.find('a') 
    urls.append(a_tag.attrs['href'])
    

In [28]:
for url in urls: 
    print(url, '\n')

https://www.whitehouse.gov/briefings-statements/president-donald-j-trump-secured-historic-deal-israel-united-arab-emirates-advance-peace-prosperity-region/ 

https://www.whitehouse.gov/briefings-statements/joint-statement-united-states-state-israel-united-arab-emirates/ 

https://www.whitehouse.gov/briefings-statements/remarks-president-trump-press-briefing-081320/ 

https://www.whitehouse.gov/briefings-statements/statement-press-secretary-regarding-safe-reopening-americas-schools/ 

https://www.whitehouse.gov/briefings-statements/remarks-president-trump-kids-first-getting-americas-children-safely-back-school/ 

https://www.whitehouse.gov/briefings-statements/president-trump-announces-presidential-delegation-dominican-republic-attend-inauguration-excellency-luis-abinader/ 

https://www.whitehouse.gov/briefings-statements/president-donald-j-trump-supporting-americas-students-families-encouraging-safe-reopening-americas-schools/ 

https://www.whitehouse.gov/briefings-statements/remarks-p