# Requests and BeautifulSoup
Accessing and retrieving text data from webpages

# Contents
- [Requests](#req) 
- [BeautifulSoup](#bs) 

In [1]:
# website for example
url = 'http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-1/'

# Requests <a name="req"></a>

In [2]:
import requests

In [3]:
# get response from webpage; access content
requests.get(url)

<Response [200]>

In [4]:
# create response object
response = requests.get(url)

In [5]:
# check for error in download
response.raise_for_status()

In [6]:
# request status code
response.status_code

200

In [7]:
# raw webpage content
response.content

b'<!doctype html>\r\n\r\n\r\n<!--[if lt IE 7 ]> <html class="ie ie6 ie-lt10 ie-lt9 ie-lt8 ie-lt7 no-js" lang="en"> <![endif]-->\r\n<!--[if IE 7 ]>    <html class="ie ie7 ie-lt10 ie-lt9 ie-lt8 no-js" lang="en"> <![endif]-->\r\n<!--[if IE 8 ]>    <html class="ie ie8 ie-lt10 ie-lt9 no-js" lang="en"> <![endif]-->\r\n<!--[if IE 9 ]>    <html class="ie ie9 ie-lt10 no-js" lang="en"> <![endif]-->\r\n<!--[if gt IE 9]><!--><html class="no-js" lang="en"><!--<![endif]-->\r\n<!-- the "no-js" class is for Modernizr. --> \r\n<head>\r\n\t<meta charset="utf-8">\t\r\n\t<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">\r\n        <script src="//cdn.optimizely.com/js/4120621558.js"></script>\r\n\t\t<title>\r\n\t\t\tChapter 1: Down the Rabbit-Hole - Alice-in-Wonderland.net\t</title>\r\n\r\n\r\n\r\n\t\n<!-- This site is optimized with the Yoast SEO plugin v11.9 - https://yoast.com/wordpress/plugins/seo/ -->\n<meta name="description" content="Chapter I: Down the Rabbit-Hole; from Alice&#039;s Ad

In [8]:
# webpage content as string (first 500 characters)
response.text[0:500]

'<!doctype html>\r\n\r\n\r\n<!--[if lt IE 7 ]> <html class="ie ie6 ie-lt10 ie-lt9 ie-lt8 ie-lt7 no-js" lang="en"> <![endif]-->\r\n<!--[if IE 7 ]>    <html class="ie ie7 ie-lt10 ie-lt9 ie-lt8 no-js" lang="en"> <![endif]-->\r\n<!--[if IE 8 ]>    <html class="ie ie8 ie-lt10 ie-lt9 no-js" lang="en"> <![endif]-->\r\n<!--[if IE 9 ]>    <html class="ie ie9 ie-lt10 no-js" lang="en"> <![endif]-->\r\n<!--[if gt IE 9]><!--><html class="no-js" lang="en"><!--<![endif]-->\r\n<!-- the "no-js" class is for Modernizr. --> \r\n<hea'

In [9]:
# webpage information
response.headers

{'Date': 'Fri, 13 Sep 2019 15:13:50 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=da06042c135a43222d2fb8ac6e5164f751568387630; expires=Sat, 12-Sep-20 15:13:50 GMT; path=/; domain=.alice-in-wonderland.net; HttpOnly, PHPSESSID=g6erbcrsu1is0gq6lbdq138r41; path=/', 'Cache-Control': 'no-store, no-cache, must-revalidate', 'Cf-Railgun': 'direct (waiting for pending WAN connection)', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Link': '<http://www.alice-in-wonderland.net/wp-json/>; rel="https://api.w.org/", <http://www.alice-in-wonderland.net/?p=591>; rel=shortlink', 'Pragma': 'no-cache', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-Powered-By': 'PHP/7.0.33', 'X-Turbo-Charged-By': 'LiteSpeed', 'Server': 'cloudflare', 'CF-RAY': '515b113fda9dcf64-IAD', 'Content-Encoding': 'gzip'}

# BeautifulSoup <a name="bs"></a>

In [10]:
import requests
from bs4 import BeautifulSoup

In [11]:
# create response object
response = requests.get(url)

In [12]:
# create BS object of parsed webpage
soup = BeautifulSoup(response.content, 'html.parser')

In [16]:
# css selector
soup.select('header h1') # page title

[<h1>Chapter 1: Down the Rabbit-Hole</h1>]

In [13]:
# return first element in list of all specified tags
soup.find_all('p')[0]

<p><strong>A</strong>lice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,’ thought Alice `without pictures or conversation?’</p>

In [14]:
# return text of element
soup.find_all('p')[0].get_text()

'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,’ thought Alice `without pictures or conversation?’'

In [15]:
# return all elements of specified class
soup.find_all('ul', class_='sub-menu-ul')[0].get_text()

'Alice\'s Adventures in Wonderland Poem: "All in the golden afternoon" Chapter 1: Down the Rabbit-Hole Chapter 2: The Pool of Tears Chapter 3: A Caucus-Race and a long Tale Chapter 4: The Rabbit sends in a little Bill Chapter 5: Advice from a Caterpillar Chapter 6: Pig and Pepper Chapter 7: A Mad Tea-Party Chapter 8: The Queen\'s Croquet-Ground Chapter 9: The Mock Turtle\'s Story Chapter 10: The Lobster Quadrille Chapter 11: Who stole the Tarts? Chapter 12: Alice\'s Evidence An Easter Greeting to every child who loves Alice Christmas Greetings Through the Looking Glass Alice\'s Adventures Under Ground The Nursery "Alice" The lost chapter: a Wasp in a Wig Alice in Wonderland quotes Summaries Disney movie script '

In [17]:
# return chapter menu elements
chapters = soup.find(class_="sub-menu-ul").select('ul li')

In [18]:
# print each element as text
for chapter in chapters:
    print(chapter.get_text())

Poem: "All in the golden afternoon" 
Chapter 1: Down the Rabbit-Hole 
Chapter 2: The Pool of Tears 
Chapter 3: A Caucus-Race and a long Tale 
Chapter 4: The Rabbit sends in a little Bill 
Chapter 5: Advice from a Caterpillar 
Chapter 6: Pig and Pepper 
Chapter 7: A Mad Tea-Party 
Chapter 8: The Queen's Croquet-Ground 
Chapter 9: The Mock Turtle's Story 
Chapter 10: The Lobster Quadrille 
Chapter 11: Who stole the Tarts? 
Chapter 12: Alice's Evidence 
An Easter Greeting to every child who loves Alice 
Christmas Greetings 


In [19]:
# return chapter link elements
links = soup.find(class_="sub-menu-ul").select('ul li a')

In [20]:
# print each element as hyperlink
for link in links:
    print(link['href'])

http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/all-in-the-golden-afternoon/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-1/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-2/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-3/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-4/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-5/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-6/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-7/
http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-8/
http://www.alice-in-wonderland.net/resources/c