## Beginner's Guide to Scraping (Python 3 version)

### Pre-requisite System Packages:
- python 3
- python3-pip

* Most newer distributions already come with pip included. *

### Required Python Packages:
- requests
- beautiful soup (bs4)
- lxml

### Optional Python Packages:
- html5lib

---

### Step 1. Import the packages to be used

In [1]:
import requests
from bs4 import BeautifulSoup

### Step 3. Page retrieval (automated browse)

In [2]:
page = requests.get("http://quotes.toscrape.com")
# establish connection and browse the page data

In [3]:
print(page.headers)
# header info - useful for debugging the page retrieval process

{'Transfer-Encoding': 'chunked', 'Date': 'Tue, 22 Aug 2017 06:43:32 GMT', 'Content-Encoding': 'gzip', 'X-Upstream': 'spidyquotes-master_web', 'Connection': 'keep-alive', 'Server': 'nginx/1.10.1', 'Content-Type': 'text/html; charset=utf-8'}


In [4]:
print(page.status_code)
# another useful call to debug your scraper

200


In [5]:
print(page.content)
# get the page data and print it.

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

### Step 4. Parse the page

In [6]:
# soup = BeautifulSoup(page)
# soup = BeautifulSoup(page,"html5lib")
soup = BeautifulSoup(page.text, "lxml")

In [7]:
soup.head

<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>

In [8]:
soup.title

<title>Quotes to Scrape</title>

In [9]:
for string in soup.strings:
    print(repr(string))

'\n'
'\n'
'\n'
'Quotes to Scrape'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'Quotes to Scrape'
'\n'
'\n'
'\n'
'\n'
'\n'
'Login'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
'\n'
'by '
'Albert Einstein'
'\n'
'(about)'
'\n'
'\n'
'\n            Tags:\n            '
'\n'
'change'
'\n'
'deep-thoughts'
'\n'
'thinking'
'\n'
'world'
'\n'
'\n'
'\n'
'\n'
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”'
'\n'
'by '
'J.K. Rowling'
'\n'
'(about)'
'\n'
'\n'
'\n            Tags:\n            '
'\n'
'abilities'
'\n'
'choices'
'\n'
'\n'
'\n'
'\n'
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'
'\n'
'by '
'Albert Einstein'
'\n'
'(about)'
'\n'
'\n'
'\n            Tags:\n            '
'\n'
'inspirational'
'\n'
'life'
'\n'
'live'
'\n'
'miracle'
'\n'
'miracles'
'\n'
'\n'
'\n'


### Step 5. Crawling other pages

In [11]:
soup.find_all('a')

[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href

#### References:

- http://docs.python-guide.org/en/latest/scenarios/scrape/
- http://www.crummy.com/software/BeautifulSoup/bs4/doc/