## Beginner's Guide to Scraping (Python 3 version)

### Pre-requisite System Packages:
- python 3
- python3-pip

* Most newer distributions already come with pip included. *

### Required Python Packages:
- requests
- lxml

### Optional Python Packages:
- html5lib

---

### Step 1. Import the packages to be used

In [16]:
import requests
from lxml import html, etree

### Step 2. Page retrieval (automated browse)

In [18]:
page = requests.get("http://quotes.toscrape.com")
# establish connection and browse the page data

In [5]:
print(page.headers)
# header info - useful for debugging the page retrieval process

{'Content-Type': 'text/html; charset=utf-8', 'Server': 'nginx/1.10.1', 'Date': 'Tue, 22 Aug 2017 06:04:00 GMT', 'Connection': 'keep-alive', 'X-Upstream': 'spidyquotes-master_web', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip'}


In [6]:
print(page.status_code)
# another useful call to debug your scraper

200


In [7]:
print(page.content)
# get the page data and print it.

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

### Step 3. Parse the page

In [19]:
tree = html.fromstring(page.content)

In [20]:
page_head = tree.xpath('//head')
from lxml import etree
for item in page_head:
    print(etree.tostring(item))

b'<head>\n\t<meta charset="UTF-8"/>\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css"/>\n    <link rel="stylesheet" href="/static/main.css"/>\n</head>\n'


In [21]:
tree.xpath('//title/text()')

['Quotes to Scrape']

### Step 4. Crawling other pages

In [22]:
page_links = tree.xpath('//a')
for item in page_links:
    print(etree.tostring(item))

b'<a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                '
b'<a href="/login">Login</a>\n                \n                '
b'<a href="/author/Albert-Einstein">(about)</a>\n        '
b'<a class="tag" href="/tag/change/page/1/">change</a>\n            \n            '
b'<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            '
b'<a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            '
b'<a class="tag" href="/tag/world/page/1/">world</a>\n            \n        '
b'<a href="/author/J-K-Rowling">(about)</a>\n        '
b'<a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            '
b'<a class="tag" href="/tag/choices/page/1/">choices</a>\n            \n        '
b'<a href="/author/Albert-Einstein">(about)</a>\n        '
b'<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            '
b'<a class="tag" href="/tag/life/page/1/">life</a>\n       

#### References:

- http://docs.python-guide.org/en/latest/scenarios/scrape/
