## Beginner's Guide to Scraping (Python 2 version)

### Pre-requisite System Packages:
- python
- python-pip

* Most newer distributions already come with pip included. *

### Required Python Packages:
- mechanize
- beautiful soup (bs4)
- lxml

### Optional Python Packages:
- requests
- html5lib

---

### Step 1. Import the packages to be used

In [1]:
import mechanize
from lxml import etree
from bs4 import BeautifulSoup

### Step 2. Create the browser object

In [2]:
browser = mechanize.Browser()


### Step 3. Page retrieval (automated browse)

In [3]:
# browser.set_handle_robots(False)
response = browser.open("http://quotes.toscrape.com")
# establish connection and browse the page data

In [12]:
print response.info()
# header info - useful for debugging the page retrieval process

Server: nginx/1.10.1
Date: Tue, 22 Aug 2017 02:27:04 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 11053
Connection: close
X-Upstream: spidyquotes-master_web



In [21]:
print response.code
# another useful call to debug your scraper

200


In [25]:
print response.read()
# get the page data and print it.

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

In [4]:
page = response.read()
print type(page)

<type 'str'>


### Step 4. Parse the page

In [5]:
# soup = BeautifulSoup(page)
# soup = BeautifulSoup(page,"html5lib")
soup = BeautifulSoup(page, "lxml")

In [7]:
soup.head

<head>\n<meta charset="unicode-escape"/>\n<title>Quotes to Scrape</title>\n<link href="/static/bootstrap.min.css" rel="stylesheet"/>\n<link href="/static/main.css" rel="stylesheet"/>\n</head>

In [8]:
soup.title

<title>Quotes to Scrape</title>

In [10]:
for string in soup.strings:
    print(repr(string))

u'\n'
u'\n'
u'\n'
u'Quotes to Scrape'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'Quotes to Scrape'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'Login'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'\n'
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
u'\n'
u'by '
u'Albert Einstein'
u'\n'
u'(about)'
u'\n'
u'\n'
u'\n            Tags:\n            '
u'\n'
u'change'
u'\n'
u'deep-thoughts'
u'\n'
u'thinking'
u'\n'
u'world'
u'\n'
u'\n'
u'\n'
u'\n'
u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d'
u'\n'
u'by '
u'J.K. Rowling'
u'\n'
u'(about)'
u'\n'
u'\n'
u'\n            Tags:\n            '
u'\n'
u'abilities'
u'\n'
u'choices'
u'\n'
u'\n'
u'\n'
u'\n'
u'\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d'
u'\n'
u'by '
u'Albert Einstein'
u'\n'
u'(about)'
u'\n'
u'\n'
u'\n            Tags:\n 

### Step 5. Crawling other pages

In [11]:
soup.find_all('a')

[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href

#### References:

- http://www.pythonforbeginners.com/mechanize/browsing-in-python-with-mechanize
- http://www.crummy.com/software/BeautifulSoup/bs4/doc/