# Python Web Scraper

An exploration of web scraping to automate the collection of data from public websites using Python while ensuring compliance with site owner policies and common web standards. Scraped data is stored in sqlite databases to ensure portability.

## Objective 1: Inspect gutenberg.org

This site hosts an extensive library of ancient texts crying out for experimentation with Natural Lanugage Processing (NLP). Let's check it out!

The first thing to scrape is the site's robots.txt file. Site owners with a scraping policy post their preferences at url_base/robots.txt. For example, to check the scraping policy for www.mywebsite.org, the robots.txt file can be found at www.mywebsite.org/robots.txt.

In [1]:
import requests
from bs4 import BeautifulSoup
import datetime
import sqlite3
import pprint 
import os
from os.path import basename

In [2]:
url = "https://www.gutenberg.org/robots.txt"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

In [3]:
print(soup.text)

User-agent: Googlebot-Mobile
Disallow: /

User-agent: AdsBot-Google
Disallow: /

User-agent: Yahoo Pipes 2.0
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: asterias
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-Agent: Oracle Secure Enterprise Search
Disallow: /

# User-agent: Baiduspider
# Disallow: /

# User-agent: Yandex
# Disallow: /

User-Agent: Mail.RU_Bot
Disallow: /

User-agent: YisouSpider
Disallow: /

User-agent: EasouSpider
Disallow: /

User-agent: Sosospider
Disallow: /

User-agent: Riddler
Disallow: /

User-agent: Daumoa
Disallow: /

User-agent: Exabot
Disallow: /

User-agent: NerdyBot
Disallow: /

User-agent: 008
Disallow: /	

User-agent: ccbot
Disallow: /	

User-agent: discobot
Disallow: /	

User-agent: OmegaSeek
Disallow: /	

User-agent: discoverybot
Disallow: / 

User-agent: MJ12bot
Disallow: /

User-agent: wotbox
Disallow: /

User-agent: yacy
Disallow: /

User-agent: Twitterbot
Disallow:

User-agent: Blekkobot
Disallow: /

User-agent: Abonti
Disal

Nice! We just scraped our first page.

Based on this result, the site owner of https://www.gutenberg.org has disallowed scraping by numerous "big data" collection orgs and has also disallowed scraping from several urls. Unlisted entities should also observe a crawl delay of 5 seconds per page. As no additional restrictions are listed, it appears that Project Gutenberg is friendly to respectful web scrapers like us. Good news!

Next, let's take a look at the site root to see what's available for download. This time we'll take a look at the html using prettify to enhance readability and generate a date with the data request so we'll have a record of the date/time this data was collected.

In [4]:
url = "https://www.gutenberg.org"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
date = datetime.datetime.now()

In [5]:
# Print the first 1000 characters
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Free eBooks | Project Gutenberg
  </title>
  <link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>
  <link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>
  <link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>
  <link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="books, ebooks, free, kindle, android, iphone, ipad" name="keywords">
   <meta content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io" name="google-site-verification"/>
   <meta content="4WNaCljsE-A82vP_ih2H_UqXZvM" name="alexaVerifyID"/>
   <link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">
    <link href="/gutenberg/favicon.ico?v=1.1" rel="shortcut icon">
     <meta content="Project Gutenberg" property="og:title"/>
     <meta content="website" property="og:type"/>
     <meta cont

Success!

After looking around more extensively with a web browser, I found a lot of excellent content to explore! I would like to have a look at several child pages from here, but also want to ensure that I'm only downloading pages once to minimize outgoing requests. This means that I need to save these pages for later while labelling them to ensure that I can navigate the same page structure from my local storage as desired.

## Objective 2: Save the Data

Next, let's save our collected data in a local folder named "gutenberg.org":

In [6]:
# Make the folder
# Note: this command should only be run once
os.mkdir("gutenberg.org")

In [7]:
# Generate a local HTML file from our soup object
with open("./gutenberg.org/index.html", "w", encoding = 'utf-8') as file:
    file.write(soup.prettify())

That worked well! If desired, we could rebuild the entire site using local folders like this, but a database would be more ideal. Let's take a look at storing this data using sqlite3.

In [8]:
# This creates a database called 'sacred-texts.db' if one doesn't already exist.
# If the database already exists, this command establishes a connection to the existing database 
db = './gutenberg.org/gutenberg_texts.db'
conn = sqlite3.connect(db)

# Generate a cursor to navigate our new database
c = conn.cursor()

In [9]:
# Create a table for storing scraped web data
# Note: this command should only be run once; it will generate an error if the table already exists
c.execute('''CREATE TABLE webpages(server_loc TEXT, url TEXT, content TEXT, scrapedate TEXT);''')

<sqlite3.Cursor at 0x7f7a60445b20>

Now we need to set up Python variables to store the data for insertion into our webpages table.

In [10]:
server_loc = '/' # This signifies that we scraped this page from the web root
content = soup.prettify()
scrapedate = date

c.execute('''INSERT INTO webpages VALUES(?,?,?,?);''', (server_loc, url, content, scrapedate))
conn.commit()

Awesome! No output from the cell above indicates no problems in saving this data. Just to be certain, let's check by querying the database.

In [11]:
# Show all tables
c.execute('''SELECT * FROM sqlite_master  
  WHERE type='table';''')
c.fetchall()

[('table',
  'webpages',
  'webpages',
  2,
  'CREATE TABLE webpages(server_loc TEXT, url TEXT, content TEXT, scrapedate TEXT)')]

In [12]:
# Show the contents of the webpages table
c.execute("SELECT * FROM webpages")
c.fetchall()

[('/',
  'https://www.gutenberg.org',
  '<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Free eBooks | Project Gutenberg\n  </title>\n  <link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>\n  <link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>\n  <link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>\n  <link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content="books, ebooks, free, kindle, android, iphone, ipad" name="keywords">\n   <meta content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io" name="google-site-verification"/>\n   <meta content="4WNaCljsE-A82vP_ih2H_UqXZvM" name="alexaVerifyID"/>\n   <link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">\n    <link href="/gutenberg/favicon.ico?v=1.1" rel="shortcut icon">\n     <meta content="Project Gutenberg" property="og:title"/>\n     <

Congratulations! If you're following along, you just scraped a web page and stored the HTML in a local database. This should be useful for future exploration and experimentation. 

Now that these first steps have been taken, let's have a look at all links from the homepage.

In [13]:
# For all links on the homepage, print link text followed by the url
pprint.pprint([(a.text, a.get_attribute_list('href')) for a in soup.find_all('a')])

[('\n\n', ['/']),
 ('About\n          ▾\n', ['/about/']),
 ('About Project Gutenberg', ['/about/']),
 ('Collection Development', ['/policy/collection_development.html']),
 ('Contact Us', ['/about/contact_information.html']),
 ('History & Philosophy', ['/about/background/']),
 ('Permissions & License', ['/policy/permission.html']),
 ('Privacy Policy', ['/policy/privacy_policy.html']),
 ('Terms of Use', ['/policy/terms_of_use.html']),
 ('Search and Browse\n      \t  ▾\n', ['/ebooks/']),
 ('Book Search', ['/ebooks/']),
 ('Bookshelves', ['/ebooks/bookshelf/']),
 ('Frequently Downloaded', ['/browse/scores/top']),
 ('Offline Catalogs', ['/ebooks/offline_catalogs.html']),
 ('Help\n          ▾\n', ['/help/']),
 ('All help topics →', ['/help/']),
 ('Copyright Procedures', ['/help/copyright.html']),
 ('Errata, Fixes and Bug Reports', ['/help/errata.html']),
 ('File Formats', ['/help/file_formats.html']),
 ('Frequently Asked Questions', ['/help/faq.html']),
 ('Policies →', ['/policy/']),
 ('Publi

## Objective 3: Functionalization

Following the DRY philosophy (Don't Repeat Yourself), I think it's time to functionalize these steps so we can scrape additional pages more easily.

In [14]:
def scrape(db, delay, urls):
    '''
    Pass scrape a list containing at least 1 url. This function scrapes a list of urls and adds
    the html to our local sqlite3 database.
    
    db:            a string representing the local database. I.E., "./gutenberg.org/gutenberg_texts.db"
    delay:         an integer representing the number of seconds to delay between scrape requests
    urls:          a Python list of url strings to be scraped
    
    '''
    
    import requests
    from bs4 import BeautifulSoup
    import datetime
    import time
    import sqlite3
    import pprint  
    
    values = {}
    for url in urls:        
        try:
            page = requests.get(url)
            
        except Exception as scrape_except:
            print(f"Page could not be scraped at {url}")
        soup = BeautifulSoup(page.content, "html.parser")
        date = datetime.datetime.now()    
        conn = sqlite3.connect('./gutenberg.org/gutenberg_texts.db')
        c = conn.cursor()
        server_loc = "/".join(url.split("/")[3:-1])
        content = soup.prettify()
        
        db_add = True
        try:
            c.execute('''INSERT INTO webpages VALUES(?,?,?,?);''', (server_loc, url, content, date))
            conn.commit()
        except Exception as exception:
            db_add = False
            print(f"{exception}")
            
        values[f"{url}"] = {"soup": soup, "date": date}
            
        print(f"page scraped at {url}")
        if not db_add:
              print(f"Database add failed with the following exception:\n{exception}.")
        time.sleep(delay)
        print(f"Waited {delay} seconds")
    return values              

def plinks(soup):
    '''
    Prints all of the links found in a soup object
    
    '''
    pprint.pprint([(a.text, a.get_attribute_list('href')) for a in soup.find_all('a')])
    return

def dbshow(db):
    '''
    Show the full contents of a local sqlite3 database
    
    db:            a string representing the local database. I.E., "./gutenberg.org/gutenberg_texts.db"
    
    '''
    
    conn = sqlite3.connect(db)
    c = conn.cursor()
    c.execute(f"SELECT * FROM webpages")
    return c.fetchall()
    

In [15]:
# Test the scrape function

urls = ["https://www.gutenberg.org/policy/robot_access.html#how-to-get-all-ebook-files",
            "https://www.gutenberg.org/policy/robot_access.html"]

scrape(db, 5, urls)

page scraped at https://www.gutenberg.org/policy/robot_access.html#how-to-get-all-ebook-files
Waited 5 seconds
page scraped at https://www.gutenberg.org/policy/robot_access.html
Waited 5 seconds


{'https://www.gutenberg.org/policy/robot_access.html#how-to-get-all-ebook-files': {'soup': <!DOCTYPE html>
  
  <html class="client-nojs" dir="ltr" lang="en">
  <head>
  <meta charset="utf-8"/>
  <title>Robot Access to Pages | Project Gutenberg</title>
  <link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>
  <link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>
  <link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>
  <link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="books, ebooks, free, kindle, android, iphone, ipad" name="keywords">
  <meta content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io" name="google-site-verification"/>
  <meta content="4WNaCljsE-A82vP_ih2H_UqXZvM" name="alexaVerifyID"/>
  <link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">
  <link href="/gutenberg/favicon.ico?v=1.1" rel="shortcut icon">
  <meta content="Project Gutenb

In [16]:
# Test the dbshow function
dbshow(db)

[('/',
  'https://www.gutenberg.org',
  '<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Free eBooks | Project Gutenberg\n  </title>\n  <link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>\n  <link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>\n  <link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>\n  <link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content="books, ebooks, free, kindle, android, iphone, ipad" name="keywords">\n   <meta content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io" name="google-site-verification"/>\n   <meta content="4WNaCljsE-A82vP_ih2H_UqXZvM" name="alexaVerifyID"/>\n   <link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">\n    <link href="/gutenberg/favicon.ico?v=1.1" rel="shortcut icon">\n     <meta content="Project Gutenberg" property="og:title"/>\n     <

In [17]:
# Test the plinks function
plinks(soup)

[('\n\n', ['/']),
 ('About\n          ▾\n', ['/about/']),
 ('About Project Gutenberg', ['/about/']),
 ('Collection Development', ['/policy/collection_development.html']),
 ('Contact Us', ['/about/contact_information.html']),
 ('History & Philosophy', ['/about/background/']),
 ('Permissions & License', ['/policy/permission.html']),
 ('Privacy Policy', ['/policy/privacy_policy.html']),
 ('Terms of Use', ['/policy/terms_of_use.html']),
 ('Search and Browse\n      \t  ▾\n', ['/ebooks/']),
 ('Book Search', ['/ebooks/']),
 ('Bookshelves', ['/ebooks/bookshelf/']),
 ('Frequently Downloaded', ['/browse/scores/top']),
 ('Offline Catalogs', ['/ebooks/offline_catalogs.html']),
 ('Help\n          ▾\n', ['/help/']),
 ('All help topics →', ['/help/']),
 ('Copyright Procedures', ['/help/copyright.html']),
 ('Errata, Fixes and Bug Reports', ['/help/errata.html']),
 ('File Formats', ['/help/file_formats.html']),
 ('Frequently Asked Questions', ['/help/faq.html']),
 ('Policies →', ['/policy/']),
 ('Publi

With this project, we built a Python web scraper, added the scraped data to a local database, and functionalized the process so in the future we can easily scrape a set of pages by passing a list of urls to the scrape() function.

Stay tuned! Future posts will include techniques to utilize scraped data from our local database and - since we have only focused on text scraping so far - the scraping of local images to accompany scraped text data.

Enjoy!

~Kevin L. Freeman