# Data Wrangling

## Extracting the top 100 E-books from gutenberg

This example uses BeautifulSoup tp parse the HTML and regular expression code to identify the Top 100 eBook file numbers.

In [3]:
# Import the necessary libraries, including regex and BeautifulSong

In [4]:
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
import ssl
import re

In [5]:
# Read the HTML from the URL and pass on to BeautifulSoup
top100url = 'https://www.gutenberg.org/browse/scores/top'
response = requests.get(top100url)

In [6]:
# Function to check the status of the web request
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [7]:
# Check the status of the response
status_check(response)

Success!


1

In [8]:
# Decode the response and pass it on the BeautifulSoup for the html parsing
contents = response.content.decode(response.encoding)
soup = BeautifulSoup(contents, 'html.parser')

In [9]:
# Find all the href tags and store them in the list of links

In [10]:
# Empty list to hold all the http links in the HTML page
lst_links=[]
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'):
    #print(link.get('href'))
    lst_links.append(link.get('href'))

In [11]:
# Check what the list looks like  - print the first 30 elements
lst_links[:30]


['/',
 '/about/',
 '/about/',
 '/policy/collection_development.html',
 '/about/contact_information.html',
 '/about/background/',
 '/policy/permission.html',
 '/policy/privacy_policy.html',
 '/policy/terms_of_use.html',
 '/ebooks/',
 '/ebooks/',
 '/ebooks/bookshelf/',
 '/browse/scores/top',
 '/ebooks/offline_catalogs.html',
 '/help/',
 '/help/',
 '/help/copyright.html',
 '/help/errata.html',
 '/help/file_formats.html',
 '/help/faq.html',
 '/policy/',
 '/help/public_domain_ebook_submission.html',
 '/help/submitting_your_own_work.html',
 '/help/mobile.html',
 '/attic/',
 '/donate/',
 '/donate/',
 'pretty-pictures',
 '#books-last1',
 '#authors-last1']

In [12]:
# Initialize the empty list to hold the file numbers
booknum=[]

In [13]:
# Numbers 19 to 118 in the original list of links have the top 100 eBook's numbers.

In [14]:
# Loop over the appropriate range ans use a regex to find the numeric digits in the link href. use the findall methods.
for i in range(19,119):
    link=lst_links[i]
    link=link.strip()
    # Regular expression to find the numeric digits in the link (href) string
    n=re.findall('[0-9]+',link)
    if len(n)==1:
        # Append the filenumber casted as integer
        booknum.append(int(n[0]))

In [15]:
# Print the file numbers
print ("\nThe file numbers for the top 100 ebooks on Gutenberg are shown below\n"+"-"*70)
print(booknum)


The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1, 1, 7, 7, 30, 30, 26184, 25558, 84, 2701, 1513, 1342, 11, 64317, 2542, 100, 145, 2641, 1952, 37106, 67979, 16389, 76, 1080, 345, 174, 25344, 844, 6761, 5200, 394, 43, 1400, 2160, 6593, 4085, 5197, 2554, 1259, 408, 50150, 1260, 57426, 75279, 3207, 1232, 98, 1727, 41445, 7370, 2000, 205, 6130, 1661, 75281, 1497, 768, 23, 1184, 16328, 15464, 219, 1998, 28054, 75282, 15399, 16119, 46, 2650, 19942, 4300, 132, 2591, 2600, 42324, 75285, 45, 55, 2814, 41, 75283, 3296, 45502, 4363, 74, 36034, 2148, 244, 996, 5740, 27761]


In [16]:
# use the .text method to print 2000 charecters
print(soup.text[:2000])





Top 100 | Project Gutenberg



























Menu▾



About
          ▾

▾


About Project Gutenberg
Collection Development
Contact Us
History & Philosophy
Permissions & License
Privacy Policy
Terms of Use



Search and Browse
      	  ▾

▾


Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          ▾

▾


All help topics →
Copyright How-To
Errata, Fixes and Bug Reports
File Formats
Frequently Asked Questions
Policies →
Public Domain eBook Submission
Submitting Your Own Work
Tablets, Phones and eReaders
The Attic →


Donate










Ways to donate







To determine the ranking we count the times each file gets downloaded.
Both HTTP and FTP transfers are counted.
Only transfers from ibiblio.org are counted as we have no access to our mirrors log files.
Multiple downloads from the same IP address on the same day count as one download.
IP addresses that download more than 100 files a day are considered
robots and are not considered.
Books made out o

In [17]:
# Temp empty list of Ebook names
lst_titles_temp=[]

In [18]:
# Creatinga  starting index. It should point at the text top 100 ebooks yesterday. Use the splitlines method of soup.text. 
#It splits the lines of the text of the soup object
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')


In [19]:
# Run the for loop from 1-100 to add strings of the next 100 lines to this temporary list
for i in range(100):
    lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])

In [20]:
# Use regex to extract only text from the name strings and append them to an empty list. Use match and span to find indices
lst_titles=[]
for i in range(100):
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
    lst_titles.append(lst_titles_temp[i][id1:id2])

In [21]:
# Print the list of titles
for i in lst_titles:
    print(i)

Simple Sabotage Field Manual by United States

Frankenstein
Moby Dick
Romeo and Juliet by William Shakespeare 
Pride and Prejudice by Jane Austen 
Alice
The Great Gatsby by F
A Doll
The Complete Works of William Shakespeare by William Shakespeare 
Middlemarch by George Eliot 
A Room with a View by E
The Yellow Wallpaper by Charlotte Perkins Gilman 
Little Women
The Blue Castle
The Enchanted April by Elizabeth Von Arnim 
Adventures of Huckleberry Finn by Mark Twain 
A Modest Proposal by Jonathan Swift 
Dracula by Bram Stoker 
The Picture of Dorian Gray by Oscar Wilde 
The Scarlet Letter by Nathaniel Hawthorne 
The Importance of Being Earnest
The Adventures of Ferdinand Count Fathom 
Metamorphosis by Franz Kafka 
Cranford by Elizabeth Cleghorn Gaskell 
The Strange Case of Dr
Great Expectations by Charles Dickens 
The Expedition of Humphry Clinker by T
History of Tom Jones
The Adventures of Roderick Random by T
My Life 
Crime and Punishment by Fyodor Dostoyevsky 
Twenty years after by Ale