# Activity 9: Top 100 ebooks' name extraction from Gutenberg.org

## What is Project Gutenberg? - 
Project Gutenberg is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by American writer Michael S. Hart and is the **oldest digital library.** This longest-established ebook project releases books that entered the public domain, and can be freely read or downloaded in various electronic formats.

## What is this activity all about?
* **This activity aims to scrape the url of the Project Gutenberg's Top 100 ebooks (yesterday's ranking) for identifying the ebook links. **
* **It uses BeautifulSoup4 for parsing the HTML and regular expression code for identifying the Top 100 ebook file numbers.**
* **You can use those book ID numbers to download the book into your local drive if you want**

### Import necessary libraries including regex, and beautifulsoup

In [5]:
import urllib.request, urllib.parse, urllib.error #urllib 3 module that can place HTTP requests and receive data from the cloud
import requests # Use Requests library to avoid dealing with HTTP methods on a lower level
from bs4 import BeautifulSoup # HTML parser package and builds a tree of all tags and markups within the page
import ssl # provides access to Transport Layer Security encryption and peer authentication facilities
import re #provides full support for Perl-like regular expressions

### Ignore SSL errors (this code will be given)

In [6]:
# Use SSL library to ignore any SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

### Read the HTML from the URL

In [11]:
response = requests.get('https://www.gutenberg.org/browse/scores/top' ) # pass Project Gutenberg's HTML address to requests 

### Write a small function to check the status of web request

In [13]:
def status_check(r): # create status_check function to check the status of the web request to make sure it was successful
    if r.status_code==200: # Looks for the HTTP status code 200 which means "OK"
        print("Success!") # 200 means success
        return 1
    else:# If it does not find the code 200, then it did not go "OK"
        print("Failed!") # Anything other than 200 will mean a failure
        return -1

In [15]:
status_check(response) # pass the response variable containing the HTML address to status_check function

Success!


1

### Decode the response and pass on to `BeautifulSoup` for HTML parsing

In [16]:
contents = response.content.decode(response.encoding) # create contents variable that decodes response variable

In [17]:
soup = BeautifulSoup(contents, 'html.parser') # pass contents to Beautiful soup and create soup variable

### Find all the _href_ tags and store them in the list of links. Check how the list looks like - print first 30 elements

In [18]:
lst_links=[] # created lst_links that creates an empty list to hold all the http links in the HTML page

In [19]:
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'): # created for loop to find all links in HTML page
    #print(link.get('href'))
    lst_links.append(link.get('href')) # append all href tags to the lst_links list variable

In [21]:
lst_links[:30] # print out the first 30

['/',
 '/about/',
 '/about/',
 '/policy/collection_development.html',
 '/about/contact_information.html',
 '/about/background/',
 '/policy/permission.html',
 '/policy/privacy_policy.html',
 '/policy/terms_of_use.html',
 '/ebooks/',
 '/ebooks/',
 '/ebooks/bookshelf/',
 '/browse/scores/top',
 '/ebooks/offline_catalogs.html',
 '/help/',
 '/help/',
 '/help/copyright.html',
 '/help/errata.html',
 '/help/file_formats.html',
 '/help/faq.html',
 '/policy/',
 '/help/public_domain_ebook_submission.html',
 '/help/submitting_your_own_work.html',
 '/help/mobile.html',
 '/attic/',
 '/donate/',
 '/donate/',
 '#books-last1',
 '#authors-last1',
 '#books-last7']

### Use regular expression to find the numeric digits in these links. <br>These are the file number for the Top 100 books.

#### Initialize empty list to hold the file numbers

In [22]:
booknum=[] # created bookbum that creates an empty list to hold all the file numbers

* Number 19 to 118 in the original list of links have the Top 100 ebooks' number. 
* Loop over appropriate range and use regex to find the numeric digits in the link (href) string.
* Hint: Use `findall()` method

In [25]:
for i in range(19,119): # created for loop to iterate over number 19 to 118 in the original list of links
    link=lst_links[i]
    link=link.strip()
    # Regular expression to find the numeric digits in the link (href) string
    n=re.findall('[0-9]+',link) # find all the numeric digits in the link (href) string
    if len(n)==1: # if statement that if the length of n is equal to 1
        # Then append the filenumber casted as integer to booknum list
        booknum.append(int(n[0]))

#### Print the file numbers

In [26]:
print ("\nThe file numbers for the top 100 ebooks on Gutenberg are shown below\n"+"-"*70) # print statement
print(booknum) # print the file number of the top 100 ebooks (booknum list)


The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1, 1, 7, 7, 30, 30, 84, 1342, 25344, 41, 345, 11, 1952, 2542, 174, 5200, 2701, 43, 1080, 1661, 98, 219, 1260, 64317, 844, 408, 46, 160, 1232, 1727, 205, 2591, 76, 4980, 16328, 1250, 23, 6130, 2814, 3825, 55, 2554, 74, 3207, 2852, 66609, 7370, 4300, 215, 4517, 768, 66613, 514, 996, 203, 2148, 16, 66612, 902, 66608, 1400, 45, 2500, 66607, 42884, 1184, 10007, 2600, 1497, 120, 3600, 19942, 5740, 779, 32449, 15399, 36, 58585, 829, 209, 1429, 1251, 22381, 66601, 512, 852, 20203, 11030, 135, 158, 35, 49018, 1, 1, 7, 7, 30, 30, 84, 1342, 25344, 41, 345, 11, 1952, 2542, 174, 5200, 2701, 43, 1080, 1661, 98, 219, 1260, 64317, 844, 408, 46, 160, 1232, 1727, 205, 2591, 76, 4980, 16328, 1250, 23, 6130, 2814, 3825, 55, 2554, 74, 3207, 2852, 66609, 7370, 4300, 215, 4517, 768, 66613, 514, 996, 203, 2148, 16, 66612, 902, 66608, 1400, 45, 2500, 66607, 42884, 1184, 

### How does the `soup` object's text look like? Use `.text()` method and print only first 2000 characters (i.e. do not print the whole thing, it is long).

You will notice lot of empty spaces/blanks here and there. Ignore them. They are part of HTML page markup and its whimsical nature!

In [28]:
print(soup.text[:2000]) # print out the soup object's text of the first 2000 characters





Top 100 | Project Gutenberg



























Menu▾



About
          ▾

▾


About Project Gutenberg
Collection Development
Contact Us
History & Philosophy
Permissions & License
Privacy Policy
Terms of Use



Search and Browse
      	  ▾

▾


Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          ▾

▾


All help topics →
Copyright Procedures
Errata, Fixes and Bug Reports
File Formats
Frequently Asked Questions
Policies →
Public Domain eBook Submission
Submitting Your Own Work
Tablets, Phones and eReaders
The Attic →


Donate










Donation







Frequently Viewed or Downloaded
These listings are based on the number of times each eBook gets downloaded.
      Multiple downloads from the same Internet address on the same day count as one download, and addresses that download more than 100 eBooks in a day are considered robots and are not counted.

Downloaded Books
2021-10-26178310
last 7 days1168075
last 30 days4889924



Top 100 EBooks yesterd

### Search in the extracted text (using regular expression) from the `soup` object to find the names of top 100 Ebooks (Yesterday's rank)

In [29]:
lst_titles_temp=[] # created lst_titles_temp variable which is an empty list if ebook names

#### Create a starting index. It should point at the text _"Top 100 Ebooks yesterday"_. Hint: Use `splitlines()` method of the `soup.text`. It splits the lines of the text of the `soup` object.

In [32]:
# created start_idx variable that is a starting index pointing at the text
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday') # Use splitlines to split soup's lines of the text 

#### Loop 1-100 to add the strings of next 100 lines to this temporary list. Hint: `splitlines()` method

In [34]:
for i in range(100): # for loop to loop 1-100 books
    lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i]) # append the strings of the next 100 lines to list_titles_temp

#### Use regular expression to extract only text from the name strings and append to an empty list
* Hint: Use `match` and `span` to find indices and use them

In [35]:
lst_titles=[] # blank list lst_titles
for i in range(100): # for loop to iterate on extraction of only text from the name strings
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span() # find indices and use them
    lst_titles.append(lst_titles_temp[i][id1:id2]) # append to lst_title list

#### Print the list of titles

In [36]:
for l in lst_titles: # for loop to print the list of titles in the lst_titles list that was populated above
    print(l)

Top 
Top 
Top 
Top 


Top 

Frankenstein
Pride and Prejudice by Jane Austen 
The Scarlet Letter by Nathaniel Hawthorne 
The Legend of Sleepy Hollow by Washington Irving 
Dracula by Bram Stoker 
Alice
The Yellow Wallpaper by Charlotte Perkins Gilman 
A Doll
The Picture of Dorian Gray by Oscar Wilde 
Metamorphosis by Franz Kafka 
Moby Dick
The Strange Case of Dr
A Modest Proposal by Jonathan Swift 
The Adventures of Sherlock Holmes by Arthur Conan Doyle 
A Tale of Two Cities by Charles Dickens 
Heart of Darkness by Joseph Conrad 
Jane Eyre
The Great Gatsby by F
The Importance of Being Earnest
The Souls of Black Folk by W
A Christmas Carol in Prose
The Awakening
The Prince by Niccol
The Odyssey by Homer 
Walden
Grimms
Adventures of Huckleberry Finn by Mark Twain 
Old Granny Fox by Thornton W
Beowulf
Anthem by Ayn Rand 
Narrative of the Life of Frederick Douglass
The Iliad by Homer 
Dubliners by James Joyce 
Pygmalion by Bernard Shaw 
The Wonderful Wizard of Oz by L
Crime and Punishment by