# Lesson 7 Activity 1: Top 100 ebooks' name extraction from Gutenberg.org

## What is Project Gutenberg? - 
Project Gutenberg is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by American writer Michael S. Hart and is the **oldest digital library.** This longest-established ebook project releases books that entered the public domain, and can be freely read or downloaded in various electronic formats.

## What is this activity all about?
* **This activity aims to scrape the url of the Project Gutenberg's Top 100 ebooks (yesterday's ranking) for identifying the ebook links. **
* **It uses BeautifulSoup4 for parsing the HTML and regular expression code for identifying the Top 100 ebook file numbers.**
* **You can use those book ID numbers to download the book into your local drive if you want**

### Import necessary libraries including regex, and beautifulsoup

In [112]:
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
import ssl
import re
import pandas as pd

### Ignore SSL errors (this code will be given)

In [2]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

### Read the HTML from the URL

In [45]:
# Read the HTML from the URL and pass on to BeautifulSoup
top100url = 'https://www.gutenberg.org/browse/scores/top'
response = requests.get(top100url)

In [23]:
# Read the HTML from the URL and pass on to BeautifulSoup
top100url = 'https://www.sports-reference.com/cfb/years/2020-ratings.html'
response = requests.get(top100url)

### Write a small function to check the status of web request

In [4]:
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [24]:
status_check(response)

Success!


1

In [27]:
"TR" in str(response.content)

False

In [25]:
print(response.content)



### Decode the response and pass on to `BeautifulSoup` for HTML parsing

In [46]:
contents = response.content.decode(response.encoding)

In [47]:
soup = BeautifulSoup(contents, 'html.parser')

### Find all the _href_ tags and store them in the list of links. Check how the list looks like - print first 30 elements

In [48]:
# Empty list to hold all the http links in the HTML page
lst_links=[]

In [49]:
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'):
    #print(link.get('href'))
    lst_links.append(link.get('href'))

In [38]:
for link in soup.find_all('tr'):
   # print(link)
    lst_links.append(link.get('ahref'))
    #    print(link.get('href'))
#    lst_links.append(link.get('href'))

In [50]:
print("\n ~~~ ".join(lst_links))

/
 ~~~ /about/
 ~~~ /about/
 ~~~ /policy/collection_development.html
 ~~~ /about/contact_information.html
 ~~~ /about/background/
 ~~~ /policy/permission.html
 ~~~ /policy/privacy_policy.html
 ~~~ /policy/terms_of_use.html
 ~~~ /ebooks/
 ~~~ /ebooks/
 ~~~ /ebooks/bookshelf/
 ~~~ /browse/scores/top
 ~~~ /ebooks/offline_catalogs.html
 ~~~ /help/
 ~~~ /help/
 ~~~ /help/copyright.html
 ~~~ /help/errata.html
 ~~~ /help/file_formats.html
 ~~~ /help/faq.html
 ~~~ /policy/
 ~~~ /help/public_domain_ebook_submission.html
 ~~~ /help/submitting_your_own_work.html
 ~~~ /help/mobile.html
 ~~~ /attic/
 ~~~ /donate/
 ~~~ /donate/
 ~~~ #books-last1
 ~~~ #authors-last1
 ~~~ #books-last7
 ~~~ #authors-last7
 ~~~ #books-last30
 ~~~ #authors-last30
 ~~~ /ebooks/1342
 ~~~ /ebooks/84
 ~~~ /ebooks/6133
 ~~~ /ebooks/11
 ~~~ /ebooks/1661
 ~~~ /ebooks/2701
 ~~~ /ebooks/174
 ~~~ /ebooks/98
 ~~~ /ebooks/64317
 ~~~ /ebooks/4300
 ~~~ /ebooks/43
 ~~~ /ebooks/345
 ~~~ /ebooks/57775
 ~~~ /ebooks/1260
 ~~~ /ebooks/1952


### Use regular expression to find the numeric digits in these links. <br>These are the file number for the Top 100 books.

#### Initialize empty list to hold the file numbers

In [51]:
booknum=[]

* Number 19 to 118 in the original list of links have the Top 100 ebooks' number. 
* Loop over appropriate range and use regex to find the numeric digits in the link (href) string.
* Hint: Use `findall()` method

In [53]:
for i in range(19,119):
    link=lst_links[i]
    link=link.strip()
    # Regular expression to find the numeric digits in the link (href) string
    n=re.findall('[0-9]+',link)
    if len(n)==1:
        # Append the filenumber casted as integer
        booknum.append(int(n[0]))

In [54]:
booknum

[1,
 1,
 7,
 7,
 30,
 30,
 1342,
 84,
 6133,
 11,
 1661,
 2701,
 174,
 98,
 64317,
 4300,
 43,
 345,
 57775,
 1260,
 1952,
 2591,
 65601,
 20228,
 16,
 2600,
 1184,
 46,
 2542,
 205,
 5200,
 844,
 74,
 1232,
 65605,
 65597,
 65608,
 45,
 6130,
 25344,
 1400,
 2554,
 2852,
 65606,
 902,
 514,
 47629,
 996,
 76,
 27827,
 158,
 35899,
 219,
 135,
 55,
 5739,
 244,
 58585,
 1497,
 30254,
 203,
 4014,
 1080,
 120,
 863,
 65604,
 43453,
 766,
 5740,
 28054,
 308,
 236,
 100,
 1727,
 1399,
 829,
 1998,
 36,
 2148,
 3600,
 768,
 16328,
 65573,
 730,
 35,
 20203,
 2814,
 815,
 4363,
 33283,
 65596,
 3090]

#### Print the file numbers

In [14]:
print ("\nThe file numbers for the top 100 ebooks on Gutenberg are shown below\n"+"-"*70)
print(booknum)


The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1342, 84, 1080, 46, 219, 2542, 98, 345, 2701, 844, 11, 5200, 43, 16328, 76, 74, 1952, 6130, 2591, 1661, 41, 174, 23, 1260, 1497, 408, 3207, 1400, 30254, 58271, 1232, 25344, 58269, 158, 44881, 1322, 205, 2554, 1184, 2600, 120, 16, 58276, 5740, 34901, 28054, 829, 33, 2814, 4300, 100, 55, 160, 1404, 786, 58267, 3600, 19942, 8800, 514, 244, 2500, 2852, 135, 768, 58263, 1251, 3825, 779, 58262, 203, 730, 20203, 35, 1250, 45, 161, 30360, 7370, 58274, 209, 27827, 58256, 33283, 4363, 375, 996, 58270, 521, 58268, 36, 815, 1934, 3296, 58279, 105, 2148, 932, 1064, 13415]


### How does the `soup` object's text look like? Use `.text()` method and print only first 2000 characters (i.e. do not print the whole thing, it is long).

You will notice lot of empty spaces/blanks here and there. Ignore them. They are part of HTML page markup and its whimsical nature!

In [55]:
print(soup.text[:2000])





Top 100 | Project Gutenberg



























Menu▾



About
          ▾

▾


About Project Gutenberg
Collection Development
Contact Us
History & Philosophy
Permissions & License
Privacy Policy
Terms of Use



Search and Browse
      	  ▾

▾


Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          ▾

▾


All help topics →
Copyright Procedures
Errata, Fixes and Bug Reports
File Formats
Frequently Asked Questions
Policies →
Public Domain eBook Submission
Submitting Your Own Work
Tablets, Phones and eReaders
The Attic →


Donate










Donation







Frequently Viewed or Downloaded
These listings are based on the number of times each eBook gets downloaded.
      Multiple downloads from the same Internet address on the same day count as one download, and addresses that download more than 100 eBooks in a day are considered robots and are not counted.

Downloaded Books
2021-06-13129179
last 7 days980206
last 30 days4418175



Top 100 EBooks yesterda

### Search in the extracted text (using regular expression) from the `soup` object to find the names of top 100 Ebooks (Yesterday's rank)

In [74]:
# Temp empty list of Ebook names
lst_titles_temp=[]

In [57]:
type(soup.text)

str

#### Create a starting index. It should point at the text _"Top 100 Ebooks yesterday"_. Hint: Use `splitlines()` method of the `soup.text`. It splits the lines of the text of the `soup` object.

In [75]:
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')

In [83]:
lst_titles_temp

['Pride and Prejudice by Jane Austen (1284)',
 'Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (723)',
 'The Extraordinary Adventures of Arsene Lupin, Gentleman-Burglar by Maurice Leblanc (711)',
 "Alice's Adventures in Wonderland by Lewis Carroll (622)",
 'The Adventures of Sherlock Holmes by Arthur Conan Doyle (615)',
 'Moby Dick; Or, The Whale by Herman Melville (539)',
 'The Picture of Dorian Gray by Oscar Wilde (512)',
 'A Tale of Two Cities by Charles Dickens (493)',
 'The Great Gatsby by F. Scott  Fitzgerald (389)',
 'Ulysses by James Joyce (379)',
 'The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson (372)',
 'Dracula by Bram Stoker (351)',
 'Le jardin des supplices by Octave Mirbeau (345)',
 'Jane Eyre: An Autobiography by Charlotte Brontë (342)',
 'The Yellow Wallpaper by Charlotte Perkins Gilman (338)',
 "Grimms' Fairy Tales by Jacob Grimm and Wilhelm Grimm (338)",
 'The Trail of Black Hawk by Paul G.  Tomlinson (336)',
 'Noli Me Tan

#### Loop 1-100 to add the strings of next 100 lines to this temporary list. Hint: `splitlines()` method

In [76]:
for i in range(100):
    lst_titles_temp.append(soup.text.splitlines()[start_idx+8+i])

In [102]:
lst_titles_temp

['Pride and Prejudice by Jane Austen (1284)',
 'Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (723)',
 'The Extraordinary Adventures of Arsene Lupin, Gentleman-Burglar by Maurice Leblanc (711)',
 "Alice's Adventures in Wonderland by Lewis Carroll (622)",
 'The Adventures of Sherlock Holmes by Arthur Conan Doyle (615)',
 'Moby Dick; Or, The Whale by Herman Melville (539)',
 'The Picture of Dorian Gray by Oscar Wilde (512)',
 'A Tale of Two Cities by Charles Dickens (493)',
 'The Great Gatsby by F. Scott  Fitzgerald (389)',
 'Ulysses by James Joyce (379)',
 'The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson (372)',
 'Dracula by Bram Stoker (351)',
 'Le jardin des supplices by Octave Mirbeau (345)',
 'Jane Eyre: An Autobiography by Charlotte Brontë (342)',
 'The Yellow Wallpaper by Charlotte Perkins Gilman (338)',
 "Grimms' Fairy Tales by Jacob Grimm and Wilhelm Grimm (338)",
 'The Trail of Black Hawk by Paul G.  Tomlinson (336)',
 'Noli Me Tan

#### Use regular expression to extract only text from the name strings and append to an empty list
* Hint: Use `match` and `span` to find indices and use them

In [100]:
lst_titles=[]
for i in range(len(lst_titles_temp)):
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
    id3 = re.findall('[0-9]+',lst_titles_temp[i])
    kv = { lst_titles_temp[i][id1:id2] : int(id3[0]) }
    lst_titles.append(kv)

In [118]:
    # Define an empty dictionary with keys
book_dict={'Title':[],'ID':[]}
    
    for i in range(len(lst_titles_temp)):
        id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
        id3 = re.findall('[0-9]+',lst_titles_temp[i])
        book_dict['Title'].append(lst_titles_temp[i][id1:id2])
        book_dict['ID'].append(id3[0])

    df= pd.DataFrame(book_dict)
    df

Unnamed: 0,Title,ID
0,Pride and Prejudice by Jane Austen,1284
1,Frankenstein,723
2,The Extraordinary Adventures of Arsene Lupin,711
3,Alice,622
4,The Adventures of Sherlock Holmes by Arthur Co...,615
...,...,...
93,Anthem by Ayn Rand,129
94,Meditations by Emperor of Rome Marcus Aurelius,129
95,Leviathan by Thomas Hobbes,126
96,The Awakening,125


In [110]:
(lst_titles[0].keys())

"dict_keys(['Pride and Prejudice by Jane Austen '])"

#### Print the list of titles

In [85]:
for l in lst_titles:
    print(l)

Pride and Prejudice by Jane Austen 
Frankenstein
The Extraordinary Adventures of Arsene Lupin
Alice
The Adventures of Sherlock Holmes by Arthur Conan Doyle 
Moby Dick
The Picture of Dorian Gray by Oscar Wilde 
A Tale of Two Cities by Charles Dickens 
The Great Gatsby by F
Ulysses by James Joyce 
The Strange Case of Dr
Dracula by Bram Stoker 
Le jardin des supplices by Octave Mirbeau 
Jane Eyre
The Yellow Wallpaper by Charlotte Perkins Gilman 
Grimms
The Trail of Black Hawk by Paul G
Noli Me Tangere by Jos
Peter Pan by J
War and Peace by graf Leo Tolstoy 
The Count of Monte Cristo
A Christmas Carol in Prose
A Doll
Walden
Metamorphosis by Franz Kafka 
The Importance of Being Earnest
The Adventures of Tom Sawyer
The Prince by Niccol
Indians of Lassen Volcanic National Park and Vicinity by Paul E
Tales of the Wild and the Wonderful 
The Great Green Diamond by Inspector Stark 
Anne of Green Gables by L
The Iliad by Homer 
The Scarlet Letter by Nathaniel Hawthorne 
Great Expectations by Char