<a href="https://colab.research.google.com/github/manreddyr/Web_Scrapping_Projects_on_goodreads.com/blob/main/web_scrapping_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web scraping top 100 best selling books of all time on `goodreads.com` website using python.**

_Goodreads_ is the world’s largest site for readers and book recommendations. Goodreads launched in January 2007.Goodreads is an American social cataloging website and a subsidiary of Amazon that allows individuals to search its database of books, annotations, quotes, and reviews. Users can sign up and register books to generate library catalogs and reading lists.

![img_banner](https://i.imgur.com/86OmQ5F.png)

 The page 'https://www.goodreads.com/list/show/33934.Best_Selling_Books_of_All_Time' provides a list of books on goodreads. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.
 
 Here's an outline of the steps we'll follow:

- Download the webpage using requests
- Parse the HTML source code using beautiful soup
- Extract book title,author name,book url, ratings from page
- Compile extracted information into Python lists and dictionaries
- Extract and combine data from multiple pages
- Save the extracted information to a CSV file.
- By the end of the project, we'll create a CSV file in the following format:

[Book_title,Author_name,Book_url,Book_ratings]

**Example :**

[A Tale of Two Cities,Charles Dickens,https://www.goodreads.com/book/show/1953.A_Tale_of_Two_Cities,3.86 avg rating — 892967 ratings]


##  What is Web Scraping?

Web Scraping gives us the ability to collect data from a source all by ourselves and in the format that we would like Of course there would be some limitations depending on the source of the data but we have greater control since we get to decide how and what data we scrape from the data available at the source.
We would be using iplt20.com to scrape the points tables as they are kind enough to allow scraping with some restrictions. We would work on _best selling books_ data.

![](https://imgur.com/ZsmNfrG.jpeg)

# Outline of the Project:

1. Understand and identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
2. Installing and Importing required libraries.
3. Simulating the page and Extracting the URLs of all the years from website using beautifulsoup.
4. Accessing book data and building a URL.
5. Parsing the books into 5 columns using Helper Functions.
6. Storing the extracted data into a dictionary.
7. Compiling all the data into a DataFrame using Pandas and saving the data into CSV file.

## Steps followed:
We have broken down into 3 steps:-

1. First we would extract all the year links from the homepage for goodread.com.

2. We would navigate all the tags and scrape the best selling books table one by one using the functions.

3. Since we already have all inner and outer tags, we will navigate to the data in the points table page and scrape the details by calling the functions into dictionary and save it to a dataframe and finally export it to a csv file. 

### Installing and importing required libraries.

In [None]:
# installing necesseray libraries - requests and beautifulsoup
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
# creating list for urls
list_of_urls = [
    '181599.Best_books_of_November_2022',
    '33934.Best_Selling_Books_of_All_Time'
]

In [None]:
#importing libraries
import requests
from bs4 import BeautifulSoup

# defining helper function to get list of books from the url
def get_topics_page(urls):
    topics_url = 'https://www.goodreads.com/list/show/' + urls
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    all_tags = doc.find("table",class_="tableList js-dataTooltip")
    all_book_tags = all_tags.find_all("tr")
    
    return all_book_tags

In [None]:
# assigning book tags in a variable by using the urls in list
books_tr_tags = [get_topics_page(url) for url in list_of_urls]

In [None]:
len(books_tr_tags)

2

### Defining function to extract book records and storing it in dictionary

In [None]:
# defining helper function to get 5 columns of needed format
def book_details(book, search_type):
    book_title = book.find("a",class_="bookTitle")
    book_author_name = book.find("a",class_="authorName")
    base_url = 'https://www.goodreads.com'
    book_url = base_url + book.find("a",class_ = "bookTitle")["href"]
    book_ratings = book.find("span",class_="minirating")
    
    # creating a dictionary and storing the book column elements
    book_dict = {
        
        'Book_title':book_title.text.strip().replace(","," "),
        'Author_name':book_author_name.text,
        'Book_url':book_url,
        'Book_ratings':book_ratings.text.strip().replace(",", ""),
        'search_type': search_type
    }
    return book_dict

In [None]:
len(all_books_done)

200

200 book records have been scraped

In [None]:
# reading a sample
all_books_done[:2]

[{'Book_title': 'A Light in the Flame (Flesh and Fire  #2)',
  'Author_name': 'Jennifer L. Armentrout',
  'Book_url': 'https://www.goodreads.com/book/show/59449896-a-light-in-the-flame',
  'Book_ratings': '4.53 avg rating — 44359 ratings',
  'search_type': 'Best Selling Books'},
 {'Book_title': 'Heist (Valenshek Legacy  #1)',
  'Author_name': 'Tate James',
  'Book_url': 'https://www.goodreads.com/book/show/63001132-heist',
  'Book_ratings': '4.58 avg rating — 2072 ratings',
  'search_type': 'Best Selling Books'}]

### Defining Function to write scrapped raw data into csv file format

In [None]:
# defining function to write book records data into csv file
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

In [None]:
write_csv(all_books_done , 'books.csv')

### install pandas to create and store the data as dataframe

In [None]:
!pip install pandas --upgrade --quiet

In [None]:
import pandas as pd

In [None]:
books_df = pd.read_csv("books.csv")

In [None]:
books_df

Unnamed: 0,Book_title,Author_name,Book_url,Book_ratings,search_type
0,A Light in the Flame (Flesh and Fire #2),Jennifer L. Armentrout,https://www.goodreads.com/book/show/59449896-a...,4.53 avg rating — 44359 ratings,Best Selling Books
1,Heist (Valenshek Legacy #1),Tate James,https://www.goodreads.com/book/show/63001132-h...,4.58 avg rating — 2072 ratings,Best Selling Books
2,Sweetest Secret,Lucy Darling,https://www.goodreads.com/book/show/62994973-s...,4.02 avg rating — 993 ratings,Best Selling Books
3,God of Wrath (Legacy of Gods #3),Rina Kent,https://www.goodreads.com/book/show/61100797-g...,4.20 avg rating — 17659 ratings,Best Selling Books
4,Dukes of Peril (The Royals of Forsyth Universi...,Angel Lawson,https://www.goodreads.com/book/show/61348581-d...,4.51 avg rating — 3274 ratings,Best Selling Books
...,...,...,...,...,...
195,The Prophet,Kahlil Gibran,https://www.goodreads.com/book/show/2547.The_P...,4.23 avg rating — 283460 ratings,Best books of November 2022
196,The Exorcist,William Peter Blatty,https://www.goodreads.com/book/show/179780.The...,4.20 avg rating — 219249 ratings,Best books of November 2022
197,The Gruffalo (Gruffalo #1),Julia Donaldson,https://www.goodreads.com/book/show/1013383.Th...,4.45 avg rating — 40501 ratings,Best books of November 2022
198,Catch-22,Joseph Heller,https://www.goodreads.com/book/show/168668.Cat...,3.99 avg rating — 806318 ratings,Best books of November 2022
