# Web Scraping Popular Books on Goodreads using Python

In [1]:
# importing required libraries
import requests
import os
from bs4 import BeautifulSoup
import pandas as pd

from IPython.display import display, Image

In [2]:
base_url = "https://www.goodreads.com"
genres_url = "https://www.goodreads.com/genres"

In [3]:
response = requests.get(genres_url)
response.status_code

200

The status code provides information about the request. If the request was completed successfully then status code should be 200. List of other status codes https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

## Extracting genres
First I will extract the genres of the books from the genres page and store it in a dictionary

In [5]:
doc = BeautifulSoup(response.text, 'html.parser')
genre_tags = doc.find_all('div', class_="rightContainer")[0].find_all('a')
genre_names = [tag.text for tag in genre_tags][1:-1]
genre_links = [tag['href'] for tag in genre_tags][1:-1]
genre_most_popular_dict = {name: base_url + "/shelf/show" + link[7:] for (name, link) in zip(genre_names, genre_links)}
genre_most_popular_dict

{'Art': 'https://www.goodreads.com/shelf/show/art',
 'Biography': 'https://www.goodreads.com/shelf/show/biography',
 'Business': 'https://www.goodreads.com/shelf/show/business',
 'Chick Lit': 'https://www.goodreads.com/shelf/show/chick-lit',
 "Children's": 'https://www.goodreads.com/shelf/show/children-s',
 'Christian': 'https://www.goodreads.com/shelf/show/christian',
 'Classics': 'https://www.goodreads.com/shelf/show/classics',
 'Comics': 'https://www.goodreads.com/shelf/show/comics',
 'Contemporary': 'https://www.goodreads.com/shelf/show/contemporary',
 'Cookbooks': 'https://www.goodreads.com/shelf/show/cookbooks',
 'Crime': 'https://www.goodreads.com/shelf/show/crime',
 'Ebooks': 'https://www.goodreads.com/shelf/show/ebooks',
 'Fantasy': 'https://www.goodreads.com/shelf/show/fantasy',
 'Fiction': 'https://www.goodreads.com/shelf/show/fiction',
 'Gay and Lesbian': 'https://www.goodreads.com/shelf/show/gay-and-lesbian',
 'Graphic Novels': 'https://www.goodreads.com/shelf/show/graphic

In [6]:
len(genre_most_popular_dict)

40

There are 40 genres in total. Let's pick anyone and start exploring the most popular books from that genre. 

I will pick my favorite the science-fiction genre.

In [7]:
sf_url = genre_most_popular_dict['Science Fiction']
sf_url

'https://www.goodreads.com/shelf/show/science-fiction'

In [8]:
# Get the content from this url using a get request
response = requests.get(sf_url)

The get method fetches the html of the page given the url address

In [15]:
import pprint
# pprint.pprint(response.text)

The html doesn't look very neat in itself so we parse it using tht bs4 library 

In [16]:
# Parsing the html using beautiful soup
doc = BeautifulSoup(response.text, 'html.parser')

This doc contains all the information we need.

By inspecting the html, we find that the 

In [19]:
doc.find_all('div', class_='left')[0].text.strip().split('\n')

['Dune (Dune, #1)',
 '',
 'by',
 '',
 '',
 'Frank Herbert',
 '',
 '',
 '',
 '',
 '              (shelved 18366 times as science-fiction)     ',
 '',
 '              ',
 '',
 '                avg rating 4.26 —',
 '                1,228,957 ratings  —',
 '                published 1965']