## Web Scraping using python libraries requests and BeautifulSoup

Goodreads is a website which lets users across the world to rate the books and to write down reviews. Every year, Goodreads conduct choice awards where users can vote to their favorite book in each genre. Based on the number of votes each book received, one book will be declared winner from each genre. 

Here, I used python to scrape the goodreads website and collect 2021 winner details like the category, title of the book, number of votes received and the author of the book. 

### Importing the needed libraries

In [1]:
import requests 
from bs4 import BeautifulSoup as bs
import pandas as pd

### Getting the page needed using requests library

In [2]:
page=requests.get("https://www.goodreads.com/choiceawards/best-books-2021")
soup=bs(page.content,'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="desktop withSiteHeaderTopFullImage">
 <head>
  <title>
   Best Books 2021 — Goodreads Choice Awards
  </title>
  <meta content="telephone=no" name="format-detection"/>
  <link href="https://www.goodreads.com/choiceawards/best-books-2021" rel="canonical"/>
  <meta content="Best books 2021, top books 2021, 2021 Goodreads Choice Awards, votes, ratings, book reviews" property="keywords">
   <meta content="2415071772" property="fb:app_id"/>
   <meta content="Announcing the Winners of the 2021 Goodreads Choice Awards!" property="og:title"/>
   <meta content="https://s.gr-assets.com/assets/award/2021/choice-logo-square-winners.png" property="og:image"/>
   <meta content="The Goodreads Choice Awards are the only major book awards decided by readers. View the winners across all 17 categories now!" property="og:description"/>
   <meta content="https://www.goodreads.com/choiceawards/best-books-2021" property="og:url"/>
   <meta content="Goodreads" property="og:site_na

### Getting required fields for one record

In [3]:
sample_data=soup.find(class_="category clearFix")
print(sample_data.prettify())

<div class="category clearFix">
 <a href="/choiceawards/best-fiction-books-2021">
  <h4 class="category__copy">
   Fiction
  </h4>
  <div class="category__winnerImageContainer">
   <img alt="Beautiful World, Where Are You" class="category__winnerImage" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1618329605l/56597885.jpg"/>
  </div>
 </a>
 <div class="wtrButtonContainer wtrSignedOut" id="1_book_56597885">
  <div class="wtrUp wtrLeft">
   <form accept-charset="UTF-8" action="/shelf/add_to_shelf" method="post">
    <input name="utf8" type="hidden" value="✓"/>
    <input name="authenticity_token" type="hidden" value="aBnKFTxsXbXU8a8h5J+Yr+evGX5e6CuN3WGMy4tnk7l8fShESjEQ5Pekoo7VXivcqs09QkDQGMohEO+lMYfUMQ=="/>
    <input id="book_id" name="book_id" type="hidden" value="56597885"/>
    <input id="name" name="name" type="hidden" value="to-read"/>
    <input id="unique_id" name="unique_id" type="hidden" value="1_book_56597885"/>
    <input id="wtr_new" name="wtr_ne

In [4]:
sample_cat=sample_data.find("h4").get_text()
print(sample_cat)


Fiction



In [5]:
sample_title=sample_data.find("img")["alt"]
print(sample_title)

Beautiful World, Where Are You


In [6]:
sample_a=sample_data.find("a")
print(sample_a.prettify())

<a href="/choiceawards/best-fiction-books-2021">
 <h4 class="category__copy">
  Fiction
 </h4>
 <div class="category__winnerImageContainer">
  <img alt="Beautiful World, Where Are You" class="category__winnerImage" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1618329605l/56597885.jpg"/>
 </div>
</a>



In [7]:
print(sample_a["href"])

/choiceawards/best-fiction-books-2021


### Getting child page from the parent page for one record

In [8]:
sample_page=requests.get("https://www.goodreads.com"+sample_a["href"])
sample_soup=bs(sample_page.content,'html.parser')
print(sample_soup.prettify())

<!DOCTYPE html>
<html class="desktop withSiteHeaderTopFullImage">
 <head>
  <title>
   Best Fiction 2021 — Goodreads Choice Awards
  </title>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="Best books 2021, top books 2021, 2021 Goodreads Choice Awards, votes, ratings, book reviews" property="keywords">
   <meta content="2415071772" property="fb:app_id"/>
   <meta content="Announcing the Goodreads Choice Winner in Best Fiction!" property="og:title"/>
   <meta content="https://s.gr-assets.com/assets/award/2021/choice-logo-square-winners.png" property="og:image"/>
   <meta content="Congratulations to our winners in 17 categories! The Goodreads Choice Awards are the only major book awards decided by readers." property="og:description"/>
   <meta content="https://www.goodreads.com/choiceawards/best-fiction-books-2021" property="og:url"/>
   <meta content="Goodreads" property="og:site_name"/>
   <!-- * Copied from https://info.analytics.a2z.com/#/docs/data_collectio

In [9]:
sample_votes=sample_soup.find(class_="greyText gcaNumVotes").get_text()
print(sample_votes)


69,770
votes



In [10]:
sample_author=sample_soup.find(class_="authorName__container")
print(sample_author.prettify())

<div class="authorName__container">
 <a class="authorName" href="https://www.goodreads.com/author/show/15860970.Sally_Rooney" itemprop="url">
  <span itemprop="name">
   Sally Rooney
  </span>
 </a>
 <span class="greyText">
  (Goodreads Author)
 </span>
</div>



In [11]:
sample_author_name=sample_author.find("span").get_text()
print(sample_author_name)

Sally Rooney


In [12]:
print(sample_cat,sample_title,sample_votes,sample_author_name)


Fiction
 Beautiful World, Where Are You 
69,770
votes
 Sally Rooney


### Getting all the records using css selectors

In [13]:
categories=[t.get_text().strip("\n") for t in soup.select(".clearFix.category h4")]
categories

['Fiction',
 'Mystery & Thriller',
 'Historical Fiction',
 'Fantasy',
 'Romance',
 'Science Fiction',
 'Horror',
 'Humor',
 'Nonfiction',
 'Memoir & Autobiography',
 'History & Biography',
 'Graphic Novels & Comics',
 'Poetry',
 'Debut Novel',
 'Young Adult Fiction',
 'Young Adult Fantasy',
 "Middle Grade & Children's"]

In [14]:
titles=[t["alt"] for t in soup.select(".category__winnerImageContainer img")]
titles

['Beautiful World, Where Are You',
 'The Last Thing He Told Me',
 'Malibu Rising',
 'A \u200bCourt of Silver Flames (A Court of Thorns and Roses, #4)',
 'People We Meet on Vacation',
 'Project Hail Mary',
 'The Final Girl Support Group',
 'Broken (In the Best Possible Way)',
 'The Anthropocene Reviewed',
 'Crying in H Mart',
 'Empire of Pain: The Secret History of the Sackler Dynasty',
 'Lore Olympus: Volume One (Lore Olympus, #1)',
 'The Hill We Climb: An Inaugural Poem for the Country',
 'The Spanish Love Deception',
 "Firekeeper's Daughter",
 'Rule of Wolves (King of Scars, #2)',
 'Daughter of the Deep']

In [15]:
child_soups=[]
child_a=soup.select(".clearFix.category a")
child_href=[a["href"] for a in child_a]
print(child_href)

['/choiceawards/best-fiction-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-mystery-thriller-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-historical-fiction-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-fantasy-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-romance-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-science-fiction-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-horror-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-humor-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-nonfiction-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-memoir-autobiography-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-history-biography-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-graphic-novels-comics-2021', '#', '#', '#', '#', '#', '/choiceawards/best-poetry-books-2021', '#', '#', '#', '#', '#', '/choiceawards/best-debut-novel-2021', '#', '#', '#', '#', '#', '/choiceawards/best

In [16]:
child_href=[i for i in child_href if i!='#']
print(child_href)

['/choiceawards/best-fiction-books-2021', '/choiceawards/best-mystery-thriller-books-2021', '/choiceawards/best-historical-fiction-books-2021', '/choiceawards/best-fantasy-books-2021', '/choiceawards/best-romance-books-2021', '/choiceawards/best-science-fiction-books-2021', '/choiceawards/best-horror-books-2021', '/choiceawards/best-humor-books-2021', '/choiceawards/best-nonfiction-books-2021', '/choiceawards/best-memoir-autobiography-books-2021', '/choiceawards/best-history-biography-books-2021', '/choiceawards/best-graphic-novels-comics-2021', '/choiceawards/best-poetry-books-2021', '/choiceawards/best-debut-novel-2021', '/choiceawards/best-young-adult-fiction-books-2021', '/choiceawards/best-young-adult-fantasy-books-2021', '/choiceawards/best-childrens-books-2021']


In [17]:
print(len(child_href))

17


In [18]:
print(len(titles))

17


In [19]:
votes=[]
authors=[]
for i,href in enumerate(child_href):
    child_page=requests.get("https://www.goodreads.com"+href)
    child_soup=bs(child_page.content,'html.parser')
    no_votes=child_soup.find(class_="greyText gcaNumVotes").get_text()
    votes.append(no_votes)
    author=child_soup.find("span",itemprop="name").get_text()
    authors.append(author)
print(len(votes))
print(len(authors))

17
17


In [20]:
print(votes)

['\n69,770\nvotes\n', '\n58,406\nvotes\n', '\n104,854\nvotes\n', '\n111,498\nvotes\n', '\n88,755\nvotes\n', '\n92,831\nvotes\n', '\n45,960\nvotes\n', '\n26,788\nvotes\n', '\n41,649\nvotes\n', '\n51,361\nvotes\n', '\n19,969\nvotes\n', '\n53,686\nvotes\n', '\n49,251\nvotes\n', '\n55,621\nvotes\n', '\n35,648\nvotes\n', '\n48,212\nvotes\n', '\n24,836\nvotes\n']


In [21]:
votes=[t.strip("\n") for t in votes]
votes=[t.replace("\n","") for t in votes]
print(votes)

['69,770votes', '58,406votes', '104,854votes', '111,498votes', '88,755votes', '92,831votes', '45,960votes', '26,788votes', '41,649votes', '51,361votes', '19,969votes', '53,686votes', '49,251votes', '55,621votes', '35,648votes', '48,212votes', '24,836votes']


In [22]:
print(authors)

['Sally Rooney', 'Laura Dave', 'Taylor Jenkins Reid', 'Sarah J. Maas', 'Emily Henry', 'Andy Weir', 'Grady Hendrix', 'Jenny  Lawson', 'John Green', 'Michelle Zauner', 'Patrick Radden Keefe', 'Rachel  Smythe', 'Amanda Gorman', 'Elena  Armas', 'Angeline Boulley', 'Leigh Bardugo', 'Rick Riordan']


### Creating a dataframe

In [23]:
winners=pd.DataFrame({"category":categories,"title":titles,"votes":votes,"author":authors})
print(winners)

                     category  \
0                     Fiction   
1          Mystery & Thriller   
2          Historical Fiction   
3                     Fantasy   
4                     Romance   
5             Science Fiction   
6                      Horror   
7                       Humor   
8                  Nonfiction   
9      Memoir & Autobiography   
10        History & Biography   
11    Graphic Novels & Comics   
12                     Poetry   
13                Debut Novel   
14        Young Adult Fiction   
15        Young Adult Fantasy   
16  Middle Grade & Children's   

                                                title         votes  \
0                      Beautiful World, Where Are You   69,770votes   
1                           The Last Thing He Told Me   58,406votes   
2                                       Malibu Rising  104,854votes   
3   A ​Court of Silver Flames (A Court of Thorns a...  111,498votes   
4                          People We Meet on Vacat

In [24]:
winners.head()

Unnamed: 0,category,title,votes,author
0,Fiction,"Beautiful World, Where Are You","69,770votes",Sally Rooney
1,Mystery & Thriller,The Last Thing He Told Me,"58,406votes",Laura Dave
2,Historical Fiction,Malibu Rising,"104,854votes",Taylor Jenkins Reid
3,Fantasy,A ​Court of Silver Flames (A Court of Thorns a...,"111,498votes",Sarah J. Maas
4,Romance,People We Meet on Vacation,"88,755votes",Emily Henry


### Saving the dataframe as a .csv file 

In [29]:
winners.to_csv("Goodreads choice awards.csv")