# BOOK DATA SCRAPING USING BEAUTIFUL SOUP
In this project, I scraped data from waterstones website using BeautifulSoup. I will use this data for future analysis.

For scraping data from this website, I'll perform the following tasks:

[**Task 1**](#task1): Importing the libraries

[**Task 2**](#task2): Creating the base url and choosing the header

[**Task 3**](#task3): Extracting product links on the first page

[**Task 4**](#task4): Extracting product links on all the pages

[**Task 5**](#task5): Extracting information of the first product

[**Task 6**](#task6): Extracting information of all the products

<a id='task1'></a>
# Task 1: Importing the libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

<a id='task2'></a>
# Task 2: Creating the base url and choosing the header

In [2]:
base_url = 'https://www.waterstones.com'
header =  {
    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
}

<a id='task3'></a>
# Task 3: Extracting product links on the first page

In [3]:
source = requests.get('https://www.waterstones.com/category/fiction/facet/347/page/1#p_9608967')
soup = BeautifulSoup(source.content, 'lxml')

In [4]:
productlist = soup.find_all('div', class_='book-preview book-preview-grid-item span3 tablet-span6 mobile-span6')

In [5]:
for item in productlist:
    for link in item.find_all('a', href = True, class_ = 'text-gold'):
        print(link['href'])

/book/the-thursday-murder-club/richard-osman/9780241425442
/book/girl-a/abigail-dean/2928377050276
/book/girl-a/abigail-dean/9780008389055
/book/hamnet/maggie-ofarrell/9781472285522
/book/where-the-crawdads-sing/delia-owens/9781472154668
/book/shuggie-bain/douglas-stuart/9781529019278
/book/girl-woman-other/bernardine-evaristo/9780241984994
/book/the-dark-remains/william-mcilvanney/ian-rankin/9781838854102
/book/troubled-blood/robert-galbraith/9780751579932
/book/the-beekeeper-of-aleppo/christy-lefteri/9781838770013
/book/the-sentinel/lee-child/andrew-child/9781787633612
/book/agent-running-in-the-field/john-le-carre/9780241986547
/book/beautiful-world-where-are-you/sally-rooney/2928377053314
/book/beautiful-world-where-are-you/sally-rooney/9780571370054
/book/pine/francine-toon/9781784164829
/book/those-who-are-loved/victoria-hislop/9781472223227
/book/the-midnight-library/matt-haig/9781786892706
/book/the-giver-of-stars/jojo-moyes/9780718183219
/book/a-single-thread/tracy-chevalier/9

In [6]:
productlinks = []
for item in productlist:
    for link in item.find_all('a', href = True, class_ = 'text-gold'):
        productlinks.append(base_url + link['href'])
print(productlinks)

['https://www.waterstones.com/book/the-thursday-murder-club/richard-osman/9780241425442', 'https://www.waterstones.com/book/girl-a/abigail-dean/2928377050276', 'https://www.waterstones.com/book/girl-a/abigail-dean/9780008389055', 'https://www.waterstones.com/book/hamnet/maggie-ofarrell/9781472285522', 'https://www.waterstones.com/book/where-the-crawdads-sing/delia-owens/9781472154668', 'https://www.waterstones.com/book/shuggie-bain/douglas-stuart/9781529019278', 'https://www.waterstones.com/book/girl-woman-other/bernardine-evaristo/9780241984994', 'https://www.waterstones.com/book/the-dark-remains/william-mcilvanney/ian-rankin/9781838854102', 'https://www.waterstones.com/book/troubled-blood/robert-galbraith/9780751579932', 'https://www.waterstones.com/book/the-beekeeper-of-aleppo/christy-lefteri/9781838770013', 'https://www.waterstones.com/book/the-sentinel/lee-child/andrew-child/9781787633612', 'https://www.waterstones.com/book/agent-running-in-the-field/john-le-carre/9780241986547', 

<a id='task4'></a>
# Task 4: Extracting product links on all the pages

In [7]:
productlinks = []
for i in range(1,50):
    source = requests.get(f'https://www.waterstones.com/category/fiction/facet/347/page/{i}')
    soup = BeautifulSoup(source.content, 'lxml')
    productlist = soup.find_all('div', class_='book-preview book-preview-grid-item span3 tablet-span6 mobile-span6')
    for item in productlist:
        for link in item.find_all('a', href = True, class_ = 'text-gold'):
            productlinks.append(base_url + link['href'])
            print('Saving book links on page', i)
print(len(productlinks))

Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 1
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on page 2
Saving book links on

Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 14
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book links on page 15
Saving book li

Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 26
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book links on page 27
Saving book li

Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 38
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book links on page 39
Saving book li

<a id='task5'></a>
# Task 5: Extracting information of the first product

In [8]:
testlink = 'https://www.waterstones.com/book/the-thursday-murder-club/richard-osman/9780241425442'
r = requests.get(testlink, headers = header)
soup = BeautifulSoup(r.content, 'lxml')

In [9]:
name = soup.find('h1', class_ = 'title').text
name = name[:-11]
print(name)

The Thursday Murder Club - The Thursday Murder Club 1


In [10]:
author = soup.find('span', class_ = 'contributors')
span = author.find_next('span')
author_name = span.string
print(author_name)

Richard Osman


In [11]:
original_price = soup.find('b', class_ = 'price-rrp').text
print(original_price)

£14.99


In [12]:
new_price = soup.find('div', class_ = 'price')
pri = new_price.find_next('b')
main_price = pri.string
print(main_price)

£14.99


In [13]:
no_of_pages = soup.find('div', class_ = 'info')
no = no_of_pages.find_next('span').find_next('span').find_next('span')
nop = no.string
print(nop)


400


In [14]:
stockp = soup.find('div', class_ = 'perk-container')
sto = stockp.find_next('span')
stock = sto.string
print(stock)


In stock


In [15]:
pubdate = soup.find('div', class_ ='info').meta
print(pubdate['content'])

2020-09-03


In [16]:
genre = soup.find('div', class_ = 'breadcrumbs span12').a.text
print(genre)

Crime, Thrillers & Mystery


In [17]:
no_of_reviews = soup.find('a', class_ = 'reviews-trigger').text
#nor = no_of_reviews.replace('Reviews', '') This is one way to do it and print nor
#nor = no_of_reviews[:-7] This another way to do it and print nor
#Best way
no_of_reviews = soup.find('a', class_ = 'reviews-trigger').text[:-7]
print(no_of_reviews)

290 


In [18]:
publisher = soup.find('p', class_ = 'spec').i.span.text
print(publisher)

Penguin Books Ltd


In [19]:
Cover_type = soup.find('div', class_ = 'info')
span = Cover_type.find_next('span')
cover = span.string.strip()
print(cover)

Hardback


In [20]:
book = {
    'Name':name,
    'Author_name': author_name,
    'Original_price': original_price,
    'New_price': main_price,
    'No_of_pages': nop,
    'Stock availability':stock,
    'Publication date': pubdate['content'],
    'Genre': genre,
    'No. of reviews': no_of_reviews,
    'Publisher': publisher,  
    'Cover':cover
}
print(book)

{'Name': 'The Thursday Murder Club - The Thursday Murder Club 1', 'Author_name': 'Richard Osman', 'Original_price': '£14.99', 'New_price': '£14.99', 'No_of_pages': '400', 'Stock availability': 'In stock', 'Publication date': '2020-09-03', 'Genre': 'Crime, Thrillers & Mystery', 'No. of reviews': '290 ', 'Publisher': 'Penguin Books Ltd', 'Cover': 'Hardback'}


<a id='task6'></a>
# Task 6: Extracting information of all the products

In [21]:
booklist =[]
for link in productlinks:
    r = requests.get(link, headers = header)
    soup = BeautifulSoup(r.content, 'lxml')
    name = soup.find('h1', class_ = 'title').text[:-11]
    
    author = soup.find('span', class_ = 'contributors')
    span = author.find_next('span')
    author_name = span.string
    try:
        original_price = soup.find('b', class_ = 'price-rrp').text
    except: 
        original_price = 'Not given'
    
    if original_price == 'Not given':
        new_price = soup.find('div', class_ = 'price')
        pri = new_price.find_next('b')
        main_price = pri.string
    else:
        new_price = soup.find('div', class_ = 'price')
        pri = new_price.find_next('b').find_next('b')
        main_price = pri.string
        
    
    try:
        no_of_pages = soup.find('div', class_ = 'info')
        no = no_of_pages.find_next('span').find_next('span').find_next('span')
        nop = no.string
    except:
        no_of_pages = 'Not given'
    
    stockp = soup.find('div', class_ = 'perk-container')
    sto = stockp.find_next('span')
    stock = sto.string
    
    pubdate = soup.find('div', class_ ='info').meta
    
    genre = soup.find('div', class_ = 'breadcrumbs span12').a.text
    try: 
        no_of_reviews = soup.find('a', class_ = 'reviews-trigger').text[:-7]

    except:
        nor = 'Not given'
    publisher = soup.find('p', class_ = 'spec').i.span.text
    
    Cover_type = soup.find('div', class_ = 'info')
    span = Cover_type.find_next('span')
    cover = span.string.strip()
    
    

    
    book = {
    'Name':name,
    'Author_name': author_name,
    'Original_price': original_price,
    'New_price': main_price,
    'No_of_pages': nop,
    'Stock availability':stock,
    'Publication date': pubdate['content'],
    'Genre': genre,
    'No. of reviews': no_of_reviews,
    'Publisher': publisher,
    'Cover':cover
   
    }
    booklist.append(book)
    print('saving:', book['Name'])

        

saving: The Thursday Murder Club - The Thursday Murder Club 1
saving: Girl A: Signed Exclusive Edition
saving: Girl A
saving: Hamnet: Exclusive Edition
saving: Where the Crawdads Sing 
saving: Shuggie Bain
saving: Girl, Woman, Other 
saving: The Dark Remains
saving: Troubled Blood
saving: The Beekeeper of Aleppo 
saving: The Sentinel - Jack Reacher 25
saving: Agent Running in the Field 
saving: Beautiful World, Where Are You: Signed Exclusive Edition
saving: Beautiful World, Where Are You: Exclusive Edition
saving: Pine 
saving: Those Who Are Loved 
saving: The Midnight Library
saving: The Giver of Stars 
saving: A Single Thread 
saving: A Song for the Dark Times
saving: A Burning: Exclusive Edition
saving: The Girl with the Louding Voice 
saving: A Court of Silver Flames: Exclusive Edition - A Court of Thorns and Roses
saving: The Testaments 
saving: Troy: Our Greatest Story Retold - Stephen Fry's Greek Myths
saving: The Foundling 
saving: The Confession 
saving: The Mermaid of Black 

saving: False Value: Exclusive Edition 
saving: Nine Perfect Strangers 
saving: Klara and the Sun: Exclusive Edition
saving: Ready Player One 
saving: Home Stretch: Exclusive Edition
saving: Death in the East - Sam Wyndham 
saving: Dominicana: Exclusive Edition 
saving: The Friendship Book 2021
saving: The Thursday Murder Club - The Thursday Murder Club 1 
saving: The Seven Sisters - The Seven Sisters 
saving: The Emperor's Exile (Eagles of the Empire 19)
saving: Nothing Ventured - William Warwick Novels 
saving: Killing Floor: (Jack Reacher 1) - Jack Reacher 
saving: The Devil and the Dark Water
saving: This Is Happiness 
saving: The Windsor Knot
saving: A Confederacy of Dunces - Penguin Modern Classics 
saving: And Then There Were None: The World's Favourite Agatha Christie Book 
saving: The Offing 
saving: A Net for Small Fishes: Exclusive Edition
saving: The Goldfinch 
saving: Tokyo Ueno Station 
saving: Bring Up the Bodies - The Wolf Hall Trilogy 
saving: The Haunting of Hill Hous

saving: Angela Carter's Book Of Fairy Tales
saving: The Five People You Meet In Heaven - Heaven 
saving: I Wish It Could Be Christmas Every Day
saving: The Guernsey Literary and Potato Peel Pie Society 
saving: Grief Is the Thing with Feathers 
saving: The Stolen Sisters 
saving: The Unbearable Lightness of Being 
saving: A Christmas Wish for the Shipyard Girls - The Shipyard Girls Series 
saving: Interior Chinatown: WINNER OF THE NATIONAL BOOK AWARDS 2020 
saving: A Christmas Memory
saving: The Fault in Our Stars 
saving: Winter - Seasonal Quartet 
saving: The Accomplice 
saving: To Calais, In Ordinary Time 
saving: Purple Hibiscus 
saving: Miss Benson's Beetle
saving: The Trouble With Peace: Book Two Signed Edition - The Age of Madness
saving: The Midnight Library: Signed Exclusive Edition
saving: The Remains of the Day 
saving: The Secret Life of Bees 
saving: Siddhartha - Penguin Modern Classics 
saving: Exhalation 
saving: Scar Tissue: The Debut Thriller from the No.1 Bestselling 

saving: Brighton Rock 
saving: Down Cemetery Road: Zoe Boehm Thrillers 1 - Zoe Boehm Thrillers 
saving: The Left Hand of Darkness - S.F. MASTERWORKS 
saving: The Pull of the Stars
saving: The Poisonwood Bible 
saving: Rebecca - Virago Modern Classics
saving: Holding 
saving: Transcription 
saving: The Heart is a Lonely Hunter - Penguin Modern Classics 
saving: In the Time We Lost 
saving: The Betrayals: Signed Exclusive Edition
saving: Those People 
saving: Pet Sematary 
saving: Oathbringer Part One: The Stormlight Archive Book Three - Stormlight Archive 
saving: The Rosie Project - The Rosie Project Series 
saving: The Carer 
saving: Norse Mythology 
saving: False Value - A Rivers of London novel 
saving: Execution - Giordano Bruno Book 6
saving: My Cousin Rachel - Virago Modern Classics 
saving: Who They Was
saving: If It Bleeds
saving: The Complete Sherlock Holmes
saving: One Shot: (Jack Reacher 9) - Jack Reacher 
saving: Oathbringer Part Two: The Stormlight Archive Book Three - STO

saving: The Night Circus: Exclusive Edition
saving: Fake Accounts
saving: A Diamond from Tiffany's 
saving: Gallowglass 
saving: A Stranger City 
saving: Zero 22: Danny Black Thriller 8
saving: A Promise of Ankles: A 44 Scotland Street Novel
saving: Woman on the Edge of Time 
saving: The Dispossessed - S.F. Masterworks 
saving: Hercule Poirot: the Complete Short Stories 
saving: As I Lay Dying 
saving: The Heavens 
saving: Go Tell it on the Mountain - Penguin Modern Classics 
saving: The Cheater's Guide to Love: Faber Stories - Faber Stories 
saving: Jeeves and the Leap of Faith
saving: Fearless Fairy Tales
saving: Half a World Away 
saving: Emma - Penguin Clothbound Classics
saving: Cold Earth - Shetland 
saving: Jane Eyre - The Penguin English Library 
saving: The House at Sea's End: The Dr Ruth Galloway Mysteries 3 - The Dr Ruth Galloway Mysteries 
saving: Colorless Tsukuru Tazaki and His Years of Pilgrimage 
saving: Empire of the Sun - 4th Estate Matchbook Classics 
saving: The Gar

saving: Dominion 
saving: The Lonely Londoners - Penguin Modern Classics 
saving: Possession: A Romance 
saving: In His Father's Footsteps 
saving: The Looking Glass War - Penguin Modern Classics 
saving: A Closed and Common Orbit: Wayfarers 2 - Wayfarers 
saving: The Ghost Tree 
saving: Mr Salary - Faber Stories 
saving: The First Woman
saving: Harry Potter and the Goblet of Fire - Gryffindor Edition
saving: Daughters of Cornwall
saving: Escape to the French Farmhouse 
saving: Exciting Times
saving: All Quiet on the Western Front 
saving: Unquiet 
saving: Love: Signed Edition 
saving: Washington Black 
saving: Burnt Sugar: Signed Bookplate Edition 
saving: Great Expectations - The Penguin English Library 
saving: A Long Petal of the Sea
saving: Earthlings: Signed Bookplate Edition
saving: Flashman - The Flashman Papers Book 1 
saving: Smoke and Ashes: Sam Wyndham Book 3 - Sam Wyndham 
saving: A Death in the Family: My Struggle Book 1 - My Struggle 
saving: The Amazing Adventures of Ka

# FINAL DATA 

In [22]:
df = pd.DataFrame(booklist)
df.tail(20)

Unnamed: 0,Name,Author_name,Original_price,New_price,No_of_pages,Stock availability,Publication date,Genre,No. of reviews,Publisher,Cover
1156,A Ration Book Childhood - Ration Book series,Jean Fullerton,Not given,£7.99,400,We can order this from the publisher,2019-10-03,Fiction,24,Atlantic Books,Paperback
1157,The Pale Horseman - The Last Kingdom Series Bo...,Bernard Cornwell,Not given,£8.99,432,10+ in stock,2007-07-08,Fiction,4,HarperCollins Publishers,Paperback
1158,Oliver Twist - Penguin Clothbound Classics,Charles Dickens,Not given,£14.99,608,In stock,2009-10-01,Fiction,2,Penguin Books Ltd,Hardback
1159,Less,Andrew Sean Greer,Not given,£8.99,272,10+ in stock,2018-05-22,Fiction,10,"Little, Brown Book Group",Paperback
1160,Miss Marple and Mystery: The Complete Short St...,Agatha Christie,Not given,£18.99,736,10+ in stock,2008-09-15,"Crime, Thrillers & Mystery",3,HarperCollins Publishers,Paperback
1161,Come Rain or Come Shine: Faber Stories - Faber...,Kazuo Ishiguro,Not given,£3.50,80,10+ in stock,2019-01-03,Fiction,1,Faber & Faber,Paperback
1162,A Theatre for Dreamers,Polly Samson,Not given,£14.99,368,10+ in stock,2020-04-02,Fiction,12,Bloomsbury Publishing PLC,Hardback
1163,The Godmother,Hannelore Cayre,Not given,£8.99,10+ in stock,10+ in stock,2019-10-15,"Crime, Thrillers & Mystery",1,Old Street Publishing,Paperback
1164,Indelicacy,Amina Cain,Not given,£9.99,168,10+ in stock,2020-09-03,Fiction,1,Daunt Books,Paperback
1165,The Court of Miracles - The Court of Miracles ...,Kester Grant,Not given,£12.99,464,10+ in stock,2020-06-18,"Science Fiction, Fantasy & Horror",46,HarperCollins Publishers,Hardback


# Convert to XLSX file

In [23]:
df.to_excel("output.xlsx", sheet_name = 'Book_data')