In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

# Exercise

Go to http://books.toscrape.com/, Using what you have learned create a csv file the contains all the books found in the website. The csv file should contain the following:

- Title
- Price
- Description
- Availability

Code guides have been provided to help you in creating the web scraper. Found below is the `get_title_links_and_next_page` this function returns 2 things book urls in a page and link to the next page. The idea here is to collect first all the book links available in the website and store the links in the `title_links` variable **(5 points)**

In [2]:
base_url = "http://books.toscrape.com/"

def get_title_links_and_next_page(page_url):
    ##print(page_url)
    #this is where we store our links to the title 
    list_links = [] 
    #get the html for the url that was given
    page = requests.get(page_url) #to go to the next page
    #parse the html file for beautifulsoup to query on
    soup = BeautifulSoup(page.text, 'html.parser')
    #inspecting the page we notice that the books are placed under 
    #the article tag so we get all articles
    for article in soup.find_all('article'):
        #the article tag has an anchor tag so we find it and get the href
        if "catalogue" not in article.find("a")['href']:
            url = base_url + "catalogue/" + article.find("a")['href']
        else:
            url = base_url + article.find("a")['href']
        #add the title url to our list of titles 
        list_links.append(url)
    
    #try to check if a next button is in the page 
    try:
        next_url = soup.find('li',class_="next").a['href']
        ##print(next_url)
    #if none we return None :)     
    except:
        next_url = None

    return (list_links, next_url)
    


#initial set up to crawl the book links and next page
res = get_title_links_and_next_page(base_url)
title_links = res[0]

#while we get a next page link keep on crawling for book links
while res[1]:
    #there are cases that the word "catalogue" is not in the link so we add it 
    #so that we can crawl properly
    if "catalogue" not in res[1]:
        page_url = base_url + "catalogue/" + res[1]
    else:
        page_url = base_url + res[1]
    res = get_title_links_and_next_page(page_url)
    title_links += res[0] #update list for 'title links'
    ##print(res[1])
    
    
title_links


['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990

Once you have a list of all the available book links we loop through the links and use the 4 functions `get_title`, `get_price`, `get_description`, `get_availability` to retrieve the book information. **(10 points)**

In [3]:
def get_title(soup):
    return soup.h1.text.strip()

def get_price(soup):
    return soup.find('p',class_="price_color").text
    
def get_description(soup):
    desc = soup.find('article', class_ = 'product_page').find_all('p')
    return desc[3].text
    
def get_availability(soup):
    return soup.find('p',class_="instock availability").text.strip()

book_data = []
for title_link in title_links: 
    page = requests.get(title_link)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    title = get_title(soup)
    price = get_price(soup)
    description = get_description(soup)
    availability = get_availability(soup)
    
    book_data += [[title, price, description, availability]]
    #print(title) #check progress every iteration


A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas
In Her Wake
How Music Works
Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More
Chase Me (Paris Nights #2)
Black Dust
Birdsong: A Story in Pictures
A

In [6]:
#book_data[600] #check contents

['The Dream Thieves (The Raven Cycle #2)',
 'Â£34.50',
 'If you could steal things from dreams, what would you take?Ronan Lynch has secrets. Some he keeps from others. Some he keeps from himself.One secret: Ronan can bring things out of his dreams.And sometimes he\'s not the only one who wants those things.Ronan is one of the raven boysâ\x80\x94a group of friends, practically brothers, searching for a dead king named Glendower, who they If you could steal things from dreams, what would you take?Ronan Lynch has secrets. Some he keeps from others. Some he keeps from himself.One secret: Ronan can bring things out of his dreams.And sometimes he\'s not the only one who wants those things.Ronan is one of the raven boysâ\x80\x94a group of friends, practically brothers, searching for a dead king named Glendower, who they think is hidden somewhere in the hills by their elite private school, Aglionby Academy. The path to Glendower has long lived as an undercurrent beneath town. But now, like Ron

In [7]:
df = pd.DataFrame(data = book_data)
df.columns = ['title', 'price', 'description', 'availability']
display(df.head())

#save to csv file 
df.to_csv('scrape.csv')

Unnamed: 0,title,price,description,availability
0,A Light in the Attic,Â£51.77,It's hard to imagine a world without A Light i...,In stock (22 available)
1,Tipping the Velvet,Â£53.74,"""Erotic and absorbing...Written with starling ...",In stock (20 available)
2,Soumission,Â£50.10,"Dans une France assez proche de la nÃ´tre, un ...",In stock (20 available)
3,Sharp Objects,Â£47.82,"WICKED above her hipbone, GIRL across her hear...",In stock (20 available)
4,Sapiens: A Brief History of Humankind,Â£54.23,From a renowned historian comes a groundbreaki...,In stock (20 available)
