# Web Scraping Setup and Data Collection
This notebook is designed to scrape book data from 'books.toscrape.com'. It performs the following tasks:

1. Imports required libraries (pandas, requests, BeautifulSoup)
2. Defines functions to collect book information including:
   - Book names
   - Stock availability 
   - Prices
   - Book ratings (1-5 stars)
3. Scrapes data from 50 pages of the website
4. Saves the collected data to a CSV file

The code demonstrates web scraping techniques using BeautifulSoup and data manipulation with pandas.


In [5]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# data_set = []
data_set = pd.DataFrame(columns=['book_name', 'instock', 'price', 'book_score', 'link'])


# Parsing Functions Documentation

This cell contains two important functions for web scraping book data:

1. `get_book_score(book)`: 
   - Takes a book HTML element as input
   - Returns a rating from 1-5 based on the 'star-rating' class
   - Returns None if no valid rating is found

2. `get_data_from_url(page_number)`:
   - Takes a page number as input 
   - Makes HTTP request to books.toscrape.com for that page
   - Parses HTML using BeautifulSoup to extract:
     * Book title
     * Stock status
     * Price 
     * Rating (using get_book_score)
   - Adds each book's data as a new row to the global data_set DataFrame


In [6]:
def get_book_score(book):
    if book.find('p', class_='star-rating One'):
        return 1
    elif book.find('p', class_='star-rating Two'):
        return 2
    elif book.find('p', class_='star-rating Three'):
        return 3
    elif book.find('p', class_='star-rating Four'):
        return 4
    elif book.find('p', class_='star-rating Five'):
        return 5
    else:
        return None


def get_data_from_url(page_number):
    data_to_parse = requests.get(f'http://books.toscrape.com/catalogue/page-{page_number}.html')
    soap = BeautifulSoup(data_to_parse.text, 'html.parser')
    books = soap.find_all('article', class_='product_pod')
    for book in books:
        book_name = book.h3.a['title']
        instock = book.find('p', class_='instock availability').text.strip()
        price = book.find('p', class_='price_color').text.strip()
        book_score = get_book_score(book)
        link = book.h3.a['href']
        data_set.loc[len(data_set)] = [book_name, instock, price, book_score, link]

        # data_set.append({'book_name': book_name
        #                     , 'instock': instock
        #                     , 'price': price
        #                     , 'book_score': book_score})


# Scraping 50 Pages of Book Data

This project will scrape data from all 50 pages of books.toscrape.com to collect:

1. Book Information
   - Title
   - Price
   - Stock availability status
   - Rating (1-5 stars)

2. Scraping Process:
   - Iterates through pages 1-50
   - Each page contains 20 books
   - Total books to be scraped: ~1000 books


In [7]:
# there are 50 pages
for page_number in range(1, 51):
    get_data_from_url(page_number)

In [8]:
data_set.to_csv('data/raw/books.csv')
data_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   book_name   1000 non-null   object
 1   instock     1000 non-null   object
 2   price       1000 non-null   object
 3   book_score  1000 non-null   int64 
 4   link        1000 non-null   object
dtypes: int64(1), object(4)
memory usage: 46.9+ KB
