GitHub - guillenery/books-catalogue-scrapping: This repository is to exercise the use of beautiful soup to get information from html pages.

This project is to exercise some skills in web scrapping using python.

📚 1. Business Context

The Coffee&Cookies Books Club is a startup in the UK that wants to build a subscription-based book club for lovers of coffee and books.

Coffee&Cookies picks only hard-to-find and greatly-rated books. The company packs them with a unique reading guide and a wellness set including exotic varieties of coffee and specialty candies.

Their market research showed that the customers keeps their subscription for longer when they receive a mix of casual and business books, from the following categories:

Classics
Science Fiction
Humor
Business

Coffee&Cookies needs to decide what books to deliver next to their subscribers, and required me search for a dataset including the following info:

Book name
Price in GBP
Customer Rating
Stock availability

💭 2. Development Strategy

Desired Output

The customer asked for the raw information as it will be used later for further analysis.

*.csv file with the information

A table will be created containing the following information:

Columns	Description
scrap_date	the date the scrapping was held
book_title	the title of the book
book_category	the book's category page
book_upc	a unique universal identifier for each title
book_price	price excl. tax
book_stars	rating from 1 to 5
book_in_stock	is it available?
nr_available	how many books are in stock

🛠Process

Tools used:

Jupyter Notebook
Python 3.9
BeautifulSoup4
Pandas
Requests

STEP 01: use a browser to understand the tags structure for the website, and find where the data we need is embedded within the code.

STEP 02: scrap the catalogue pages to get a dataset of titles available (urls) with their respective categories.

STEP 03: use the list of urls created in step 02 to access the books pages to scrap for the remaining information needed.

STEP 04: join create a table with the scrapped data.

STEP 05: transform and clean the data, and export it to *.csv and deliver it to Coffee&Cookies.

⏩ Inputs

I will use https://books.toscrape.com as a source for the data. It is a website made for scrapping and can be used free of charge.

🚀 Next Steps

Add Pagination feature for categories that have more than 20 books. The ones requested by Coffee&Cookies didn't, but it would be nice to have it ready.
Add an interactive filter to select the categories.
Send this to production for a better usage, possibly using streamlit and heroku.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
CoffeeCookies-dataset.csv		CoffeeCookies-dataset.csv
LICENSE		LICENSE
README.md		README.md
m01_books_scrapping.ipynb		m01_books_scrapping.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitattributes

.gitattributes

.gitignore

.gitignore

CoffeeCookies-dataset.csv

CoffeeCookies-dataset.csv

LICENSE

LICENSE

README.md

README.md

m01_books_scrapping.ipynb

m01_books_scrapping.ipynb

requirements.txt

requirements.txt

Repository files navigation

📚 1. Business Context

💭 2. Development Strategy

Desired Output

🛠Process

⏩ Inputs

🚀 Next Steps

About

Releases

Packages

Contributors 2

Languages

License

guillenery/books-catalogue-scrapping

Folders and files

Latest commit

History

Repository files navigation

📚 1. Business Context

💭 2. Development Strategy

Desired Output

🛠Process

⏩ Inputs

🚀 Next Steps

About

Resources

License

Stars

Watchers

Forks

Languages