# Web scraped data cleaning using python

Author: [Masud Rahman](masud90.github.io)
This project will scrape a website (with permission) for dataset, clean them, and produce an output.

## Initialize setup

In [41]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Get data

In [43]:
# URL of the website to scrape
url = "https://books.toscrape.com/"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all the book items on the page
    books = soup.find_all('article', class_='product_pod')
    
    # Initialize an empty list to store book data
    books_data = []
    
    # Iterate over each book item and extract the required information
    for book in books:
        # Extract book title
        title = book.h3.a['title']
        
        # Extract book price
        price = book.find('p', class_='price_color').text
        
        # Extract availability
        availability = book.find('p', class_='instock availability').text.strip()
        
        # Add the extracted data to the list
        books_data.append((title, price, availability))
    
    # Print or process the extracted data
    for book in books_data:
        print(f"Title: {book[0]}, Price: {book[1]}, Availability: {book[2]}")
else:
    print("Failed to retrieve the webpage")

Title: A Light in the Attic, Price: £51.77, Availability: In stock
Title: Tipping the Velvet, Price: £53.74, Availability: In stock
Title: Soumission, Price: £50.10, Availability: In stock
Title: Sharp Objects, Price: £47.82, Availability: In stock
Title: Sapiens: A Brief History of Humankind, Price: £54.23, Availability: In stock
Title: The Requiem Red, Price: £22.65, Availability: In stock
Title: The Dirty Little Secrets of Getting Your Dream Job, Price: £33.34, Availability: In stock
Title: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, Price: £17.93, Availability: In stock
Title: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, Price: £22.60, Availability: In stock
Title: The Black Maria, Price: £52.15, Availability: In stock
Title: Starving Hearts (Triangular Trade Trilogy, #1), Price: £13.99, Availability: In stock
Title: Shakespeare's Sonnets, Price: £20.66, Availability: In stock
Title: Set

## Generate the dataset
We will now use the scraped data to produce a dataframe.

In [45]:
# Create a pandas DataFrame
df = pd.DataFrame(books_data, columns=["Title", "Price", "Availability"])
df.head()

Unnamed: 0,Title,Price,Availability
0,A Light in the Attic,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History of Humankind,£54.23,In stock


## Data cleaning

In [47]:
# Convert the price from string to float
df['Price'] = df['Price'].str.replace('£', '').astype(float)
df.head()

Unnamed: 0,Title,Price,Availability
0,A Light in the Attic,51.77,In stock
1,Tipping the Velvet,53.74,In stock
2,Soumission,50.1,In stock
3,Sharp Objects,47.82,In stock
4,Sapiens: A Brief History of Humankind,54.23,In stock


## Basic Checks
Now, let's perform some basic analysis on the data.

In [49]:
print("\nBasic Statistics:")
print(f"Total number of books: {df.shape[0]}")
print(f"Average price of books: £{df['Price'].mean():.2f}")
print(f"Most expensive book: {df.loc[df['Price'].idxmax()]['Title']} (£{df['Price'].max():.2f})")
print(f"Least expensive book: {df.loc[df['Price'].idxmin()]['Title']} (£{df['Price'].min():.2f})")


Basic Statistics:
Total number of books: 20
Average price of books: £38.05
Most expensive book: Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991 (£57.25)
Least expensive book: Starving Hearts (Triangular Trade Trilogy, #1) (£13.99)


## Save the data

In [51]:
# Save the DataFrame to a CSV file
df.to_csv('books_data.csv', index=False)

print("Data saved to books_data.csv")

Data saved to books_data.csv
