# Web Scraping

For this tutorial we will:
1. Make an HTTP request to fetch the web page using `requests`
2. Parse the HTML code using `BeautifouSoup` and extract and clean data
3. Load the data into `pandas` datafrane

### Import libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# URL of the web page
url = 'http://books.toscrape.com/catalogue/category/books_1/index.html'

In [3]:
# Fetch the web page
response = requests.get(url)
print(response)


<Response [200]>


#### Accessing Response Content:
* `response.status_code`: To get the status code (e.g., 200).
* `response.text`: To get the HTML content of the page as a string.
* `response.json()`: If the response content is in JSON format, this method will parse it into a Python dictionary.
* `response.headers`: To access the headers returned by the server.

In [4]:
# Parse the HTML
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
# Extract data (example: book titles, prices, and availability)
books = soup.find_all('article', class_='product_pod')
books = soup.find_all('article', attrs={'class':'product_pod'})
print(books)

[<article class="product_pod">
<div class="image_container">
<a href="../../a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../../../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="../../a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>, <article class="product_pod">
<div class="image_container">
<a href="../../tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="../../../media/ca

In [6]:
# Initialize lists to hold extracted data
titles = []
prices = []
availabilities = []
ratings = []

for book in books:
    # Title
    title = book.find('h3').get_text()
    titles.append(title)
    
    # Price
    price = book.find('p', class_='price_color').text
    prices.append(price)
    
    # Availability
    availability = book.find('p', class_='instock availability').text.strip()
    availabilities.append(availability)

    # Rating
    rating = book.find('p')['class'][1]
    ratings.append(rating)

# Create a DataFrame
df = pd.DataFrame({
    'Title': titles,
    'Price': prices,
    'Availability': availabilities,
    'Rating': ratings
})

#### Scrape all the pages

In [7]:
limit = soup.find('li', class_='current').text
print(f'Page {limit}')
limit = int(limit.split()[-1])
print(f'Page {limit}')

Page 
            
                Page 1 of 50
            
            
Page 50


In [8]:
for i in range(2, limit+1):
    url = f'http://books.toscrape.com/catalogue/category/books_1/page-{i}.html'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    books = soup.find_all('article', class_='product_pod')
    for book in books:
        title = book.find('h3').get_text()
        titles.append(title)
        price = book.find('p', class_='price_color').text
        prices.append(price)
        availability = book.find('p', class_='instock availability').text.strip()
        availabilities.append(availability)
        rating = book.find('p')['class'][1]
        ratings.append(rating)

In [9]:
# Create a DataFrame
df = pd.DataFrame({
    'Title': titles,
    'Price': prices,
    'Availability': availabilities,
    'Rating': ratings
})

In [10]:
df

Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the ...,£51.77,In stock,Three
1,Tipping the Velvet,£53.74,In stock,One
2,Soumission,£50.10,In stock,One
3,Sharp Objects,£47.82,In stock,Four
4,Sapiens: A Brief History ...,£54.23,In stock,Five
...,...,...,...,...
995,Alice in Wonderland (Alice's ...,£55.53,In stock,One
996,"Ajin: Demi-Human, Volume 1 ...",£57.06,In stock,Four
997,A Spy's Devotion (The ...,£16.97,In stock,Five
998,1st to Die (Women's ...,£53.98,In stock,One
