Web scraping is the process of automatically extracting information from websites.
Instead of a human manually copy-pasting data, a script (or "bot") fetches the web page and pulls out the required information

#### Why is it useful?
* **Market research (e.g., product prices, reviews)
* **Lead generation:(finding potential customers ("leads") who may be interested in your product or service) (e.g., contact info from directories):
* **Data analysis (e.g., gathering sports statistics, stock prices)

The Golden Rule: Scraping Responsibly
* This is the most important part of the webscraping. Always be a good web citizen.

* Check robots.txt: Most websites have a file at www.example.com/robots.txt that tells bots which pages they are allowed/disallowed to visit. Always respect this.

* Read the Terms of Service (ToS): The website's ToS may explicitly forbid scraping.

* Don't Overwhelm the Server: Send requests at a reasonable rate. A human clicks a link every few seconds; your script should do the same. Use time.sleep() between requests.

* Scrape for Public Data: Only scrape data that is publicly visible and not behind a login wall, unless you have explicit permission.

Scraping follows a simple process:
* `Request`: Send an HTTP request to the website's server to get the page's content.
* `Parse`: Interpret the raw HTML content into a structured format.
* `Extract`: Find and pull the specific data you need from the parsed structure.
* `Store`: Save the extracted data in a useful format (like CSV or a database).

#### Our Tools:
* `requests`: A fantastic Python library for making HTTP requests.
* `beautifulsoup4`: The master tool for parsing messy HTML and XML.
* `lxml`: A parser that BeautifulSoup uses under the hood. It's very fast.
* `pandas`: The go-to library for data analysis and manipulation, perfect for storing our data.

In [29]:
import requests
import time 
from bs4 import BeautifulSoup
import pandas as pd 
print("Libraries imported successfully!")

Libraries imported successfully!


In [30]:
URL="http://books.toscrape.com/"
headers={
    "User-Agent":"My Web Scraper 1.0 - for educational purposes"
}
#Making request
response=requests.get(URL,headers=headers)

In [31]:
response

<Response [200]>

In [32]:
#check the status code 
if response.status_code==200:
    print("Success! request was successful.")
    print("First 500 characters of the page:")
    print(response.text[:500])
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Success! request was successful.
First 500 characters of the page:
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" /


In [33]:
#create a beautifulSoup object
soup=BeautifulSoup(response.text,'lxml')

print(soup)

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

In [34]:
#we can use .prettify() to see the HTML in a nicely indented format
print(soup.prettify())

#--Finding Elements--
#Find the first element of a certain tag (eg., the page title)

page_title=soup.find('title')
print(f"Page Title: {page_title.text}")

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [35]:
#Find the first H1 tag
main_header = soup.find('h1')
print(f"Main Header: {main_header.text}")

#Find an element by its class
#note: "class" is a reserved keyword in python, so we use 'class_'

a_book_price = soup.find('p', class_ ='price_color')
print(f"Price of the first book: {a_book_price.text}")

Main Header: All products
Price of the first book: Â£51.77


In [36]:
all_h1_tags=soup.find_all('h1')
all_prices = soup.find_all('p',class_='price_color')

In [37]:
for h1_tags in all_h1_tags:
    print(h1_tags.text)

All products


In [38]:
all_h2_tags=soup.find_all('h2')
all_prices = soup.find_all('p',class_='price_color')

In [39]:
for h2_tags in all_h2_tags:
    print(h2_tags.text)

In [40]:
for price in all_prices:
    print(price.text)

Â£51.77
Â£53.74
Â£50.10
Â£47.82
Â£54.23
Â£22.65
Â£33.34
Â£17.93
Â£22.60
Â£52.15
Â£13.99
Â£20.66
Â£17.46
Â£52.29
Â£35.02
Â£57.25
Â£23.88
Â£37.59
Â£51.33
Â£45.17


In [41]:
#Extracting data from the page
#our goal is to get the title, price, and star rating for everybook on the first page.

In [42]:
#Find all the book containers

books=soup.find_all('article',class_='product_pod')
print(f"Found {len(books)} books on the page. ")

Found 20 books on the page. 


In [43]:
books_data=[]

In [44]:
for book in books:
    #get the title from the <a> tag inside the <h3> tag
    title=book.find('h3').find('a')['title']
    #get the price from the <p> tag's class attribute
    price=book.find('p',class_='price_color').text
    #the tag is <p class="star-rating Three">, so we want the second class 
    rating = book.find('p',class_='star-rating')['class'][1]
    #get the availability of the book
    stock=book.find('p',class_='instock')['class'][1]

    #store the extracted data in a dictionary
    book_info={
        'Title':title,
        'Price':price,
        'Stock':stock,
        'Rating':f"{rating} out of Five"
    }

    #Add the dictionary to our list
    books_data.append(book_info)

In [45]:
books_data

[{'Title': 'A Light in the Attic',
  'Price': 'Â£51.77',
  'Stock': 'availability',
  'Rating': 'Three out of Five'},
 {'Title': 'Tipping the Velvet',
  'Price': 'Â£53.74',
  'Stock': 'availability',
  'Rating': 'One out of Five'},
 {'Title': 'Soumission',
  'Price': 'Â£50.10',
  'Stock': 'availability',
  'Rating': 'One out of Five'},
 {'Title': 'Sharp Objects',
  'Price': 'Â£47.82',
  'Stock': 'availability',
  'Rating': 'Four out of Five'},
 {'Title': 'Sapiens: A Brief History of Humankind',
  'Price': 'Â£54.23',
  'Stock': 'availability',
  'Rating': 'Five out of Five'},
 {'Title': 'The Requiem Red',
  'Price': 'Â£22.65',
  'Stock': 'availability',
  'Rating': 'One out of Five'},
 {'Title': 'The Dirty Little Secrets of Getting Your Dream Job',
  'Price': 'Â£33.34',
  'Stock': 'availability',
  'Rating': 'Four out of Five'},
 {'Title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  'Price': 'Â£17.93',
  'Stock': 'availability',
  'Rating

In [46]:
print("\n---Data for the first 3 books---")
for i in range(3):
    print(books_data[i])


---Data for the first 3 books---
{'Title': 'A Light in the Attic', 'Price': 'Â£51.77', 'Stock': 'availability', 'Rating': 'Three out of Five'}
{'Title': 'Tipping the Velvet', 'Price': 'Â£53.74', 'Stock': 'availability', 'Rating': 'One out of Five'}
{'Title': 'Soumission', 'Price': 'Â£50.10', 'Stock': 'availability', 'Rating': 'One out of Five'}


In [47]:
#Storing data with pandas

In [48]:
##convert the list of dictionaries to a pandas dataframe

In [49]:
df=pd.DataFrame(books_data)

In [50]:
print("---Pandas Dataframe---")
display(df.head())

---Pandas Dataframe---


Unnamed: 0,Title,Price,Stock,Rating
0,A Light in the Attic,Â£51.77,availability,Three out of Five
1,Tipping the Velvet,Â£53.74,availability,One out of Five
2,Soumission,Â£50.10,availability,One out of Five
3,Sharp Objects,Â£47.82,availability,Four out of Five
4,Sapiens: A Brief History of Humankind,Â£54.23,availability,Five out of Five


In [51]:
#Save the dataframe to CSV file
#index=False means we don't write the dataframe's row numbers to the file

output_filename='books_day1.csv'
df.to_csv(output_filename, index=False)
print(f"\nData successfully saved to {output_filename}!")


Data successfully saved to books_day1.csv!


Advanced Techniques

* Handling Pagination: Scraping data across multiple pages.
* The Challenge of Dynamic Content: When requests isn't enough.
* Introduction to Selenium: Automating a real web browser.
* Hands-On with Selenium: Scraping a JavaScript-powered website.
* Best Practices: Error handling, waits, and putting it all together.

Handling Pagination

Strategy:
* Scrape the current page.
* Find the "Next" button's link.
* If a "Next" button exists, form the URL for the next page and repeat the process.
* If there's no "Next" button, we've reached the last page, so we stop.

In [55]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time 
from urllib.parse import urljoin 

base_url='http://books.toscrape.com/catalogue/'
current_page_url=base_url + "page-1.html"
headers={"User-Agent":"My Web Scraper 2.0 - for educational purposes"}
all_books_data=[]
page_count=0
max_pages=50

while current_page_url and page_count<max_pages:
    page_count+=1
    print(f"Scraping page {page_count}: {current_page_url}")
    response=requests.get(current_page_url,headers=headers)
    soup=BeautifulSoup(response.text,'lxml')

    books = soup.find_all('article',class_='product_pod')

    for book in books:
        title = book.find('h3').find('a')['title']
        price=book.find('p',class_='price_color').text
        rating=book.find('p',class_='star-rating')['class'][1]

        all_books_data.append({
            'Title':title,
            'Price':price,
            'Rating':f"{rating} out of five"
        })

    #fin the 'next' button link
    next_button = soup.find('li',class_='next')
    if next_button:
        #the link is relative, so we use urljoin to create the full, absolute url
        next_page_relative_url=next_button.find('a')['href']
        current_page_url=urljoin(base_url,next_page_relative_url)
        time.sleep(1)

    else:
        current_page_url=None

print(f"\nFinished scraping. Total books found: {len(all_books_data)}")

Scraping page 1: http://books.toscrape.com/catalogue/page-1.html
Scraping page 2: http://books.toscrape.com/catalogue/page-2.html
Scraping page 3: http://books.toscrape.com/catalogue/page-3.html
Scraping page 4: http://books.toscrape.com/catalogue/page-4.html
Scraping page 5: http://books.toscrape.com/catalogue/page-5.html
Scraping page 6: http://books.toscrape.com/catalogue/page-6.html
Scraping page 7: http://books.toscrape.com/catalogue/page-7.html
Scraping page 8: http://books.toscrape.com/catalogue/page-8.html
Scraping page 9: http://books.toscrape.com/catalogue/page-9.html
Scraping page 10: http://books.toscrape.com/catalogue/page-10.html
Scraping page 11: http://books.toscrape.com/catalogue/page-11.html
Scraping page 12: http://books.toscrape.com/catalogue/page-12.html
Scraping page 13: http://books.toscrape.com/catalogue/page-13.html
Scraping page 14: http://books.toscrape.com/catalogue/page-14.html
Scraping page 15: http://books.toscrape.com/catalogue/page-15.html
Scraping page

In [56]:
all_books_data

[{'Title': 'A Light in the Attic',
  'Price': 'Â£51.77',
  'Rating': 'Three out of five'},
 {'Title': 'Tipping the Velvet',
  'Price': 'Â£53.74',
  'Rating': 'One out of five'},
 {'Title': 'Soumission', 'Price': 'Â£50.10', 'Rating': 'One out of five'},
 {'Title': 'Sharp Objects', 'Price': 'Â£47.82', 'Rating': 'Four out of five'},
 {'Title': 'Sapiens: A Brief History of Humankind',
  'Price': 'Â£54.23',
  'Rating': 'Five out of five'},
 {'Title': 'The Requiem Red', 'Price': 'Â£22.65', 'Rating': 'One out of five'},
 {'Title': 'The Dirty Little Secrets of Getting Your Dream Job',
  'Price': 'Â£33.34',
  'Rating': 'Four out of five'},
 {'Title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  'Price': 'Â£17.93',
  'Rating': 'Three out of five'},
 {'Title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  'Price': 'Â£22.60',
  'Rating': 'Four out of five'},
 {'Title': 'The Black Maria', 'Price': 

In [57]:
#save all the data to a new csv

df_all_pages=pd.DataFrame(all_books_data)
df_all_pages.to_csv('books_all_pages.csv',index=False)
print("Data from all pages saved to books_all_pages.csv")
display(df_all_pages.head())
display(df_all_pages.tail())

Data from all pages saved to books_all_pages.csv


Unnamed: 0,Title,Price,Rating
0,A Light in the Attic,Â£51.77,Three out of five
1,Tipping the Velvet,Â£53.74,One out of five
2,Soumission,Â£50.10,One out of five
3,Sharp Objects,Â£47.82,Four out of five
4,Sapiens: A Brief History of Humankind,Â£54.23,Five out of five


Unnamed: 0,Title,Price,Rating
995,Alice in Wonderland (Alice's Adventures in Won...,Â£55.53,One out of five
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Â£57.06,Four out of five
997,A Spy's Devotion (The Regency Spies of London #1),Â£16.97,Five out of five
998,1st to Die (Women's Murder Club #1),Â£53.98,One out of five
999,"1,000 Places to See Before You Die",Â£26.08,Five out of five
