# **BeautifulSoup**

bs4 stands for BeautifulSoup4, and it is a Python library used for web scraping purposes.

 Beautiful Soup is a powerful library that makes it easy to scrape and parse HTML and XML documents. It provides a convenient way to navigate and manipulate the elements of a web page or an XML document.


 Beautiful Soup provides various types of functions and methods to navigate, search, and manipulate HTML or XML documents. Here are some of the key types of functions and methods available in Beautiful Soup:

**Parsing Functions:**

1. BeautifulSoup: This function is used to create a Beautiful Soup object from an HTML or XML document. It takes the document content and a parser (e.g., "html.parser") as arguments.

**Navigation Functions:**

1. find(): Used to find the first element that matches a given tag or filter criterion.

2. find_all(): Finds all elements that match a given tag or filter criterion and returns them as a list.

3. find_parent(): Finds the parent element of a specified element.


**Filtering Functions:**

1. select(): Allows you to select elements using CSS selectors. This is particularly useful for more complex filtering and selection.

2. select_one(): Similar to select(), but returns only the first matching element.

**Accessing Tag Attributes:**

1. get(): Retrieves the value of a specified attribute of a tag.

 Extracting Data:

 text: Returns the text content of an element or a tag.

 string: Returns the first string within a tag.

**Manipulation Functions:**

1. insert(): Inserts content into a tag.

2. replace_with(): Replaces a tag or its contents with specified content.

3. extract(): Removes a tag or a tree of tags from the document.

4. append(), prepend(), insert_after(), insert_before(): Methods for adding content to a tag.

**Modifying Attributes:**

1. replace_with(): Allows you to replace the tag's attributes with new ones.

**Encoding and Decoding Functions:**

1. encode(): Encodes the document to a specified character encoding.

2. decode(): Decodes the document from a specified character encoding.

**Utility Functions:**

1. prettify(): Returns a nicely formatted string representation of the document, making it easier to read.

#**Requests module**


The requests module in Python is a popular and widely used library for making HTTP requests to web services and websites. It simplifies the process of sending HTTP requests and handling responses, making it easier to interact with web APIs, scrape web pages, and perform other HTTP-related tasks.

Here are some of the key functions and methods provided by the requests module:

**Sending HTTP Requests:**

1. requests.get(url, params=None, **kwargs): Sends an HTTP GET request to the specified URL.

2. requests.post(url, data=None, json=None, **kwargs): Sends an HTTP POST request to the specified URL, optionally sending data as form data or JSON.

3. requests.put(url, data=None, **kwargs): Sends an HTTP PUT request to the specified URL.

4. requests.delete(url, **kwargs): Sends an HTTP DELETE request to the specified URL.


5. requests.request(method, url, **kwargs): Sends a custom HTTP request with the specified method.

**Handling Responses:**

Once a request is sent, you can access the response and its properties using attributes like status_code, headers, and text.

response.content: Returns the raw binary content of the response.
1. response.json(): Parses the response content as JSON (if applicable).
2. response.raise_for_status(): Raises an exception if the request resulted in an HTTP error status code.
Headers and Cookies:

You can set custom headers in your requests using the headers parameter.

The requests module also allows you to work with cookies by setting and accessing cookies in the CookieJar object.



In [None]:
import requests as rs
from bs4 import BeautifulSoup as bs

In [None]:
link="http://books.toscrape.com/"
page=rs.get(link)
print(page)

<Response [200]>


In [None]:
print(page.content)



In [None]:
a=bs(page.content,"html.parser")
print(a.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [None]:
product_price=a.find_all("div",class_="product_price")
print(product_price)

[<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>, <div class="product_price">
<p class="price_color">£53.74</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>, <div class="product_price">
<p class="price_color">£50.10</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>, <div class="product_price">
<p class="price_color">£47.82</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>

In [None]:
Book_price=[]
for i in product_price:
    Book_price.append(i.text)
print(Book_price)

['\n£51.77\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£53.74\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£50.10\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£47.82\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£54.23\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£22.65\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£33.34\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£17.93\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£22.60\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£52.15\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£13.99\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£20.66\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£17.46\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£52.29\n\n\n    \n        In stock\n    \n\n\nAdd to basket\n\n', '\n£35.02\n\n\n    \n        In s

In [None]:
Book_price=[i.strip("\n\n\n    \n        In stock\n    \n\n\nAdd to basket") for i in Book_price]
print(Book_price)

['£51.77', '£53.74', '£50.10', '£47.82', '£54.23', '£22.65', '£33.34', '£17.93', '£22.60', '£52.15', '£13.99', '£20.66', '£17.46', '£52.29', '£35.02', '£57.25', '£23.88', '£37.59', '£51.33', '£45.17']


In [None]:
len(Book_price)

20

In [None]:
import requests
from bs4 import BeautifulSoup
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'
book_list=[]
for i in range (1,51):
    scrape_url = base_url.format(i)

    result = requests.get(scrape_url)
    soup = BeautifulSoup(result.text, 'lxml')
    books = soup.select('.product_pod')

    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            title = book.select('a')[1]['title']
            book_list.append(title)
            print(title)

Starving Hearts (Triangular Trade Trilogy, #1)
Libertarianism for Beginners
It's Only the Himalayas
How Music Works
Maude (1883-1993):She Grew Up with the country
You can't bury them all: Poems
Reasons to Stay Alive
Without Borders (Wanderlove #1)
Soul Reader
Security
Saga, Volume 5 (Saga (Collected Editions) #5)
Reskilling America: Learning to Labor in the Twenty-First Century
Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics
Obsidian (Lux #1)
My Paris Kitchen: Recipes and Stories
Masks and Shadows
Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)
Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)
Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)
I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)
Giant Days, Vol. 2 (Giant Days #5-8)
Everydata: The Misinformation Hidden in the Little Data You Consume Every 

In [None]:
print(book_list)

['Starving Hearts (Triangular Trade Trilogy, #1)', 'Libertarianism for Beginners', "It's Only the Himalayas", 'How Music Works', 'Maude (1883-1993):She Grew Up with the country', "You can't bury them all: Poems", 'Reasons to Stay Alive', 'Without Borders (Wanderlove #1)', 'Soul Reader', 'Security', 'Saga, Volume 5 (Saga (Collected Editions) #5)', 'Reskilling America: Learning to Labor in the Twenty-First Century', 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics', 'Obsidian (Lux #1)', 'My Paris Kitchen: Recipes and Stories', 'Masks and Shadows', 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)', 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)', 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)', 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)', 'Giant Days, Vol. 2 (Giant Days #5-8)', 'Everydata

In [None]:
print(len(book_list))

196


In [None]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'
for i in range (1,51):
    scrape_url = base_url.format(i)
    print(scrape_url)

http://books.toscrape.com/catalogue/page-1.html
http://books.toscrape.com/catalogue/page-2.html
http://books.toscrape.com/catalogue/page-3.html
http://books.toscrape.com/catalogue/page-4.html
http://books.toscrape.com/catalogue/page-5.html
http://books.toscrape.com/catalogue/page-6.html
http://books.toscrape.com/catalogue/page-7.html
http://books.toscrape.com/catalogue/page-8.html
http://books.toscrape.com/catalogue/page-9.html
http://books.toscrape.com/catalogue/page-10.html
http://books.toscrape.com/catalogue/page-11.html
http://books.toscrape.com/catalogue/page-12.html
http://books.toscrape.com/catalogue/page-13.html
http://books.toscrape.com/catalogue/page-14.html
http://books.toscrape.com/catalogue/page-15.html
http://books.toscrape.com/catalogue/page-16.html
http://books.toscrape.com/catalogue/page-17.html
http://books.toscrape.com/catalogue/page-18.html
http://books.toscrape.com/catalogue/page-19.html
http://books.toscrape.com/catalogue/page-20.html
http://books.toscrape.com/cat

In [None]:
    result = requests.get(scrape_url)
    soup = BeautifulSoup(result.text, 'lxml')
    books = soup.select('.product_pod')
print(books)

[<article class="product_pod">
<div class="image_container">
<a href="frankenstein_20/index.html"><img alt="Frankenstein" class="thumbnail" src="../media/cache/00/25/0025515e987a1ebd648773f9ac70bfe6.jpg"/></a>
</div>
<p class="star-rating Two">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="frankenstein_20/index.html" title="Frankenstein">Frankenstein</a></h3>
<div class="product_price">
<p class="price_color">Â£38.00</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>, <article class="product_pod">
<div class="image_container">
<a href="forever-rockers-the-rocker-12_19/index.html"><img alt="Forever Rockers (The Rocker #12)" class="thumbnail" src="../media/cache/7f/b0/7fb03a053c270000667a50dd8d594843.jpg"/>

In [None]:
import requests
from bs4 import BeautifulSoup

base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

for i in range(1, 51):
    scrape_url = base_url.format(i)

    result = requests.get(scrape_url)
    soup = BeautifulSoup(result.text, 'lxml')
    books = soup.select('.product_pod')

    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            title = book.select('a')[1]['title']
            price = book.select('.price_color')[0].get_text()
            print(f"Title: {title} | Price: {price}")

Title: Starving Hearts (Triangular Trade Trilogy, #1) | Price: Â£13.99
Title: Libertarianism for Beginners | Price: Â£51.33
Title: It's Only the Himalayas | Price: Â£45.17
Title: How Music Works | Price: Â£37.32
Title: Maude (1883-1993):She Grew Up with the country | Price: Â£18.02
Title: You can't bury them all: Poems | Price: Â£33.63
Title: Reasons to Stay Alive | Price: Â£26.41
Title: Without Borders (Wanderlove #1) | Price: Â£45.07
Title: Soul Reader | Price: Â£39.58
Title: Security | Price: Â£39.25
Title: Saga, Volume 5 (Saga (Collected Editions) #5) | Price: Â£51.04
Title: Reskilling America: Learning to Labor in the Twenty-First Century | Price: Â£19.83
Title: Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics | Price: Â£36.28
Title: Obsidian (Lux #1) | Price: Â£14.86
Title: My Paris Kitchen: Recipes and Stories | Price: Â£33.37
Title: Masks and Shadows | Price: Â£56.

In [None]:
import requests
from bs4 import BeautifulSoup

base_url = 'http://books.toscrape.com/catalogue/page-{}.html'
book_prices = []

for i in range(1, 51):
    scrape_url = base_url.format(i)

    result = requests.get(scrape_url)
    soup = BeautifulSoup(result.text, 'lxml')
    books = soup.select('.product_pod')

    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            price = book.select('.price_color')[0].get_text()
            book_prices.append(price)

print(book_prices)

['Â£13.99', 'Â£51.33', 'Â£45.17', 'Â£37.32', 'Â£18.02', 'Â£33.63', 'Â£26.41', 'Â£45.07', 'Â£39.58', 'Â£39.25', 'Â£51.04', 'Â£19.83', 'Â£36.28', 'Â£14.86', 'Â£33.37', 'Â£56.40', 'Â£46.91', 'Â£19.92', 'Â£53.90', 'Â£29.17', 'Â£22.11', 'Â£54.35', 'Â£37.97', 'Â£49.46', 'Â£37.92', 'Â£28.09', 'Â£21.04', 'Â£36.00', 'Â£43.54', 'Â£38.21', 'Â£37.60', 'Â£55.02', 'Â£23.15', 'Â£10.93', 'Â£55.99', 'Â£26.12', 'Â£12.23', 'Â£23.99', 'Â£16.85', 'Â£19.15', 'Â£22.16', 'Â£37.80', 'Â£27.43', 'Â£11.11', 'Â£36.50', 'Â£15.48', 'Â£14.36', 'Â£19.60', 'Â£46.31', 'Â£45.95', 'Â£22.08', 'Â£25.38', 'Â£27.12', 'Â£43.04', 'Â£54.21', 'Â£40.67', 'Â£52.26', 'Â£13.82', 'Â£25.48', 'Â£18.28', 'Â£21.95', 'Â£38.39', 'Â£37.61', 'Â£51.51', 'Â£54.07', 'Â£40.12', 'Â£47.72', 'Â£46.23', 'Â£32.01', 'Â£34.20', 'Â£10.90', 'Â£16.73', 'Â£56.48', 'Â£40.79', 'Â£11.53', 'Â£33.26', 'Â£56.54', 'Â£48.05', 'Â£20.55', 'Â£48.39', 'Â£27.26', 'Â£21.15', 'Â£13.34', 'Â£17.28', 'Â£24.57', 'Â£11.82', 'Â£16.68', 'Â£47.44', 'Â£35.92', 'Â£35.79', 'Â£39.36'

In [None]:
print(len(book_prices))

196


In [None]:
import pandas as pd

# Create an empty DataFrame
df = pd.DataFrame()

# Display the empty DataFrame
print(df)

Empty DataFrame
Columns: []
Index: []


In [None]:
df['BOOK_TITLE']=book_list

In [None]:
df

Unnamed: 0,BOOK_TITLE
0,"Starving Hearts (Triangular Trade Trilogy, #1)"
1,Libertarianism for Beginners
2,It's Only the Himalayas
3,How Music Works
4,Maude (1883-1993):She Grew Up with the country
...,...
191,Of Mice and Men
192,My Perfect Mistake (Over the Top #1)
193,Meditations
194,Frankenstein


In [None]:
df['BOOK_PRICE']=book_prices
df

Unnamed: 0,BOOK_TITLE,BOOK_PRICE
0,"Starving Hearts (Triangular Trade Trilogy, #1)",Â£13.99
1,Libertarianism for Beginners,Â£51.33
2,It's Only the Himalayas,Â£45.17
3,How Music Works,Â£37.32
4,Maude (1883-1993):She Grew Up with the country,Â£18.02
...,...,...
191,Of Mice and Men,Â£47.11
192,My Perfect Mistake (Over the Top #1),Â£38.92
193,Meditations,Â£25.89
194,Frankenstein,Â£38.00


In [None]:
# Save the DataFrame to a CSV file
df.to_csv('books.csv')

In [None]:
import requests
from bs4 import BeautifulSoup
import csv

# Download HTML content
url = 'https://www.w3schools.com/sql/sql_in.asp'
response = requests.get(url)
html_content = response.text

# Parse HTML with Beautiful Soup
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
# Find the table
table = soup.find('table')
print(table)

<table class="ws-table-all notranslate">
<tr>
<th>CustomerID</th><th>CustomerName</th><th>ContactName</th><th>Address</th><th>City</th><th>PostalCode</th><th>Country</th>
</tr>
<tr>
<td>1</td><td>Alfreds Futterkiste</td><td>Maria Anders</td><td>Obere Str. 57</td><td>Berlin</td><td>12209</td><td>Germany</td>
</tr>
<tr>
<td>2</td><td>Ana Trujillo Emparedados y helados</td><td>Ana Trujillo</td><td>Avda. de la ConstituciÃ³n 2222</td><td>MÃ©xico D.F.</td><td>05021</td><td>Mexico</td>
</tr>
<tr>
<td>3</td><td>Antonio Moreno TaquerÃ­a</td><td>Antonio Moreno</td><td>Mataderos 2312</td><td>MÃ©xico D.F.</td><td>05023</td><td>Mexico</td>
</tr>
<tr>
<td>4</td><td>Around the Horn</td><td>Thomas Hardy</td><td>120 Hanover Sq.</td><td>London</td><td>WA1 1DP</td><td>UK</td>
</tr>
<tr>
<td>5</td><td>Berglunds snabbkÃ¶p</td><td>Christina Berglund</td><td>BerguvsvÃ¤gen 8</td><td>LuleÃ¥</td><td>S-958 22</td><td>Sweden</td>
</tr>
<tr>
<td>6</td><td>Blauer See Delikatessen</td><td>Hanna Moos</td><td>Forsters

In [None]:
# Extract table rows
rows = table.find_all('tr')
print(rows)

[<tr>
<th>CustomerID</th><th>CustomerName</th><th>ContactName</th><th>Address</th><th>City</th><th>PostalCode</th><th>Country</th>
</tr>, <tr>
<td>1</td><td>Alfreds Futterkiste</td><td>Maria Anders</td><td>Obere Str. 57</td><td>Berlin</td><td>12209</td><td>Germany</td>
</tr>, <tr>
<td>2</td><td>Ana Trujillo Emparedados y helados</td><td>Ana Trujillo</td><td>Avda. de la ConstituciÃ³n 2222</td><td>MÃ©xico D.F.</td><td>05021</td><td>Mexico</td>
</tr>, <tr>
<td>3</td><td>Antonio Moreno TaquerÃ­a</td><td>Antonio Moreno</td><td>Mataderos 2312</td><td>MÃ©xico D.F.</td><td>05023</td><td>Mexico</td>
</tr>, <tr>
<td>4</td><td>Around the Horn</td><td>Thomas Hardy</td><td>120 Hanover Sq.</td><td>London</td><td>WA1 1DP</td><td>UK</td>
</tr>, <tr>
<td>5</td><td>Berglunds snabbkÃ¶p</td><td>Christina Berglund</td><td>BerguvsvÃ¤gen 8</td><td>LuleÃ¥</td><td>S-958 22</td><td>Sweden</td>
</tr>, <tr>
<td>6</td><td>Blauer See Delikatessen</td><td>Hanna Moos</td><td>Forsterstr. 57</td><td>Mannheim</td><td>68

In [None]:
# Extract data from rows and save as CSV
with open('data.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    for row in rows:
        cols = row.find_all('td')
        row_data = [col.text.strip() for col in cols]
        csvwriter.writerow(row_data)
print("The data has been saved into 'data.csv' file.")

The data has been saved into 'data.csv' file.
