# Webscraping 1


## Introduction to Web Scraping

Web scraping is the process of extracting data from websites. It is widely used for data analysis, automation, and research. 
However, before scraping, it is essential to check the site's `robots.txt` file to understand its scraping policies.

### Ethical Considerations:
- Always respect `robots.txt` rules.
- Avoid overloading the server with frequent requests.
- Use scraping responsibly and avoid collecting sensitive/private data.

## Checking `robots.txt` Before Scraping

Before scraping a website, it's good practice to check its `robots.txt` file to see if scraping is allowed.


In [1]:
import requests
from bs4 import BeautifulSoup
import json

# Checking robots.txt
robots_url = 'https://webscraper.io/robots.txt'
robots_response = requests.get(robots_url)

if robots_response.status_code == 200:
    print("Robots.txt content:\n", robots_response.text[:5000])  # Display first 500 characters
else:
    print("Failed to retrieve robots.txt")


Robots.txt content:
 User-agent: *
Disallow: /test-sites/e-commerce/
Disallow: /test-sites/tables

Sitemap: https://webscraper.io/sitemap.xml



In [2]:
response =  requests.get('https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops')

In [3]:
print(response.status_code)

200


### Http Request Methods


#### GET - this method is used to retrieve information from the given server using a given URI

#### POST - this request method requests that a web server accepts the data enclosed in the body of the request message, most likely for storing it

#### PUT - The PUT method requests that the enclosed entity be stored under the supplied URI. If the URI refers to an already existing resource, it is modified and if the URI does not point to an existing resource, then the server can create the resource with that URI

#### DELETE - The DELETE method deletes the specified resource

#### HEAD - The HEAD method asks for a response identical to that of a GET request, but without the response body.

#### PATCH - 	It is used for modify capabilities. The PATCH request only needs to contain the changes to the resource, not the complete resource

### Status code list
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [4]:
print(response.url)

https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops


In [5]:
print(response.json)

<bound method Response.json of <Response [200]>>


In [6]:
print(response)

<Response [200]>


In [7]:
print(response.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<!-- Google Tag Manager -->
<script nonce="vxZpLm8JyoNCPB0vnwzP93SV3UHq8yEE">(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');</script>
<!-- End Google Tag Manager -->
	<title>Allinone | Web Scraper Test Sites</title>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">

	<meta name="keywords"
		  content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper"/>
	<meta name="description"
		  content="Test Web Scraper&#039;s features and performance on mock e-commerce sites. Extract product data, prices, and categories in a controll

In [8]:
print(response.headers)

{'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Date': 'Mon, 24 Feb 2025 14:04:26 GMT', 'Cache-Control': 'max-age=300, public', 'Content-Security-Policy': "report-uri /report-csp;default-src 'self';connect-src 'self' https:;script-src 'self' https://*.googletagmanager.com 'nonce-vxZpLm8JyoNCPB0vnwzP93SV3UHq8yEE';style-src 'self' 'unsafe-inline' https://fonts.googleapis.com;font-src 'self' https://fonts.gstatic.com;frame-src 'self' https://www.youtube-nocookie.com https://*.googletagmanager.com;img-src 'self' data: https:;report-to report-to-csp-group", 'Report-To': '{"group":"report-to-csp-group","max_age":432000,"endpoints":[{"url":"https:\\/\\/webscraper.io\\/report-csp"}]}', 'Set-Cookie': 'test=original_toggled_annually; expires=Wed, 26 Mar 2025 14:04:26 GMT; Max-Age=2592000; path=/; domain=.webscraper.io; secure; samesite=lax', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=47474747; includeSubDomains

In [9]:
print(response.request.headers)

{'User-Agent': 'python-requests/2.32.3', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


In [10]:
soup = BeautifulSoup(response.content, "html.parser")

In [11]:
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<!-- Google Tag Manager -->
<script nonce="vxZpLm8JyoNCPB0vnwzP93SV3UHq8yEE">(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');</script>
<!-- End Google Tag Manager -->
<title>Allinone | Web Scraper Test Sites</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords">
<meta content="Test Web Scraper's features and performance on mock e-commerce sites. Extract product data, prices, and categories in a controlled environment." name="description">


In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- Google Tag Manager -->
  <script nonce="vxZpLm8JyoNCPB0vnwzP93SV3UHq8yEE">
   (function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');
  </script>
  <!-- End Google Tag Manager -->
  <title>
   Allinone | Web Scraper Test Sites
  </title>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords">
   <meta content="Test Web Scraper's features and performance on mock e-commerce sites. Extract product data, prices, and categories in a controlled env

In [13]:
print(soup.get_text())







Allinone | Web Scraper Test Sites


































Toggle navigation















Web Scraper





Cloud Scraper





Pricing






								Learn
								





Documentation


Video Tutorials


Test Sites


Forum




Install


Cloud Login













Test Sites











Home



					Computers
					




							Laptops
						



							Tablets
						





					Phones
					







Computers / Laptops






$295.99

Asus VivoBook...

Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd


14 reviews














$299

Prestigio Smar...

Prestigio SmartBook 133S Dark Grey, 13.3" FHD IPS, Celeron N3350 1.1GHz, 4GB, 32GB, Windows 10 Pro + Office 365 1 gadam


8 reviews













$299

Prestigio Smar...

Prestigio SmartBook 133S Gold, 13.3" FHD IPS, Celeron N3350 1.1GHz, 4GB, 32GB, Windows 10 Pro + Office 365 1 gadam


12 reviews















$306.99

Aspire E1-510

15.6", Pentium N3520 2.16GHz, 4GB, 500GB, Linux



In [14]:
print(soup.title)

<title>Allinone | Web Scraper Test Sites</title>


In [15]:
print(soup.title.get_text())

Allinone | Web Scraper Test Sites


In [16]:
price = soup.find("h4", {"class": "price float-end card-title pull-right"})
print(price)

<h4 class="price float-end card-title pull-right">$295.99</h4>


In [17]:
content = str(price)

print(content)

<h4 class="price float-end card-title pull-right">$295.99</h4>


In [18]:
print(price.get_text())

$295.99


In [19]:
allPrices = soup.findAll("h4", {"class": "price float-end card-title pull-right"})

  allPrices = soup.findAll("h4", {"class": "price float-end card-title pull-right"})


In [20]:
print(allPrices)

[<h4 class="price float-end card-title pull-right">$295.99</h4>, <h4 class="price float-end card-title pull-right">$299</h4>, <h4 class="price float-end card-title pull-right">$299</h4>, <h4 class="price float-end card-title pull-right">$306.99</h4>, <h4 class="price float-end card-title pull-right">$321.94</h4>, <h4 class="price float-end card-title pull-right">$356.49</h4>, <h4 class="price float-end card-title pull-right">$364.46</h4>, <h4 class="price float-end card-title pull-right">$372.7</h4>, <h4 class="price float-end card-title pull-right">$379.94</h4>, <h4 class="price float-end card-title pull-right">$379.95</h4>, <h4 class="price float-end card-title pull-right">$391.48</h4>, <h4 class="price float-end card-title pull-right">$393.88</h4>, <h4 class="price float-end card-title pull-right">$399</h4>, <h4 class="price float-end card-title pull-right">$399.99</h4>, <h4 class="price float-end card-title pull-right">$404.23</h4>, <h4 class="price float-end card-title pull-right"

In [21]:
for price in allPrices:
    print(price.get_text())

$295.99
$299
$299
$306.99
$321.94
$356.49
$364.46
$372.7
$379.94
$379.95
$391.48
$393.88
$399
$399.99
$404.23
$408.98
$409.63
$410.46
$410.66
$416.99
$433.3
$436.29
$436.29
$439.73
$454.62
$454.73
$457.38
$465.95
$468.56
$469.1
$484.23
$485.9
$487.8
$488.64
$488.78
$494.71
$497.17
$498.23
$520.99
$564.98
$577.99
$581.99
$609.99
$679
$679
$729
$739.99
$745.99
$799
$809
$899
$999
$1033.99
$1096.02
$1098.42
$1099
$1099
$1101.83
$1102.66
$1110.14
$1112.91
$1114.55
$1123.87
$1123.87
$1124.2
$1133.82
$1133.91
$1139.54
$1140.62
$1143.4
$1144.2
$1144.4
$1149
$1149
$1149.73
$1154.04
$1170.1
$1178.19
$1178.99
$1179
$1187.88
$1187.98
$1199
$1199
$1199.73
$1203.41
$1212.16
$1221.58
$1223.99
$1235.49
$1238.37
$1239.2
$1244.99
$1259
$1260.13
$1271.06
$1273.11
$1281.99
$1294.74
$1299
$1310.39
$1311.99
$1326.83
$1333
$1337.28
$1338.37
$1341.22
$1347.78
$1349.23
$1362.24
$1366.32
$1381.13
$1399
$1399
$1769
$1769
$1799


In [22]:
items = soup.findAll("a", {"class": "title"})

  items = soup.findAll("a", {"class": "title"})


In [23]:
print(items)

[<a class="title" href="/test-sites/e-commerce/allinone/product/60" title="Asus VivoBook X441NA-GA190">Asus VivoBook...</a>, <a class="title" href="/test-sites/e-commerce/allinone/product/61" title="Prestigio SmartBook 133S Dark Grey">Prestigio Smar...</a>, <a class="title" href="/test-sites/e-commerce/allinone/product/62" title="Prestigio SmartBook 133S Gold">Prestigio Smar...</a>, <a class="title" href="/test-sites/e-commerce/allinone/product/32" title="Aspire E1-510">Aspire E1-510</a>, <a class="title" href="/test-sites/e-commerce/allinone/product/63" title="Lenovo V110-15IAP">Lenovo V110-15...</a>, <a class="title" href="/test-sites/e-commerce/allinone/product/64" title="Lenovo V110-15IAP">Lenovo V110-15...</a>, <a class="title" href="/test-sites/e-commerce/allinone/product/65" title="Hewlett Packard 250 G6 Dark Ash Silver">Hewlett Packar...</a>, <a class="title" href="/test-sites/e-commerce/allinone/product/66" title="Acer Aspire 3 A315-31 Black">Acer Aspire 3...</a>, <a class="ti

In [24]:
count = 0
for item in items:
    print(item.get_text())
    count = count + 1

Asus VivoBook...
Prestigio Smar...
Prestigio Smar...
Aspire E1-510
Lenovo V110-15...
Lenovo V110-15...
Hewlett Packar...
Acer Aspire 3...
Acer Aspire A3...
Acer Aspire ES...
Acer Aspire 3...
Acer Aspire 3...
Asus VivoBook...
Asus VivoBook...
Lenovo ThinkPa...
Acer Aspire 3...
Lenovo V110-15...
Acer Aspire ES...
Asus VivoBook...
Packard 255 G2
Asus EeeBook R...
Acer Aspire 3...
Acer Aspire ES...
Acer Extensa 1...
Acer Aspire ES...
Lenovo V110-15...
Acer Aspire A3...
Lenovo V110-15...
Asus VivoBook...
Acer Aspire ES...
Lenovo V510 Bl...
Acer Aspire ES...
Lenovo V510 Bl...
Acer Swift 1 S...
Dell Vostro 15
Acer Aspire 3...
Dell Vostro 15...
Lenovo V510 Bl...
HP 250 G3
Acer Spin 5
HP 350 G1
Aspire E1-572G
Pavilion
Acer Aspire A5...
Dell Inspiron...
Asus VivoBook...
ProBook
Inspiron 15
Asus ROG STRIX...
Acer Nitro 5 A...
Asus ROG STRIX...
Lenovo ThinkPa...
ThinkPad Yoga
Lenovo ThinkPa...
Dell Inspiron...
MSI GL72M 7RDX
MSI GL72M 7RDX
Asus ROG Strix...
Dell Latitude...
Dell Latitude...
Lenovo

In [25]:
print(count)

117



## Handling Errors in Web Scraping

When making HTTP requests, errors such as timeouts, connection failures, and invalid responses can occur.


In [26]:

try:
    response = requests.get('https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops', timeout=5)
    response.raise_for_status()  # Raises an error for 4xx and 5xx status codes
    print("Request successful:", response.status_code)
except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
except requests.exceptions.ConnectionError as conn_err:
    print(f"Connection error occurred: {conn_err}")
except requests.exceptions.Timeout as timeout_err:
    print(f"Timeout error occurred: {timeout_err}")
except requests.exceptions.RequestException as req_err:
    print(f"An error occurred: {req_err}")


Request successful: 200



## Handling Pagination

Many websites display data across multiple pages. To scrape multiple pages, we iterate over paginated URLs.


In [27]:
import requests
import json
import csv
from bs4 import BeautifulSoup

# Base URL for paginated products
base_url = 'https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page='
all_products = []

# Loop through pages
for page in range(1, 20):  # Adjust range for more pages
    url = base_url + str(page)
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract product names and prices
        titles = soup.findAll("a", {"class": "title"})
        prices = soup.findAll("h4", {"class": "price float-end card-title pull-right"})

        # Store extracted data
        for title, price in zip(titles, prices):
            all_products.append({"product": title.text, "price": price.text})

        print(f"Scraped page {page} successfully.")
    else:
        print(f"Failed to scrape page {page}")

# Save data as JSON
with open('scraped_products.json', 'w') as json_file:
    json.dump(all_products, json_file, indent=4)

# Save data as CSV
with open('scraped_products.csv', 'w', newline='', encoding='utf-8') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=["product", "price"])
    writer.writeheader()
    writer.writerows(all_products)

print("Data saved successfully in JSON and CSV formats!")


  titles = soup.findAll("a", {"class": "title"})
  prices = soup.findAll("h4", {"class": "price float-end card-title pull-right"})


Scraped page 1 successfully.
Scraped page 2 successfully.
Scraped page 3 successfully.
Scraped page 4 successfully.
Scraped page 5 successfully.
Scraped page 6 successfully.
Scraped page 7 successfully.
Scraped page 8 successfully.
Scraped page 9 successfully.
Scraped page 10 successfully.
Scraped page 11 successfully.
Scraped page 12 successfully.
Scraped page 13 successfully.
Scraped page 14 successfully.
Scraped page 15 successfully.
Scraped page 16 successfully.
Scraped page 17 successfully.
Scraped page 18 successfully.
Scraped page 19 successfully.
Data saved successfully in JSON and CSV formats!


In [28]:
import requests
import json
import csv
from bs4 import BeautifulSoup

base_url = 'https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page='
all_products = []

for page in range(1, 20):  # Adjust range as needed
    response = requests.get(base_url + str(page))
    if response.status_code != 200:
        print(f"Failed to scrape page {page}")
        continue

    soup = BeautifulSoup(response.text, 'html.parser')
    products = soup.select('.title')  # Select by class
    prices = soup.select('.price')  # Select by class

    all_products.extend({"product": p.text.strip(), "price": pr.text.strip()} for p, pr in zip(products, prices))
    print(f"Scraped page {page} successfully.")

# Save to JSON
with open('scraped_products.json', 'w') as json_file:
    json.dump(all_products, json_file, indent=4)

# Save to CSV
with open('scraped_products.csv', 'w', newline='', encoding='utf-8') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=["product", "price"])
    writer.writeheader()
    writer.writerows(all_products)

print("Data saved in JSON and CSV!")


Scraped page 1 successfully.
Scraped page 2 successfully.
Scraped page 3 successfully.
Scraped page 4 successfully.
Scraped page 5 successfully.
Scraped page 6 successfully.
Scraped page 7 successfully.
Scraped page 8 successfully.
Scraped page 9 successfully.
Scraped page 10 successfully.
Scraped page 11 successfully.
Scraped page 12 successfully.
Scraped page 13 successfully.
Scraped page 14 successfully.
Scraped page 15 successfully.
Scraped page 16 successfully.
Scraped page 17 successfully.
Scraped page 18 successfully.
Scraped page 19 successfully.
Data saved in JSON and CSV!
