<img src="LaeCodes.png" 
     align="center" 
     width="100" />

# Web Scraping

Web scraping in Python involves extracting data from websites by making HTTP requests to the web pages and then parsing the HTML content to retrieve the desired information. Python offers several libraries that make web scraping efficient and straightforward. Two of the most popular libraries for web scraping in Python are BeautifulSoup and Scrapy. <br>

**Key Steps in Web Scraping** <br>
1.	Send an HTTP request to the website <br>
2.	Retrieve the HTML content of the page <br>
3.	Parse the HTML content to extract the required data <br>
4.	Store the extracted data in a suitable format <br>

**Using BeautifulSoup** <br>
BeautifulSoup is a library that makes it easy to scrape information from web pages. It sits on top of an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. <br>

**Installation** <br>
First, you need to install BeautifulSoup and requests:
![image-3.png](attachment:image-3.png)

**Basic Example**

In [2]:
import requests
from bs4 import BeautifulSoup

# Step 1: Send an HTTP request to the website
url = 'http://google.com'
response = requests.get(url)

# Step 2: Retrieve the HTML content of the page
html_content = response.text

# Step 3: Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Step 4: Extract data
title = soup.title.text
print("Title:", title)

# Example: Extracting all paragraphs
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)

Title: Google
© 2024 - Privacy - Termini


**Using Scrapy** <br>
Scrapy is an open-source and collaborative web crawling framework for Python. It is more powerful and suitable for complex web scraping tasks compared to BeautifulSoup. <br>

**Installation** <br>
To install Scrapy, run:
![image.png](attachment:image.png) <br>

**Basic Example** <br>
Create a Scrapy project:
![image-3.png](attachment:image-3.png) <br>

**Generate a spider:** <br>
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages. <br>
![image-5.png](attachment:image-5.png) <br>

**Edit the spider (example/spiders/test.py):** <br>
![image-7.png](attachment:image-7.png) <br>

**Run the spider:** <br>
![image-9.png](attachment:image-9.png) <br>
This will create a file named output.json with the scraped data. <br>

**Handling Dynamic Content** <br>
Many modern websites load content dynamically using JavaScript. To handle such sites, you might need to use a headless browser like Selenium.

**Using Selenium** <br>
Selenium automates browsers and can be used to interact with dynamic web pages. <br>

**Installation** <br>
![image-3.png](attachment:image-3.png) <br>

You also need to download a WebDriver (e.g., ChromeDriver for Google Chrome) that matches the version of your browser. <br>

**Basic Example** <br>
Here is an example of using Selenium to scrape a dynamically loaded website: <br>
![image-2.png](attachment:image-2.png)

**Best Practices**
<br>

1. Respect website policies: Always check the website’s robots.txt file to see which parts of the site are allowed to be scraped. <br>
2. Rate limiting: Implement delays between requests to avoid overloading the website's server. <br>
3. User-Agent: Use a legitimate User-Agent header to mimic a real browser request. <br>
4. Error handling: Implement error handling to manage different HTTP response statuses and other potential issues. <br>

Web scraping is a powerful tool for extracting information from the web, but it should be used responsibly. Python, with libraries like BeautifulSoup, Scrapy, and Selenium, provides robust options to perform web scraping tasks efficiently.
