# Web Scraping Tutorial

This tutorial will cover the basics of web scraping using Python.

## What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It can be done using various tools and libraries in Python, such as BeautifulSoup and Scrapy.

## Installing Required Libraries

For this tutorial, we will use `requests` to fetch web pages and `BeautifulSoup` to parse HTML content. Install these libraries using pip:
```bash
pip install requests
pip install beautifulsoup4
```

## Fetching a Web Page

Use the `requests` library to fetch the content of a web page.

In [None]:
import requests

# URL of the web page you want to scrape
url = 'https://en.wikipedia.org/wiki/Artificial_intelligence'

# Fetch the content of the web page
response = requests.get(url)

# Print the status code to check if the request was successful
print(response.status_code)

# Print the first 500 characters of the content
print(response.text[:500])

## Parsing HTML Content

Use `BeautifulSoup` to parse the HTML content of the web page.

In [None]:
from bs4 import BeautifulSoup

# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(response.text, 'html.parser')

# Print the title of the web page
print(soup.title)

# Print all paragraph tags
for p in soup.find_all('p'):
    print(p.text)

## Extracting Specific Data

Extract specific data from the web page using BeautifulSoup.

In [None]:
# Extract all links from the web page
links = soup.find_all('a')

# Print the URLs of the links
for link in links:
    print(link.get('href'))

## Saving Data to a File

Save the extracted data to a file for later use.

In [None]:
with open('links.txt', 'w') as file:
    for link in links:
        href = link.get('href')
        if href is not None:
            file.write(href + '\n')

## Handling Pagination

Some websites have multiple pages of data. Handle pagination by iterating through each page.

In [None]:
# Loop through the first 5 pages
for page in range(1, 6):
    url = f'https://en.wikipedia.org/wiki/{page}'  # Update this URL to the actual website you are scraping
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all titles and paragraphs
    titles = soup.find_all('h1')  # Assuming titles are in <h1> tags
    paragraphs = soup.find_all('p')
    
    # Print titles
    print(f"Page {page} Titles:")
    for title in titles:
        print(title.get_text())
        
    print(f"\nPage {page} Paragraphs:")
    for paragraph in paragraphs:
        print(paragraph.get_text())
    print("\n" + "="*50 + "\n")

## Extracting Data from Tables

Web pages often contain data in HTML tables. You can extract this data using BeautifulSoup.

In [None]:
# Extract data from a table
table = soup.find('table')
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    for col in cols:
        print(col.text)

## Handling JavaScript-Rendered Content

Some web pages render content using JavaScript, which requires different approaches to scrape, such as using Selenium or Scrapy with Splash.

In [None]:
# Example using Selenium to handle JavaScript-rendered content
# Install Selenium using pip: pip install selenium
from selenium import webdriver

# Set up the WebDriver (e.g., using Chrome)
driver = webdriver.Chrome()

# Navigate to the web page
driver.get('http://example.com')

# Get the page source and parse it with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Extract data as usual
print(soup.title.text)

# Close the WebDriver
driver.quit()

## Using Scrapy for Advanced Web Scraping

Scrapy is a powerful web scraping framework for Python. It allows you to define spiders to crawl and scrape websites.

In [None]:
# Example of a simple Scrapy spider
# Save this code in a file named my_spider.py

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['http://example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

# Run the spider using the command: scrapy runspider my_spider.py

## Avoiding Getting Blocked

When scraping websites, it’s important to follow best practices to avoid getting blocked. Here are some tips:

1. **Respect `robots.txt`:** Check the website's `robots.txt` file to see which parts of the site are allowed to be scraped.
2. **Throttle your requests:** Do not overload the server with too many requests in a short period. Use delays and random intervals between requests.
3. **Use User-Agent headers:** Some websites block requests that do not come from a browser. Set the User-Agent header to mimic a real browser.

In [None]:
# Example of setting headers to mimic a real browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)

## Error Handling

Implement error handling to make your scraper more robust and handle exceptions gracefully.

In [None]:
# Example of error handling
try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.HTTPError as err:
    print(f'HTTP error occurred: {err}')
except Exception as err:
    print(f'Other error occurred: {err}')

## Summary

In this tutorial, we've covered the basics of web scraping using Python, including fetching web pages, parsing HTML content, extracting specific data, saving data to a file, handling pagination, extracting data from tables, handling JavaScript-rendered content, using Scrapy for advanced scraping, avoiding getting blocked, and implementing error handling.