# Tutorial Web Scraping 1: A first webscraper
This tutorial will guide you through the basics of web scraping using Python. You’ll learn how to:

- Request HTML from a webpage
- Use BeautifulSoup to parse the HTML
- Extract specific content, like titles and paragraphs

By the end of this tutorial, you’ll be able to create a working web scraper and to understand how to handle errors when scraping.


## What is web scraping about?

Web scraping is the process of automatically extracting data from websites. With Python, we can do this efficiently using libraries like `urllib` and `BeautifulSoup`.

The basic workflow for scraping a web page consists of:

- Sending a request to the server to retrieve the web page.
- Reading the content of the page.
- Parsing the HTML to extract the desired information.

Let’s break down the process with an example.

Note on the Jupyter environment:
- Markdown cells like this are used for explanations.
- Code cells (gray boxes as the ones below) are where you write and run Python code. You can add the desired code to the cell and run all the code in the cell by clicking on the Run button (icon below the menu) or directly on the keyboard with the Shift + Enter keys.
- To clear the results after execution of a code, click Cell>Current Outputs>Clear in the menu.



### Step 1: Fetching HTML content using urlopen
The `urlopen` function from Python’s `urllib.request` module allows us to send an HTTP request and retrieve the HTML content of a page. Try to retrieve the content of “page1” by executing the code in the cell below.

In [None]:
from urllib.request import urlopen

# Send a request to the server and fetch the HTML content of the page
html = urlopen('http://pythonscraping.com/pages/page1.html')

# Print the raw HTML content
print(html.read())

**Explanation:**
`urlopen` sends a request to the provided URL and retrieves the web page.
html.read() gives us the HTML content, which is currently in its raw form (a long string of HTML code).

### Step 2: Parsing HTML with BeautifulSoup
The BeautifulSoup library makes it easy to parse and navigate HTML or XML files. Let’s use it to extract specific data from the HTML.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Fetch the HTML content of the page
html = urlopen('http://pythonscraping.com/pages/page1.html')

# Parse the HTML using BeautifulSoup
bs = BeautifulSoup(html.read(), 'html.parser')

# Extract and print the content of the <h1> tag
print(bs.h1)

**Question:** What do you find in the variable `bs`?

**Explanation:**
- `BeautifulSoup` is used to parse the raw HTML content and convert it into a structured format.
- `bs.h1` accesses the first `<h1>` tag on the page.

## Exercise 1: Modify the scraping code
In this exercise, you’ll build on the existing code by:

- Changing the URL to scrape a different page (try using Wikipedia).
- Modify the code to extract the content of the `<title>` tag instead of the `<h1>` tag.
- Print the first `<p>` tag as well.

*Hint:* You can access the title tag using `bs.title`.

In [None]:
# Import the required libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Fetch the HTML content of a different page (modify the URL as needed)
html = urlopen('http://example.com')

# Parse the HTML using BeautifulSoup
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs)

# Print the content of the <title> tag
print("Title:", bs.title.get_text())

# Print the content of the first <p> tag
print("First paragraph:", bs.p.get_text())


### Explaining errors and error handling
When scraping websites, it's common to encounter errors:

- HTTPError: Occurs when the webpage can’t be found (e.g., a 404 error).
- URLError: Occurs when the server is down or the URL is incorrect.

Let’s explore how to handle these errors gracefully.

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError, URLError

# Error handling with try-except blocks
try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print("The server returned an HTTP error:", e)
except URLError as e:
    print("The server could not be found:", e)
else:
    print("HTML content retrieved successfully!")


### Writing functions to reuse code
To make your scraping process more efficient, you can write functions that encapsulate the scraping logic. This way, you can reuse the same code for multiple pages without rewriting it.

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

# Function to retrieve the <h1> title of a webpage
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

# Call the function with a URL
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print("Title:", title.get_text())


## Exercise 2: Scrape a Wikipedia page
Now that you know the basics of error handling and how to use BeautifulSoup, let’s scrape a more complex page!

Task:

- Use the URL for any Wikipedia page (e.g., Split screen) and scrape the title.
- Extract and print the first paragraph `<p>` tag on that page.

In [None]:
# Your task: Modify the code below to scrape a new URL and extract the <title> and <p> tags

from urllib.request import urlopen
from bs4 import BeautifulSoup

# Fetch the HTML content of a Wikipedia page (modify the URL as needed)
html = urlopen('https://en.wikipedia.org/wiki/Split_screen_(video_production)')

# Parse the HTML using BeautifulSoup
bs = BeautifulSoup(html.read(), 'html.parser')

# Print the content of the <title> tag
print("Title:", bs.title.get_text())

# Print the content of the first <p> tag
print("First paragraph:", bs.p.get_text())


### Conclusion
In this tutorial, you’ve learned:

- How to make web requests using `urlopen`.
- How to parse HTML with `BeautifulSoup`.
- How to handle common web scraping errors.
- How to extract specific content, like titles and paragraphs.

Feel free to experiment with scraping different websites and extracting different types of information!

### Additional exercises
- Modify your scraper to find all image tags on a page using `bs.find_all('img')` and print their URLs.
- Try scraping a different Wikipedia page or another website of your choice.
- Write a function to check if an element exists before trying to scrape it.