<a href="https://colab.research.google.com/github/misupova/BSSDH-2023/blob/main/3_web_scraping_intro_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping with Python

Web scraping is a method used to extract data from websites. This process involves making HTTP requests to the specific URLs of the websites you are interested in, and then parsing the HTML data that is returned to extract the information you need.

In this notebook, we will be using Python to scrape data from several Wikipedia pages. We will guide you through the process of identifying the HTML tags and attributes that contain the data you want, and then writing a Python script to extract this data and save it to a CSV file.

## **Your main task is to run cells one by one, fill in input fields, and check whether the script is indeed performing scraping.**

As a reminder, basic HTML structure looks like this:

![HTML_structure](https://stuyhsdesign.files.wordpress.com/2015/09/basic-structure.png)

And HTML tag structure looks like this:

![HTML_tag](https://tutorial.techaltum.com/images/element.png)

Before we start, we need to import the necessary Python libraries that we will be using for our web scraping script. If you don't have these libraries installed, you can do so by running the command `!pip install library_name`.

Here are the libraries we will be using:

- `requests`: This library allows us to send HTTP requests.
- `BeautifulSoup`: This library is used for parsing the HTML data returned by our requests.
- `pandas`: We will use this library to handle our data and save it to a CSV file.

Let's go ahead and import these libraries.

In [None]:
%%capture
# Install libraries
!pip install BeautifulSoup

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from ipywidgets import widgets
from ipywidgets import Layout
from IPython.display import display

Now that we have imported our libraries, we can move on to the next step: identifying the HTML tags and attributes that contain the data we want to scrape. In the following sections, you will be provided with input fields where you can specify these tags and attributes.


# Inputs

In this section, we will specify the input parameters for our web scraping script. These include the URLs we want to scrape data from, the HTML tags that contain the data we're interested in, and the name of the CSV file we want to save our data to.

You have interactive input fields for these parameters. These fields will allow you to easily change the input parameters without having to modify the code.

To explain further, in HTML, an attribute is a property used to provide additional information about an HTML element. Attributes are included in the opening tag and are made up of a pair: the attribute name and the attribute value. The general form is attribute_name="attribute_value". For example, in the tag `<img src="image.jpg" alt="My Image">`, src and alt are attributes, with image.jpg and My Image as their respective values.

In the context of web scraping, attributes can be very useful for narrowing down the elements that you want to extract data from.

We have the following input fields:

- **URL List**: A list of URLs that you want to scrape data from.
- **Title Tag**: The HTML tag that contains the title of the page.
- **Intro Tag**: The HTML tag that contains the introduction or summary of the page.
- **Intro Paragraphs**: The number of non-empty introduction paragraphs to scrape from the page.
- **Image Tags**: The HTML tag used for images.
- **Link Tags**: The HTML tag used for hyperlinks.
- **CSV File Name**: The name of the CSV file to save the scraped data to.
- **Table of Contents Tag**: The HTML tag that contains the table of contents.
- **Table of Contents Attribute**: The HTML attribute used to identify the table of contents tag.
- **Table of Contents Attribute Value**: The value of the HTML attribute used to identify the table of contents tag.

In [None]:

# URL list input
url_list_input = widgets.Textarea(
    placeholder='Enter URLs, separated by commas',
    description='URL List:',
    layout=Layout(width='auto', height='100px')
)
url_list_input.style.description_width = 'initial'

# Title tag input
title_tag_input = widgets.Text(
    placeholder='Enter HTML tag for title',
    description='Title Tag:',
    layout=Layout(width='auto')
)
title_tag_input.style.description_width = 'initial'

# Intro tag input
intro_tag_input = widgets.Text(
    placeholder='Enter HTML tag for intro',
    description='Intro Tag:',
    layout=Layout(width='auto')
)
intro_tag_input.style.description_width = 'initial'

# Number of intro paragraphs
intro_paragraphs_input = widgets.IntSlider(
    min=0,
    max=10,
    step=1,
    description='Intro Paragraphs:',
)
intro_paragraphs_input.style.description_width = 'initial'

# Image tags input
image_tags_input = widgets.Text(
    placeholder='Enter HTML tag for images',
    description='Image Tags:',
    layout=Layout(width='auto')
)
image_tags_input.style.description_width = 'initial'

# Link tags input
link_tags_input = widgets.Text(
    placeholder='Enter HTML tag for links',
    description='Link Tags:',
    layout=Layout(width='auto')
)
link_tags_input.style.description_width = 'initial'

# Table of Contents Tag input
toc_tag_input = widgets.Text(
    placeholder='Enter HTML tag for table of contents',
    description='ToC Tag:',
    layout=Layout(width='auto')
)
toc_tag_input.style.description_width = 'initial'

# Table of Contents Attribute input
toc_attribute_input = widgets.Text(
    placeholder='Enter attribute name for ToC tag',
    description='ToC Attribute Name:',
    layout=Layout(width='auto')
)
toc_attribute_input.style.description_width = 'initial'

# Table of Contents Attribute Value input
toc_attribute_value_input = widgets.Text(
    placeholder='Enter attribute value for ToC tag',
    description='ToC Attribute Value:',
    layout=Layout(width='auto')
)
toc_attribute_value_input.style.description_width = 'initial'

# CSV file name input
csv_file_name_input = widgets.Text(
    value='scraped_data.csv',
    placeholder='Enter name of the CSV file',
    description='CSV File Name:',
    layout=Layout(width='auto')
)
csv_file_name_input.style.description_width = 'initial'

# Display the widgets
display(url_list_input, title_tag_input, intro_tag_input, intro_paragraphs_input,
        image_tags_input, link_tags_input,
        toc_tag_input, toc_attribute_input, toc_attribute_value_input,
        csv_file_name_input)

Now you can fill in the input fields above with your desired parameters. Once you're done, we can move on to the next section where we'll define the functions for our web scraping script.

# Function Definitions

In this section, we will define several Python functions that will be used in our web scraping script.

These functions will be responsible for:

1. Sending a GET request to a URL and returning the response.
2. Parsing the HTML content of the response using BeautifulSoup.
3. Extracting the required data from the parsed HTML.
4. Writing the scraped data to a CSV file.

Let's go ahead and define these functions.


In [None]:

# Function to send a GET request to a URL
def fetch_url(url):
    response = requests.get(url)
    return response

# Function to parse HTML content
def parse_html(content):
    soup = BeautifulSoup(content, 'html.parser')
    return soup

# Function to extract data from parsed HTML
def extract_data(url, soup, title_tag, intro_tag, intro_paragraphs, image_tags, link_tags,
                 toc_tag, toc_attribute, toc_attribute_value):
    data = {}

    # Extract URL first
    data['url'] = url

    # Extract title
    title = soup.find(title_tag)
    if title:
        data['title'] = title.text.strip()

    # Extract table of contents
    toc = soup.find(toc_tag, attrs={toc_attribute: toc_attribute_value})
    if toc:
        data['toc'] = toc.text.strip()

    # Extract intro paragraphs
    intro_paras = soup.find_all(intro_tag)
    # Filter out empty paragraphs or paragraphs only containing whitespace
    intro_paras = [para for para in intro_paras if para.text.strip() != '']
    intro_paras = intro_paras[:intro_paragraphs] if len(intro_paras) > intro_paragraphs else intro_paras
    if intro_paras:
        data['intro'] = ' '.join([para.text.strip() for para in intro_paras])

    # Extract image URLs
    image_urls = []
    for tag in image_tags:
        images = soup.find_all(tag)
        if images:
            image_urls.extend([img.get('src') for img in images])
    data['image_urls'] = image_urls

    # Extract links
    links = []
    for tag in link_tags:
        anchors = soup.find_all(tag)
        if anchors:
            links.extend([link.get('href') for link in anchors])
    data['links'] = links

    return data

# Function to write data to CSV
def write_to_csv(data, file_name):
    df = pd.DataFrame(data)
    df.to_csv(file_name, index=False)


# Web Scraping

Now that we've defined our input parameters and functions, we can move on to the main part of our script: the web scraping.

In this section, we will use the functions defined above to:

1. Send a GET request to each URL in our URL list.
2. Parse the HTML content of each response.
3. Extract the required data from the parsed HTML.
4. Save the scraped data to a CSV file.

Let's start the web scraping.


In [None]:
# Convert the string input values into correct data types
url_list = url_list_input.value.split(',')
title_tag = title_tag_input.value
intro_tag = intro_tag_input.value
intro_paragraphs = intro_paragraphs_input.value
image_tags = image_tags_input.value.split(',')
link_tags = link_tags_input.value.split(',')
csv_file_name = csv_file_name_input.value
toc_tag = toc_tag_input.value
toc_attribute = toc_attribute_input.value
toc_attribute_value = toc_attribute_value_input.value

# Initialize an empty list to store the scraped data
data_list = []

# Iterate over each URL in the URL list
for url in url_list:
    # Fetch the URL
    response = fetch_url(url)
    if response.status_code == 200:
        # Parse the HTML content of the response
        soup = parse_html(response.content)
        # Extract the required data
        data = extract_data(url, soup, title_tag, intro_tag, intro_paragraphs, image_tags, link_tags, toc_tag, toc_attribute, toc_attribute_value)
        # Add the data to our data list
        data_list.append(data)

# Convert the data list into a pandas DataFrame
df = pd.DataFrame(data_list)

# Display the DataFrame
display(df)

# Write the DataFrame to a CSV file
df.to_csv(csv_file_name, index=False)

print(f"Data saved to {csv_file_name}")



The web scraping process is now complete! The data has been written to a CSV file named according to the input you provided.

In the next section, we'll discuss potential next steps and extensions to this web scraping script.


# Conclusion and Next Steps

Congratulations! You've successfully scraped a web page using Python.

In this notebook, we covered:

- The process of making HTTP requests to retrieve data from a website.
- How to parse the HTML response and identify the tags and attributes that contain the information we're interested in.
- How to extract this information and save it to a CSV file.

Remember, this is just the beginning. There's so much more you can do with web scraping. Here are a few ideas for next steps:

- Explore other HTML tags and attributes. The web is filled with all sorts of interesting data, and you never know what you might find!
- Improve the output format - clean the data, sort it by tags or attributes, save the data in different formats
- Try scraping different websites. Each website is different, and you'll need to adjust your scraping strategy to match the structure of each site.
- Learn about handling more complex scenarios, like JavaScript-heavy sites, dealing with cookies and sessions, and handling different data formats like JSON and XML.
- Always remember to respect the terms of service of the sites you're scraping. Some sites may have rules against scraping, or require you to identify your bot in a specific way.



# Optional: Data Cleaning

After extracting data from the web, it's common to find that the data is not in the exact format we want. It might include extra spaces, irrelevant characters, HTML tags, or other elements that we don't need for our analysis or data processing. This is where data cleaning comes in.

Data cleaning is the process of identifying and correcting (or removing) errors in the dataset. It is an important and often necessary step in the data preprocessing pipeline, especially when dealing with unstructured data like web scraped content.

In the context of our web scraping task, data cleaning can involve processes such as:

    Removing HTML tags from the scraped text.
    Replacing multiple spaces with a single space.
    Removing leading and trailing whitespaces.
    Standardizing or removing special characters.
    Ensuring links are complete and well-formed.

Keep in mind that the extent of data cleaning required can vary greatly depending on the data and the specific task or analysis you're planning to do. Therefore, this step is optional and customizable based on your needs. Here is a simple function to clean our scraped data.

In [None]:
import re

def clean_text(text):
    """Cleans the input text by removing HTML tags, replacing multiple spaces with a single space, and trimming leading/trailing whitespaces."""
    # Remove HTML tags
    cleaned_text = re.sub('<[^<]+?>', '', text)
    # Replace multiple spaces with a single space
    cleaned_text = re.sub('\s+', ' ', cleaned_text)
    # Remove leading and trailing whitespaces
    cleaned_text = cleaned_text.strip()
    return cleaned_text