# Crawling Web Data
In this tutorial, you will learn how to write a web crawler to collect data from the Web.

A **web crawler**, also known as a spider or web spider, is an automated software program or bot that systematically browses the World Wide Web, typically for the purpose of indexing and gathering information from web pages. Web crawlers start by visiting one or more seed URLs, and then follow hyperlinks on those pages to discover and fetch additional URLs. This process continues recursively, allowing the crawler to traverse the web and build a comprehensive index of web pages.

Web crawlers play a crucial role in powering search engines like Google, Bing, and Yahoo, as well as other web-based applications such as web archiving, content aggregation, and data mining. They enable users to discover and access information across the vast expanse of the World Wide Web.

We will show step by step how to write a web crawler to crawl wikipedia webpages related to a certain topic.

### Import Libraries

In [2]:
import requests # A powerful HTTP library that allows you to send HTTP requests easily and efficiently.
from bs4 import BeautifulSoup # A Python library used for parsing HTML and XML documents and extracting data from them. 

### Sending an Http Request

In [3]:
# Send a GET request to a URL
response = requests.get("https://en.wikipedia.org/wiki/Data_science")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Print the response content
    print(response.text)
else:
    # Print an error message if the request failed
    print('Error:', response.status_code)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Data science - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feat

### Use BeautifulSoup to Extract Information from the Webpage

In [4]:
# Send a GET request to a URL
response = requests.get("https://en.wikipedia.org/wiki/Data_science")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse HTML content using Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract and print the title of the web page
    title = soup.title # Returns a node in the HTML document tree
    print(title) # html code of the node
    print(title.text) # .text extracts the text of the node

    # Find and print all links on the web page
    links = soup.find_all('a')
    for link in links:
        print(link) # Each link is an <a> node
        try:
            print(link['href']) # The href attributes is the destination url
        except:
            pass # In case some link does not have href.
        print()
else:
    # Print an error message if the request failed
    print('Error:', response.status_code)

<title>Data science - Wikipedia</title>
Data science - Wikipedia
<a class="mw-jump-link" href="#bodyContent">Jump to content</a>
#bodyContent

<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>
/wiki/Main_Page

<a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>
/wiki/Wikipedia:Contents

<a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>
/wiki/Portal:Current_events

<a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>
/wiki/Special:Random

<a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>
/wiki/Wikipedia:About

<a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>
//en.wikipedia.org/wiki/Wikipedia:Contact_us

<a href="https://donate.wikimedia

### Filter Links
There is a certain type of web link that links to another wikipedia page about some keyword. 

For example, the hyperlink \<a href="/wiki/Data_analysis" title="Data analysis">data analysis</a> points to another wikipedia page about data analysis.

Therefore, we can use the pattern "/wiki/{keyword}" to filter links within a wiki page. 

In [5]:
# Send a GET request to a URL
response = requests.get("https://en.wikipedia.org/wiki/Data_science")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse HTML content using Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find and print all links on the web page
    links = soup.find_all('a', href=True) # href=True requires a link should have href attribute
    for link in links:
        if link['href'].startswith('/wiki/'): 
            print(link['href']) # The href attributes is the destination url
else:
    # Print an error message if the request failed
    print('Error:', response.status_code)

/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
/wiki/Data_science
/wiki/Talk:Data_science
/wiki/Data_science
/wiki/Data_science
/wiki/Special:WhatLinksHere/Data_science
/wiki/Special:RecentChangesLinked/Data_science
/wiki/Wikipedia:File_Upload_Wizard
/wiki/Special:SpecialPages
/wiki/Information_science
/wiki/File:PIA23792-1600x1200(1).jpg
/wiki/Comet_NEOWISE
/wiki/Astronomical_survey
/wiki/Space_telescope
/wiki/Wide-field_Infrared_Survey_Explorer
/wiki/Interdisciplinary
/wiki/Statistics
/wiki/Scientific_computing
/wiki/Scientific_method
/wiki/Algorithm
/wiki/Knowledge
/wiki/Unstructured_data
/wiki/Statistics
/wiki/Data_analysis
/wiki/Informatics
/wiki/Scientific_method
/wiki

Among those links, we can find that many links contain a colon(:). These links are not actually wiki pages but about some special services. We need to filter them out.

Also, there is a link "/wiki/Main_Page" pointing to the wikipedia homepage. We don't need it.

In [6]:
# Send a GET request to a URL
response = requests.get("https://en.wikipedia.org/wiki/Data_science")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse HTML content using Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find and print all links on the web page
    links = soup.find_all('a', href=True) # href=True requires a link should have href attribute
    for link in links:
        if link['href'].startswith('/wiki/') and ':' not in link['href'] and 'Main_Page' not in link['href']:
            print(link['href']) # The href attributes is the destination url
else:
    # Print an error message if the request failed
    print('Error:', response.status_code)

/wiki/Data_science
/wiki/Data_science
/wiki/Data_science
/wiki/Information_science
/wiki/Comet_NEOWISE
/wiki/Astronomical_survey
/wiki/Space_telescope
/wiki/Wide-field_Infrared_Survey_Explorer
/wiki/Interdisciplinary
/wiki/Statistics
/wiki/Scientific_computing
/wiki/Scientific_method
/wiki/Algorithm
/wiki/Knowledge
/wiki/Unstructured_data
/wiki/Statistics
/wiki/Data_analysis
/wiki/Informatics
/wiki/Scientific_method
/wiki/Phenomena
/wiki/Data
/wiki/Mathematics
/wiki/Computer_science
/wiki/Information_science
/wiki/Domain_knowledge
/wiki/Computer_science
/wiki/Turing_Award
/wiki/Jim_Gray_(computer_scientist)
/wiki/Empirical_research
/wiki/Basic_research
/wiki/Computational_science
/wiki/Information_technology
/wiki/Information_explosion
/wiki/Interdisciplinarity
/wiki/Academic_discipline
/wiki/Knowledge_extraction
/wiki/Big_data
/wiki/Data_set
/wiki/Problem-solving
/wiki/Analysis
/wiki/Data_visualization
/wiki/Information_visualization
/wiki/Data_sonification
/wiki/Data_integration
/wik

Now, each link points to a wikipedia page about some topic. However, those links are relative links, i.e., they do not have the domain name. We can add the wikipedia domain name at the front to make them full web urls.

In [7]:
# Send a GET request to a URL
response = requests.get("https://en.wikipedia.org/wiki/Data_science")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse HTML content using Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find and print all links on the web page
    links = soup.find_all('a', href=True) # href=True requires a link should have href attribute
    for link in links:
        if link['href'].startswith('/wiki/') and ':' not in link['href'] and 'Main_Page' not in link['href']:
            print(f"https://en.wikipedia.org{link['href']}") # Full link
else:
    # Print an error message if the request failed
    print('Error:', response.status_code)

https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Information_science
https://en.wikipedia.org/wiki/Comet_NEOWISE
https://en.wikipedia.org/wiki/Astronomical_survey
https://en.wikipedia.org/wiki/Space_telescope
https://en.wikipedia.org/wiki/Wide-field_Infrared_Survey_Explorer
https://en.wikipedia.org/wiki/Interdisciplinary
https://en.wikipedia.org/wiki/Statistics
https://en.wikipedia.org/wiki/Scientific_computing
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Algorithm
https://en.wikipedia.org/wiki/Knowledge
https://en.wikipedia.org/wiki/Unstructured_data
https://en.wikipedia.org/wiki/Statistics
https://en.wikipedia.org/wiki/Data_analysis
https://en.wikipedia.org/wiki/Informatics
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Phenomena
https://en.wikipedia.org/wiki/Data
https://en.wikipedia.org/wiki/Mathematics
https:/

### Save a Webpage
After a webpage is crawled, we can use the following code to save it to our local disk.

In [8]:
import os

# Send a GET request to a URL
response = requests.get("https://en.wikipedia.org/wiki/Data_science")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse HTML content using Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract the title of the web page
    title = soup.title.text 

    # Create a folder to store webpages
    if not os.path.exists("pages"): 
        os.mkdir("pages")
    
    file_name = f"{title}.html"
    file_path = os.path.join("pages", file_name)

    # Save the webpage
    with open(file_path, 'w', encoding='utf-8', errors="ignore") as f:
        f.write(response.text)

    
else:
    # Print an error message if the request failed
    print('Error:', response.status_code)

### Use a Queue to Crawl Webpages Continuously
The above code pieces show how to download and save one single wikipedia page and extract all the links within the page that point to other wiki pages. Now, we will use a queue to continuously downloading and saving wiki pages following the links. Below are the major steps.

1. Initialize the queue with a seed url.
2. Pop an url from the queue.
3. Download the save the wiki page with the above url.
4. Extract all urls that link to another wiki page from the downloaded page.
5. Push those urls into the queue.
6. Repeat step 2-5, until a certain number of wiki pages are downloaded.

In [9]:
queue = ["https://en.wikipedia.org/wiki/Data_science"] # Use a Python array as queue

max_pages = 100 # max number of wiki pages we need to download

visited_urls = set()

if not os.path.exists("pages"):
    os.makedirs("pages")

while queue and len(visited_urls) < max_pages:
    url = queue.pop(0) # Pop the last element in the array

    # Avoid revisiting visited urls. 
    if url in visited_urls:
        continue

    visited_urls.add(url) 

    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract the title of the web page
        title = soup.title.text 

        file_name = f"{title}.html"
        file_path = os.path.join("pages", file_name)

        # Save the webpage
        with open(file_path, 'w', encoding='utf-8', errors="ignore") as f:
            f.write(response.text)
        
        # Find links to other Wikipedia pages
        links = soup.find_all('a', href=True) 
        for link in links:
            if link['href'].startswith('/wiki/') and ':' not in link['href'] and 'Main_Page' not in link['href']:
                queue.append(f"https://en.wikipedia.org{link['href']}") # Push new links into the queue
        
    except Exception as e:
        print(f"Error occurred while crawling {url}: {e}")

print("Crawling complete!")

Crawling complete!


### Wrap up the Crawler
Now, we will wrap up the above code as a single function to crawl and downloaded wiki pages.

In [10]:
import requests 
from bs4 import BeautifulSoup 
import os
import time

def crawl_wiki(seed_urls, max_pages, destination_folder):

    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    queue = []
    for url in seed_urls:
        queue.append(url)
    visited_urls = set()

    while queue and len(visited_urls) < max_pages:
        url = queue.pop(0)
        if url in visited_urls:
            continue
        visited_urls.add(url) 

        try:
            response = requests.get(url)
            time.sleep(1) # Add a polite policy to reduce the rate of visiting the wiki server.

            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract the title of the web page
            title = soup.title.text 

            file_name = f"{title}.html"
            file_path = os.path.join(destination_folder, file_name)

            # Save the webpage
            with open(file_path, 'w', encoding='utf-8', errors="ignore") as f:
                f.write(response.text)
                print(f"{url} saved.")
            
            # Find links to other Wikipedia pages
            links = soup.find_all('a', href=True) 
            for link in links:
                href = link['href']
                if href.startswith('/wiki/') and ':' not in href and 'Main_Page' not in href:
                    queue.append(f"https://en.wikipedia.org{href}") # Push new links into the queue
            
        except Exception as e:
            print(f"Error occurred while crawling {url}: {e}")

    print("Crawling complete!")

In [11]:
# Test the crawling function
seed_urls = ["https://en.wikipedia.org/wiki/Data_science"]
max_pages = 500
destination_folder = "pages"
crawl_wiki(seed_urls, max_pages, destination_folder)

https://en.wikipedia.org/wiki/Data_science saved.
https://en.wikipedia.org/wiki/Information_science saved.
https://en.wikipedia.org/wiki/Comet_NEOWISE saved.
https://en.wikipedia.org/wiki/Astronomical_survey saved.
https://en.wikipedia.org/wiki/Space_telescope saved.
https://en.wikipedia.org/wiki/Wide-field_Infrared_Survey_Explorer saved.
https://en.wikipedia.org/wiki/Interdisciplinary saved.
https://en.wikipedia.org/wiki/Statistics saved.
https://en.wikipedia.org/wiki/Scientific_computing saved.
https://en.wikipedia.org/wiki/Scientific_method saved.
https://en.wikipedia.org/wiki/Algorithm saved.
https://en.wikipedia.org/wiki/Knowledge saved.
https://en.wikipedia.org/wiki/Unstructured_data saved.
https://en.wikipedia.org/wiki/Data_analysis saved.
https://en.wikipedia.org/wiki/Informatics saved.
https://en.wikipedia.org/wiki/Phenomena saved.
https://en.wikipedia.org/wiki/Data saved.
https://en.wikipedia.org/wiki/Mathematics saved.
https://en.wikipedia.org/wiki/Computer_science saved.
ht