
# **Objective: Use a web crawler to acquire data from a specific domain.**


Task:
1. Choose a government website with primarily textual data
2. Use web crawling tools or libraries (such as BeautifulSoup, Scrapy, or others) to extract data
from the chosen website (at least 10 html pages).
3. Pre-process each html page using REGEX and other techniques to remove all html tags,
special characters, advertisements/ extra link.

***
***
***

## **What is Web Crawling?**


- Web crawling is a technique used to systematically navigate websites, indexing their content for search engines or other purposes.

- Unlike web scraping, which targets specific data, web crawling focuses on indexing and discovering information.

> ## **Python and Web Crawling**
Python has cemented its place as a go-to language for web scraping and crawling, due to its rich ecosystem of libraries. **BeautifulSoup** , in conjunction with requests, forms a formidable duo for crafting web crawlers. BeautifulSoup simplifies HTML parsing, allowing for easy extraction of data, while requests handle the complexities of making HTTP requests.

***

## **STEP 01: WEBSITE SELECTION**
- I have chosen State Bank of Pakistan (SBP) website. Since it contains a lot of textual content related to the country’s monetary policies and regulations. This data is essential for understanding how the economy functions in Pakistan.

- Webiste URL: https://www.sbp.org.pk/index.html
***

## **STEP 02: IMPORTING NECESSARY LIBRARIES**

In [None]:
# requests library will handle HTTP requests and fetch web page content
import requests

# BeautifulSoup from bs4 will use to parse HTML and extract data from web pages
from bs4 import BeautifulSoup

# re library for regular expressions to preprocess and clean text data
import re

# pandas for data manipulation and analysis, especially for creating DataFrames and saving data to CSV
import pandas as pd


***
## **STEP 03: Creating Dictionary of URLs and their corresponsing page names**

- Creating dictionary called page_names is to map specific URLs from the State Bank of Pakistan website to their corresponding descriptive page names.
-  By organizing the URLs in this way, the code can efficiently iterate through each page, making it easier to manage and extract content from the selected web pages during the crawling process.

- **The ten chosen pages from SBP webiste are:**
    1. 'https://www.sbp.org.pk/about/Intro.asp': 'Introduction To SBP',
    2. 'https://www.sbp.org.pk/about/Lf.asp': 'Legal Framework',
    3. 'https://www.sbp.org.pk/about/Func.asp': 'Function of SBP',
    4. 'https://www.sbp.org.pk/about/Govr.asp': 'Governance',
    5. 'https://www.sbp.org.pk/about/BOD.asp': 'Powers of the Board',
    6.'https://www.sbp.org.pk/about/MPC.asp': 'Monetary Policy Committee (MPC)',

    7. 'https://www.sbp.org.pk/m_policy/mpf-01.asp': 'Monetary Policy Framework',

    8.  'https://www.sbp.org.pk/departments/ifpd.htm': 'Islamic Finance Policy Department',

    9. 'https://www.sbp.org.pk/departments/ifdd.htm': 'Islamic Finance Development Department',
    
    10. 'https://www.sbp.org.pk/departments/ITOD.htm': 'Information Technology Operations Department'

In [None]:
# Defining the URLs and their corresponding page names
page_names = {
    'https://www.sbp.org.pk/about/Intro.asp': 'Introduction To SBP',
    'https://www.sbp.org.pk/about/Lf.asp': 'Legal Framework',
    'https://www.sbp.org.pk/about/Func.asp': 'Function of SBP',
    'https://www.sbp.org.pk/about/Govr.asp': 'Governance',
    'https://www.sbp.org.pk/about/BOD.asp': 'Powers of the Board',
    'https://www.sbp.org.pk/about/MPC.asp': 'Monetary Policy Committee (MPC)',
    'https://www.sbp.org.pk/m_policy/mpf-01.asp': 'Monetary Policy Framework',
    'https://www.sbp.org.pk/departments/ifpd.htm': 'Islamic Finance Policy Department',
    'https://www.sbp.org.pk/departments/ifdd.htm': 'Islamic Finance Development Department',
    'https://www.sbp.org.pk/departments/ITOD.htm': 'Information Technology Operations Department'
}

***
## **STEP 04: CREATING FUNCTION TO FETCH DATA FROM EACH OF 10 PAGES**


- Inside the function, a try block is used to handle any potential errors that may occur during the execution of the code.
- The requests.get(url) function is called to send an HTTP GET request to the specified URL. This method attempts to retrieve the content from the webpage.
- The response.raise_for_status() method is called to check the response status.
- If the HTTP request returned an unsuccessful status code this method raises an exception.

- If the request is successful, the function proceeds to return response.content, which contains the raw content of the response (HTML).

- If an exception occurs (for example, if the URL is invalid or the server is down), the code in the except block is executed.
An error message is printed, indicating which URL failed to fetch and providing the exception message for debugging.

- If the request fails, the function returns None, indicating that the data could not be fetched. This allows the program to continue running without crashing.



In [None]:
# Creating Function to fetch data from each URL

def fetch_data(url):
    try:
        # Sends a GET request to the specified URL
        response = requests.get(url)
        # Checks for HTTP errors
        response.raise_for_status()
        # If the request is successful, return the content of the response
        return response.content
    # Catches any exceptions that occur during the request
    except Exception as e:
        # Print an error message if the request fails
        print(f"Error fetching {url}: {e}")
        return None

***
## **STEP 05: CREATING FUNCTION TO PRE-PROCESS HTML CONTENT**

- preprocess_html function -> takes raw HTML content from a webpage for processing.

- **BeautifulSoup library** with the html.parser option -> creates a soup object that facilitates easy text extraction of HTML structure.

-  soup.get_text(separator='\n') method -> extracts all the text, with newlines separating different elements.

- Regex re.sub(r'<[^>]+>', '', text)** ->  **eliminates any remaining HTML tags** from the extracted text.

- Regex, re.sub(r'\[[0-9]*\]', '', clean_text), -> removes reference links formatted as numbers in brackets.

-  Regex re.sub(r'\W+', ' ', clean_text), -> **removes any special characters** and replaces non-word characters with spaces to maintain the text structure.

-  Regex re.sub(r'\s+', ' ', clean_text).strip() -> simplifies multiple spaces to a single space and trims any leading or trailing spaces from the text.

- Specific phrases like "State Bank of Pakistan" and "About Us" are removed from the start of the text using regex, with case-insensitivity enabled for broader matching.

- A regex pattern is also ussed to remove common footer content that may appear at the end of the text, ensuring only the main content remains.

In [None]:
# Creating Function to preprocess the HTML content


def preprocess_html(html_content):
    # Parsing the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')
    # Extracting text from the parsed HTML, separating elements with newlines
    text = soup.get_text(separator='\n')

    # Cleaning up text with regex
    clean_text = re.sub(r'<[^>]+>', '', text)  # Removing HTML tags
    clean_text = re.sub(r'\[[0-9]*\]', '', clean_text)  # Removing reference links
    clean_text = re.sub(r'\W+', ' ', clean_text)  # Removing special characters
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()  # Replacing multiple spaces

    # Removing leading "State Bank of Pakistan" and "About Us" if present at the start of each page
    clean_text = re.sub(r'^(State Bank of Pakistan\s*)?(About Us\s*)?', '', clean_text, flags=re.IGNORECASE).strip()

    # Removing known footer content patterns at the end of the text
    footer_pattern = re.compile(r"(Home|About SBP|Publications|Economic Data|Press Releases|"
                                r"Circulars|Notifications|Laws Regulations|What\'s New|Library|"
                                r"Screen Resolution \d{3,4} \d{3,4}).*$", re.IGNORECASE)

    # Trimming the footer text using regex to search for these patterns at the end
    clean_text = footer_pattern.sub('', clean_text).rstrip()

    return clean_text

***
## **STEP 06: STORING THE EXTRACTED CONTENT AFTER CLEANING**
- A dictionary called data is created to hold the cleaned content for each webpage, using the page names as keys for easy reference.

- The code loops through the page_names dictionary to get each page's URL and name, keeping track of the page number starting from one.

- For each page, a message is displayed to show which one is being fetched, and the fetch_data function retrieves the HTML content for that page.

- If the HTML content is fetched successfully, the preprocess_html function cleans the raw HTML, and the cleaned text is saved.

- Finally, the cleaned text, page number, and original HTML content are all stored in the data dictionary.

In [None]:
# Initializing the data dictionary for storing extracted and cleaned content
data = {}

# Iterating through the pages, getting both the URL and the corresponding page name
for page_num, (url, name) in enumerate(page_names.items(), start=1):

    # Printing the current page being fetched with its number
    print(f"Fetching: {name} (Page {page_num})")

    # Fetching the HTML content from the URL
    html_content = fetch_data(url)

    # Checking if the HTML content was successfully retrieved
    if html_content:

        # Applying preprocess_html function on html_content to Preprocess the HTML content for cleaning
        clean_content = preprocess_html(html_content)

        # Storing the cleaned content along with the page number and original HTML content
        data[name] = {
            'page_number': page_num,
            'clean_content': clean_content,
            'html_content': html_content.decode('utf-8')
        }


***
## **STEP 07: Saving Extracted Data to CSV and Text/HTML Files**

In [None]:
# Creating a DataFrame from the extracted data using a list comprehension.
# The DataFrame will have columns for Page Name, Page Number, and Content.
df = pd.DataFrame([(name, info['page_number'], info['clean_content']) for name, info in data.items()],
                  columns=['Page Name', 'Page Number', 'Content'])



# Saving the dataset to CSV
# Setting index=False to avoid saving the DataFrame index as a separate column in the CSV.
df.to_csv('sbp_data.csv', index=False)



# Saving the clean text and HTML content into separate files
# Iterating over the items in the data dictionary to save each page's content.
for name, info in data.items():
    # Opening a new text file with the name of the page, and writting the cleaned content to it.
    # Using encoding='utf-8' to ensure proper handling of special characters.
    with open(f"{name}.txt", "w", encoding='utf-8') as file:
        file.write(info['clean_content'])

    # Opening a new HTML file with the name of the page, and writting the original HTML content to it.
    with open(f"{name}.html", "w", encoding='utf-8') as file:
        file.write(info['html_content'])



print("Data extraction, preprocessing, and saving complete.")


Fetching: Introduction To SBP (Page 1)
Fetching: Legal Framework (Page 2)
Fetching: Function of SBP (Page 3)
Fetching: Governance (Page 4)
Fetching: Powers of the Board (Page 5)
Fetching: Monetary Policy Committee (MPC) (Page 6)
Fetching: Monetary Policy Framework (Page 7)
Fetching: Islamic Finance Policy Department (Page 8)
Fetching: Islamic Finance Development Department (Page 9)
Fetching: Information Technology Operations Department (Page 10)
Data extraction, preprocessing, and saving complete.


***
## **STEP 08: DISPLAYING THE FINA CLEANED CONTENT THAT IS EXTRACTED FROM 10 PAGES OF WEBISTE**

In [None]:
df

Unnamed: 0,Page Name,Page Number,Content
0,Introduction To SBP,1,Introduction The State Bank of Pakistan SBP is...
1,Legal Framework,2,Legal Framework SBP s Functions are mainly gov...
2,Function of SBP,3,Functions of SBP Like a Central Bank in any de...
3,Governance,4,Governance Board of Directors Board of Directo...
4,Powers of the Board,5,Powers of the Board The SBP Board derives its ...
5,Monetary Policy Committee (MPC),6,Monetary Policy Committee MPC Powers and Funct...
6,Monetary Policy Framework,7,Monetary Policy Monetary Policy Objectives Mon...
7,Islamic Finance Policy Department,8,Departments Islamic Finance Policy Department ...
8,Islamic Finance Development Department,9,Departments Islamic Finance Development Depart...
9,Information Technology Operations Department,10,Departments Information Technology Operations ...
