# Web Scraping

**Welcome to the Web Scraping Notebook!**

The notebook showcases the power of Python to scrape data from the internet. We will use the library `beautifulsoup4` to do the following session and exercises.

**Note: This is not a definitive guide.**

## What is Web Scraping?
Due to the abundance of information being generated in the internet today, access to these information have been relatively easy. The main problem arises on the collection of these data in bulk, organizing and analyzing. 

Web scraping is a method used by most organizations to gather data in bulk. Web scraping automatically extracts data and presents in a chosen format e.g. CSV, XLSX.

### Web Scraping and AI
Machine learning models require a lot of data, web or data scraping provides a way to generate and aggregate these data to be fed to create a machine learning model. As an example, `ChatGPT` was built by scraping through the entire internet. 

## Prerequisites
Before we start, you will need to have a basic knowledge of the following technologies

- [HTML](https://www.w3schools.com/html/default.asp)
- [Python](https://www.python.org/)

## Primary Tool
The primary library or tool that we will be using is `Beautifulsoup4`. This library allows us to parse through `HTML` and retrieve the value that we want. There are also a number of scraping tools and libraries supported by Python like `scrapy` which act like as a scraping framework.

## Anatomy of a Web Scraper
Any web scraper has a basic anatomy. It requires a retriever, parser and a transformer.

### Retriever
Retrievers are responsible to fetch the information from a website. Retrievers are not responsible in extracting or parsing the information. When a retriever fetches a website, it will return a blob of HTML tags where it will be passed to the parser/extractor. 

The logic of the retriever is only limited to generating the URL to be retrieved e.g. passing arguments or authentication parameters.  

### Parser/Extractor
Parsers are responsible to go through the information fetched by the retriever. Logic controls like extract all tables or images will be placed here. Once the information or value is extracted, it will be then passed to the transfomer which changes the form or structure. 

### Transfomer
Transformers are responsible to change the form or structure of the value.

### Storer
The changed value can be stored into a database or a file like a CSV. Python has a wide ecosystem of connectors to databases and libraries to generate custom formats.

## Hands-on
For this tutorial, we will be scraping the following [link](https://en.wikipedia.org/wiki/Economy_of_the_Philippines). In this page, we will be extracting the **Regional Accounts** table. Let us try to inspect this page and try to understand its structure. 

### Import Libraries
We will import the `requests` and `bs4` libraries. 

In [1]:
import requests
from bs4 import BeautifulSoup

### Retriever
To implement a retriever step, we will be using the `requests` library. The requests library allows us to fetch the contents of a website. The content that is retrieved is in a plain text format.  

In [2]:
# retriever
url = requests.get('https://en.wikipedia.org/wiki/Economy_of_the_Philippines')


### Parser
Once we have retrieved the contents in plain text, we will be using the library called `beautifulsoup4` or `bs4`. This library already has some functions which abstracts the parsing and filtering of an HTML content.

In [3]:
source = BeautifulSoup(url.text, 'html.parser')

In [4]:
# retrieve all the tables inside this webpage
all_tables = source.find_all('table', class_="wikitable")

In [5]:
# fetch the regional table
regional_table = all_tables[1]

In [6]:
# extract the top headers
count = 0
attribute_headers = []
rows = regional_table.find_all('tr')
for row in rows:
    headers = row.find_all('th')
    if headers is None: continue
    if count == 0: attribute_headers = [header.text.replace("\n", "") for header in headers] # 
    break
attribute_headers

['Region',
 'GRDP(PHP, thousands)',
 'Agriculture(PHP, thousands)',
 'Industry(PHP, thousands)',
 'Services(PHP, thousands)',
 'GRDP per capita(PHP)']

In [8]:
# extract the column values
data = {}
rows = regional_table.find_all('tr')
for row in rows:
    headers = row.find('th')
    if headers is None: continue
    columns = row.find_all('td')
    for idx, col in enumerate(columns):
        if idx % 2 != 0: continue
        city = headers.text.replace("\n", "")
        value = float(col.text.replace("\n", "").replace(",", ""))
        if city in data: data[city].append(value)
        else: data[city] = [value]


### Transformer
Finally, we will implement some post-processing and transform and combine the extracted header attributes and values.

In [9]:
csv_data = [attribute_headers]
for d in data:
    tmp_list = [d] + data[d]
    csv_data.append(tmp_list)


### Storer
We will then store the transformed data into a CSV. Python has the `csv` module as a part of its standard library.

In [10]:
import csv
import os
os.makedirs("data", exist_ok=True)
with open("data/wiki_economics.csv", "w") as econ_csv:
    data_writer = csv.writer(econ_csv, delimiter=",")
    data_writer.writerows(csv_data)

## Conclusion

We have successfully scraped the web page. To become proficient in web scraping, it will require hundreds of hours of practice and exposure to various structures. In the next session, we will be using this data to clean and visualize. 