# Web Scraping

**Welcome to the Web Scraping Notebook!**

The notebook showcases the power of Python to scrape data from the internet. We will use the library `beautifulsoup4` to do the following session and exercises.

**Note: This is not a definitive guide.**

## What is Web Scraping?
Due to the abundance of information being generated in the internet today, access to these information have been relatively easy. The main problem arises on the collection of these data in bulk, organizing and analyzing. 

Web scraping is a method used by most organizations to gather data in bulk. Web scraping automatically extracts data and presents in a chosen format e.g. CSV, XLSX.

### Web Scraping and AI
Machine learning models require a lot of data, web or data scraping provides a way to generate and aggregate these data to be fed to create a machine learning model. As an example, `ChatGPT` was built by scraping through the entire internet. 

## Prerequisites
Before we start, you will need to have a basic knowledge of the following technologies

- [HTML](https://www.w3schools.com/html/default.asp)
- [Python]()

## Primary Tool
The primary library or tool that we will be using is `Beautifulsoup4`. This library allows us to parse through `HTML` and retrieve the value that we want.

## Anatomy of a Web Scraper
Any web scraper has a basic anatomy. It requires a retriever, parser and a transformer.

### Retriever
Retrievers are responsible to fetch the information from a website. Retrievers are not responsible in extracting or parsing the information. When a retriever fetches a website, it will return a blob of HTML tags where it will be passed to the parser/extractor. 

The logic of the retriever is only limited to generating the URL to be retrieved e.g. passing arguments or authentication parameters.  

### Parser/Extractor
Parsers are responsible to go through the information fetched by the retriever. Logic controls like extract all tables or images will be placed here. Once the information or value is extracted, it will be then passed to the transfomer which changes the form or structure. 

### Transfomer
Transformers are responsible to change the form or structure of the value. The changed value can be stored into a database or a file like a CSV.

### Import Libraries

In [4]:
import requests
from bs4 import BeautifulSoup

### Retriever

In [5]:
# retriever
url = requests.get('https://en.wikipedia.org/wiki/Economy_of_the_Philippines')
url.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Economy of the Philippines - Wikipedia</title>\n<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; )

### Parser

In [11]:
source = BeautifulSoup(url.text, 'html.parser')

In [15]:
# retrieve all the tables inside this webpage
all_tables = source.find_all('table', class_="wikitable")
all_tables[0]

<table class="wikitable sortable" style="text-align:right; font-size:90%;">
<tbody><tr>
<th rowspan="2" scope="col">Year
</th>
<th rowspan="2" scope="col">GDP growth<sup class="reference" id="cite_ref-growthrate_43-0"><a href="#cite_note-growthrate-43">[a]</a></sup>
</th>
<th colspan="3" scope="col">GDP, current prices
</th>
<th colspan="2" scope="col">GDP, PPP
</th>
<th rowspan="2" scope="col">PHP:USD<br/>exchange rate<sup class="reference" id="cite_ref-44"><a href="#cite_note-44">[b]</a></sup>
</th></tr>
<tr>
<th data-sort-type="number" scope="col">(<a href="/wiki/Philippine_peso" title="Philippine peso">PHP</a>, billions)
</th>
<th data-sort-type="number" scope="col">(<a class="mw-redirect" href="/wiki/US_dollar" title="US dollar">USD</a>, billions)
</th>
<th data-sort-type="number" scope="col">Per capita <br/>(USD)
</th>
<th data-sort-type="number" scope="col">(USD, billions)
</th>
<th data-sort-type="number" scope="col">Per capita, <br/>(USD)
</th></tr>
<tr>
<th scope="row">2023<s

In [20]:
all_tables[1]

[]