# Web Scraping and API

## Motivation
Web scraping and APIs are essential tools for extracting data from the web. They allow developers to gather information from websites and services, enabling data analysis, automation, and integration with other applications. This will be useful for regulators, researchers, and firms to constantly gather data from various sources.

## Tools we will use
1. Web scraping: used to extract data from HTML pages
2. APIs: used to interact with web services and retrieve structured data


# Part I: Web Scraping with BeautifulSoup

## What is Web Scraping?
Web scraping is the process of extracting data from websites. It involves fetching the HTML content of a web page and parsing it to extract the desired information.

### Legal and Ethical Considerations
Before scraping a website, it's important to check its `robots.txt` file and terms of service to ensure that scraping is allowed. Always respect the website's rules and avoid overloading their servers with requests. Do not abuse the rate of your requests and do not violate logins or captchas. Warning: scraping certain websites may be illegal or against their terms of service.

## HTML Basics
HTML (HyperText Markup Language) is the standard language for creating web pages. It consists of elements represented by tags, such as `<div>`, `<p>`, `<a>`, etc. Understanding the structure of HTML is crucial for effective web scraping.

### HTML Structure Example
HTML has a tree-like structure with nested elements. For example:
```html
<html>
  <head>
    <title>Sample Page</title>
  </head>
  <body>
    <h1>Welcome to the Sample Page</h1>
    <p>This is a sample paragraph.</p>
    <a href="https://example.com">Visit Example.com</a>
  </body>
</html>
```

And HTMLs also contain tables like this:

```html
<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Row 1, Cell 1</td>
    <td>Row 1, Cell 2</td>
  </tr>
  <tr>  
    <td>Row 2, Cell 1</td>
    <td>Row 2, Cell 2</td>
  </tr>
</table>
``` 

You can see these structures by inspecting the HTML of a webpage in your browser.

## BeautifulSoup Library
BeautifulSoup is a Python library used for parsing HTML and XML documents. It creates a parse tree that makes it easy to navigate and search for specific elements.

### Loading HTML with BeautifulSoup

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup


In [11]:
url = 'https://phlpost.gov.ph/zip-code-locator/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    
response = requests.get(url, headers=headers)                                     # Send a GET request to the URL                               # Check that the request was successful

In [12]:
content = response.content                                       # Get the content of the response  
soup = BeautifulSoup(content, 'html.parser')                    # Parse the HTML as a string

In [13]:
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>500 - Internal server error.</title>
<style type="text/css">
<!--
body{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-serif;background:#EEEEEE;}
fieldset{padding:0 15px 10px 15px;} 
h1{font-size:2.4em;margin:0;color:#FFF;}
h2{font-size:1.7em;margin:0;color:#CC0000;} 
h3{font-size:1.2em;margin:10px 0 0 0;color:#000000;} 
#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-family:"trebuchet MS", Verdana, sans-serif;color:#FFF;
background-color:#555555;}
#content{margin:0 0 0 2%;position:relative;}
.content-container{background:#FFF;width:96%;margin-top:8px;padding:10px;position:relative;}
-->
</style>
</head>
<body>
<div id="header"><h1>Server Error</h1></div>
<div id="content">
<div class="content-container"><fieldset>
<h2>5

In [7]:
# Find the first table on the page
table = soup.find("table")
if table is None:
    raise RuntimeError("No <table> found on the page. Check the page structure.")

RuntimeError: No <table> found on the page. Check the page structure.

In [56]:
# Extract header row
header_cells = table.find("thead").find_all("th")
columns = [h.get_text(strip=True) for h in header_cells]
print("Columns:", columns)

Columns: ['Region', 'Provinces', 'City/Municipality', 'Zip Code']


In [57]:
# Extract data rows
data_rows = []
for row in table.find("tbody").find_all("tr"):
    cells = row.find_all("td")
    if not cells:
        continue
    values = [c.get_text(strip=True) for c in cells]
    data_rows.append(values)

print("Last 5 data rows:", data_rows[-5:])

Last 5 data rows: [['Region 1 (Ilocos Region)', 'Pangasinan', 'Natividad', '2446'], ['Region 1 (Ilocos Region)', 'Pangasinan', 'Mapandan', '2429'], ['Region 1 (Ilocos Region)', 'Pangasinan', 'Mangatarem', '2413'], ['Region 1 (Ilocos Region)', 'Pangasinan', 'Mangaldan', '2432'], ['Region 10 (Northern Mindanao)', 'Agusan del Norte', 'Kitaotao', '8716']]


In [58]:
# Build DataFrame
df_zip = pd.DataFrame(data_rows, columns=columns)

In [59]:
df_zip

Unnamed: 0,Region,Provinces,City/Municipality,Zip Code
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
...,...,...,...,...
1395,Region 1 (Ilocos Region),Pangasinan,Natividad,2446
1396,Region 1 (Ilocos Region),Pangasinan,Mapandan,2429
1397,Region 1 (Ilocos Region),Pangasinan,Mangatarem,2413
1398,Region 1 (Ilocos Region),Pangasinan,Mangaldan,2432


In [60]:
# trim any leading/trailing whitespace and drop rows with any missing values
df_zip = df_zip.replace("", np.nan)
df_zip = df_zip.dropna(how="all")
df_zip.info()

<class 'pandas.core.frame.DataFrame'>
Index: 960 entries, 438 to 1399
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Region             960 non-null    object
 1   Provinces          960 non-null    object
 2   City/Municipality  960 non-null    object
 3   Zip Code           960 non-null    object
dtypes: object(4)
memory usage: 37.5+ KB


In [61]:
df_zip

Unnamed: 0,Region,Provinces,City/Municipality,Zip Code
438,NCR (National Capital Region),Metro Manila,Pandacan,1011
440,CAR (Cordillera Administrative Region),Apayao,Santa Marcela,3811
441,CAR (Cordillera Administrative Region),Kalinga,Tabuk City,3800
442,CAR (Cordillera Administrative Region),Kalinga,Tanudan,3805
443,CAR (Cordillera Administrative Region),Kalinga,Tinglayan,3804
...,...,...,...,...
1395,Region 1 (Ilocos Region),Pangasinan,Natividad,2446
1396,Region 1 (Ilocos Region),Pangasinan,Mapandan,2429
1397,Region 1 (Ilocos Region),Pangasinan,Mangatarem,2413
1398,Region 1 (Ilocos Region),Pangasinan,Mangaldan,2432


BeautifulSoup is not the only library for web scraping, but it is one of the most popular and easy to use. Other libraries include Scrapy and Selenium.

The advantages of BeautifulSoup include its simplicity, ease of use, and ability to handle poorly formatted HTML. However, it may not be as fast as other libraries for large-scale scraping tasks.

# Using APIs 

## What is an API?
An API (Application Programming Interface) is a set of rules and protocols that allows different software applications to communicate with each other. APIs provide a way to access data and services from web applications in a structured manner.

## Why use APIs?
APIs are preferred over web scraping for several reasons:
1. Structured Data: APIs provide data in structured formats like JSON or XML, making it easier to parse and use.
2. Reliability: APIs are designed for data access, while web scraping relies on the structure of web pages, which can change frequently.
3. Efficiency: APIs often provide more efficient access to data, reducing the need for complex parsing logic.

| Topic             | Scraping      | API         |
| ----------------- | ------------- | ----------- |
| Data structure    | Messy HTML    | Clean JSON  |
| Reliability       | Changes often | Stable      |
| Legal status      | Ambiguous     | Clear terms |
| Query flexibility | Limited       | High        |


## Google Maps API 

The Google Maps API allows developers to access various services provided by Google Maps, such as geocoding, directions, and places information. To use the Google Maps API, you need to sign up for an API key and follow the usage guidelines provided by Google.

APIs most relevant to economic analysis:
- Geocoding API → Convert address → coordinates
- Distance Matrix API → Travel times (key for geographic markets)
- Places API → Lists businesses (supermarkets, pharmacies, etc.)
- Directions API → Driving routes

## Setting Up Google Maps API
1. Go to the [Google Cloud Console](https://console.cloud.google.com/).
2. Create a new project.
3. Enable the Google Maps APIs you need (e.g., Geocoding API, Distance Matrix API).
4. Generate an API key and restrict its usage to your project.

## Other Useful APIs
- OpenWeatherMap API: Provides weather data for locations worldwide.
- Twitter API: Access tweets and user data for social media analysis.

Basically, what API does is that it allows you to send a request to a server and get data back in a structured format like JSON or XML. This is useful because it allows you to access data without having to scrape web pages, which can be unreliable and against the terms of service of some websites. 

APIs are widely used in various applications, from mobile apps to web services, to provide dynamic content and functionality. For academic research, APIs can be invaluable for accessing large datasets and integrating data from multiple sources.

# Extracting Tables from pdf

In [None]:
import camelot
# path = "/Users/moxballo/Library/Mobile Documents/com~apple~CloudDocs/Documents/UPSE/AY202425_2 Econ 198/"

ic_path = '/Users/moxballo/Documents/GitHub/ds4upse-2526-s1/03_data/IC Data.pdf'

In [42]:
# read the PDF file and extract all tables into a list of DataFrames
dfs = camelot.read_pdf(ic_path, pages="1-end", flavor="lattice")
df = dfs[0].df  # get the first table as a DataFrame

In [43]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Premium Income of Life Insurance Companies\nas...,,,,,,,,,
1,Name of Company,FIRST YEAR,,SINGLE,,RENEWAL,,TOTAL,,GRAND TOTAL
2,,Traditional,Variable,Traditional,Variable,Traditional,Variable,Traditional,Variable,
3,,,,,,,,,,
4,"1 \n. Sun Life of Canada (Philippines), Inc.","4,603,960,375\n₱",3343023258,2684863725,7406929900,13493048167,25605339832,20781872267,36355292990,"57,137,165,257\n₱"
5,2 \n. Pru Life Insurance Corporation of U.K.,631721628,8978954883,1233375407,976976257,152874208,36178519833,2017971242,46134450974,48152422216
6,3 \n. FWD Life Insurance Corporation,1364308054,3677803818,594670,26976174586,2569870096,5263175562,3934772820,35917153966,39851926786
7,"4 \n. Allianz PNB Life Insurance, Inc.",236263937,974126382,1786997385,26918323513,610472376,1660694680,2633733698,29553144576,32186878273
8,5 \n. AXA Philippines Life and General Insu...,1810800459,983122923,1232222262,9611441954,5136254233,7776731621,8179276954,18371296498,26550573452
9,"6 \n. BDO Life Assurance Company, Inc.",4408273408,186324485,820117,940366361,11759063575,2395739030,16168157101,3522429875,19690586976
