## Introduction
- Web scraping is the automated process of extracting data from the internet. 
- The Python libraries **Requests** and **Beautiful Soup** are powerful tools for the job.
- 測試網址 URL = "https://realpython.github.io/fake-jobs/"

## Step 1: Inspect data source
- developer tools: Safari -> 開發 -> 顯示網頁檢閱器
- developer tools: Chrome -> 檢視 -> 開發人員選項 -> 開發人員工具

<p align="left">
  <img src="https://realpython.com/cdn-cgi/image/width=1440,format=auto/https://files.realpython.com/media/bs4-devtools.f0a236ca5fa3.png" 
  alt="網頁檢閱器"
  width="640" 
  height="360">
</p>

## Step 2: Scrape HTML content from a page

- 打開 URL
- 印出 HTML 的內容  
- Copy-paste儲存格輸出到 HTML Formatter 網站(https://htmlformatter.com)去協助整理格式

In [None]:
import requests

URL = "https://realpython.github.io/fake-jobs/"
response = requests.get(URL)
print("Status code: ", response.status_code)
print("Elapsed Time:", response.elapsed)
print(response.text)
# if needed to paste response.text content to HTML formater

利用HTML formatter 網站或是Safari 網頁檢閱器(檢閱元件)分析HTML發現
- class="title is-5": job title
- class="subtitle is-6 company": company
- class="location": location
- class="is-small has-text-grey": post date

## Step 3: Parse HTML code with Beautiful Soup
- Find elements by ID
- Find elements by class name
- Extract text from HTML elements
- Find elements by class name and text content


In [None]:
from bs4 import BeautifulSoup
sp = BeautifulSoup(response.content, "html.parser") # Parser parse 'response.content' better than 'response.text'
print(sp.title); print(sp.title.text)

#### 3.1 Find elements by ID

In [None]:
# find elements by ID
# In HTML, the id attribute is used to uniquely identify an element within the document.
results = sp.find(id="ResultsContainer")
print(results.prettify()) 

#### 3.2 Find elements by class name

In [None]:
# find elements by tag name and class name
job_cards = results.find_all("div", class_="card-content")
print(len(job_cards), '\n')
print(job_cards[0].prettify()[:600], '\n') 
for job_card in job_cards:
    print(job_card.prettify()[:600], end="\n" * 2)
    print("=" * 60, end="\n" * 2)

In [None]:
# print all job listings
print('JOB LISTINGS')
for i, job_card in enumerate(job_cards):
    title_element = job_card.find("h2", class_="title is-5")
    company_element = job_card.find("h3", class_="subtitle is-6 company")
    location_element = job_card.find("p", class_="location")
    print(f"Job {i}:")
    print(title_element.prettify())
    print(company_element.prettify())
    print(location_element.prettify())
    print('-' * 80)

#### 3.3 Extract Text From HTML Elements

In [None]:
# Extract Text From HTML Elements
for i, job_card in enumerate(job_cards):
    title_element = job_card.find("h2", class_="title is-5")
    company_element = job_card.find("h3", class_="subtitle is-6 company")
    location_element = job_card.find("p", class_="location")
    print(f"Job {i}:")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

#### 3.4 Print all Python-related job title

In [None]:
# Find all Python related jobs
# Find Elements by Class Name and Text Content
# Passing a string (Python) to the string argument. This needs exact same match
python_jobs = results.find_all("h2", string="Python")
print(len(python_jobs))
senior_python_developer_jobs = results.find_all("h2", string="Senior Python Developer")
print(len(senior_python_developer_jobs))

# Passing an lambda function to the string argument. It will return all the elements that return True.  
python_jobs = results.find_all("h2", string=lambda text: "python" in text.lower())
print(len(python_jobs))
for job_card in python_jobs:
    print(job_card.text.strip()[:22], '\t\t', job_card)

#### 3.5 Print all Python-related job title, company and location

In [None]:
python_jobs = results.find_all("h2", string=lambda text: "python" in text.lower())
python_job_cards = [h2_element.parent.parent.parent for h2_element in python_jobs]
for job_card in python_job_cards:
    title_element = job_card.find("h2", class_="title is-5")
    company_element = job_card.find("h3", class_="subtitle is-6 company")
    location_element = job_card.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

#### 3.6 Extract attributes from HTML elements

In [None]:
for job_card in python_job_cards:
    link_url = job_card.find_all("a")[1]["href"]
    print(f"Apply here: {link_url}\n")        

## Step4: Assemble code in a script

In [None]:
import requests
from bs4 import BeautifulSoup

URL = "https://realpython.github.io/fake-jobs/"
response = requests.get(URL)

sp = BeautifulSoup(response.content, "html.parser")
results = sp.find(id="ResultsContainer")

python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

python_job_cards = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]

for job_card in python_job_cards:
    title_element = job_card.find("h2", class_="title")
    company_element = job_card.find("h3", class_="company")
    location_element = job_card.find("p", class_="location")
    link_url = job_card.find_all("a")[1]["href"]
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print(f"Apply here: {link_url}\n")

## Lab

- Python.org job board: https://www.python.org/jobs/
- find job title and company name

In [None]:
# Open a URL and check the status code 200
import requests
from bs4 import BeautifulSoup

URL = "https://www.python.org/jobs/"
response = requests.get(URL)
print("Status code: ", response.status_code)
print("Elapsed Time:", response.elapsed)
print(response.text)

In [None]:
# Parse the HTML content and check parsing successful
sp = BeautifulSoup(response.content, "html.parser")
print(sp.title); print(sp.title.text)

In [None]:
# scoping the job container
results = sp.find("ol", class_="list-recent-jobs list-row-container menu")
if results:
    print(len(results.find_all("li")))  # Count the number of <li> elements inside the <ol>
else:
    print("No job cards found")

In [133]:
# print all job listings
print('JOB LISTINGS')
for i, job_card in enumerate(job_cards):
    title_element = job_card.find("a").text
    company_element = job_card.find("span", class_="listing-company-name").text.replace("\t", "").strip().split("\n")[-1]
    print(f"Job {i}:")
    print(title_element.strip())
    print(company_element.strip())
    print('-' * 80)

JOB LISTINGS
Job 0:
Python Software Engineer
HypothesisBase
--------------------------------------------------------------------------------
Job 1:
Python Lead Developer
SenecaGlobal
--------------------------------------------------------------------------------
Job 2:
Full Stack Python Developer
TeraLumen Solutions Pvt Ltd
--------------------------------------------------------------------------------
Job 3:
Senior Back-End Python Engineer
ActivePrime, Inc.
--------------------------------------------------------------------------------
Job 4:
Senior Python Engineer
Kazang a company part of the Lesaka Technologies Group
--------------------------------------------------------------------------------
Job 5:
Software Engineer - OpenStack Swift Object Storage
Red Hat, Inc.
--------------------------------------------------------------------------------
Job 6:
Porting code from Rust to Python on GPU
NJB Brands LLC
-------------------------------------------------------------------------