## 1111 Job Listing Scraper

This notebook is my practice of scraping job listings on the 1111.com.tw website. The 1111 website was chosen as a target for my practice due to its relatively straightforward structure and apparent lack of resctriction on anti-data mining systems, which makes it suitable for me and illustrating web scraping techniques.

The HTML class structures on the 1111 website are also clear and consistent, simplifying the process of identifying and extracting relevant job information such as titles, companies, locations, and salaries.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

In [2]:
def get_job_site(url):
  res = requests.get(url)
  soup = BeautifulSoup(res.text, 'html.parser')
  return soup

In [3]:
def get_job_info(job_info):
  url_tags = job_info.select("a")
  title_tag = job_info.find("h2", class_ = "text-[18px] leading-[1.5] font-medium whitespace-wrap break-all")
  company_tag = job_info.find("h2", class_ = "inline")
  location_tag = job_info.find("a", class_ = "job-card-condition__text cursor-pointer hover:underline underline-offset-2")
  salary_tag = job_info.find("h4", class_ = "job-card-condition__text")

  job_url = url_tags[0]["href"] if url_tags and "href" in url_tags[0].attrs else ""
  job_title = title_tag.get_text(strip=True) if title_tag else ""
  company = company_tag.get_text(strip=True) if company_tag else ""
  location = location_tag.get_text(strip=True) if location_tag else ""
  salary = salary_tag.get_text(strip=True) if salary_tag else ""

  return {
      'url': job_url,
      'title': job_title,
      'company': company,
      'location': location,
      'salary': salary
  }

In [4]:
def scrape_all_pages(start_url, max_pages):
    all_records = []
    page = 1

    # The initial URL contains page=1.
    # For 1111 website, 'page' parameter is always in the format 'page=X' in the URL.
    base_url_template = start_url.replace("page=1", "page={}")

    while True:
        current_url = base_url_template.format(page)

        print(f"Fetching page {page} from {current_url}")
        soup = get_job_site(current_url)
        time.sleep(random.randint(5, 15)) # Add a random delay between 5 and 15 seconds

        if not soup:
            print(f"Failed to retrieve content for page {page}. Stopping.")
            break

        job_listings_on_page = soup.find_all("div", class_="flex flex-col lg:gap-4 lg:flex-row")

        # If no job listings are found, it might indicate the end of available pages
        if not job_listings_on_page:
            print(f"No job listings found on page {page}. Stopping.")
            break

        df_page = build_records(soup)
        if not df_page.empty:
            all_records.append(df_page)

        page += 1
        if max_pages and page > max_pages:
            print(f"Reached maximum page limit ({max_pages}). Stopping.")
            break

    if all_records:
        full_df = pd.concat(all_records, ignore_index=True)
        return full_df
    else:
        print("No records found after scraping.")
        return pd.DataFrame() # Return an empty DataFrame if no records are found

In [5]:
def build_records(soup):
  records = []
  for job_info in soup.find_all("div", class_="flex flex-col lg:gap-4 lg:flex-row"):
    record = get_job_info(job_info)
    records.append(record)
  df = pd.DataFrame(records)
  return df

In [6]:
def create_file(df):
  url_prefix = "https://www.1111.com.tw"

  df['url'] = url_prefix + df['url']
  df.to_excel('job_listings.xlsx', index= False)
  print('DataFrame successfully saved to job_listings.xlsx')

In [7]:
# Main execution
url = "https://www.1111.com.tw/search/job?page=1&col=da&sort=desc&ks=%E8%9B%8B%E7%99%BD"

df_jobs = scrape_all_pages(url, max_pages=None) # Scrape all available pages

if not df_jobs.empty:
    create_file(df_jobs)
    print(f"Successfully processed and saved {len(df_jobs)} job listings from multiple pages.")
else:
    print("No job listings found or processed.")

Fetching page 1 from https://www.1111.com.tw/search/job?page=1&col=da&sort=desc&ks=%E8%9B%8B%E7%99%BD
Fetching page 2 from https://www.1111.com.tw/search/job?page=2&col=da&sort=desc&ks=%E8%9B%8B%E7%99%BD
Fetching page 3 from https://www.1111.com.tw/search/job?page=3&col=da&sort=desc&ks=%E8%9B%8B%E7%99%BD
Fetching page 4 from https://www.1111.com.tw/search/job?page=4&col=da&sort=desc&ks=%E8%9B%8B%E7%99%BD
Fetching page 5 from https://www.1111.com.tw/search/job?page=5&col=da&sort=desc&ks=%E8%9B%8B%E7%99%BD
No job listings found on page 5. Stopping.
DataFrame successfully saved to job_listings.xlsx
Successfully processed and saved 40 job listings from multiple pages.
