# INFORMATION
- 4-Week Goal: Learn data scraping from basic to mid-level and be able to successfully perform scraping on Google Maps
- 30 minutes reviewing last week’s challenge, 60 minutes of practice.

# WEEK 1: Data Scraping Basic

## What is Data Scraping
Data scraping, also known as web scraping, is the automated process of extracting large amounts of data from websites or other digital sources using software tools, scripts, or bots. This technique involves parsing the underlying HTML or structured data of web pages to collect specific information, such as prices, contact details, or user reviews, which can then be stored in databases, spreadsheets, or used for analysis. While it enables efficient data gathering for purposes like market research, competitive analysis, or machine learning training, data scraping must comply with legal and ethical guidelines, as it can violate terms of service, infringe on copyrights, or overload servers if not done responsibly.

## Core Concept
- HTML: The main markup language for web structure
- CSS: Used for styling, but crucial for selectors (e.g., .class or #id) when selecting specific elements on a page
- JS (JavaScript): Many modern sites load data dynamically via JS (AJAX)
- API: Application programming interface
- Cookies: Session tokens for authentication or tracking
- Sitemap: An XML file that lists all pages on a site

More Advanced Concepts:
- CAPTCHA: A challenge-response test (e.g., image recognition or puzzles) designed to distinguish humans from bots
- Honeypots: Hidden traps on websites, like invisible form fields that bots might fill out but humans won't see
- Proxies and IP Rotation: Using a pool of proxy servers to switch IP addresses randomly, which helps avoid rate limiting, bans, or detection when making repeated requests.
- GraphQL: A more flexible API variation

## Data Scraping Level
- Basic Level - Static website
- Mid Level - Dynamic website
- Advance - Any website

# Install Requirements

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Wikipedia Data Scraping (Beautiful Soup 4)
https://en.wikipedia.org/wiki/Database 

Task: Retrieve all articles mentioned in the “See Also” section.

Output: Title and URLS

In [10]:
import requests
from bs4 import BeautifulSoup
import csv 

url = "https://en.wikipedia.org/wiki/Database"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}

response = requests.get(url, headers=headers)

response.raise_for_status()

soup = BeautifulSoup(response.text, 'html.parser')

container = soup.select("div.div-col")[0]
results = []
for a in container.select("li a[href^='/wiki/']"):
    raw_title = a.get("title")
    title = raw_title.split("-")[0].split("Page displayinh")[0].strip()
    link = "https://en.wikipedia.org/" + a.get("href")
    results.append({"title": title, "link": link})
    print(f"{title}: {link}")

csv_name = "url_see_also.csv"

with open (csv_name, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["title", "link"])
    writer.writeheader()
    writer.writerows(results)
    
    

Comparison of database tools: https://en.wikipedia.org//wiki/Comparison_of_database_tools
Comparison of object database management systems: https://en.wikipedia.org//wiki/Comparison_of_object_database_management_systems
Comparison of object–relational database management systems: https://en.wikipedia.org//wiki/Comparison_of_object%E2%80%93relational_database_management_systems
Comparison of relational database management systems: https://en.wikipedia.org//wiki/Comparison_of_relational_database_management_systems
Data bank: https://en.wikipedia.org//wiki/Data_bank
Data hierarchy: https://en.wikipedia.org//wiki/Data_hierarchy
Data store: https://en.wikipedia.org//wiki/Data_store
Database testing: https://en.wikipedia.org//wiki/Database_testing
Database theory: https://en.wikipedia.org//wiki/Database_theory
Database: https://en.wikipedia.org//wiki/Database-as-IPC
Database: https://en.wikipedia.org//wiki/Database-centric_architecture
Datalog: https://en.wikipedia.org//wiki/Datalog
DBOS: ht

## Halodoc Data Scraping (Selenium)
https://www.halodoc.com/kesehatan

Task: Scrape and extract all dictionaries topic titles from Halodoc that start with the letter A.

Output: Title and URLS

In [16]:
import csv
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time

Options = Options()
Options.add_argument("--start-maximized")
driver = webdriver.Chrome(service=Service(), options=Options)

driver.get("https://www.halodoc.com/kesehatan")
time.sleep(4)

results = []
topics = driver.find_elements(By.CLASS_NAME, "category-section")

indicator = """
    arguments[0].style.ouline = "4px solid red";
    arguments[0].style.backgroundColor = "red";
    arguments[0].style.color = "white";
    arguments[0].style.fontWeight = "bold";
    arguments[0].style.padding = "2px 6px";
    arguments[0].style.borderRadius = "4px";
"""



for topic in topics:
    for a in topic.find_elements(By.TAG_NAME, "a"):
        driver.execute_script(indicator, a)
        title = a.text
        link = a.get_attribute("href")
        results.append({"title": title, "link": link})
        print(f"{title}: {link}")
        time.sleep(0.5)

csv_name = "halodoc.csv"

with open (csv_name, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["title", "link"])
    writer.writeheader()
    writer.writerows(results)

time.sleep(1)
driver.close()

Abdominal Migrain: https://www.halodoc.com/kesehatan/abdominal-migrain
Abilify: https://www.halodoc.com/kesehatan/abilify
Ablasi Retina: https://www.halodoc.com/kesehatan/ablasi-retina
Abses: https://www.halodoc.com/kesehatan/abses
Abses Gigi: https://www.halodoc.com/kesehatan/abses-gigi
Abses Otak: https://www.halodoc.com/kesehatan/abses-otak
Abses Payudara: https://www.halodoc.com/kesehatan/abses-payudara
Abses Peritonsil: https://www.halodoc.com/kesehatan/abses-peritonsil
Abses Hati: https://www.halodoc.com/kesehatan/abses-hati
Abses Paru: https://www.halodoc.com/kesehatan/abses-paru
Abses Perianal: https://www.halodoc.com/kesehatan/abses-perianal
Aceruloplasminemia: https://www.halodoc.com/kesehatan/aceruloplasminemia
Acetylcysteine: https://www.halodoc.com/kesehatan/acetylcysteine
Acetylsalicylic Acid: https://www.halodoc.com/kesehatan/acetylsalicylic-acid
Achondroplasia: https://www.halodoc.com/kesehatan/achondroplasia
Acifar: https://www.halodoc.com/kesehatan/acifar
Acitral: htt

## Challenge (Selenium)
https://www.halodoc.com/kesehatan

Task: Scrape and extract all dictionaries topic titles from Halodoc

Output: Title and URLS

## 