<hr style="border: 2px solid #8E7B6B; margin-top: 10px;">

<br>
<h1 style="font-family:verdana; font-size:36px; font-weight:bold"> <center>~ Notebook Scraper Paris 2024 Olympic ~</center> </h1>
<p style = "font-size:16px; font-family:verdana"><center>Oleh: Izhar Alif Akbar / 18223129 </center><p>

<br>

<hr style="border: 2px solid #8E7B6B; margin-top: 10px;">

# ~ Notebook Contents ~

1. [**Medal Scraper**](#1)

2. [**Sport Scraper**](#2)

3. [**Athlete Scraper**](#3)

4. [**Data Pre-processing**](#4)

<hr style="border: 2px solid #8E7B6B; margin-top: 10px;">

# Initialization <a name="initialization"></a>

> Note: "Run All" pada notebook ini akan memakan waktu yang cukup lama yaitu sekitar 10 jam karena proses scraping. (10k+ data expected)

<hr style="border: 2px solid #8E7B6B; margin-top: 10px;">

## Libraries

In [1]:
import os
import json
import logging
import requests
import re
import unicodedata
import time
import pandas as pd

from bs4 import BeautifulSoup
from bs4.element import Tag
from tqdm import tqdm

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

## 1. Medal Scraper

In [None]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s', datefmt='%H:%M:%S')

class MedalScraper:
    """
    Scraper untuk mengambil data tabel medali dari situs Olimpiade.
    """
    URL = "https://www.olympics.com/en/olympic-games/paris-2024/medals"
    OUTPUT_FILENAME = os.path.join(os.path.dirname(os.getcwd()), "data", "raw_medals.json")
    SCRAPPER_IDENTITY = 'Izhar Alif Akbar/18223129@std.stei.itb.ac.id'
    
    def __init__(self):
        """Inisialisasi scraper."""
        self.scraped_data = []
        self.headers = {
            'User-Agent': f'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        logging.info(f"Scraper diinisialisasi dengan User-Agent: {self.SCRAPPER_IDENTITY}")

    def _get_soup(self) -> BeautifulSoup | None:
        """Mengambil konten halaman dan mengembalikannya sebagai objek BeautifulSoup."""
        try:
            logging.info(f"Mengakses URL: {self.URL}")
            response = requests.get(self.URL, headers=self.headers, timeout=30)
            response.raise_for_status()
            logging.info("Halaman berhasil diakses. Mem-parsing HTML...")
            return BeautifulSoup(response.content, 'lxml')
        except requests.exceptions.RequestException as e:
            logging.error(f"Gagal mengambil halaman: {e}")
            return None

    def _get_text_from_selector(self, parent: Tag, selector: str, default: str = "0") -> str:
        """Helper untuk mendapatkan teks dari elemen dengan aman."""
        element = parent.select_one(selector)
        return element.get_text(strip=True) if element else default

    def _parse_row(self, row_tag: Tag, table_soup: BeautifulSoup) -> dict:
        """Mengurai satu baris data negara dari tabel."""
        row_id = row_tag.get('data-row-id', '')
        row_number = row_id.split('-')[-1]

        # Cari elemen nama negara yang berelasi dengan row_id
        country_name_tag = table_soup.find('div', class_='sc-26c0a561-4', attrs={'data-row-id': row_id})
        country_name = self._get_text_from_selector(country_name_tag, 'span.sc-26c0a561-6', "N/A")

        # Cari data medali menggunakan row_number
        gold = self._get_text_from_selector(table_soup, f'div[data-medal-id="gold-medals-row-{row_number}"]')
        silver = self._get_text_from_selector(table_soup, f'div[data-medal-id="silver-medals-row-{row_number}"]')
        bronze = self._get_text_from_selector(table_soup, f'div[data-medal-id="bronze-medals-row-{row_number}"]')
        total = self._get_text_from_selector(table_soup, f'div[data-medal-id="total-medals-row-{row_number}"]')

        return {
            "country": country_name,
            "gold": int(gold) if gold.isdigit() else 0,
            "silver": int(silver) if silver.isdigit() else 0,
            "bronze": int(bronze) if bronze.isdigit() else 0,
            "total": int(total) if total.isdigit() else 0
        }

    def scrape(self):
        """Menjalankan proses scraping utama."""
        soup = self._get_soup()
        if not soup:
            return

        table_div = soup.find("div", {"data-cy": "table-content"})
        if not table_div:
            logging.error("Container tabel utama ('table-content') tidak ditemukan.")
            return

        # Deteksi baris negara
        country_rows = table_div.find_all('div', class_='sc-26c0a561-2')
        logging.info(f"Ditemukan {len(country_rows)} negara di dalam tabel.")

        # TQDM (progress bar)
        for row in tqdm(country_rows, desc="Scraping Medals"):
            country_data = self._parse_row(row, table_div)
            if country_data["country"] != "N/A":
                self.scraped_data.append(country_data)

        logging.info("Scraping semua negara selesai.")

    def save_to_json(self):
        """Menyimpan data yang telah di-scrape ke file JSON."""
        if not self.scraped_data:
            logging.warning("Tidak ada data untuk disimpan.")
            return
            
        logging.info(f"Menyimpan {len(self.scraped_data)} data ke file {self.OUTPUT_FILENAME}...")
        with open(self.OUTPUT_FILENAME, 'w', encoding='utf-8') as f:
            json.dump(self.scraped_data, f, ensure_ascii=False, indent=4)
        logging.info("Penyimpanan data selesai.")

if __name__ == "__main__":
    scraper = MedalScraper()
    scraper.scrape()
    scraper.save_to_json()





13:23:44 [INFO] Scraper diinisialisasi dengan User-Agent: Izhar Alif Akbar/18223129@std.stei.itb.ac.id
13:23:44 [INFO] Mengakses URL: https://www.olympics.com/en/olympic-games/paris-2024/medals
13:23:45 [INFO] Halaman berhasil diakses. Mem-parsing HTML...
13:23:45 [INFO] Ditemukan 92 negara di dalam tabel.
Scraping Medals: 100%|██████████| 92/92 [00:01<00:00, 59.12it/s] 
13:23:46 [INFO] Scraping semua negara selesai.
13:23:46 [INFO] Menyimpan 92 data ke file c:\Izhar\Lab Basdat\Seleksi-2025-Tugas-1\Data Scraping\data\raw_medal_test.json...
13:23:46 [INFO] Penyimpanan data selesai.


## 2. Sport Scraper

In [None]:
import time
import json
import re
import os
import unicodedata
import logging
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
from tqdm import tqdm

# ==============================================================================
# KONFIGURASI LOGGING
# ==============================================================================
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s', datefmt='%H:%M:%S')

class OlympicScraper:
    """
    Scraper komprehensif untuk mengambil data hasil pertandingan Olimpiade Paris 2024,
    termasuk event tim, ganda, dan individual, dengan mekanisme checkpoint.
    """
    
    BASE_URL = "https://www.olympics.com/en/olympic-games/paris-2024/results/"
    OUTPUT_FILENAME = os.path.join(os.path.dirname(os.getcwd()), "data", "raw_sports.json")

    def __init__(self):
        """Menginisialisasi scraper, me-load checkpoint, dan setup WebDriver."""
        self.scraped_data = self._load_checkpoint()
        self.driver = self._setup_driver()
        logging.info("OlympicScraper berhasil diinisialisasi.")

    def _load_checkpoint(self) -> list:
        """Memuat data dari file checkpoint JSON jika ada."""
        if os.path.exists(self.OUTPUT_FILENAME):
            try:
                with open(self.OUTPUT_FILENAME, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                    logging.info(f"Checkpoint ditemukan. {len(data)} data berhasil dimuat.")
                    return data
            except (json.JSONDecodeError, IOError) as e:
                logging.error(f"Gagal memuat checkpoint: {e}. Memulai dari awal.")
                return []
        logging.info("Tidak ada checkpoint ditemukan. Memulai sesi scraping baru.")
        return []

    def _update_checkpoint(self):
        """Menyimpan/memperbarui data ke file JSON."""
        try:
            with open(self.OUTPUT_FILENAME, 'w', encoding='utf-8') as f:
                json.dump(self.scraped_data, f, ensure_ascii=False, indent=4)
        except IOError as e:
            logging.error(f"Gagal menyimpan checkpoint: {e}")

    def _setup_driver(self) -> webdriver.Chrome:
        """Mengkonfigurasi instance Selenium WebDriver."""
        service = Service(ChromeDriverManager().install())
        options = webdriver.ChromeOptions()
        options.add_argument('--log-level=3')
        # options.add_argument('--headless') # Aktifkan untuk menjalankan di background
        
        driver = webdriver.Chrome(service=service, options=options)
        driver.set_window_size(760, 800)
        logging.info("WebDriver berhasil di-setup.")
        return driver
    
    def _handle_cookie_popup(self):
        """handle pop-up cookie jika muncul."""
        try:
            cookie_accept_button = WebDriverWait(self.driver, 10).until(
                EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))
            )
            cookie_accept_button.click()
            tqdm.write("[*] Pop-up cookie ditemukan dan diterima.")
            time.sleep(1)
        except Exception:
            pass

    @staticmethod
    def slugify(text: str) -> str:
        """Mengubah teks menjadi format 'slug'"""
        text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode('utf-8')
        text = text.lower()
        text = re.sub(r"\'s\b", '', text)
        text = re.sub(r"s\'\b", '', text)
        text = text.replace('&', ' and ')
        text = text.replace('+', ' plus ')
        text = re.sub(r'[()]', '', text)
        text = re.sub(r'[\s.,_]+', '-', text)
        text = re.sub(r'[^a-z0-9-]', '', text)
        text = re.sub(r'-+', '-', text)
        text = text.strip('-')
        return text

    def get_all_sports(self) -> list[dict]:
        """Mengambil daftar semua cabang olahraga dari halaman utama."""
        self.driver.get(self.BASE_URL)
        self._handle_cookie_popup()
        
        try:
            WebDriverWait(self.driver, 20).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'section[data-cy="disciplines-list"]'))
            )
        except Exception as e:
            logging.error(f"Gagal memuat daftar olahraga utama: {e}")
            return []

        soup = BeautifulSoup(self.driver.page_source, 'lxml')
        sport_links = []
        
        discipline_section = soup.find('section', attrs={'data-cy': 'disciplines-list'})
        if not discipline_section:
            logging.warning("Section 'disciplines-list' tidak ditemukan.")
            return []

        links = discipline_section.find_all('a', attrs={'data-cy': 'disciplines-item'})
        for link in links:
            href = link.get('href')
            name_element = link.find('p')
            if href and name_element:
                full_url = f"https://www.olympics.com{href}"
                sport_links.append({"name": name_element.text.strip(), "url": full_url})
        
        logging.info(f"Ditemukan {len(sport_links)} total cabang olahraga di situs.")
        return sport_links

    def get_events_for_sport(self, sport_url: str) -> list[dict]:
        """Mengambil semua event untuk satu cabang olahraga."""
        self.driver.get(sport_url)
        events_data = []
        
        dropdown_selector = 'button[data-cy="event-select"]'
        button_selector = 'div[data-cy="inline-wizard-events"] button[data-cy="event-button"]'

        try:
            dropdown_button = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, dropdown_selector)))
            self.driver.execute_script("arguments[0].click();", dropdown_button)
            time.sleep(1)

            WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, button_selector)))
            event_buttons = self.driver.find_elements(By.CSS_SELECTOR, button_selector)
            
            sport_slug = sport_url.rstrip('/').split('/')[-1]
            base_event_url = f"{self.BASE_URL.replace('/results/', f'/results/{sport_slug}/')}"

            for btn in event_buttons:
                event_name = btn.find_element(By.TAG_NAME, 'p').text.strip()
                event_slug = self.slugify(event_name)
                events_data.append({"name": event_name, "url": f"{base_event_url}{event_slug}"})
        
        except Exception:
            page_title = self.driver.title
            event_name = page_title.split('-')[1].strip() if '-' in page_title else "Main Event"
            events_data.append({"name": event_name, "url": sport_url})
            
        return events_data

    def _scrape_event_page(self, sport: dict, event: dict) -> list[dict]:
        """Melakukan scraping pada satu halaman event dan mengembalikan hasilnya."""
        self.driver.get(event['url'])
        event_results = []
        try:
            WebDriverWait(self.driver, 20).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'div[data-cy="table-content"]'))
            )
            time.sleep(3)
        except Exception:
            tqdm.write(f"  [!] Gagal memuat konten tabel untuk event: {event['name']}")
            return event_results

        result_rows = self.driver.find_elements(By.CSS_SELECTOR, 'div[data-row-id]')
        
        for row in result_rows:
            row_id = row.get_attribute('data-row-id')
            row_cy_type = row.get_attribute('data-cy')
            
            if not row_id or not row_cy_type: continue

            row_number = row_id.split('-')[-1]
            
            rank = self._get_element_text(row, f'div[data-cy="medal-row-{row_number}"] span[data-cy="ocs-text-module"]')
            
            name_to_use = "N/A"
            members_list = []
            event_type = "N/A"

            if row_cy_type == 'team-result-row':
                event_type = "Team"
                name_to_use = self._get_element_text(row, f'div[data-cy="country-name-row-{row_number}"] span.iAyztF')
                members_list = self._scrape_team_members(row, row_number, name_to_use)
            
            elif row_cy_type == 'doubles-result-row':
                event_type = "Team"
                name_to_use = self._get_element_text(row, f'div[data-cy="flag-row-{row_number}"] span')
                members_list = self._scrape_doubles_event(row)

            elif row_cy_type == 'single-athlete-result-row':
                event_type = "Individual"
                name_to_use, members_list = self._scrape_individual_athlete(row)
            
            else:
                continue
                
            event_results.append({
                "sport": sport['name'],
                "event": event['name'],
                "event_url": event['url'],
                "event_type": event_type,
                "rank": rank,
                "team_or_athlete_name": name_to_use,
                "members": members_list
            })
            time.sleep(0.2)
        return event_results

    def _get_element_text(self, parent: WebElement, selector: str, default: str = "N/A") -> str:
        """Helper untuk mendapatkan teks dari elemen dengan aman."""
        try:
            return parent.find_element(By.CSS_SELECTOR, selector).text.strip()
        except:
            return default

    def _scrape_team_members(self, row_element: WebElement, row_number: str, country_name: str) -> list[dict]:
        """Scrape anggota tim dari satu baris hasil (event tim besar)."""
        members_list = []
        arrow_selector = f'span[data-cy="arrow-row-{row_number}"]'
        try:
            arrow_element = row_element.find_element(By.CSS_SELECTOR, arrow_selector)
            self.driver.execute_script("arguments[0].click();", arrow_element)
            
            team_container_selector = f'div[data-cy="team-members-row-{row_number}"].open a[data-cy="team-member"]'
            WebDriverWait(self.driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, team_container_selector)))
            
            member_elements = self.driver.find_elements(By.CSS_SELECTOR, team_container_selector)
            for member in member_elements:
                name = self._get_element_text(member, 'span')
                profile_url = member.get_attribute('href')
                members_list.append({"name": name, "profile_url": profile_url})
            
            self.driver.execute_script("arguments[0].click();", arrow_element)
        except Exception as e:
            tqdm.write(f"  [!] Gagal mengambil anggota tim untuk {country_name}: {e}")
        return members_list

    def _scrape_doubles_event(self, row_element: WebElement) -> list[dict]:
        """Scrape data untuk event ganda (duet)."""
        members_list = []
        try:
            athlete_elements = row_element.find_elements(By.CSS_SELECTOR, 'div[data-cy="athlete-image-name"]')
            for athlete_element in athlete_elements:
                name = self._get_element_text(athlete_element, 'h3[data-cy="athlete-name"]')
                profile_url = athlete_element.find_element(By.CSS_SELECTOR, 'a[data-cy="link"]').get_attribute('href')
                members_list.append({"name": name, "profile_url": profile_url})
        except Exception as e:
            tqdm.write(f"  [!] Gagal mengambil data pemain ganda: {e}")
        return members_list

    def _scrape_individual_athlete(self, row_element: WebElement) -> tuple[str, list]:
        """Scrape data atlet individual."""
        name = "N/A"
        members_list = []
        try:
            name = self._get_element_text(row_element, 'h3[data-cy="athlete-name"]')
            profile_url = row_element.find_element(By.CSS_SELECTOR, 'a[data-cy="link"]').get_attribute('href')
            members_list.append({"name": name, "profile_url": profile_url})
        except Exception as e:
            tqdm.write(f"  [!] Gagal mengambil data pemain individual: {e}")
        return name, members_list

    def run(self):
        """Menjalankan keseluruhan proses scraping dari awal hingga akhir."""
        try:
            sports_to_scrape = self.get_all_sports()
            if not sports_to_scrape:
                logging.error("Tidak ada cabang olahraga yang ditemukan. Proses dihentikan.")
                return

            # add checkpoint
            scraped_event_urls = {item['event_url'] for item in self.scraped_data}
            
            sport_pbar = tqdm(sports_to_scrape, desc="Total Progress", unit="sport")
            
            for sport in sport_pbar:
                sport_pbar.set_description(f"Processing {sport['name']}")
                
                events = self.get_events_for_sport(sport['url'])
                if not events:
                    tqdm.write(f"  [-] Tidak ada event untuk {sport['name']}.")
                    continue
                
                new_data_for_this_sport = []
                event_pbar = tqdm(events, desc=f"  -> Events", unit="event", leave=False)
                for event in event_pbar:
                    
                    # checkpoint
                    if event['url'] in scraped_event_urls:
                        continue 

                    event_pbar.set_description(f"  -> Scraping {event['name'][:30]}...")
                    results = self._scrape_event_page(sport, event)
                    new_data_for_this_sport.extend(results)

                # simpan checkpoint (safety)
                if new_data_for_this_sport:
                    self.scraped_data.extend(new_data_for_this_sport)
                    self._update_checkpoint()
                    tqdm.write(f"[✔] Checkpoint untuk {sport['name']} berhasil diperbarui.")

        except Exception as e:
            logging.critical(f"Terjadi kesalahan fatal: {e}", exc_info=True)
        finally:
            self.driver.quit()
            logging.info("Browser telah ditutup. Proses selesai.")

if __name__ == "__main__":
    scraper = OlympicScraper()
    scraper.run()


## 3. Athlete Scraper

In [None]:


logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s', datefmt='%H:%M:%S')

class AthleteScraperPipeline:
    
    SPORTS_DATA_INPUT_FILE = os.path.join(os.path.dirname(os.getcwd()), "data", "raw_sports.json")
    ATHLETES_DATA_OUTPUT_FILE = os.path.join(os.path.dirname(os.getcwd()), "data", "raw_athletes.json")

    def __init__(self):
        """
        Inisialisasi pipeline dengan memuat data yang dibutuhkan dan setup scraper.
        """
        self.athlete_urls_to_scrape = self._get_unique_athlete_urls()
        self.scraper = AthleteProfileScraper()

    def _get_unique_athlete_urls(self) -> list[str]:
        """
        Membaca file input, mengekstrak URL profil unik, dan mengurutkannya.
        """
        if not os.path.exists(self.SPORTS_DATA_INPUT_FILE):
            logging.error(f"File input '{self.SPORTS_DATA_INPUT_FILE}' tidak ditemukan. Jalankan sport_scraper.py terlebih dahulu.")
            return []

        try:
            with open(self.SPORTS_DATA_INPUT_FILE, 'r', encoding='utf-8') as f:
                sports_data = json.load(f)
            
            all_urls = set()
            for item in sports_data:
                if "members" in item and item["members"]:
                    for member in item["members"]:
                        if "profile_url" in member and member["profile_url"]:
                            all_urls.add(member["profile_url"])
            
            logging.info(f"Ditemukan {len(all_urls)} URL atlet unik dari {len(sports_data)} entri data olahraga.")
            return sorted(list(all_urls))

        except (json.JSONDecodeError, IOError) as e:
            logging.error(f"Gagal membaca atau mem-parsing file '{self.SPORTS_DATA_INPUT_FILE}': {e}")
            return []

    def run(self):
        """
        Menjalankan keseluruhan proses scraping profil atlet.
        """
        if not self.athlete_urls_to_scrape:
            logging.warning("Tidak ada URL atlet untuk di-scrape. Proses dihentikan.")
            return

        try:
            self.scraper.scrape_all(self.athlete_urls_to_scrape, self.ATHLETES_DATA_OUTPUT_FILE)
        except Exception as e:
            logging.critical(f"Terjadi kesalahan fatal selama proses pipeline: {e}", exc_info=True)
        finally:
            self.scraper.close()

class AthleteProfileScraper:
    """
    Scraper yang didedikasikan untuk mengambil data dari halaman profil atlet.
    """
    CHECKPOINT_INTERVAL = 50 # Interval checkpoint (50 data)

    def __init__(self):
        self.driver = self._setup_driver()
        self.scraped_data = []

    def _setup_driver(self) -> webdriver.Chrome:
        service = Service(ChromeDriverManager().install())
        options = webdriver.ChromeOptions()
        options.add_argument('--log-level=3')
        driver = webdriver.Chrome(service=service, options=options)
        driver.set_window_size(1024, 768)
        return driver

    def _load_checkpoint(self, filename: str) -> list:
        if os.path.exists(filename):
            try:
                with open(filename, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                    logging.info(f"Checkpoint atlet ditemukan. {len(data)} data berhasil dimuat.")
                    return data
            except (json.JSONDecodeError, IOError): return []
        return []

    def _update_checkpoint(self, filename: str):
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(self.scraped_data, f, indent=4, ensure_ascii=False)

    def _get_profile_section(self, url: str) -> Tag | None:
        self.driver.get(url)
        try:
            WebDriverWait(self.driver, 20).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'section[data-cy="athlete-profile"]'))
            )
            time.sleep(2)
            soup = BeautifulSoup(self.driver.page_source, 'html.parser')
            return soup.find('section', attrs={'data-cy': 'athlete-profile'})
        except Exception:
            return None

    def _parse_medals(self, profile_section: Tag) -> dict:
        medals = {'gold': 0, 'silver': 0, 'bronze': 0}
        medal_elements = profile_section.select('div[data-cy="medal-module"]')
        for medal in medal_elements:
            try:
                count = int(medal.select_one('span[data-cy="medal-main"]').get_text(strip=True))
                medal_type_char = medal.select_one('span[data-cy="medal-additional"]').get_text(strip=True)
                if medal_type_char == 'S': medals['silver'] = count
                elif medal_type_char == 'B': medals['bronze'] = count
                elif medal_type_char == 'G': medals['gold'] = count
            except (ValueError, AttributeError): continue
        return medals

    def scrape_all(self, urls: list[str], output_filename: str):
        self.scraped_data = self._load_checkpoint(output_filename)
        scraped_urls = {item['url'] for item in self.scraped_data}
        
        # Filter URL yang belum di-scrape (kebutuhan checkpoint)
        urls_to_process = [url for url in urls if url not in scraped_urls]
        
        if not urls_to_process:
            logging.info("Semua data atlet sudah lengkap sesuai checkpoint. Tidak ada yang perlu di-scrape.")
            return

        logging.info(f"Akan memproses {len(urls_to_process)} atlet baru.")
        
        new_entries_since_last_checkpoint = 0
        url_pbar = tqdm(urls_to_process, desc="Scraping Athlete Profiles", unit="athlete")

        for url in url_pbar:
            profile_section = self._get_profile_section(url)
            if not profile_section:
                tqdm.write(f"[!] Gagal memuat profil untuk: {url}")
                continue
            
            def get_text(selector: str, default: str = "N/A") -> str:
                element = profile_section.select_one(selector)
                return element.get_text(strip=True) if element else default

            data = {
                "url": url,
                "name": get_text('div[data-cy="display-name"] h1'),
                "country": get_text('div[data-cy="nocs"] span'),
                "discipline": get_text('div[data-cy="disciplines"] span'),
                "game_participations": get_text('span[data-cy="games-participations"]'),
                "first_olympic_games": get_text('span[data-cy="first-olympic-game"]'),
                "year_of_birth": get_text('span[data-cy="year-of-birth"]'),
                "olympic_medals": self._parse_medals(profile_section)
            }
            
            url_pbar.set_description(f"Processing {data['name']}")
            self.scraped_data.append(data)
            new_entries_since_last_checkpoint += 1

            # Save setiap 50 data berhasil di-scrape (checkpoint)
            if new_entries_since_last_checkpoint >= self.CHECKPOINT_INTERVAL:
                self._update_checkpoint(output_filename)
                tqdm.write(f"[✔] Checkpoint disimpan ({len(self.scraped_data)} total atlet).")
                new_entries_since_last_checkpoint = 0
            
            time.sleep(1)

        if new_entries_since_last_checkpoint > 0:
            self._update_checkpoint(output_filename)
            logging.info(f"Penyimpanan final dilakukan. Total data atlet: {len(self.scraped_data)}")

        logging.info("Proses scraping semua atlet selesai.")

    def close(self):
        if self.driver:
            self.driver.quit()

if __name__ == "__main__":
    pipeline = AthleteScraperPipeline()
    pipeline.run()


## 4. Pre-processing

In [None]:
import json
import pandas as pd
import logging
import os

# ==============================================================================
# KONFIGURASI
# ==============================================================================
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s', datefmt='%H:%M:%S')

# --- Konfigurasi File ---
SPORTS_INPUT_FILE = os.path.join(os.path.dirname(os.getcwd()), "data", "raw_sports.json")
ATHLETES_INPUT_FILE = os.path.join(os.path.dirname(os.getcwd()), "data", "raw_athletes.json")
OUTPUT_DIR = os.path.join(os.path.dirname(os.getcwd()), "data", "cleaned_data")

# ==============================================================================
# FUNGSI TRANSFORMASI
# ==============================================================================

def transform_rank(rank: str) -> str:
    """
    Mengubah format rank sesuai aturan:
    - G/S/B -> emas/perak/perunggu
    - Angka (e.g., '5', '=9') -> #angka (e.g., '#5', '#9')
    """
    if not isinstance(rank, str):
        return None
    
    rank = rank.lower().strip()
    
    if rank == 'g': return 'emas'
    if rank == 's': return 'perak'
    if rank == 'b': return 'perunggu'
    
    rank_numeric = ''.join(filter(str.isdigit, rank))
    if rank_numeric:
        return f"#{rank_numeric}"
        
    return None

# ==============================================================================
# PROSES UTAMA
# ==============================================================================

def preprocess_data():
    """
    Fungsi utama untuk memuat, membersihkan, mentransformasi,
    dan menyimpan data dari semua sumber ke dalam file CSV yang ternormalisasi.
    """
    try:
        logging.info(f"Membaca file input: {SPORTS_INPUT_FILE} dan {ATHLETES_INPUT_FILE}")
        df_sports_raw = pd.read_json(SPORTS_INPUT_FILE)
        df_sports = pd.json_normalize(df_sports_raw.to_dict('records'), 'members', 
                                          ['sport', 'event', 'event_type', 'rank', 'team_or_athlete_name'])
        
        df_athletes_raw = pd.read_json(ATHLETES_INPUT_FILE)
        logging.info(f"Data berhasil dimuat. Total {len(df_sports)} partisipasi dan {len(df_athletes_raw)} profil atlet.")
        
    except FileNotFoundError as e:
        logging.error(f"Error: File tidak ditemukan -> {e}. Pastikan kedua file JSON ada.")
        return
    except Exception as e:
        logging.error(f"Error saat memuat data: {e}")
        return

    #  Cleaning dan Transformasi Awal
    logging.info("Memulai proses cleaning dan transformasi...")
    
    # --- Proses df_sports ---
    df_sports = df_sports.drop(columns=['profile_url'])
    for col in ['sport', 'event', 'event_type', 'team_or_athlete_name', 'name']:
        df_sports[col] = df_sports[col].str.lower()
    df_sports['medali'] = df_sports['rank'].apply(transform_rank)
    df_sports = df_sports.drop(columns=['rank'])

    df_sports.loc[df_sports['name'] == 'n/a', 'name'] = 'unknown'
    df_sports.loc[df_sports['team_or_athlete_name'] == 'team', 'team_or_athlete_name'] = 'unknown'
    logging.info("Mengubah nama 'n/a' dan negara 'team' menjadi 'unknown' di data sports.")
    
    # --- Proses df_athletes_raw ---
    df_athletes_raw = df_athletes_raw.drop(columns=['url', 'olympic_medals'])
    for col in ['name', 'country', 'discipline', 'first_olympic_games']:
        df_athletes_raw[col] = df_athletes_raw[col].str.lower()
    
    df_athletes_raw.loc[df_athletes_raw['name'] == 'n/a', 'name'] = 'unknown'
    df_athletes_raw.loc[df_athletes_raw['country'] == 'team', 'country'] = 'unknown'
    df_athletes_raw['country'] = df_athletes_raw['country'].replace({
        'hong kong, china': 'hong kong',
        'n/a': 'unknown',
        'virgin islands, british': 'unknown'
    })
    logging.info("Menstandarkan nama negara (hong kong, n/a, virgin islands).")
    
    original_rows = len(df_athletes_raw)
    df_athletes_raw.dropna(subset=['country'], inplace=True)
    if original_rows > len(df_athletes_raw):
        logging.warning(f"Membuang {original_rows - len(df_athletes_raw)} baris dari data atlet karena tidak memiliki data negara.")

    for col in ['game_participations', 'year_of_birth']:
        df_athletes_raw[col] = pd.to_numeric(df_athletes_raw[col], errors='coerce')
    
    df_athletes_raw['game_participations'] = df_athletes_raw['game_participations'].fillna(1)
    df_athletes_raw['year_of_birth'] = df_athletes_raw['year_of_birth'].fillna(0)
    
    for col in ['game_participations', 'year_of_birth']:
        df_athletes_raw[col] = df_athletes_raw[col].astype(int)
    logging.info("Mengisi nilai null untuk partisipasi (1) dan tahun lahir (0).")

    # --- Membuat Tabel Entitas ---
    
    logging.info("Membuat tabel entitas...")
    all_countries = df_athletes_raw['country'].dropna().unique()
    df_negara = pd.DataFrame(all_countries, columns=['nama_negara'])
    df_negara['benua'] = None

    df_pertandingan = df_sports[['sport', 'event', 'event_type']].drop_duplicates().reset_index(drop=True)
    df_pertandingan = df_pertandingan.rename(columns={'sport': 'nama_olahraga', 'event': 'nama_pertandingan', 'event_type': 'jenis_pertandingan'})
    df_pertandingan.insert(0, 'ID', df_pertandingan.index + 1)
    
    df_atlet = df_athletes_raw.rename(columns={'name': 'nama', 'country': 'nama_negara', 'discipline': 'cabang_olahraga', 'game_participations': 'jumlah_partisipasi_olimpiade', 'first_olympic_games': 'olimpiade_pertama', 'year_of_birth': 'tahun_lahir'})
    df_atlet.insert(0, 'ID', df_atlet.index + 1)

    # --- Membuat Tabel Relasi (Partisipasi) ---
    logging.info("Membuat Tabel Partisipasi...")
    
    df_partisipasi = pd.merge(df_sports, df_pertandingan, 
                              left_on=['sport', 'event', 'event_type'], 
                              right_on=['nama_olahraga', 'nama_pertandingan', 'jenis_pertandingan'])
                              
    df_partisipasi = pd.merge(df_partisipasi, df_atlet, 
                              left_on='name',
                              right_on='nama')
                              
    df_partisipasi = df_partisipasi[['ID_y', 'ID_x', 'medali']].rename(columns={'ID_y': 'ID_atlet', 'ID_x': 'ID_pertandingan'})
    
    original_count = len(df_partisipasi)
    df_partisipasi.drop_duplicates(subset=['ID_atlet', 'ID_pertandingan'], inplace=True)
    if len(df_partisipasi) < original_count:
        logging.warning(f"Membuang {original_count - len(df_partisipasi)} baris partisipasi duplikat.")
    
    logging.info(f"Tabel Partisipasi berhasil dibuat dengan {len(df_partisipasi)} entri unik.")

    # --- Menyimpan Semua Tabel ke File CSV ---
    logging.info("Menyimpan semua tabel ke file CSV...")
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    df_negara[['nama_negara', 'benua']].to_csv(f"{OUTPUT_DIR}/negara.csv", index=False)
    df_pertandingan[['ID', 'nama_olahraga', 'nama_pertandingan', 'jenis_pertandingan']].to_csv(f"{OUTPUT_DIR}/pertandingan.csv", index=False)
    df_atlet[['ID', 'nama_negara', 'nama', 'cabang_olahraga', 'jumlah_partisipasi_olimpiade', 'olimpiade_pertama', 'tahun_lahir']].to_csv(f"{OUTPUT_DIR}/atlet.csv", index=False)
    df_partisipasi.to_csv(f"{OUTPUT_DIR}/partisipasi.csv", index=False)
    
    logging.info(f"Semua file telah disimpan di dalam folder '{OUTPUT_DIR}'.")

if __name__ == "__main__":
    preprocess_data()



19:52:15 [INFO] Membaca file input: c:\Izhar\Lab Basdat\Seleksi-2025-Tugas-1\Data Scraping\data\raw_sports.json dan c:\Izhar\Lab Basdat\Seleksi-2025-Tugas-1\Data Scraping\data\raw_athletes.json
19:52:15 [INFO] Data berhasil dimuat. Total 13978 partisipasi dan 5028 profil atlet.
19:52:15 [INFO] Memulai proses cleaning dan transformasi...
19:52:16 [INFO] Mengubah nama 'n/a' dan negara 'team' menjadi 'unknown' di data sports.
19:52:16 [INFO] Menstandarkan nama negara (hong kong, n/a, virgin islands).
19:52:16 [INFO] Mengisi nilai null untuk partisipasi (1) dan tahun lahir (0).
19:52:16 [INFO] Membuat tabel entitas...
19:52:16 [INFO] Membuat Tabel Partisipasi...
19:52:16 [INFO] Tabel Partisipasi berhasil dibuat dengan 7331 entri unik.
19:52:16 [INFO] Menyimpan semua tabel ke file CSV...
19:52:16 [INFO] Semua file telah disimpan di dalam folder 'c:\Izhar\Lab Basdat\Seleksi-2025-Tugas-1\Data Scraping\data\cleaned_data'.
