![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# Introduction to Text Mining

### Homework 1

### Group Members: Deepak Malik, Noemi Lucchi, Tirdod Behbehani


Booking.com Scraping Pipeline

This script performs the following tasks:
  - Starts a headless Firefox browser with custom preferences.
  - Navigates to Booking.com, closes popups, rejects cookies, and searches for hotels in specified cities and date ranges.
  - Scrapes hotel data (Hotel Name, URL, Rating, Number of Ratings, Tiers, Price) from the results pages.
  - Concurrently fetches hotel descriptions.
  - Saves the scraped data into CSV files for each city/date combination.

Before running, update:
  - download_folder: a local directory for temporary downloads.
  - geko_path: the path to your geckodriver executable.
  - city_date_ranges: the list of search parameters.
  



### Step-1 Importing Required Modules

This script uses several Python modules to create a robust web scraping pipeline:

- **os**: Interacts with the operating system.
- **time**: Manages delays to ensure smooth automation.
- **re**: Handles pattern matching with regular expressions.
- **logging**: Records events and helps debug issues.
- **pandas**: Structures scraped data and exports it to CSV files.
- **requests**: Fetches HTML content from web pages.
- **BeautifulSoup (from bs4)**: Parses HTML content to extract information.
- **concurrent.futures**: Executes tasks concurrently to speed up operations.
    - **ThreadPoolExecutor**: Manages concurrent execution.
    - **as_completed**: Handles task completion.
- **selenium**: Automates browser actions and handles dynamic web content.
    - **webdriver**: Controls the browser.
    - **Service**: Manages the browser service.
    - **Options**: Configures browser settings.
    - **WebDriverWait**: Waits for elements to load.
    - **expected_conditions**: Defines conditions to wait for.
    - **TimeoutException**: Handles timeouts.
    - **StaleElementReferenceException**: Manages stale elements.

These modules work together to build an efficient, scalable, and maintainable web scraping solution.

In [1]:
import os
import time
import re
import logging
import pandas as pd
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException


### Step 2 : Logging Configuration

The logging configuration sets up a centralized logging system for the script. By calling `logging.basicConfig()`, we define the minimum level of log messages to capture (in this case, `INFO`) and specify a format that includes the timestamp, log level, and message. This setup is critical for debugging and monitoring the scraping process, as it provides real-time feedback on the script’s execution and helps trace any issues that arise.

In [2]:
# -----------------------------------------------------------------------------
# Logging configuration
# -----------------------------------------------------------------------------
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s'
)

### Step 3: Browser Configuration and Initialization

Here we defines two essential functions for configuring and launching a Firefox browser instance tailored for web scraping:

1. **`ffx_preferences` Function**:
    - **Purpose**: Sets up a Firefox profile with custom download settings and an option to run in headless mode (operates without a graphical user interface to boost performance).
    - **Configurations**:
        - **Download Directory**: Specifies where files should be saved automatically.
        - **File Types**: Lists file types to save without prompts.
        - **PDF Viewer**: Optionally disables the built-in PDF viewer for direct downloads.

2. **`start_up` Function**:
    - **Purpose**: Ensures the download directory exists, retrieves the custom preferences, and starts the Firefox browser using the provided geckodriver path.
    - **Process**:
        - **Directory Check**: Ensures the specified download directory exists.
        - **Browser Launch**: Starts Firefox with the custom settings.
        - **Navigation**: Navigates to a specified URL and includes a brief pause to allow the page to load fully.

Together, these functions establish a controlled, automated browser environment essential for reliable web scraping.

In [3]:
# -----------------------------------------------------------------------------
# Browser Preferences and Startup
# -----------------------------------------------------------------------------
def ffx_preferences(download_folder: str, download: bool = False, headless: bool = True) -> Options:
    """
    Configure Firefox preferences including download settings and headless mode.
    """
    profile = webdriver.FirefoxProfile()
    profile.set_preference("browser.download.dir", download_folder)
    profile.set_preference("browser.download.folderList", 2)
    profile.set_preference("browser.download.manager.showWhenStarting", False)
    profile.set_preference(
        "browser.helperApps.neverAsk.saveToDisk",
        "application/msword,application/rtf,application/csv,text/csv,image/png,image/jpeg,application/pdf,text/html,text/plain,application/octet-stream"
    )
    if download:
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
        profile.set_preference("pdfjs.disabled", True)

    options = Options()
    options.profile = profile
    options.headless = headless  # Run browser in headless mode for speed.
    return options

def start_up(link: str, download_folder: str, geko_path: str, download: bool = False, headless: bool = True) -> webdriver.Firefox:
    """
    Start Firefox with the custom preferences and navigate to the given link.
    """
    os.makedirs(download_folder, exist_ok=True)
    options = ffx_preferences(download_folder, download, headless)
    service = Service(geko_path)
    browser = webdriver.Firefox(service=service, options=options)
    browser.get(link)
    time.sleep(2)  # Allow a short time for the page to load.
    return browser



### Step 4: Handling Genius Popup

The `close_genius_popup` function ensures smooth automation by managing the Genius popup that may appear on the Booking.com website. Here's how it works:

1. **Popup Detection**:
    - Uses `WebDriverWait` to wait up to 3 seconds for the Genius popup's close button to become clickable.
    - Identifies the button using its XPath (`path_genius`).

2. **Popup Closure**:
    - If the button is found, clicks it to dismiss the popup.
    - Pauses briefly to ensure the action is registered.

3. **Logging**:
    - Logs the detection and closure of the popup.
    - If no popup is detected within the timeout, logs that no popup was found.

This function helps maintain the flow of automated browsing by preventing interruptions from unexpected popups.

In [4]:
# -----------------------------------------------------------------------------
# Popup and Cookie Handling
# -----------------------------------------------------------------------------
def close_genius_popup(browser: webdriver.Firefox, path_genius: str):
    """
    Close the Genius popup if it appears.
    """
    try:
        logging.info("Checking for Genius popup...")
        genius_button = WebDriverWait(browser, 3).until(
            EC.element_to_be_clickable((By.XPATH, path_genius))
        )
        logging.info("Genius popup detected; closing it.")
        genius_button.click()
        time.sleep(1)
    except TimeoutException:
        logging.info("No Genius popup detected.")


### Step5: Search for city and dates using `search_city_and_dates` Function

1. **Load Homepage**: Open the Booking.com homepage and wait for the page to stabilize.
2. **Dismiss Pop-ups**: Close any pop-ups and cookie consent banners using explicit waits.
3. **Enter City Name**: Type the desired city name into the search field.
4. **Open Calendar**: Click to open the calendar widget.
5. **Select Dates**: Use the `select_date` helper function to:
   - Scroll through the calendar.
   - Click the appropriate check-in and check-out date elements.
   - Move to the next month if necessary.
6. **Set Search Parameters**: Ensure all search parameters (city and dates) are set.
7. **Click Search Button**: Click the search button to initiate the search.
8. **Wait for Results**: Wait for the hotel listings to load and appear on the page.

In [5]:
# -----------------------------------------------------------------------------
# Search for City and Dates
# -----------------------------------------------------------------------------
def search_city_and_dates(
    browser: webdriver.Firefox,
    city: str,
    from_date: str,
    to_date: str,
    path_cookies: str = '//*[@id="onetrust-reject-all-handler"]',
    calendar_button_css: str = 'button.ebbedaf8ac:nth-child(2) > span:nth-child(1)',
    path_date_selection: str = '//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa ef091eb985"]',
    path_load_dates: str = 'button[aria-label="Mes siquiente"]',
    path_search: str = '//div[@id="indexsearch"]//div[@class="ffb9c3d6a3 b3b8f00b52 c9a7790c31 e691439f9a"]//button[@class="a83ed08757 c21c56c305 a4c1805887 f671049264 a2abacf76b c082d89982 cceeb8986b b9fd3c6b3c"]',
    path_results: str = '//div[contains(@class,"c82435a4b8") and contains(@class,"a178069f51")]',
    path_genius: str = '//div[@class="f0c216ee26 c676dd76fe b5018b639f"]//button[@class="a83ed08757 c21c56c305 f38b6daa18 d691166b09 ab98298258 f4552b6561"]'
) -> None:
    """
    Perform the search by:
      - Navigating to the Booking.com homepage.
      - Closing popups and rejecting cookies.
      - Entering the city.
      - Opening the calendar and selecting check-in and check-out dates.
      - Clicking the search button.
    """
    booking_home = "https://www.booking.com/index.es.html"
    browser.get(booking_home)
    time.sleep(2)

    # Close any popups.
    close_genius_popup(browser, path_genius)

    # Reject cookies if present.
    try:
        cookie_button = WebDriverWait(browser, 5).until(
            EC.element_to_be_clickable((By.XPATH, path_cookies))
        )
        logging.info("Rejecting cookies...")
        cookie_button.click()
        time.sleep(1)
    except TimeoutException:
        logging.info("No cookie popup found.")

    # Enter the city.
    try:
        logging.info(f"Entering city: {city} ...")
        city_input = WebDriverWait(browser, 5).until(
            EC.element_to_be_clickable((By.XPATH, '//*[@id=":rh:"]'))
        )
        city_input.clear()
        city_input.send_keys(city)
        time.sleep(1)
    except TimeoutException:
        logging.error("City input box not found.")
        return

    # Open the calendar.
    try:
        calendar_button = WebDriverWait(browser, 5).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, calendar_button_css))
        )
        logging.info("Opening calendar...")
        calendar_button.click()
        time.sleep(1)
    except TimeoutException:
        logging.error("Calendar button not found.")
        return

    # Helper function: Select a date.
    def select_date(target_date: str) -> bool:
        while True:
            try:
                dates = WebDriverWait(browser, 5).until(
                    EC.presence_of_all_elements_located((By.XPATH, path_date_selection))
                )
                for date in dates:
                    if date.get_attribute("data-date") == target_date:
                        logging.info(f"Selecting date: {target_date}")
                        browser.execute_script("arguments[0].scrollIntoView(true);", date)
                        browser.execute_script("arguments[0].click();", date)
                        time.sleep(1)
                        return True
                logging.info(f"Date {target_date} not found on current view. Clicking 'Next Month'.")
                load_more_button = WebDriverWait(browser, 5).until(
                    EC.element_to_be_clickable((By.CSS_SELECTOR, path_load_dates))
                )
                browser.execute_script("arguments[0].click();", load_more_button)
                time.sleep(1)
            except TimeoutException:
                logging.error(f"Failed to find date {target_date}.")
                return False

    if not select_date(from_date):
        logging.error("Check-in date selection failed.")
        return
    if not select_date(to_date):
        logging.error("Check-out date selection failed.")
        return

    # Click the search button.
    try:
        search_button = WebDriverWait(browser, 5).until(
            EC.element_to_be_clickable((By.XPATH, path_search))
        )
        logging.info("Clicking search button...")
        browser.execute_script("arguments[0].scrollIntoView(true);", search_button)
        search_button.click()
    except TimeoutException:
        logging.error("Search button not clickable.")
        return

    # Wait for results to load.
    try:
        WebDriverWait(browser, 10).until(
            EC.presence_of_element_located((By.XPATH, path_results))
        )
        logging.info("Search results loaded.")
    except TimeoutException:
        logging.warning("Search results may not have loaded properly.")

    close_genius_popup(browser, path_genius)

### Step 5: Scraping Hotel Data from Search Results

1. **Parse Hotel Listings**:
    - Use Selenium to locate hotel listing blocks on the page.
    - Extract HTML and use BeautifulSoup to parse key details (hotel name, URL, rating, number of ratings, tiers, and price).
    - Derive a "BaseURL" for each listing to help in deduplication.

2. **Scroll Down**:
    - Programmatically scroll the webpage to load additional hotel listings.
    - Check if new content appears by comparing the document's height before and after scrolling.

3. **Scrape All Results**:
    - Iteratively gather all hotel listings by parsing the page, scrolling down, and clicking the "Load More" button when available.
    - Maintain a set of seen URLs to avoid duplicates.
    - Continue the process until no new hotels are loaded or a maximum number of attempts is reached.

In [6]:
# -----------------------------------------------------------------------------
# Scraping Hotel Data from Results
# -----------------------------------------------------------------------------
def parse_hotels_on_page(browser: webdriver.Firefox) -> list:
    """
    Extract hotel details from the current page using BeautifulSoup.
    """
    hotel_blocks = browser.find_elements(By.XPATH, '//div[contains(@class,"c82435a4b8") and contains(@class,"a178069f51")]')
    hotels_data = []
    for block in hotel_blocks:
        try:
            block_html = block.get_attribute("outerHTML")
        except StaleElementReferenceException:
            continue
        soup = BeautifulSoup(block_html, "html.parser")
        data = {
            "Hotel Name": (soup.select_one('div.f6431b446c.a15b38c233') or "NA").get_text(strip=True)
                if soup.select_one('div.f6431b446c.a15b38c233') else "NA",
            "URL": (soup.select_one('a.a78ca197d0') or {"href": "NA"})["href"]
        }
        # Extract Rating.
        rating_el = soup.select_one('div.a3b8729ab1.d86cee9b25')
        if rating_el:
            match = re.search(r'\d+,\d+', rating_el.get_text(strip=True))
            data["Rating"] = match.group() if match else "NA"
        else:
            data["Rating"] = "NA"
        data["Number of Ratings"] = (soup.select_one('div.abf093bdfe.f45d8e4c32.d935416c47') or "NA").get_text(strip=True) \
            if soup.select_one('div.abf093bdfe.f45d8e4c32.d935416c47') else "NA"
        data["Tiers"] = (soup.select_one('div.a3b8729ab1.e6208ee469.cb2cbb3ccb') or "NA").get_text(strip=True) \
            if soup.select_one('div.a3b8729ab1.e6208ee469.cb2cbb3ccb') else "NA"
        data["Price"] = (soup.select_one('[data-testid="price-and-discounted-price"]') or "NA").get_text(strip=True) \
            if soup.select_one('[data-testid="price-and-discounted-price"]') else "NA"
        # Create a BaseURL for deduplication.
        data["BaseURL"] = data["URL"].split("?", 1)[0]
        hotels_data.append(data)
    return hotels_data

def scroll_down(browser: webdriver.Firefox, pause_time: float = 1.5) -> bool:
    """
    Scroll down the page and return True if new content is loaded.
    """
    last_height = browser.execute_script("return document.body.scrollHeight")
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause_time)
    new_height = browser.execute_script("return document.body.scrollHeight")
    return new_height > last_height

def scrape_all_results(browser: webdriver.Firefox, max_attempts: int = 100) -> list:
    """
    Iteratively scroll and/or click a "Load More" button to gather all hotel results.
    """
    all_hotels = []
    seen_urls = set()
    attempts = 0
    HOTEL_BLOCK_XPATH = '//div[contains(@class,"c82435a4b8") and contains(@class,"a178069f51")]'

    while attempts < max_attempts:
        current_hotels = parse_hotels_on_page(browser)
        new_hotels = [h for h in current_hotels if h["BaseURL"] not in seen_urls]
        for hotel in new_hotels:
            seen_urls.add(hotel["BaseURL"])
            all_hotels.append(hotel)
        logging.info(f"Iteration {attempts + 1}: {len(new_hotels)} new hotels (total: {len(all_hotels)})")
        
        old_count = len(browser.find_elements(By.XPATH, HOTEL_BLOCK_XPATH))
        if scroll_down(browser, pause_time=1.5):
            try:
                WebDriverWait(browser, 5).until(lambda d: len(d.find_elements(By.XPATH, HOTEL_BLOCK_XPATH)) > old_count)
                logging.info("Additional hotels loaded after scrolling.")
                attempts += 1
                continue
            except TimeoutException:
                logging.info("No new hotels loaded after scrolling.")
        
        try:
            load_more_button = WebDriverWait(browser, 5).until(
                EC.element_to_be_clickable((By.XPATH, "//button[.//span[contains(text(),'Cargar más resultados')]]"))
            )
            logging.info("Clicking 'Cargar más resultados' button...")
            browser.execute_script("arguments[0].scrollIntoView(true);", load_more_button)
            load_more_button.click()
            time.sleep(1.5)
            attempts += 1
        except TimeoutException:
            logging.info("No 'Cargar más resultados' button found; stopping iteration.")
            break

    logging.info(f"Total unique hotels scraped: {len(all_hotels)}")
    return all_hotels

### Step 6: Adding Hotel Descriptions

Here we add more information to each hotel's data by getting the hotel's description from its webpage. 

1. **`fetch_description` function**:
    - Takes a single hotel record.
    - Uses an HTTP GET request to get the hotel's webpage.
    - Parses the HTML to find and extract the description from a specific paragraph element.
    - If the URL is missing or something goes wrong, it returns "NA" or an error message.

2. **`add_descriptions_to_hotels` function**:
    - Uses Python's `ThreadPoolExecutor` to make these HTTP requests at the same time (in parallel).
    - This makes the process much faster when dealing with many hotels.
    - It sends a task for each hotel and updates the hotel record with the description as soon as the task is done.
    - Logs progress periodically.

By doing this, we add useful details to our dataset and make the process more efficient by reducing the waiting time compared to doing the requests one by one.

In [7]:
# -----------------------------------------------------------------------------
# Concurrently Fetching Hotel Descriptions
# -----------------------------------------------------------------------------
def fetch_description(hotel: dict, headers: dict) -> str:
    """
    Fetch the hotel's page and extract a description.
    """
    url = hotel.get("URL", "")
    if not url or url == "NA":
        return "NA"
    try:
        response = requests.get(url, headers=headers, timeout=8)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        desc_tag = soup.find('p', class_='a53cbfa6de b3efd73f69')
        return desc_tag.get_text(strip=True) if desc_tag else "NA"
    except Exception as e:
        return f"Error: {str(e)}"

def add_descriptions_to_hotels(hotels_list: list) -> list:
    """
    Concurrently fetch the description for each hotel and add it to the hotel data.
    """
    headers = {
        'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                       'AppleWebKit/537.36 (KHTML, like Gecko) '
                       'Chrome/75.0.3770.142 Safari/537.36')
    }
    total = len(hotels_list)
    logging.info(f"Fetching descriptions for {total} hotels concurrently...")
    
    with ThreadPoolExecutor(max_workers=20) as executor:
        future_to_hotel = {
            executor.submit(fetch_description, hotel, headers): hotel for hotel in hotels_list
        }
        for idx, future in enumerate(as_completed(future_to_hotel), 1):
            hotel = future_to_hotel[future]
            hotel["Description"] = future.result()
            if idx % 25 == 0:
                logging.info(f"Processed {idx} / {total} hotel descriptions.")
    
    logging.info("Finished fetching descriptions for all hotels.")
    return hotels_list

### Step 7: Main Function Overview

The `main` function orchestrates the entire scraping pipeline with the following steps:

1. **Configuration Settings**:
    - **Download Folder**: Specifies the local directory for temporary downloads.
    - **Geckodriver Path**: Path to the geckodriver executable.
    - **Headless Mode**: Option to run the browser in headless mode for performance.
    - **Booking.com URL**: The URL of the Booking.com homepage.

2. **Search Parameters**:
    - Defines city names, date ranges, and output CSV filenames.

3. **Execution Flow**:
    - Logs the start of a new search.
    - Initializes a Firefox browser with custom settings using `start_up`.
    - Performs the hotel search with `search_city_and_dates`.
    - Collects hotel listings using `scrape_all_results`.
    - Enriches each listing with descriptions via `add_descriptions_to_hotels`.
    - Converts the data into a pandas DataFrame and saves it as a CSV file.

4. **Error Handling**:
    - Logs any issues encountered during the scraping process.
    - Ensures the browser is properly closed in a `finally` block.

5. **Completion**:
    - Logs the completion of all searches.
    - The `if __name__ == "__main__":` block ensures the `main` function runs when the script is executed directly.

In [8]:
# -----------------------------------------------------------------------------
# Main Pipeline
# -----------------------------------------------------------------------------
def main():
    # Update these paths according to your system.
    download_folder = "/"     
    geko_path = "/Users/noemilucchi/Desktop/Term2/Text Mining/geckodriver"
    headless = True                               
    booking_url = "https://www.booking.com/index.es.html"

    # Define the city, date ranges, and output CSV filenames.
    city_date_ranges = [
        ("Barcelona", "2025-02-24", "2025-02-27", "barcelona_feb24_feb27.csv"),
        ("Barcelona", "2025-03-03", "2025-03-06", "barcelona_march3_march6.csv"),
        ("Madrid", "2025-02-24", "2025-02-27", "madrid_feb24_feb27.csv"),
        ("Madrid", "2025-03-03", "2025-03-06", "madrid_march3_march6.csv")
    ]

    for city, from_date, to_date, filename in city_date_ranges:
        logging.info(f"\nStarting search for hotels in {city} from {from_date} to {to_date}...")
        browser = start_up(booking_url, download_folder, geko_path, download=False, headless=headless)
        try:
            search_city_and_dates(browser, city, from_date, to_date)
            hotels_data = scrape_all_results(browser, max_attempts=100)
            hotels_data = add_descriptions_to_hotels(hotels_data)
            df_final = pd.DataFrame(hotels_data)
            df_final.to_csv(filename, index=False)
            logging.info(f"Saved {len(df_final)} hotels to {filename}")
        except Exception as exc:
            logging.error(f"Error for {city} ({from_date} to {to_date}): {exc}")
        finally:
            browser.quit()
    logging.info("All searches completed!")

if __name__ == "__main__":
    main()


2025-02-05 09:07:55,671 [INFO] 
Starting search for hotels in Barcelona from 2025-02-24 to 2025-02-27...
2025-02-05 09:08:04,654 [INFO] Checking for Genius popup...
2025-02-05 09:08:07,724 [INFO] No Genius popup detected.
2025-02-05 09:08:07,740 [INFO] Rejecting cookies...
2025-02-05 09:08:08,971 [INFO] Entering city: Barcelona ...
2025-02-05 09:08:10,040 [INFO] Opening calendar...
2025-02-05 09:08:11,299 [INFO] Selecting date: 2025-02-24
2025-02-05 09:08:12,351 [INFO] Selecting date: 2025-02-27
2025-02-05 09:08:13,373 [INFO] Clicking search button...
2025-02-05 09:08:15,910 [INFO] Search results loaded.
2025-02-05 09:08:15,910 [INFO] Checking for Genius popup...
2025-02-05 09:08:15,921 [INFO] Genius popup detected; closing it.
2025-02-05 09:08:17,272 [INFO] Iteration 1: 28 new hotels (total: 28)
2025-02-05 09:08:18,795 [INFO] Additional hotels loaded after scrolling.
2025-02-05 09:08:19,155 [INFO] Iteration 2: 25 new hotels (total: 53)
2025-02-05 09:08:20,684 [INFO] Clicking 'Cargar m