Project Idea:
* This project involves scraping data from Booking.com, a website that allows travellers to book hotels in various cities worldwide.

* By scraping data from this website, we can collect information about hotels like their name, type of room, location, etc., and use machine learning algorithms to train a model that learns various features of the hotels and predicts the prices.

1. Import the Necesarry libraries

In [24]:
#Install libraries
!pip install selectorlib requests beautifulsoup4 pandas scikit-learn tqdm




In [25]:
#Import the libraries for the project
import pandas as pd
import requests
import numpy as pd
from bs4 import BeautifulSoup
from selectorlib import Extractor
import time
import random
from tqdm import tqdm
import json
import os
from urllib.parse import urljoin, urlencode, urlparse, parse_qs
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

2. Create the YAML extractor for SelectorLib

* A YAML file (with the .yaml or .yml extension) is a human-readable data serialization format used for storing and exchanging data — similar to JSON or XML, but simpler and easier to read.
* Note: Booking.com HTML changes frequently. You will likely need to tweak selectors for your target region / page layout. Use the browser inspector to verify CSS paths.


In [26]:
# write the YAML file
selectors_yaml = """
# Selectors for Booking.com (list page) - NOTE: these are approximate and may need adjusting.
list:
  hotels:
    css: "div[data-testid='property-card']"
    type: list
    children:
      name:
        css: "div[data-testid='title']"
        xpath: null
        type: text
      url:
        css: "a[data-testid='title-link']"
        type: attribute
        attribute: href
      rating:
        css: "div[aria-label*='Scored']"
        type: text
      price:
        css: "span[data-testid='price-and-discounted-price']"
        type: text
      review_count:
        css: "div[data-testid='reviews-number']"
        type: text
      location_snippet:
        css: "span[data-testid='distance']"
        type: text

# Selectors for hotel details page
detail:
  name:
    css: "h2[data-testid='title']"
    type: text
  address:
    css: "span[data-testid='address']"
    type: text
  overall_review_score:
    css: "div[data-testid='review-score-component'] div[aria-hidden='false']"
    type: text
  review_count:
    css: "div[data-testid='review-score-component'] span"
    type: text
  star_rating:
    css: "span[class*='bd73d']"    # may need updating
    type: text
  amenities:
    css: "div[data-testid='hotel-facilities'] li"
    type: list
    children:
      amenity:
        css: "div"
        type: text
  room_types:
    css: "table[class*='hprt-table'] tr"
    type: list
    children:
      room_type:
        css: "td[class*='hprt-roomtype']"
        type: text
  price_from:
    css: "div[data-testid='price-and-discounted-price']"
    type: text
  latitude:
    css: "meta[property='booking:location:latitude']"
    type: attribute
    attribute: content
  longitude:
    css: "meta[property='booking:location:longitude']"
    type: attribute
    attribute: content
"""
open("booking_selectors.yml", "w", encoding="utf-8").write(selectors_yaml)
print("Wrote booking_selectors.yml")


Wrote booking_selectors.yml


3. Basic HTTP session with polite headers, retry and delay

In [27]:
SESSION = requests.Session()
# rotate between a few user agents
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.1 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0 Safari/537.36",
]
HEADERS = {
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "User-Agent": random.choice(USER_AGENTS),
}

def polite_get(url, params=None, max_retries=3, sleep_range=(2,5)):
    for attempt in range(max_retries):
        try:
            headers = HEADERS.copy()
            headers["User-Agent"] = random.choice(USER_AGENTS)
            resp = SESSION.get(url, params=params, headers=headers, timeout=15)
            if resp.status_code == 200:
                # small random delay
                time.sleep(random.uniform(*sleep_range))
                return resp.text
            else:
                # backoff
                time.sleep(2 + attempt)
        except Exception as e:
            time.sleep(2 + attempt)
    raise Exception(f"Failed to GET {url} after {max_retries} tries")


4. Function to extract hotels from a search/list page

In [28]:
from selectorlib import Extractor
extractor = Extractor.from_yaml_file('booking_selectors.yml')

def parse_search_page(html):
    data = extractor.extract(html)
    # data['hotels'] should be a list of hotel blocks
    hotels = data.get('hotels') or []
    # normalize prices/ratings strings
    normalized = []
    for h in hotels:
        # ensure URL is absolute
        url = h.get('url') or ''
        if url and url.startswith('/'):
            url = urljoin("https://www.booking.com", url)
        h['url'] = url
        normalized.append(h)
    return normalized


5. Function to parse detail page (hotel page)

In [29]:
def parse_detail_page(html):
    data = extractor.extract(html)  # using the detail rules too
    # extractor returns both list and detail keys; focus on detail-level keys
    detail = data
    # Clean/normalize a few fields
    if 'amenities' in detail and isinstance(detail['amenities'], list):
        detail['amenities'] = [a.strip() for a in detail['amenities'] if a and a.strip()]
    if 'room_types' in detail and isinstance(detail['room_types'], list):
        # Extract text only
        rooms = []
        for r in detail['room_types']:
            if isinstance(r, dict):
                rooms.append(r.get('room_type','').strip())
            else:
                rooms.append(str(r).strip())
        detail['room_types'] = [r for r in rooms if r]
    return detail


6. Putting it together: scrape N pages of search results for a city

In [30]:
def build_search_url(city, checkin=None, checkout=None, page=0):
    # Basic Booking search URL with query param ss
    base = "https://www.booking.com/searchresults.html"
    params = {"ss": city, "offset": page*25}  # offset controlled by page
    if checkin:
        params.update(checkin)   # optional dict
    return base + "?" + urlencode(params)

def scrape_city(city, pages=2, max_hotels=None):
    results = []
    for p in range(pages):
        url = build_search_url(city, page=p)
        print("Fetching list page:", url)
        html = polite_get(url)
        hotels = parse_search_page(html)
        print(f"Found {len(hotels)} hotels on page {p}")
        for h in hotels:
            # attempt to fetch hotel detail page if URL exists
            hotel_data = dict(h)  # start with list-level fields
            detail_url = h.get('url')
            if detail_url:
                try:
                    detail_html = polite_get(detail_url)
                    detail_parsed = parse_detail_page(detail_html)
                    hotel_data.update(detail_parsed)
                except Exception as e:
                    print("Failed to fetch detail:", e)
            results.append(hotel_data)
            if max_hotels and len(results) >= max_hotels:
                return results
        # tiny pause between pages
        time.sleep(random.uniform(5,10))
    return results

# Example run (small)
# hotels_data = scrape_city("Nairobi", pages=1, max_hotels=10)
# pd.DataFrame(hotels_data).head()


7. Save results to CSV

In [32]:
def save_results(data, filename):
    """Saves a list of dictionaries to a CSV file."""
    if not data:
        print("No data to save.")
        return None
    df = pd.DataFrame(data)
    df.to_csv(f"{filename}.csv", index=False)
    print(f"Saved {len(data)} records to {filename}.csv")
    return df

# Scrape some data (example: 1 page of results for Nairobi)
hotels_data = scrape_city("Nairobi", pages=1)

# Save the data to a CSV file
df = save_results(hotels_data, "booking_nairobi_results")

# List files to show the CSV was created
!ls

Fetching list page: https://www.booking.com/searchresults.html?ss=Nairobi&offset=0


Exception: Failed to GET https://www.booking.com/searchresults.html?ss=Nairobi&offset=0 after 3 tries

 NOTE:
 * The code was unable to fetch the content of the specified URL after multiple attempts.
 * This is because the website we are trying to scrape (Booking.com in this case) is actively blocking the requests.
 * Websites often implement measures to prevent automated scraping.