# CAS ADS 24-25 - M6 Project
Author: **Marcel Grosjean**

### **/!\ The scrapping part  will not work properly on Google Collab /!\**

I had started working on this project in Google Colab, but making requests from Colab to homegate.ch was complicated since most of my requests were blocked. So, I did everything locally, which works much better.

The purpose of this project is to scrape the entire homegate.ch website to collect all rental listings and generate statistics based on the data. The focus will exclusively be on rental apartments, excluding other types of rental listings such as parking spaces, houses, commercial properties, etc. Sales listings will not be considered.

We will only perform the scraping once due to the technical complexity involved. There may be instances where we do not obtain data for all municipalities, and the pricing in the listings may not accurately represent market averages, as we are only capturing data from a single point in time.

This approach will provide a snapshot of the rental market, but we must acknowledge its limitations regarding comprehensiveness and accuracy.

## 1. Project setup
First I need to install the necessary packages.
The fake-useragent package is used to generate fake user agent.

In [None]:
%pip install fake-useragent geopandas matplotlib beautifulsoup4 bs4 redis openai selenium selenium-wire --break-system-packages > /dev/null

Import some libraies:

In [None]:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import time
import math
import redis
import random
import json
import re
from fake_useragent import UserAgent
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import openai
import json

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import TimeoutException

import subprocess



Let's create some helpers function that are needed:

In [None]:
# OpenAI API key
client = openai.OpenAI(api_key="")

# Set up Selenium with ChromeDriver
chrome_options = Options()
# Chrome can't be headless otherwise homegate will detect that !!!
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("--no-startup-window")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--start-minimized")

# client for our redis database
redis_client = redis.Redis(host='localhost', port=6380, decode_responses=True)

ua = UserAgent()

# Sleep for 10 seconds before each request is ok for cloudflare
def sleep(s = 10):
  time.sleep(s)

# This function is called at every query to generate a new http query headers
def generate_headers():
  # List of User-Agent strings to rotate
  user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"
  ]
  headers = {
    "User-Agent": random.choice(user_agents),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "DNT": "1", 
  }
  return headers

# function to create a requests session with retry logic
def create_requests_session_with_retries():
  session = requests.Session()
  retries = Retry(
    total = 5,
    backoff_factor = 1,
    status_forcelist = [500, 502, 503, 504],  # Retry on these status codes
    allowed_methods = ["GET"],
  )
  adapter = HTTPAdapter(max_retries=retries)
  session.mount("http://", adapter)
  session.mount("https://", adapter)
  return session

# Dictionary mapping of canton abbreviations to full names
def get_canton_name(abbreviation):
  canton_map = {
    'AG': 'Aargau', 'AR': 'Appenzell Ausserrhoden', 'AI': 'Appenzell Innerrhoden',
    'BL': 'Basel-Landschaft', 'BS': 'Basel-Stadt', 'BE': 'Bern', 'FR': 'Fribourg',
    'GE': 'Genève', 'GL': 'Glarus', 'GR': 'Graubünden', 'JU': 'Jura', 'LU': 'Lucerne',
    'NE': 'Neuchâtel', 'NW': 'Nidwalden', 'OW': 'Obwalden', 'SZ': 'Schwyz',
    'SH': 'Schaffhausen', 'SO': 'Solothurn', 'SG': 'Sankt Gallen', 'TG': 'Thurgau',
    'TI': 'Ticino', 'UR': 'Uri', 'VS': 'Valais', 'VD': 'Vaud', 'ZG': 'Zug', 'ZH': 'Zürich'
  }

  # Return the corresponding full canton name, or None if not found
  return canton_map.get(abbreviation.upper(), None)

# 2. Query homegate.ch
**You do not need to execute chapter 2 in Google Collab. You can jump into chapter 3 and run all the code from there. All the data needed are in chapter 3**
## 2.1 Query the cantons
Making requests to homegate.ch is quite challenging. It's not possible to do this on Google Colab because Google's IPs are blocked. As a result, I had to run the process on a local Jupyter notebook instance.

Additionally, homegate aggressively applies rate limiting through Cloudflare's proxy. If too many requests are made, Cloudflare blocks access for 12 hours.

After some trial and error, I found that waiting 10 seconds between each request prevents getting blocked.

We begin by searching for all the links to the listings for each canton.

In [None]:
session = create_requests_session_with_retries()

cantons_url = "https://www.homegate.ch/rent/apartment/switzerland"
cantons_results = session.get(cantons_url, headers=generate_headers())
cantons_results.encoding = 'utf-8'

print(f"Get all the cantons from {cantons_url}")
print(f"Status code: {cantons_results.status_code}")

cantons_page = BeautifulSoup(cantons_results.text)
cantons = []
for arr in cantons_page.select('[class^="row GeoDrillDownLocationsSection_spacer_"]'):
  # get the links to all the cantons
  main_div = arr.find()
  link_tag = main_div.find()

  # region name for homegate.ch
  region_name = link_tag['href'].split("/")[-1]

  # region full name
  canton = link_tag.text

  # we skip regions that are not Swiss canton. It must contain the "Canton" string in its text
  if "Canton" not in canton:
    continue

  # only keep the canton name
  canton = canton.split("Canton")[1].strip()

  cantons.append({
    "canton": canton, 
    "url": {
      "appartments": f"https://www.homegate.ch/rent/apartment/{region_name}/matching-list?ipd=true", 
      "houses": f"https://www.homegate.ch/rent/house/{region_name}/matching-list?ep=1&tab=list&ipd=true"
    }
  })

#print(f"NB Cantons: {len(cantons)}")
#print(f"First canton: {cantons[0]}")
for canton in cantons:
  print(f"{canton['canton']}")
  print(f"  - Appartments: {canton['url']['appartments']}")
  print(f"  - Houses: {canton['url']['houses']}")

## 2.2 Query the listings ids for each canton


In [None]:

start_time = time.time()
session = create_requests_session_with_retries()

for canton in cantons: 
  print(f"############################")

  for url in canton['url'].values():
    print(f"Querying {url} ")
    
    # now we want to find out the number of listings and pages for a given region
    infos_page = session.get(url, headers=generate_headers())
    infos_page.encoding = 'utf-8'
    infos = BeautifulSoup(infos_page.text)

    # number of listings for this region
    nb = int(infos.find_all(class_=re.compile(r"^ResultListHeader_locations_bold_"))[0].text.split(' ')[0])

    # number of pages  for this listing
    pages = math.ceil(nb/20)
    if pages > 50: # we can handle at most 50 pages with the homegate ui
      pages = 50

    print(f"Query canton: {canton['canton']}")
    print(f"Pages: {pages}, listings: {nb}")
    print(f"Quering page: ", end='')

    # Query all the listings, page by page. 
    # There might be up to 50 pages of listings.
    for i in range(1, pages+1):
      print(f"{i}, ", end='')

      url = f"{url}&ep={i}" # query page by page
      listings = session.get(url, headers=generate_headers())
      listings.encoding = 'utf-8'
      listings = BeautifulSoup(listings.text)

      # cards contains the informations about a listing
      cards = listings.find_all("a", class_="HgCardElevated_content_uir_2")
      for card in cards:
        # we will get the listing id from the href
        # can can go directly to the listing with the url
        # https://www.homegate.ch/rent/{listing_id}
        listing_id = card.get("href").split("/")[-1]

        # Ok we have the id of the listing but we won't query the listing yet.
        # We store the id in redis with a null payload if it does not already exist
        redis_key = f"listing:{listing_id}"
        # Use SETNX to add the key only if it doesn't exist
        redis_client.setnx(redis_key, "")

      #break
      # sleep before querying the next page
      sleep()

    print(f"done for canton {canton['canton']}")


    #break
    sleep()

print()
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Done in {elapsed_time:.2f} seconds.")

## 2.3 Get all the listings details
We're going to get all the listings that have empty payload

In [175]:
start_time = time.time()
i = 0

# lopp through all the keys in redis
for key in redis_client.scan_iter("listing:*"):
  value = redis_client.get(key)
  
  # we only process the keys that have an empty value
  if value != "":
    continue

  listing_id = key.split(":")[-1]
  print(f"Query listing: {listing_id}")

  # query the listing page for each listing id
  # we're going to extract only the text needed for the analysis
  # we're going to use selenium, create, lunch and quit a whole new browser for each listing
  # otherwise homegate.ch will send us bad data... -______-
  listing_url = f"https://www.homegate.ch/rent/{listing_id}"
  print(f"Query listing : {listing_url}")

  # lunch a new chrome browser and go to the listing page
  service = Service('/opt/homebrew/bin/chromedriver')
  driver = webdriver.Chrome(service=service, options=chrome_options)
  driver.request_interceptor = lambda request: [request.headers.update(generate_headers())]

  # Move the window off-screen using AppleScript
  subprocess.run(['osascript', '-e', 'tell application "Google Chrome" to set minimized of front window to true'])


  driver.get(listing_url)




  # Wait for the page to load and the desired element to be present
  try:
    WebDriverWait(driver, 10).until(
      EC.presence_of_element_located((By.CLASS_NAME, "hg-listing-details"))
    )
  except TimeoutException:
    # if we get here it means that the page is no longer available
    print("Timed out waiting for page to load")
    driver.quit()
    redis_client.delete(key) # remove the listing
    sleep()
    continue


  # Get the page source and parse it with BeautifulSoup
  listing_page = driver.page_source
  listing = BeautifulSoup(listing_page, 'html.parser')
  listing = listing.find_all(class_="hg-listing-details")[0].text

  ### EXTRACTING THE DATA WITH CHATGPT ###
  prompt = f"""
  You are an assistant that extracts information from an unstructured real estate listing and converts it into a structured JSON format.
  
  Here is the JSON structure you must follow:
  {{
      "title": "string (NO_TITLE if missing)",
      "rent": int (0 if missing),
      "rooms": float (0 if missing),
      "living_space": int (0 if missing, can also be called "space living" or "surface" or "floor space" in the listing),
      "floor": int (0 if missing),
      "type": "Apartment" | "House",
      "address": {{
        "canton": "string (2 uppercase letters)",
        "municipality": "string",
        "postal_code": int,
        "street": "string"
      }},
      "features": {{
        "wheelchair_access": boolean,
        "elevator": boolean,
        "balcony_terrace": boolean,
        "pets_allowed": boolean,
        "new_building": boolean,
        "parking_garage": boolean,
        "old_building": boolean,
        "minergie": boolean,
        "swimming_pool": boolean
      }}
  }}
  
  You have to decide if the listing is either apartment or house. 
  If you can't find the canton name in the listing, try to find out the canton name based on the postal code and if you can't, use "NO_CANTON".
  All features are false by default and must be set to true if they appear in the listing.
  
  Here is the real estate listing:
  =========
  {listing}
  =========
  
  Return only the JSON object without any additional text.
  The JSON object must be directly parsable by python without any modifications.
  """
  
  
  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You are an assistant specialized in data extraction."},
      {"role": "user", "content": prompt}
    ]
  )
  
  structured_data = response.choices[0].message.content.strip()
  # Remove Markdown code block delimiters if present
  if structured_data.startswith("```json"):
      structured_data = structured_data[len("```json"):].strip()
  if structured_data.endswith("```"):
      structured_data = structured_data[:-len("```")].strip()
  # suppress control characters
  structured_data = re.sub(r"[\x00-\x1F\x7F]", "", structured_data)
  
  print(structured_data)
  structured_data = json.loads(structured_data)

  print(structured_data)
  
  # Save the structured data in redis
  redis_client.set(key, json.dumps(structured_data))

  # procedure at the end of each listing page request
  driver.quit() # we need to close the browser each time
  i += 1
  print()
  sleep()

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Done in {elapsed_time:.2f} seconds.")
print(f"Total listings processed: {i}")

Query listing: 4001975309
Query listing : https://www.homegate.ch/rent/4001975309
Timed out waiting for page to load
Query listing: 4001849816
Query listing : https://www.homegate.ch/rent/4001849816
{    "title": "Superbe appartement près de la gare",    "rent": 1180,    "rooms": 2.5,    "living_space": 45,    "floor": 3,    "type": "Apartment",    "address": {        "canton": "NE",        "municipality": "La Chaux-de-Fonds",        "postal_code": 2300,        "street": "Parc 85"    },    "features": {        "wheelchair_access": false,        "elevator": false,        "balcony_terrace": false,        "pets_allowed": false,        "new_building": false,        "parking_garage": false,        "old_building": false,        "minergie": false,        "swimming_pool": false    }}
{'title': 'Superbe appartement près de la gare', 'rent': 1180, 'rooms': 2.5, 'living_space': 45, 'floor': 3, 'type': 'Apartment', 'address': {'canton': 'NE', 'municipality': 'La Chaux-de-Fonds', 'postal_code': 230

KeyboardInterrupt: 