# CAS ADS 2024 - M1 Project
Author: **Marcel Grosjean**

### /!\ This page was created in a local jupyter notebook. It may not work properly on Google Collab /!\



I had started working on this project in Google Colab, but making requests from Colab to homegate.ch was complicated since most of my requests were blocked. So, I did everything locally, which works much better.

### 1. Project setup
First I need to install the necessary packages. 
The fake-useragent package is used to generate fake user agent.

In [4]:
!pip install fake-useragent



Import some libraries:

In [22]:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import time
import math
import random
from fake_useragent import UserAgent

Let's create some constants and functions: 

In [237]:
ua = UserAgent()

# This function is called at every query to generate a new http query headers
def generate_headers():
  return {
    'User-Agent': ua.random, # set random user agent
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'TE': 'Trailers'
}

# function to create a requests session with retry logic
def create_requests_session_with_retries():
    session = requests.Session()
    retries = Retry(
        total = 5,  
        backoff_factor = 1,  
        status_forcelist = [500, 502, 503, 504],  # Retry on these status codes
        allowed_methods = ["GET"],  
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

# extract number from a string
def extract_number_from_string(input_string):
    numeric_string = ''.join([char for char in input_string if char.isdigit()])
    if numeric_string:
        return int(numeric_string)
    else:
        return None

# replace all the comas in a string, except the last one
def replace_commas(s):
    parts = s.split(',')
    
    if len(parts) <= 1:
        return s
    
    return '.'.join(parts[:-1]) + ',' + parts[-1]

### 2. Query homegate.ch
Okay so far we have setup everything we need to query homegate.ch. 

**EXPLAIN WHY QUERYING HOMEGATE.CH IS TRICKY**

**MANY 429 TOO MANY REQUESTS**

**UI changes everytime and is complex**


We start by querying all the cantons: 

In [28]:
session = create_requests_session_with_retries()

cantons_url = "https://www.homegate.ch/rent/apartment/switzerland"
cantons_results = session.get(cantons_url, headers=generate_headers())
cantons_results.encoding = 'utf-8'

print(f"Get all the cantons from {cantons_url}")
print(f"Status code: {cantons_results.status_code}")

cantons_page = BeautifulSoup(cantons_results.text)
cantons = []
for arr in cantons_page.select('[class^="row GeoDrillDownLocationsSection_spacer_"]'):
    # get the links to all the cantons
    main_div = arr.findChild()
    link_tag = main_div.findChild()
    
    # region full name
    canton = link_tag.text
    
    # we skip regions that are not Swiss canton. It must contain the "Canton" string in its text
    if "Canton" not in canton:
        continue
    
    # only keep the canton name
    canton = canton.split("Canton")[1].strip()
    cantons.append({"canton": canton, "url": f"https://www.homegate.ch{link_tag["href"]}"})

print(f"NB Cantons: {len(cantons)}")
print(f"First canton: {cantons[0]}")

Get all the cantons from https://www.homegate.ch/rent/apartment/switzerland
Status code: 200
NB Cantons: 26
First canton: {'canton': 'Aargau', 'url': 'https://www.homegate.ch/rent/apartment/canton-aargau'}


The *contons* list contains the following informations: 
[{
    canton: string, 
    url: string
}]

Where the url is the link to all the districts.

There is a tricky part now, all pages for a canton may have listing for a region (district) or cities but not both. 
We need to check if this is a district or cities listing and add links acordingly. 


In [32]:
session = create_requests_session_with_retries()

listings_list_pages = []
# we try the smallest listing possible per canton
for canton in cantons:
    canton_page = session.get(canton["url"], headers=generate_headers())
    canton_page.encoding = 'utf-8'

    # parse the page
    canton_links_page = BeautifulSoup(canton_page.text)
    
    # get a parent tag and navigate to the links
    parent = canton_links_page.select('[class^="GeoDrilldownPage_rootTitle_zNMQr"]')    
    parent = parent[0].nextSibling.nextSibling.find_all()
    regions = parent[1]

    # extract the links and store it in the list
    links = []
    for region in regions:
        link = f"https://www.homegate.ch{region.find("a")["href"]}"
        if "/matching-list" not in link: # listing page
            link += "/matching-list" 
        link += "?ipd=true" # we only keep listing that have a price
        links.append(link)

    listings_list_pages.append({
        "canton": canton["canton"],
        "links": links
    })    
    print(f"Done for {canton["canton"]}")
    time.sleep(10)
    

for ct in listings_list_pages:
    print(ct["canton"])
    for link in ct["links"]:
        print(link)
    print()
#print(listings_list_pages)


Done for Aargau
Done for Appenzell Ausserrhoden
Done for Appenzell Innerrhoden
Done for Basel-Land
Done for Basel-Stadt
Done for Bern
Done for Fribourg
Done for Geneva
Done for Glarus
Done for Graubünden
Done for Jura
Done for Lucerne
Done for Neuchâtel
Done for Nidwalden
Done for Obwalden
Done for Schaffhausen
Done for Schwyz
Done for Solothurn
Done for St. Gallen
Done for Thurgau
Done for Ticino
Done for Uri
Done for Valais
Done for Vaud
Done for Zug
Done for Zurich
Aargau
https://www.homegate.ch/rent/apartment/region-aarau/matching-list?ipd=true
https://www.homegate.ch/rent/apartment/region-baden/matching-list?ipd=true
https://www.homegate.ch/rent/apartment/region-bremgarten/matching-list?ipd=true
https://www.homegate.ch/rent/apartment/region-brugg/matching-list?ipd=true
https://www.homegate.ch/rent/apartment/region-kulm/matching-list?ipd=true
https://www.homegate.ch/rent/apartment/region-laufenburg/matching-list?ipd=true
https://www.homegate.ch/rent/apartment/region-lenzburg/matchi

Now we're going to get the number of listing and pages for each region or city.

In [104]:
start_time = time.time()
session = create_requests_session_with_retries()

# contains all the listings
listings = []

# we also write the results in a "listings.txt" as the process may break anytime
with open("listings.txt", "w") as file:
    # for each cantons, we'll get all the listings
    for infos in listings_list_pages:
        print(f"{infos["canton"]}:")
    
        for link in infos["links"]:
            
    
            # now we want to find out the number of listings and pages for a given region
            infos_page = requests.get(link, headers=generate_headers())
            infos_page.encoding = 'utf-8'
    
            # parse the page
            infos = BeautifulSoup(infos_page.text)
            # number of listings for this region
            nb = extract_number_from_string(infos.find_all(class_="ResultListHeader_locations_bold_OhksP")[0].text)
            # number of pages  for this listing
            pages = math.ceil(nb/20)
            if pages > 50: # we can handle at most 50 pages with the homegate ui
                pages = 50
    
            print(f" {link.split("/")[5]} {nb} listings for {pages} pages")
            
            if nb == 0: # no listing for this district
                continue
    
            # query all the pages
            print(f"  Page: ", end="")
            for i in range(1, pages + 1):
                # get the specific page
                target_page_link = f"{link}&page={i}"
                target_page = session.get(target_page_link, headers=generate_headers())
                target_page.encoding = 'utf-8'
    
                target_page = BeautifulSoup(target_page.text)
    
                # cards contains the informations about a listing
                cards = target_page.find_all(class_="HgListingCard_info_RKrwz")
                for card in cards:
                    # get the price, space and address
                    listing_informations = {
                        "price": card.find_all(class_="HgListingCard_price_JoPAs")[0].text.strip(),
                        "space": card.find_all(class_="HgListingRoomsLivingSpace_roomsLivingSpace_GyVgq")[0].text.strip(),
                        "address": card.find_all(class_="HgListingCard_address_JGiFv")[0].text.strip()
                    }
                    #print(listing_informations)
                    # store in the listings list
                    listings.append(listing_informations)
                    # and store it in the file
                    file.write(f"{listing_informations}\n")
                    
    
                # DO NOT REMOVE THIS !!!
                time.sleep(10)
    
                print(f"{i} ", end="")
                #break
    
            print()
    
            # DO NOT REMOVE THIS !!!
            time.sleep(10)
            #break
            
        
        print()
    
        #break

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Done in {elapsed_time:.2f} seconds.")

Aargau:
 region-aarau 145 listings for 8 pages
  Page: 1 2 3 4 5 6 7 8 
 region-baden 261 listings for 14 pages
  Page: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
 region-bremgarten 135 listings for 7 pages
  Page: 1 2 3 4 5 6 7 
 region-brugg 101 listings for 6 pages
  Page: 1 2 3 4 5 6 
 region-kulm 113 listings for 6 pages
  Page: 1 2 3 4 5 6 
 region-laufenburg 95 listings for 5 pages
  Page: 1 2 3 4 5 
 region-lenzburg 125 listings for 7 pages
  Page: 1 2 3 4 5 6 7 
 region-muri 64 listings for 4 pages
  Page: 1 2 3 4 
 region-rheinfelden 133 listings for 7 pages
  Page: 1 2 3 4 5 6 7 
 region-zofingen 203 listings for 11 pages
  Page: 1 2 3 4 5 6 7 8 9 10 11 
 region-zurzach 86 listings for 5 pages
  Page: 1 2 3 4 5 

Appenzell Ausserrhoden:
 region-hinterland 77 listings for 4 pages
  Page: 1 2 3 4 
 region-mittelland 51 listings for 3 pages
  Page: 1 2 3 
 region-vorderland 53 listings for 3 pages
  Page: 1 2 3 

Appenzell Innerrhoden:
 city-appenzell 8 listings for 1 pages
  Page: 1 
 

Let's check how many results we have got:

In [112]:
print(f"We managed to scrape {len(listings)} results")

We managed to scrape 24379 results


We'll parse all the results and do some cleaning:


In [2]:
listings_clean = []
for listing in listings:
    price = extract_number_from_string(listing["price"][4:][:-10])
    space = listing["space"].strip()
    address = replace_commas(listing["address"])
    
    # we filter out appartments that have no price
    if price is None:
        continue
        
    # we filter out appartments that have no rooms or living space
    if "rooms" not in space or "space" not in space:
        continue

    # there is a cp and canton, but not always an address
    if "," in address:
        street = address.split(",")[0].strip()
        cp = int(address.split(",")[1].strip().split(" ")[0].strip())
        municipality = address.split(",")[1].strip()[5:]
    else:
        street = "-"
        cp = int(address.split(" ")[0].strip())
        municipality = address.strip()[5:]
    
    rooms = float(space.split("rooms")[0].strip())
    surface = int(space.split("rooms")[1].strip()[:-15])
    
    listings_clean.append({
        "price": price, 
        "surface": surface,
        "rooms": rooms,
        "cp": cp,
        "municipality": municipality
    })

print(f"There are {len(listings_clean)} cleaned listings")
listings_clean


NameError: name 'listings' is not defined