## Immoscout24 Scraper - 2024

## Introduction

Welcome to this Jupyter Notebook, which serves as a prototype for a Python-based project aimed at scraping real estate listings from ImmoScout24. In the face of modern web technologies and anti-scraping measures, conventional scraping methods often fall short. This prototype explores an innovative approach using Selenium, a powerful browser automation tool, to navigate these challenges.

#### Key Highlights of the Prototype:

- **Selenium WebDriver Implementation**: Leverages Selenium for browser automation, facilitating interaction with the website's dynamic content.
- **Manual CAPTCHA Resolution**: Includes a manual intervention step for CAPTCHA solving, balancing automated data extraction with adherence to web access protocols.
- **Preliminary Data Extraction**: Extracts basic details such as titles, addresses, rental prices, living spaces, and room counts from real estate listings.
- **SQLite Database Integration**: Demonstrates how extracted data can be stored in a SQLite database, although this implementation is in its early stages.

#### Prototype Objectives:

- To evaluate the feasibility of automated real estate data extraction from ImmoScout24.
- To establish a foundational framework that can be refined and expanded in future iterations of the project.

#### Intended Use:

This prototype is designed for developers, data analysts, and researchers interested in real estate data aggregation and analysis. It provides a starting point for more complex and robust implementations.

### Imports

In [13]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd
import logging
from datetime import datetime
import sqlite3
import os

### Setup Variables

**Important Note:** The *SEARCH_PATH* variable in this script is pre-configured to filter and retrieve apartment listings specifically in Ulm, Germany. If you wish to target a different location or set of criteria, please modify the *SEARCH_PATH* variable accordingly to suit your specific requirements.

In [24]:
# Path variables
output_path = r"d:\immoscrape"
log_path = f"{output_path}\scraping_log.log"

# Search parameters
rooms = "1.5-"
price = "-600.0"
livingspace = "30.0-60.0"

# Constants
BASE_URL = "https://www.immobilienscout24.de"
SEARCH_PATH = f"/Suche/de/baden-wuerttemberg/ulm/wohnung-mieten?numberofrooms={rooms}&price={price}&livingspace={livingspace}&pricetype=rentpermonth&enteredFrom=result_list"

logging.basicConfig(filename=log_path, level=logging.INFO)

### Database Connection

The following section of code is dedicated to setting up and managing the database functionalities for our project. It includes functions for establishing a connection to an SQLite database, creating a table to store real estate listings, checking for existing records to avoid duplicates, and inserting new data entries. These functions are crucial for efficiently storing and organizing the scraped data from ImmoScout24.

In [26]:
# Function to establish database connection
def create_connection(db_file):
    """ Create a database connection to the SQLite database specified by db_file """
    conn = None
    try:
        conn = sqlite3.connect(db_file)
    except Exception as e:
        print(e)
    return conn


# Function to create table
def create_table(conn):
    """ Create table if it doesn't exist and set data_id as the primary key """
    try:
        sql_create_table = """CREATE TABLE IF NOT EXISTS listings (
                                data_id text PRIMARY KEY,
                                title text,
                                address text,
                                kaltmiete real,
                                living_space real,
                                rooms real,
                                date text
                            );"""
        cur = conn.cursor()
        cur.execute(sql_create_table)
    except Exception as e:
        print(e)
        
        
def check_listing_exists(conn, data_id, extraction_date):
    """ Check if a listing with the same ID and date already exists """
    cur = conn.cursor()
    cur.execute("SELECT data_id FROM listings WHERE data_id = ? AND date = ?", (data_id, extraction_date))
    return cur.fetchone() is not None


# Function to insert data into table
def insert_listing(conn, listing):
    """ Insert a new listing into the listings table """
    query = ''' INSERT OR IGNORE INTO listings(data_id,title,address,kaltmiete,living_space,rooms,date)
              VALUES(?,?,?,?,?,?,?) '''
    cur = conn.cursor()
    cur.execute(query, listing)
    return cur.lastrowid

### Web Scraping

In [30]:
def main():
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

    start_time = time.time()
    all_listings_data = []
    
    # Get current date
    extraction_date = datetime.now().strftime("%Y-%m-%d")
    
    # Database setup
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    db_path = os.path.join(output_path, 'listings.db')
    print(db_path)
    conn = create_connection(db_path)
    
    # Create table if not exists
    if conn is not None:
        create_table(conn)

    driver.get(BASE_URL + SEARCH_PATH)
    time.sleep(5)  # Wait for the page to load

    # Check if there's a CAPTCHA and prompt the user to solve it
    input("Please solve the CAPTCHA and then press Enter to continue...")

    # Access the result-list-content div
    result_list_content = driver.find_element(By.ID, "result-list-content")
    #print(result_list_content)

    # Find all listings within this div
    listing_elements = result_list_content.find_elements(By.CLASS_NAME, "result-list__listing")
    #print(len(listing_elements))
    
    # List to hold valid listings
    valid_listings = []

    # Find all listings within this div
    listing_elements = result_list_content.find_elements(By.CLASS_NAME, "result-list__listing")
    #print(f"Number of listings found: {len(listing_elements)}")

    # Print the title for each listing for verification
    for listing_element in listing_elements:
        try:
            title_element = listing_element.find_element(By.CSS_SELECTOR, 'h2.result-list-entry__brand-title')
            title = title_element.text
            # print(f"Listing Title: {title}") # Title was found!
            valid_listings.append(listing_element)  # Add to valid listings if title found
        except Exception as e:
            print(f"Error finding title in listing: {e}")

    # print(f"Number of valid listings: {len(valid_listings)}")
    
    # For all valid_listings, get location, price, number of rooms, number of squaremetres
    for listing_element in valid_listings:
        try:
           # Extracting the data-id
            data_id = listing_element.get_attribute('data-id')

            # Extracting the title (as done previously)
            title_element = listing_element.find_element(By.CSS_SELECTOR, 'h2.result-list-entry__brand-title')
            title = title_element.text

            # Extracting the address
            address_element = listing_element.find_element(By.CSS_SELECTOR, 'div.result-list-entry__address')
            address = address_element.text
            
            # Extract rent, living space, and number of rooms
            criteria_elements = listing_element.find_elements(By.CSS_SELECTOR, 'dl.result-list-entry__primary-criterion')
            kaltmiete = living_space = rooms = None

            for criteria in criteria_elements:
                label = criteria.find_element(By.TAG_NAME, 'dt').text.strip()
                value = criteria.find_element(By.TAG_NAME, 'dd').text.strip()

                if 'Kaltmiete' in label:
                    kaltmiete = value.replace(' €', '').replace('.', '').replace(',', '.') # Remove ' €'
                    kaltmiete = float(kaltmiete)
                elif 'Wohnfläche' in label:
                    living_space = value.replace(' m²', '').replace(',', '.') # Remove ' m²'
                    living_space = float(living_space)
                elif 'Zi.' in label:
                    rooms = value.replace(',', '.')
                    rooms = float(rooms)
            
            # Print title and address for verification
            print(f"Object ID: {data_id}")
            print(f"Title: {title}")
            print(f"Address: {address}")
            print(f"Kaltmiete: {kaltmiete}")
            print(f"Wohnfläche: {living_space}")
            print(f"Zi.: {rooms}")            
            
            # Check if the listing already exists
            if not check_listing_exists(conn, data_id, extraction_date):
                # Insert into the database
                listing = (data_id, title, address, kaltmiete, living_space, rooms, extraction_date)
                insert_listing(conn, listing)
            
        except Exception as e:
            print(e)
         
    # Commit changes and close connection
    conn.commit()
    conn.close()
    driver.quit()

    logging.info("Scraping completed.")
    print("Scraping completed.")

if __name__ == "__main__":
    main()

d:\immoscrape\listings.db


Please solve the CAPTCHA and then press Enter to continue... 


Error finding title in listing: Message: no such element: Unable to locate element: {"method":"css selector","selector":"h2.result-list-entry__brand-title"}
  (Session info: chrome=120.0.6099.130); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x004F6EE3+174339]
	(No symbol) [0x00420A51]
	(No symbol) [0x00136FF6]
	(No symbol) [0x00169876]
	(No symbol) [0x00169C2C]
	(No symbol) [0x00162631]
	(No symbol) [0x00187054]
	(No symbol) [0x001625B0]
	(No symbol) [0x00187414]
	(No symbol) [0x0019A104]
	(No symbol) [0x00186DA6]
	(No symbol) [0x00161034]
	(No symbol) [0x00161F8D]
	GetHandleVerifier [0x00594B1C+820540]
	sqlite3_dbdata_init [0x006553EE+653550]
	sqlite3_dbdata_init [0x00654E09+652041]
	sqlite3_dbdata_init [0x006497CC+605388]
	sqlite3_dbdata_init [0x00655D9B+656027]
	(No symbol) [0x0042FE6C]
	(No symbol) [0x004283B8]
	(No symbol) [0x004284DD]
	(No symbol) 