# üè† Sarajevo Flats Scraper
This notebook demonstrates how to collect real estate data (flats in Sarajevo Canton) from **NEKRETNINE.ba**, a popular Bosnian classifieds platform.

The goal is to:
- Collect key property details (title, price, size, location, condition‚Ä¶)
- Store them in a structured dataset (`sarajevo_flats.csv`)
- Prepare the dataset for future analysis or machine learning (e.g. AI price estimation)

We'll use **Selenium** for dynamic page loading and **BeautifulSoup** for parsing HTML.


In [1]:
import sys
import os

print("=" * 80)
print("PYTHON ENVIRONMENT INFO")
print("=" * 80)
print(f"\nPython executable: {sys.executable}")
print(f"Python version: {sys.version}")
print(f"\nPython path (where packages are loaded from):")
for i, path in enumerate(sys.path[:5], 1):
    print(f"  {i}. {path}")
print("\n" + "=" * 80)

PYTHON ENVIRONMENT INFO

Python executable: /home/msinanovic/Desktop/IUS/VIIsemester/IntroductionToMachineLearning/EE418-Introduction-to-Machine-Learning-Project/venv/bin/python
Python version: 3.14.0 (main, Oct 17 2025, 00:00:00) [GCC 15.2.1 20251022 (Red Hat 15.2.1-3)]

Python path (where packages are loaded from):
  1. /usr/lib64/python314.zip
  2. /usr/lib64/python3.14
  3. /usr/lib64/python3.14/lib-dynload
  4. 
  5. /home/msinanovic/Desktop/IUS/VIIsemester/IntroductionToMachineLearning/EE418-Introduction-to-Machine-Learning-Project/venv/lib64/python3.14/site-packages



In [2]:
import os
import time
import csv
import re
import random
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.common.exceptions import WebDriverException, TimeoutException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

In [3]:
# Firefox + Geckodriver setup
firefox_binary = "/usr/bin/firefox"
geckodriver_binary = "/home/mustafasinanovic/miniforge3/bin/geckodriver"

# Scraper settings
BASE_URL = "https://nekretnine.ba/listing.php?lang=ba&sel=nekretnine&grad=65&naselje=&kat=3&subjekt=2&cij1=&cij2=&pov1=&pov2=&spr1=&spr2=&firma=&page={}"
OUTPUT_CSV = "data/sarajevo_flats_nekretnine.csv"
MAX_PAGES = 88
REQUEST_DELAY = (2, 5)

# Multithreading settings
MAX_WORKERS = 3  # Number of parallel browser instances (don't set too high to avoid blocking)

os.makedirs("data", exist_ok=True)

The scraper will fetch up to 88 pages of listings from the OLX search results for *Sarajevo Canton flats*.  
All results are stored in `data/sarajevo_flats_nekretnine.csv`.  
We use randomized delays between requests to reduce the risk of blocking.

In [4]:
def clean_text(s):
    return " ".join(s.split()).strip() if s else None

def extract_price(text):
    if not text:
        return None
    cleaned = re.sub(r"[^0-9]", "", text)
    return int(cleaned) if cleaned else None

def extract_number(text):
    if not text:
        return None
    m = re.search(r"(\d+)", text)
    return int(m.group(1)) if m else None

In [5]:
def fetch_page_source(url, driver, short_wait=10):
    """
    Loads a given URL and returns whatever HTML is available immediately.
    Does NOT wait for the page to fully load (useful for slow or problematic websites).

    Parameters:
        url (str): URL to load
        driver (webdriver): Selenium WebDriver instance
        short_wait (int or float): seconds to wait after opening page before returning source

    Returns:
        str or None: HTML source (may be partially loaded)
    """
    try:
        print(f"[+] Attempting to load URL quickly: {url}")
        driver.get(url)
        time.sleep(short_wait)  # minimal wait to let some content render
        html = driver.page_source
        if html:
            print(f"[+] HTML fetched (may be partial): {url}")
        else:
            print(f"[!] No HTML returned for {url}")
        return html
    except (TimeoutException, WebDriverException, OSError) as e:
        print(f"[!] Failed to load page: {url} ‚Üí {e}")
        return None
    except Exception as e:
        print(f"[!] Unexpected error loading page: {url} ‚Üí {e}")
        return None


This function uses Selenium to load pages dynamically.
If a page fails (timeout, network error, etc.), we log the issue but continue scraping.

In [6]:
def parse_detail_page(url, driver):
    html = fetch_page_source(url, driver)
    if not html:
        return None

    try:
        soup = BeautifulSoup(html, "lxml")

        # Extract title
        title_elem = soup.select_one("div.listing-titlebar-title h2")
        if title_elem:
            # Remove the tag span from title
            tag_span = title_elem.find("span", class_="listing-tag")
            if tag_span:
                tag_span.decompose()
            title = clean_text(title_elem.get_text())
        else:
            title = None

        # Extract municipality (address/location)
        municipality_elem = soup.select_one("a.listing-address")
        municipality = clean_text(municipality_elem.get_text()) if municipality_elem else None

        # Extract price
        price_elem = soup.select_one("span.re-slidep")
        price_numeric = extract_price(price_elem.get_text()) if price_elem else None

        # Extract property type
        property_type_elem = soup.find("b", string="TIP")
        property_type = clean_text(property_type_elem.find_next("div").get_text()) if property_type_elem else None

        # Extract ad type (subject - prodaja/izdavanje)
        ad_type_elem = soup.find("b", string="SUBJEKT")
        ad_type = clean_text(ad_type_elem.find_next("div").get_text()) if ad_type_elem else None

        # Extract rooms
        rooms_elem = soup.find("b", string="BROJ SOBA")
        rooms = clean_text(rooms_elem.find_next("div").get_text()) if rooms_elem else None

        # Extract square meters
        square_m2_elem = soup.find("b", string="POVR≈†INA")
        if square_m2_elem:
            area_text = square_m2_elem.find_next("div").get_text(strip=True)
            # Extract number and convert to float
            area_match = re.search(r'([\d,\.]+)', area_text)
            if area_match:
                area_str = area_match.group(1).replace(',', '.')
                try:
                    square_m2 = float(area_str)
                except:
                    square_m2 = None
            else:
                square_m2 = None
        else:
            square_m2 = None

        # Extract description
        description_head = soup.find("h3", string=re.compile("Opis nekretnine"))
        description = clean_text(description_head.find_next("p").get_text(" ")) if description_head else None

        # Extract equipment/amenities
        equipment_list = [clean_text(li.get_text()) for li in soup.select("ul.listing-features li")]
        equipment = ", ".join([e for e in equipment_list if e])  # Filter out None values

        details = {
            "title": title,
            "url": url,
            "price_numeric": price_numeric,
            "municipality": municipality,
            "property_type": property_type,
            "ad_type": ad_type,
            "rooms": rooms,
            "square_m2": square_m2,
            "equipment": equipment,
            "description": description
        }

        print("Parsed:", details)
        return details
    except Exception as e:
        print(f"[!] Failed to parse details for {url} ‚Üí {e}")
        return None


In [7]:
def create_driver():
    print("[*] Initializing Firefox WebDriver...")
    try:
        options = Options()
        options.binary_location = firefox_binary
        options.add_argument("--headless")

        # ‚úÖ New way to set pageLoadStrategy (Selenium 4+)
        options.set_capability("pageLoadStrategy", "none")

        service = Service(executable_path=geckodriver_binary)
        driver = webdriver.Firefox(service=service, options=options)
        driver.set_page_load_timeout(10)
        print("[+] WebDriver started successfully.")
        return driver
    except Exception as e:
        print(f"[!] Failed to start Firefox driver: {e}")
        return None


## Multithreaded Scraping Functions

We'll use ThreadPoolExecutor to run multiple Selenium instances in parallel. Each thread gets its own WebDriver instance to avoid conflicts.

In [8]:
def scrape_listing(link, driver):
    """
    Scrape a single listing and return the data.
    Each thread will call this function with its own driver instance.
    """
    try:
        data = parse_detail_page(link, driver)
        if data:
            print(f"      ‚úî Scraped: {link}")
        else:
            print(f"      ‚úñ Failed: {link}")
        time.sleep(random.uniform(*REQUEST_DELAY))
        return data
    except Exception as e:
        print(f"[!] Error scraping {link}: {e}")
        return None


def scrape_page_listings(page_num, driver):
    """
    Scrape all listings from a single search results page.
    Returns a list of listing URLs found on that page.
    """
    print(f"\n[+] Fetching search page {page_num}: {BASE_URL.format(page_num)}")
    html = fetch_page_source(BASE_URL.format(page_num), driver)
    
    if not html:
        print(f"[!] No HTML for page {page_num}, skipping.")
        return []
    
    try:
        soup = BeautifulSoup(html, "lxml")
        links = [urljoin("https://nekretnine.ba/", a["href"]) 
                for a in soup.find_all("a", href=re.compile(r"^real-estate\.php\?lang=ba&sel=nekretnine&view="))]
        
        print(f"  ‚Üí Found {len(links)} listings on page {page_num}")
        
        if not links:
            print(f"[!] No links found on page {page_num}. Possible structure change?")
        
        return links
    except Exception as e:
        print(f"[!] Failed to parse search page {page_num} ‚Üí {e}")
        return []


def scrape_with_threading():
    """
    Multithreaded scraping function.
    Creates multiple WebDriver instances and processes listings in parallel.
    """
    fieldnames = ["title", "url", "price_numeric", "municipality", "property_type", "ad_type", "rooms", "square_m2", "equipment", "description"]
    
    # Thread-safe lock for writing to CSV
    csv_lock = threading.Lock()
    
    # Create main driver for collecting listing URLs
    print("[*] Creating main driver for collecting listing URLs...")
    main_driver = create_driver()
    if not main_driver:
        print("[!] Failed to create main driver. Exiting.")
        return
    
    # Collect all listing URLs first
    print(f"[*] Collecting listing URLs from {MAX_PAGES} pages...")
    all_listing_urls = []
    
    for page in range(1, MAX_PAGES + 1):
        links = scrape_page_listings(page, main_driver)
        all_listing_urls.extend(links)
        time.sleep(random.uniform(1, 2))  # Small delay between pages
    
    main_driver.quit()
    print(f"\n[+] Collected {len(all_listing_urls)} total listings to scrape.")
    
    if not all_listing_urls:
        print("[!] No listings found. Exiting.")
        return
    
    # Prepare CSV file
    write_header = not os.path.exists(OUTPUT_CSV)
    
    def worker_scrape(url_batch):
        """Worker function that each thread will execute"""
        driver = create_driver()
        if not driver:
            print("[!] Failed to create worker driver")
            return []
        
        results = []
        for url in url_batch:
            data = scrape_listing(url, driver)
            if data:
                results.append(data)
        
        driver.quit()
        return results
    
    # Split listings into batches for each worker
    batch_size = len(all_listing_urls) // MAX_WORKERS + 1
    url_batches = [all_listing_urls[i:i + batch_size] for i in range(0, len(all_listing_urls), batch_size)]
    
    print(f"\n[*] Starting multithreaded scraping with {MAX_WORKERS} workers...")
    print(f"[*] Processing {len(url_batches)} batches...")
    
    # Use ThreadPoolExecutor for parallel scraping
    all_results = []
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        # Submit all batches to thread pool
        futures = {executor.submit(worker_scrape, batch): i for i, batch in enumerate(url_batches)}
        
        # Process results as they complete
        for future in as_completed(futures):
            batch_num = futures[future]
            try:
                batch_results = future.result()
                all_results.extend(batch_results)
                print(f"[+] Batch {batch_num + 1}/{len(url_batches)} completed. Scraped {len(batch_results)} listings.")
            except Exception as e:
                print(f"[!] Batch {batch_num + 1} failed: {e}")
    
    # Write all results to CSV at once (thread-safe)
    print(f"\n[*] Writing {len(all_results)} results to CSV...")
    with csv_lock:
        with open(OUTPUT_CSV, "a" if not write_header else "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            if write_header:
                writer.writeheader()
                print(f"[+] Created new CSV file: {OUTPUT_CSV}")
            
            for data in all_results:
                writer.writerow(data)
    
    print(f"\n‚úÖ Finished scraping. Data saved to: {OUTPUT_CSV}")
    print(f"‚úÖ Total listings scraped: {len(all_results)}/{len(all_listing_urls)}")


def scrape():
    """Original single-threaded scraping function (kept for reference)"""
    driver = create_driver()

    fieldnames = ["title", "url", "price_numeric", "municipality", "property_type", "ad_type", "rooms", "square_m2", "equipment", "description"]

    write_header = not os.path.exists(OUTPUT_CSV)
    with open(OUTPUT_CSV, "a", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        if write_header:
            writer.writeheader()
            print(f"[+] Created new CSV file: {OUTPUT_CSV}")
        else:
            print(f"[+] Appending to existing CSV: {OUTPUT_CSV}")

        print(f"[*] Starting scraping of up to {MAX_PAGES} pages...")

        for page in range(1, MAX_PAGES + 1):
            print(f"\n[+] Fetching search page {page}: {BASE_URL.format(page)}")
            html = fetch_page_source(BASE_URL.format(page), driver)
            if not html:
                print(f"[!] No HTML for page {page}, skipping.")
                continue

            try:
                soup = BeautifulSoup(html, "lxml")

                links = [urljoin("https://nekretnine.ba/", a["href"]) for a in soup.find_all("a", href=re.compile(r"^real-estate\.php\?lang=ba&sel=nekretnine&view="))]
                
                print(f"  ‚Üí Found {len(links)} listings on page {page}")

                if not links:
                    print(f"[!] No links found on page {page}. Possible structure change?")
                    continue

                for i, link in enumerate(links, start=1):
                    print(f"    [{i}/{len(links)}] Scraping listing: {link}")
                    try:
                        data = parse_detail_page(link, driver)
                        if data:
                            writer.writerow(data)
                            print("      ‚úî Saved listing data to CSV.")
                        else:
                            print("      ‚úñ No data parsed, skipping.")
                        time.sleep(random.uniform(*REQUEST_DELAY))
                    except Exception as e:
                        print(f"[!] Error scraping {link}: {e}")
            except Exception as e:
                print(f"[!] Failed to parse search page {page} ‚Üí {e}")

    driver.quit()
    print(f"\n‚úÖ Finished scraping. Data saved to: {OUTPUT_CSV}")

## Run the Scraper

Choose which scraper to run:
- `scrape_with_threading()` - **Multithreaded version** (faster, uses 3 parallel browsers)
- `scrape()` - Single-threaded version (slower, but more stable)

In [9]:
# if __name__ == "__main__":
#     # Use multithreaded version for faster scraping
#     scrape_with_threading()
    
    # Or use single-threaded version (comment above, uncomment below)
    # scrape()

## üìä Data Inspection

Let's load and inspect the scraped data from the CSV file.

In [10]:
import pandas as pd

# Load the CSV file
df = pd.read_csv('../data/sarajevo_flats_nekretnine_cleaned.csv')

# Display basic information
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)
print(f"Total records: {len(df)}")
print(f"Total columns: {len(df.columns)}")
print(f"\nColumn names: {list(df.columns)}")
print("\n" + "=" * 80)

DATASET OVERVIEW
Total records: 1545
Total columns: 9

Column names: ['title', 'url', 'price_numeric', 'municipality', 'rooms', 'square_m2', 'equipment', 'description', 'price_per_m2']



In [11]:
# Display first few rows
print("FIRST 5 ROWS:")
print("=" * 80)
df.head()

FIRST 5 ROWS:


Unnamed: 0,title,url,price_numeric,municipality,rooms,square_m2,equipment,description,price_per_m2
0,Sarajevo,https://nekretnine.ba/real-estate.php?lang=ba&...,,,Dvosoban,82.0,"Gara≈æa, Balkon, Centralno grijanje, Telefonski...",Agencija za nekretnine Stanpromet.ba izdvaja p...,
1,"Sarajevo, Sarajevo ‚Äì Stari grad",https://nekretnine.ba/real-estate.php?lang=ba&...,339000.0,,ƒåetverosoban,94.0,"Plin, Telefonski prikljuƒçak, Struja, Namje≈°ten...",Rental prodaje troiposoban salonski stan od 94...,3606.382979
2,Sarajevo,https://nekretnine.ba/real-estate.php?lang=ba&...,333000.0,,Dvosoban,73.0,"Centralno grijanje, Telefonski prikljuƒçak, Str...","Realno, za ponudu najboljih nekretnina treba V...",4561.643836
3,Sarajevo,https://nekretnine.ba/real-estate.php?lang=ba&...,,,Dvosoban,81.0,"Gara≈æa, Balkon, Centralno grijanje, Telefonski...",Stanpromet.ba agencija za nekretnine najavljuj...,
4,Sarajevo,https://nekretnine.ba/real-estate.php?lang=ba&...,,,Dvosoban,75.0,"Gara≈æa, Balkon, Centralno grijanje, Telefonski...",Stanpromet.ba agencija za nekretnine najavljuj...,


In [12]:
# Display data types and missing values
print("DATA TYPES AND MISSING VALUES:")
print("=" * 80)
df.info()

DATA TYPES AND MISSING VALUES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1545 entries, 0 to 1544
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          1545 non-null   object 
 1   url            1545 non-null   object 
 2   price_numeric  1033 non-null   float64
 3   municipality   817 non-null    object 
 4   rooms          1545 non-null   object 
 5   square_m2      1545 non-null   float64
 6   equipment      1474 non-null   object 
 7   description    1408 non-null   object 
 8   price_per_m2   1029 non-null   float64
dtypes: float64(3), object(6)
memory usage: 108.8+ KB


In [13]:
# Statistical summary of numeric columns
print("STATISTICAL SUMMARY (Numeric Columns):")
print("=" * 80)
df.describe()

STATISTICAL SUMMARY (Numeric Columns):


Unnamed: 0,price_numeric,square_m2,price_per_m2
count,1033.0,1545.0,1029.0
mean,258976.8,74.684142,3864.43122
std,212309.9,40.530484,6386.076703
min,1.0,0.0,0.006667
25%,126000.0,50.0,2093.333333
50%,201725.0,66.0,3100.0
75%,330000.0,87.0,4518.072289
max,1600000.0,300.0,192000.0


In [14]:
# Check for missing values per column
print("MISSING VALUES PER COLUMN:")
print("=" * 80)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage.round(2)
})
print(missing_df[missing_df['Missing Count'] > 0])
print("\n" + "=" * 80)

MISSING VALUES PER COLUMN:
               Missing Count  Percentage
price_numeric            512       33.14
municipality             728       47.12
equipment                 71        4.60
description              137        8.87
price_per_m2             516       33.40



In [15]:
# Price analysis
print("PRICE ANALYSIS:")
print("=" * 80)
print(f"Average Price: {df['price_numeric'].mean():.2f} KM")
print(f"Median Price: {df['price_numeric'].median():.2f} KM")
print(f"Min Price: {df['price_numeric'].min():.2f} KM")
print(f"Max Price: {df['price_numeric'].max():.2f} KM")
print(f"Standard Deviation: {df['price_numeric'].std():.2f} KM")
print("\n" + "=" * 80)

PRICE ANALYSIS:
Average Price: 258976.75 KM
Median Price: 201725.00 KM
Min Price: 1.00 KM
Max Price: 1600000.00 KM
Standard Deviation: 212309.90 KM



In [16]:
# Square meters analysis
print("SQUARE METERS ANALYSIS:")
print("=" * 80)
print(f"Average Area: {df['square_m2'].mean():.2f} m¬≤")
print(f"Median Area: {df['square_m2'].median():.2f} m¬≤")
print(f"Min Area: {df['square_m2'].min():.2f} m¬≤")
print(f"Max Area: {df['square_m2'].max():.2f} m¬≤")
print(f"Standard Deviation: {df['square_m2'].std():.2f} m¬≤")
print("\n" + "=" * 80)

SQUARE METERS ANALYSIS:
Average Area: 74.68 m¬≤
Median Area: 66.00 m¬≤
Min Area: 0.00 m¬≤
Max Area: 300.00 m¬≤
Standard Deviation: 40.53 m¬≤



In [17]:
# Filter out unrealistic property sizes (> 300 m¬≤)
print("FILTERING UNREALISTIC PROPERTY SIZES:")
print("=" * 80)
print(f"Records before filtering: {len(df)}")

# Show properties that will be removed
large_properties = df[df['square_m2'] > 300]
if len(large_properties) > 0:
    print(f"\n‚ö†Ô∏è Found {len(large_properties)} properties with area > 300 m¬≤:")
    print(large_properties[['title', 'square_m2', 'property_type', 'url']])
    
    # Remove properties with area > 300 m¬≤
    df = df[df['square_m2'] <= 300]
    print(f"\n‚úÖ Filtered out {len(large_properties)} properties")
else:
    print("\n‚úÖ No properties with area > 300 m¬≤ found")

print(f"Records after filtering: {len(df)}")
print("\n" + "=" * 80)

FILTERING UNREALISTIC PROPERTY SIZES:
Records before filtering: 1545

‚úÖ No properties with area > 300 m¬≤ found
Records after filtering: 1545



In [18]:
# Optionally save the cleaned data to a new CSV file
print("SAVING CLEANED DATA:")
print("=" * 80)
output_file = '../data/sarajevo_flats_nekretnine_cleaned.csv'
df.to_csv(output_file, index=False)
print(f"‚úÖ Cleaned data saved to: {output_file}")
print(f"Total records saved: {len(df)}")
print("\n" + "=" * 80)

SAVING CLEANED DATA:
‚úÖ Cleaned data saved to: ../data/sarajevo_flats_nekretnine_cleaned.csv
Total records saved: 1545



In [19]:
# Price per square meter analysis (filter out invalid data first)
print("PRICE PER SQUARE METER ANALYSIS:")
print("=" * 80)

# Check for zero or null square_m2 values
print(f"Properties with square_m2 = 0 or NaN: {((df['square_m2'] == 0) | df['square_m2'].isna()).sum()}")
print(f"Properties with price_numeric = 0 or NaN: {((df['price_numeric'] == 0) | df['price_numeric'].isna()).sum()}")

# Filter out properties with invalid data for price per m¬≤ calculation
valid_df = df[(df['square_m2'] > 0) & (df['price_numeric'] > 0) & df['square_m2'].notna() & df['price_numeric'].notna()].copy()

print(f"\nValid properties for price/m¬≤ analysis: {len(valid_df)}/{len(df)}")
print("=" * 80)

# Calculate price per m¬≤ only on valid data
valid_df['price_per_m2'] = valid_df['price_numeric'] / valid_df['square_m2']

print(f"\nAverage Price per m¬≤: {valid_df['price_per_m2'].mean():.2f} KM/m¬≤")
print(f"Median Price per m¬≤: {valid_df['price_per_m2'].median():.2f} KM/m¬≤")
print(f"Min Price per m¬≤: {valid_df['price_per_m2'].min():.2f} KM/m¬≤")
print(f"Max Price per m¬≤: {valid_df['price_per_m2'].max():.2f} KM/m¬≤")
print(f"Standard Deviation: {valid_df['price_per_m2'].std():.2f} KM/m¬≤")

# Add price_per_m2 back to main dataframe
df['price_per_m2'] = df.apply(
    lambda row: row['price_numeric'] / row['square_m2'] 
    if (row['square_m2'] > 0 and row['price_numeric'] > 0) 
    else None, 
    axis=1
)

print("\n" + "=" * 80)

PRICE PER SQUARE METER ANALYSIS:
Properties with square_m2 = 0 or NaN: 5
Properties with price_numeric = 0 or NaN: 512

Valid properties for price/m¬≤ analysis: 1029/1545

Average Price per m¬≤: 3864.43 KM/m¬≤
Median Price per m¬≤: 3100.00 KM/m¬≤
Min Price per m¬≤: 0.01 KM/m¬≤
Max Price per m¬≤: 192000.00 KM/m¬≤
Standard Deviation: 6386.08 KM/m¬≤



In [20]:
# Check for duplicate records
print("DUPLICATE RECORDS CHECK:")
print("=" * 80)
duplicates = df.duplicated(subset=['url']).sum()
print(f"Number of duplicate URLs: {duplicates}")

if duplicates > 0:
    print("\nDuplicate URLs found:")
    print(df[df.duplicated(subset=['url'], keep=False)][['title', 'url', 'price_numeric']])
else:
    print("No duplicate URLs found!")
    
print("\n" + "=" * 80)

DUPLICATE RECORDS CHECK:
Number of duplicate URLs: 0
No duplicate URLs found!



## üìã Complete Categorical Values Analysis

Let's examine all possible values for each categorical column in the dataset.

In [21]:
# Identify categorical columns (object dtype or low cardinality numeric columns)
print("=" * 80)
print("ALL UNIQUE VALUES FOR CATEGORICAL COLUMNS")
print("=" * 80)

# Get all columns
all_columns = df.columns.tolist()

# Separate numeric and non-numeric columns
categorical_cols = []
numeric_cols = []

for col in all_columns:
    if df[col].dtype == 'object':
        categorical_cols.append(col)
    elif df[col].dtype in ['int64', 'float64']:
        # Check if it's a low cardinality numeric column (might be categorical)
        unique_count = df[col].nunique()
        if unique_count <= 20:  # Threshold for categorical numeric columns
            categorical_cols.append(col)
        else:
            numeric_cols.append(col)
    else:
        categorical_cols.append(col)

# remove identifier columns from categorical list
categorical_cols = [c for c in categorical_cols if c not in ('title', 'url', 'description', 'equipment')]
print(f"\nFound {len(categorical_cols)} categorical columns")
print(f"Categorical columns: {categorical_cols}")
print(f"\nNumeric columns (excluded): {numeric_cols}")
print(f"\nText columns (excluded): title, url, description, equipment")
print("\n" + "=" * 80)

ALL UNIQUE VALUES FOR CATEGORICAL COLUMNS

Found 2 categorical columns
Categorical columns: ['municipality', 'rooms']

Numeric columns (excluded): ['price_numeric', 'square_m2', 'price_per_m2']

Text columns (excluded): title, url, description, equipment



In [22]:
# Display all unique values for each categorical column
print("=" * 80)
print("ALL POSSIBLE VALUES FOR EACH CATEGORICAL COLUMN")
print("=" * 80)

for col in categorical_cols:
    print(f"\n{'='*80}")
    print(f"üìå COLUMN: {col.upper()}")
    print(f"{'='*80}")
    
    # Get unique values (excluding NaN)
    unique_values = df[col].dropna().unique()
    unique_count = len(unique_values)
    
    print(f"Total unique values: {unique_count}")
    print(f"\nAll possible values:")
    print("-" * 80)
    
    # Sort values for better readability
    if df[col].dtype in ['int64', 'float64']:
        sorted_values = sorted(unique_values)
    else:
        sorted_values = sorted(unique_values, key=lambda x: str(x))
    
    # Display all unique values in a clean list
    for i, value in enumerate(sorted_values, 1):
        print(f"{i:3d}. {value}")

print("\n" + "=" * 80)
print("‚úÖ ANALYSIS COMPLETE")
print("=" * 80)

ALL POSSIBLE VALUES FOR EACH CATEGORICAL COLUMN

üìå COLUMN: MUNICIPALITY
Total unique values: 655

All possible values:
--------------------------------------------------------------------------------
  1. -
  2. 12.mart
  3. A.B.
  4. AVDE SMAJLOVI√Ñ‚Ä†A 23
  5. Adema Buƒáe
  6. Adema Buƒçe 106
  7. Adija Mulabegovica
  8. Adija Mulabegoviƒáa
  9. Ad≈æemoviƒáa
 10. Ahatoviƒáka
 11. Ahmeda Bo≈°njaka
 12. Ajdinoviƒái
 13. Akademika Petra Mandiƒáa
 14. Aleksandra Puskina
 15. Aleksandra Puskina 29
 16. Alifakovac
 17. Alipasina
 18. Alipasino polje, faza B
 19. Alipa≈°ina 11
 20. Anrdreja Andrejeviƒáa
 21. Ante Babiƒáa
 22. Ante Babiƒáa 3-5
 23. Antuna Branka Simica 10
 24. Antuna Branka ≈†imiƒáa
 25. Antuna Hangija
 26. Antuna Hangija 3
 27. Apartman Bjela≈°nica, novogradnja, 20m2
 28. Apartman na dvije eta≈æe Jahoirina-Dvori≈°ta, 53m2
 29. Apartman na dvije eta≈æe u novogradnji Bjela≈°nica, 66m2
 30. Apartman sa dvije spavaƒáe sobe Ravna planina, 42m2
 31. Apartman sa gara≈æom u novo

## üó∫Ô∏è Municipality Standardization

Let's check if addresses or neighborhoods in the data can be mapped to the 9 standard municipalities.

In [23]:
# Step 1: Analyze current municipality values and look for patterns
print("=" * 80)
print("CURRENT MUNICIPALITY ANALYSIS")
print("=" * 80)

if 'municipality' in df.columns:
    print(f"\nTotal unique municipality values: {df['municipality'].nunique()}")
    print(f"Missing municipality values: {df['municipality'].isnull().sum()}")
    
    print("\n" + "-" * 80)
    print("ALL MUNICIPALITY VALUES (with counts):")
    print("-" * 80)
    
    municipality_counts = df['municipality'].value_counts()
    for idx, (municipality, count) in enumerate(municipality_counts.items(), 1):
        print(f"{idx:3d}. {municipality:<50} ({count} records)")
    
    # Define target municipalities
    target_municipalities = [
        'Had≈æiƒái',
        'Ilid≈æa',
        'Ilija≈°',
        'Sarajevo - Centar',
        'Sarajevo - Novi Grad',
        'Sarajevo - Novo Sarajevo',
        'Sarajevo - Stari Grad',
        'Trnovo',
        'Vogo≈°ƒáa'
    ]
    
    print("\n" + "=" * 80)
    print("TARGET MUNICIPALITIES (from reference):")
    print("=" * 80)
    for idx, municipality in enumerate(target_municipalities, 1):
        count = (df['municipality'] == municipality).sum()
        print(f"{idx:3d}. {municipality:<30} ({count} records in dataset)")
    
    # Check which values don't match target municipalities
    non_matching = df[~df['municipality'].isin(target_municipalities) & df['municipality'].notna()]
    
    if len(non_matching) > 0:
        print("\n" + "=" * 80)
        print(f"‚ö†Ô∏è FOUND {len(non_matching)} RECORDS WITH NON-STANDARD MUNICIPALITY VALUES")
        print("=" * 80)
        print("\nThese need to be mapped to one of the 9 target municipalities:")
        non_matching_counts = non_matching['municipality'].value_counts()
        for municipality, count in non_matching_counts.items():
            print(f"  - {municipality:<50} ({count} records)")
    else:
        print("\n‚úÖ All municipality values already match the target municipalities!")
        
else:
    print("‚ùå No 'municipality' column found")

print("\n" + "=" * 80)

CURRENT MUNICIPALITY ANALYSIS

Total unique municipality values: 655
Missing municipality values: 728

--------------------------------------------------------------------------------
ALL MUNICIPALITY VALUES (with counts):
--------------------------------------------------------------------------------
  1. Breka                                              (14 records)
  2. -                                                  (10 records)
  3. Semira Fraste                                      (6 records)
  4. Put Mladih Muslimana 2                             (6 records)
  5. Skenderpa≈°ina 20                                   (6 records)
  6. Barska                                             (6 records)
  7. Stupska bb                                         (6 records)
  8. Ferhadija                                          (5 records)
  9. Grbaviƒçka                                          (5 records)
 10. Olimpijska                                         (5 records)
 11. Himze P

In [24]:
# Step 2: Create a comprehensive mapping of neighborhoods to municipalities
print("=" * 80)
print("CREATING NEIGHBORHOOD-TO-MUNICIPALITY MAPPING")
print("=" * 80)

# Define comprehensive mapping based on Sarajevo Canton geography
# This maps neighborhoods, areas, and alternate names to their parent municipalities
neighborhood_to_municipality = {
    # Sarajevo - Centar (Central Sarajevo)
    'Centar': 'Sarajevo - Centar',
    'Marijin Dvor': 'Sarajevo - Centar',
    'Skenderija': 'Sarajevo - Centar',
    'Mejtas': 'Sarajevo - Centar',
    'Mejta≈°': 'Sarajevo - Centar',
    'mejta≈°': 'Sarajevo - Centar',
    'D≈æid≈æikovac': 'Sarajevo - Centar',
    'Bjelave': 'Sarajevo - Centar', 
    'ƒåobanija': 'Sarajevo - Centar',
    '≈†ip': 'Sarajevo - Centar',
    'Pearl-≈†ip': 'Sarajevo - Centar',
    'Parl-≈†ip': 'Sarajevo - Centar',
    'Ko≈°evo': 'Sarajevo - Centar',
    'Ko≈°evsko brdo': 'Sarajevo - Centar',
    'Drvenija': 'Sarajevo - Centar',
    'Ferhadija': 'Sarajevo - Centar',
    'Breka': 'Sarajevo - Centar',
    'Soukbunar': 'Sarajevo - Centar',
    
    # Sarajevo - Stari Grad (Old Town)
    'Stari Grad': 'Sarajevo - Stari Grad',
    'Ba≈°ƒçar≈°ija': 'Sarajevo - Stari Grad',
    'Alifakovac': 'Sarajevo - Stari Grad',
    'Jekovac': 'Sarajevo - Stari Grad',
    'Kovaƒçi': 'Sarajevo - Stari Grad',
    'Vratnik': 'Sarajevo - Stari Grad',
    'Sedrenik': 'Sarajevo - Stari Grad',
    'Hrid': 'Sarajevo - Stari Grad',
    'Bistrik': 'Sarajevo - Stari Grad',
    
    # Sarajevo - Novo Sarajevo (New Sarajevo)
    'Novo Sarajevo': 'Sarajevo - Novo Sarajevo',
    'Grbavica': 'Sarajevo - Novo Sarajevo',
    'Dolac Malta': 'Sarajevo - Novo Sarajevo',
    'Ciglane': 'Sarajevo - Novo Sarajevo',
    'Hrasno': 'Sarajevo - Novo Sarajevo',
    'Vele≈°iƒái': 'Sarajevo - Novo Sarajevo',
    'Kovaƒçiƒái': 'Sarajevo - Novo Sarajevo',
    'Kovacici': 'Sarajevo - Novo Sarajevo',
    'Vraca': 'Sarajevo - Novo Sarajevo',
    'Zmaja od Bosne': 'Sarajevo - Novo Sarajevo',
    'Pofaliƒái': 'Sarajevo - Novo Sarajevo',
    'Socijalno': 'Sarajevo - Novo Sarajevo',
    'Robot Socijalno': 'Sarajevo - Novo Sarajevo',
    'Sarajevo Tower': 'Sarajevo - Novo Sarajevo',
    
    # Sarajevo - Novi Grad (New City)
    'Novi Grad': 'Sarajevo - Novi Grad',
    '≈†vrakino Selo': 'Sarajevo - Novi Grad',
    'Alipa≈°ino Polje': 'Sarajevo - Novi Grad',
    'Alipa≈°ino': 'Sarajevo - Novi Grad',
    'ƒåengiƒá Vila': 'Sarajevo - Novi Grad',
    'Zabrƒëe': 'Sarajevo - Novi Grad',
    'Stupsko Brdo': 'Sarajevo - Novi Grad',
    'Buƒáa Potok': 'Sarajevo - Novi Grad',
    'Vojniƒçko polje': 'Sarajevo - Novi Grad',
    'Mali Bosmal': 'Sarajevo - Novi Grad',
    'Fra Antuna Kne≈æeviƒáa': 'Sarajevo - Novi Grad',
    'Aerodromsko naslje': 'Sarajevo - Novi Grad',
    'Miljacka': 'Sarajevo - Novi Grad',
    'Bulevar': 'Sarajevo - Novi Grad',
    'Teheranski trg': 'Sarajevo - Novi Grad',
    'Mojmilo': 'Sarajevo - Novi Grad',
    'Dobrinja': 'Sarajevo - Novi Grad',
    'Otoka': 'Sarajevo - Novi Grad',
    'Aneks': 'Sarajevo - Novi Grad',
    
    
    # Ilid≈æa
    'Ilid≈æa': 'Ilid≈æa',
    'Butmir': 'Ilid≈æa',
    'Sokoloviƒá Kolonija': 'Ilid≈æa',
    'Otes': 'Ilid≈æa',
    'Pejton': 'Ilid≈æa',
    'Hrasnica': 'Ilid≈æa',
    'Bla≈æuj': 'Ilid≈æa',
    'Lu≈æani': 'Ilid≈æa',
    'Luzani': 'Ilid≈æa',
    'luzani': 'Ilid≈æa',
    'Pijacna': 'Ilid≈æa',
    'Stup': 'Ilid≈æa',
    'Gray Residence': 'Ilid≈æa',
    
    
    # Had≈æiƒái
    'Had≈æiƒái': 'Had≈æiƒái',
    'Pazariƒá': 'Had≈æiƒái',
    'Tarƒçin': 'Had≈æiƒái',
    'Tovi≈°': 'Had≈æiƒái',
    
    # Vogo≈°ƒáa
    'Vogo≈°ƒáa': 'Vogo≈°ƒáa',
    'Semizovac': 'Vogo≈°ƒáa',
    'Kobilja Glava': 'Vogo≈°ƒáa',
    'Vogoscanskih odreda': 'Vogo≈°ƒáa',
    'Vogo≈°ƒáanskih odreda': 'Vogo≈°ƒáa',
    'Hotonj': 'Vogo≈°ƒáa',
    
    # Ilija≈°
    'Ilija≈°': 'Ilija≈°',
    'Podlugovi': 'Ilija≈°',
    
    # Trnovo
    'Trnovo': 'Trnovo',
    'Trnovo - Bjela≈°nica': 'Trnovo',
    'Bjela≈°nica': 'Trnovo',
    'Artes Bjela≈°nica': 'Trnovo',
    
    # NOTE: Some locations are outside Sarajevo Canton:
    # - Lukavica (East Sarajevo - not in Sarajevo Canton)
    # - Jahorina (Pale municipality - not in Sarajevo Canton)
    # - Ravna planina (part of Jahorina area)
    # - Makarska (coastal city in Croatia)
    # - Visoko (separate municipality, not in Sarajevo Canton)
    # These will remain unmapped as they don't belong to the 9 target municipalities
    
    # Street names that can be mapped based on known locations:
    'D≈æemala Bijediƒáa': 'Sarajevo - Novi Grad',
    'Dr. Silve Rizvanbegoviƒá': 'Sarajevo - Centar',
    'Dr.Silve Rizvanbegovic': 'Ilid≈æa',
    'Silve Rizvanbegovic': 'Ilid≈æa',
    'Josipa Slavenskog': 'Ilid≈æa',
    'Mesa Selimovic': 'Sarajevo - Novi Grad',
    'Svetozara ƒÜoroviƒáa': 'Sarajevo - Centar',
    'Semira Fraste': 'Sarajevo - Novi Grad',
    'F. Becirbegovica': 'Sarajevo - Novo Sarajevo',
    'Tome Mendesa': 'Vogo≈°ƒáa',
    'Ante Babiƒáa': 'Sarajevo - Novi Grad',
    'Ibrahima Ljubovica': 'Ilid≈æa',
    'Samira Catovica Kobre': 'Ilid≈æa',
    'Ramiza Jasara': 'Ilid≈æa',
    'ramiza jasara': 'Ilid≈æa',
    'Karla Malya': 'Ilid≈æa',
    'Trg solidarnosti': 'Sarajevo - Novi Grad',
    'Nikole Sopa': 'Ilid≈æa',
    'Skendera Kulenovica': 'Sarajevo - Stari Grad',
    'Hifzi Bjelavca': 'Ilid≈æa',
    'Be≈°areviƒáa': 'Sarajevo - Centar',
    'Samin gaj': 'Ilid≈æa',
    'Slatina': 'Sarajevo - Centar',
    'Kod OHR-a': 'Sarajevo - Centar',
    'Latiƒçka': 'Ilid≈æa',
    'Stupska': 'Ilid≈æa',
    'p.o.zvijezda': 'Vogo≈°ƒáa',
}

print(f"\n‚úÖ Created mapping with {len(neighborhood_to_municipality)} neighborhood entries")
print(f"   Mapping to {len(set(neighborhood_to_municipality.values()))} municipalities")

print("\n" + "-" * 80)
print("SAMPLE MAPPINGS:")
print("-" * 80)
for i, (neighborhood, municipality) in enumerate(list(neighborhood_to_municipality.items())[:10], 1):
    print(f"{i:3d}. {neighborhood:<30} ‚Üí {municipality}")

print("\n" + "=" * 80)

CREATING NEIGHBORHOOD-TO-MUNICIPALITY MAPPING

‚úÖ Created mapping with 116 neighborhood entries
   Mapping to 9 municipalities

--------------------------------------------------------------------------------
SAMPLE MAPPINGS:
--------------------------------------------------------------------------------
  1. Centar                         ‚Üí Sarajevo - Centar
  2. Marijin Dvor                   ‚Üí Sarajevo - Centar
  3. Skenderija                     ‚Üí Sarajevo - Centar
  4. Mejtas                         ‚Üí Sarajevo - Centar
  5. Mejta≈°                         ‚Üí Sarajevo - Centar
  6. mejta≈°                         ‚Üí Sarajevo - Centar
  7. D≈æid≈æikovac                    ‚Üí Sarajevo - Centar
  8. Bjelave                        ‚Üí Sarajevo - Centar
  9. ƒåobanija                       ‚Üí Sarajevo - Centar
 10. ≈†ip                            ‚Üí Sarajevo - Centar



In [25]:
# Step 3: Check if title or municipality contains neighborhood keywords
print("=" * 80)
print("SEARCHING FOR NEIGHBORHOODS IN MUNICIPALITY AND TITLE COLUMNS")
print("=" * 80)

if 'municipality' in df.columns:
    # Create a copy of the original municipality column for comparison
    df['municipality_original'] = df['municipality'].copy()
    
    found_mappings = []
    unmapped_records = []
    
    for idx, row in df.iterrows():
        municipality_value = str(row['municipality']) if pd.notna(row['municipality']) else ''
        title_value = str(row['title']) if pd.notna(row['title']) else ''
        
        # Combine both for searching
        search_text = f"{municipality_value} {title_value}".lower()
        
        # Try to find a matching neighborhood
        matched = False
        for neighborhood, target_municipality in neighborhood_to_municipality.items():
            if neighborhood.lower() in search_text:
                # Found a match!
                if pd.isna(row['municipality']) or row['municipality'] != target_municipality:
                    found_mappings.append({
                        'index': idx,
                        'original_municipality': row['municipality'],
                        'title': row['title'],
                        'found_neighborhood': neighborhood,
                        'mapped_to': target_municipality
                    })
                    # Update the municipality
                    df.at[idx, 'municipality'] = target_municipality
                matched = True
                break
        
        # If no match found and municipality is not one of the target 9
        target_municipalities = set(neighborhood_to_municipality.values())
        if not matched and pd.notna(row['municipality']) and row['municipality'] not in target_municipalities:
            unmapped_records.append({
                'index': idx,
                'municipality': row['municipality'],
                'title': row['title']
            })
    
    print(f"\n‚úÖ FOUND {len(found_mappings)} records with neighborhood keywords")
    
    if len(found_mappings) > 0:
        print("\n" + "-" * 80)
        print("SAMPLE OF MAPPED RECORDS (first 10):")
        print("-" * 80)
        for i, mapping in enumerate(found_mappings[:10], 1):
            print(f"\n{i}. Found '{mapping['found_neighborhood']}' ‚Üí Mapped to '{mapping['mapped_to']}'")
            print(f"   Original: {mapping['original_municipality']}")
            print(f"   Title: {mapping['title'][:80]}...")
    
    if len(unmapped_records) > 0:
        print("\n" + "=" * 80)
        print(f"‚ö†Ô∏è STILL HAVE {len(unmapped_records)} UNMAPPED RECORDS")
        print("=" * 80)
        print("\nThese don't match any known neighborhood:")
        
        # Show unique unmapped municipalities
        unique_unmapped = {}
        for record in unmapped_records:
            mun = record['municipality']
            if mun not in unique_unmapped:
                unique_unmapped[mun] = []
            unique_unmapped[mun].append(record['title'])
        
        for mun, titles in unique_unmapped.items():
            print(f"\n  {mun} ({len(titles)} records)")
            print(f"    Sample title: {titles[0][:80]}...")
    else:
        print("\n‚úÖ All records successfully mapped!")
        
else:
    print("‚ùå No 'municipality' column found")

print("\n" + "=" * 80)

SEARCHING FOR NEIGHBORHOODS IN MUNICIPALITY AND TITLE COLUMNS

‚úÖ FOUND 881 records with neighborhood keywords

--------------------------------------------------------------------------------
SAMPLE OF MAPPED RECORDS (first 10):
--------------------------------------------------------------------------------

1. Found 'Stari Grad' ‚Üí Mapped to 'Sarajevo - Stari Grad'
   Original: nan
   Title: Sarajevo, Sarajevo ‚Äì Stari grad...

2. Found 'Centar' ‚Üí Mapped to 'Sarajevo - Centar'
   Original: nan
   Title: Sarajevo, Sarajevo ‚Äì Centar...

3. Found 'Novi Grad' ‚Üí Mapped to 'Sarajevo - Novi Grad'
   Original: nan
   Title: Sarajevo, Sarajevo ‚Äì Novi grad...

4. Found 'Centar' ‚Üí Mapped to 'Sarajevo - Centar'
   Original: nan
   Title: Sarajevo, Sarajevo ‚Äì Centar...

5. Found 'Centar' ‚Üí Mapped to 'Sarajevo - Centar'
   Original: nan
   Title: Sarajevo, Sarajevo ‚Äì Centar...

6. Found 'Centar' ‚Üí Mapped to 'Sarajevo - Centar'
   Original: nan
   Title: Sarajevo, Sarajevo ‚Äì

In [26]:
# Step 4: Show final municipality distribution after mapping
print("=" * 80)
print("FINAL MUNICIPALITY DISTRIBUTION AFTER MAPPING")
print("=" * 80)

if 'municipality' in df.columns:
    print("\nMunicipality Value Counts:")
    print("-" * 80)
    
    final_counts = df['municipality'].value_counts()
    for idx, (municipality, count) in enumerate(final_counts.items(), 1):
        percentage = (count / len(df)) * 100
        print(f"{idx:3d}. {municipality:<35} {count:5d} records ({percentage:5.2f}%)")
    
    # Check how many were successfully mapped
    target_municipalities = [
        'Had≈æiƒái', 'Ilid≈æa', 'Ilija≈°', 'Sarajevo - Centar',
        'Sarajevo - Novi Grad', 'Sarajevo - Novo Sarajevo',
        'Sarajevo - Stari Grad', 'Trnovo', 'Vogo≈°ƒáa'
    ]
    
    mapped_count = df[df['municipality'].isin(target_municipalities)].shape[0]
    unmapped_count = df[~df['municipality'].isin(target_municipalities) & df['municipality'].notna()].shape[0]
    missing_count = df['municipality'].isnull().sum()
    
    print("\n" + "=" * 80)
    print("MAPPING SUMMARY:")
    print("=" * 80)
    print(f"‚úÖ Mapped to target municipalities: {mapped_count} ({mapped_count/len(df)*100:.2f}%)")
    print(f"‚ö†Ô∏è  Still unmapped (non-standard):   {unmapped_count} ({unmapped_count/len(df)*100:.2f}%)")
    print(f"‚ùå Missing municipality data:       {missing_count} ({missing_count/len(df)*100:.2f}%)")
    print(f"\nüìä Total records: {len(df)}")
    
    # Show what changed
    if 'municipality_original' in df.columns:
        changed = df[df['municipality'] != df['municipality_original']].shape[0]
        print(f"\nüîÑ Updated {changed} records with neighborhood mapping")
    
else:
    print("‚ùå No 'municipality' column found")

print("\n" + "=" * 80)

FINAL MUNICIPALITY DISTRIBUTION AFTER MAPPING

Municipality Value Counts:
--------------------------------------------------------------------------------
  1. Sarajevo - Centar                     326 records (21.10%)
  2. Sarajevo - Novi Grad                  180 records (11.65%)
  3. Sarajevo - Novo Sarajevo              136 records ( 8.80%)
  4. Ilid≈æa                                113 records ( 7.31%)
  5. Sarajevo - Stari Grad                  59 records ( 3.82%)
  6. Trnovo                                 26 records ( 1.68%)
  7. Vogo≈°ƒáa                                25 records ( 1.62%)
  8. Had≈æiƒái                                12 records ( 0.78%)
  9. Ilija≈°                                  4 records ( 0.26%)
 10. Trosoban apartman u sklopu hotela Vuƒçko Jahorina, 59m2, #58     1 records ( 0.06%)
 11. Trosoban stan u novogradnji Lukavica, 96 m2, #13530     1 records ( 0.06%)
 12. Apartman sa spavaƒáom sobom Ravna planina, 36 m2     1 records ( 0.06%)
 13. Dvosoban apa

In [27]:
# Step 5: Remove records outside Sarajevo Canton
print("=" * 80)
print("REMOVING RECORDS OUTSIDE SARAJEVO CANTON")
print("=" * 80)

# Define keywords for locations outside Sarajevo Canton
outside_canton_keywords = [
    'lukavica', 'lukavici', 'jahorina', 'jahoirina', 'ravna planina', 
    'makarska', 'visoko', 'dvori≈°ta', 'vuƒçko', 'pahulja', 
    'olovske luke', 'ajdinoviƒái', 'azapoviƒái', 'ponijeri', 'podvisoki', 
    'homolj', 'prhinje', 'bo≈°ka jugoviƒáa', 'dr. d≈æananoviƒáa', 
    'spasovdanska', 'srpskih vladara', 'bukova ravan', 'tu≈°njiƒái', 
    'be≈°agiƒáa visoko'
]

print(f"\nSearching for {len(outside_canton_keywords)} location keywords...")
print(f"Keywords: {', '.join(outside_canton_keywords[:5])}... (and {len(outside_canton_keywords) - 5} more)")

# Track which records to remove
records_to_remove = []
keyword_counts = {keyword: [] for keyword in outside_canton_keywords}

initial_count = len(df)

# Search in both municipality and title columns (case-insensitive)
for idx, row in df.iterrows():
    municipality_text = str(row['municipality']).lower() if pd.notna(row['municipality']) else ''
    title_text = str(row['title']).lower() if pd.notna(row['title']) else ''
    search_text = f"{municipality_text} {title_text}"
    
    for keyword in outside_canton_keywords:
        if keyword in search_text:
            records_to_remove.append(idx)
            keyword_counts[keyword].append({
                'index': idx,
                'municipality': row['municipality'],
                'title': row['title'][:80]
            })
            break  # Only count each record once

print("\n" + "-" * 80)
print("FOUND RECORDS TO REMOVE:")
print("-" * 80)

# Show breakdown by keyword
records_found = 0
for keyword, records in keyword_counts.items():
    if len(records) > 0:
        records_found += len(records)
        print(f"\nüîç Keyword '{keyword}': {len(records)} records")
        for i, record in enumerate(records[:3], 1):  # Show first 3 examples
            print(f"   {i}. {record['municipality']} - {record['title']}")
        if len(records) > 3:
            print(f"   ... and {len(records) - 3} more")

if records_found > 0:
    # Remove the records
    indices_to_drop = list(set(records_to_remove))  # Remove duplicates
    df.drop(indices_to_drop, inplace=True)
    df.reset_index(drop=True, inplace=True)
    
    final_count = len(df)
    removed_count = initial_count - final_count
    
    print("\n" + "=" * 80)
    print("REMOVAL SUMMARY:")
    print("=" * 80)
    print(f"‚úÖ Removed {removed_count} records outside Sarajevo Canton")
    print(f"üìä Records before: {initial_count}")
    print(f"üìä Records after:  {final_count}")
    print(f"üìà Removed: {(removed_count/initial_count)*100:.2f}%")
else:
    print("\n‚úÖ No records found outside Sarajevo Canton")

print("\n" + "=" * 80)

REMOVING RECORDS OUTSIDE SARAJEVO CANTON

Searching for 24 location keywords...
Keywords: lukavica, lukavici, jahorina, jahoirina, ravna planina... (and 19 more)

--------------------------------------------------------------------------------
FOUND RECORDS TO REMOVE:
--------------------------------------------------------------------------------

üîç Keyword 'lukavica': 10 records
   1. Trosoban stan u novogradnji Lukavica, 96 m2, #13530 - Sarajevo
   2. Dvosoban namje≈°ten stan novogradnja Lukavica, 39 m2 - Sarajevo
   3. ƒåetvorosoban stan Lukavica, 85 m2 - Sarajevo
   ... and 7 more

üîç Keyword 'lukavici': 1 records
   1. Trosoban stan u Lukavici, 102 m2 - Sarajevo

üîç Keyword 'jahorina': 9 records
   1. Dvosoban apartman Apart Hotel "Pahulja" Jahorina, 39m2 - Sarajevo
   2. Trosoban apartman u sklopu hotela Vuƒçko Jahorina, 59m2, #58 - Sarajevo
   3. Dvosoban apartman Apart Hotel "Pahulja" Jahorina, 36m2 - Sarajevo
   ... and 6 more

üîç Keyword 'jahoirina': 1 records
   1.

## üìù Extract Municipalities from Descriptions

For records with missing municipality data, let's search their descriptions for mentions of neighborhoods or municipalities.

In [28]:
# Step 6: Extract municipalities from descriptions for records with missing municipality data
print("=" * 80)
print("EXTRACTING MUNICIPALITIES FROM DESCRIPTIONS")
print("=" * 80)

if 'municipality' in df.columns and 'description' in df.columns:
    # Find records with missing municipality but have description
    missing_municipality = df['municipality'].isnull()
    has_description = df['description'].notna()
    
    records_to_check = missing_municipality & has_description
    records_count = records_to_check.sum()
    
    print(f"\nüìä Records with missing municipality: {missing_municipality.sum()}")
    print(f"üìä Records with description available: {has_description.sum()}")
    print(f"üîç Records to check (missing municipality + has description): {records_count}")
    
    if records_count > 0:
        print("\n" + "-" * 80)
        print("SEARCHING DESCRIPTIONS FOR MUNICIPALITY KEYWORDS...")
        print("-" * 80)
        
        # Use the same neighborhood mapping dictionary
        found_from_description = []
        
        for idx, row in df[records_to_check].iterrows():
            description_text = str(row['description']).lower() if pd.notna(row['description']) else ''
            title_text = str(row['title']).lower() if pd.notna(row['title']) else ''
            
            # Combine description and title for better matching
            search_text = f"{description_text} {title_text}"
            
            # Try to find a matching neighborhood in the description
            matched = False
            for neighborhood, target_municipality in neighborhood_to_municipality.items():
                if neighborhood.lower() in search_text:
                    # Found a match!
                    found_from_description.append({
                        'index': idx,
                        'title': row['title'],
                        'found_neighborhood': neighborhood,
                        'mapped_to': target_municipality,
                        'description_snippet': row['description'][:100] if pd.notna(row['description']) else ''
                    })
                    # Update the municipality
                    df.at[idx, 'municipality'] = target_municipality
                    matched = True
                    break
        
        if len(found_from_description) > 0:
            print(f"\n‚úÖ SUCCESS! Found {len(found_from_description)} municipalities from descriptions!")
            print("\n" + "-" * 80)
            print("SAMPLE OF EXTRACTED MUNICIPALITIES (first 10):")
            print("-" * 80)
            
            for i, extraction in enumerate(found_from_description[:10], 1):
                print(f"\n{i}. Found '{extraction['found_neighborhood']}' ‚Üí Mapped to '{extraction['mapped_to']}'")
                print(f"   Title: {extraction['title'][:70]}...")
                print(f"   Description: {extraction['description_snippet']}...")
            
            if len(found_from_description) > 10:
                print(f"\n   ... and {len(found_from_description) - 10} more")
            
            # Show updated statistics
            print("\n" + "=" * 80)
            print("UPDATED MUNICIPALITY STATISTICS:")
            print("=" * 80)
            
            target_municipalities = [
                'Had≈æiƒái', 'Ilid≈æa', 'Ilija≈°', 'Sarajevo - Centar',
                'Sarajevo - Novi Grad', 'Sarajevo - Novo Sarajevo',
                'Sarajevo - Stari Grad', 'Trnovo', 'Vogo≈°ƒáa'
            ]
            
            mapped_count = df[df['municipality'].isin(target_municipalities)].shape[0]
            missing_count = df['municipality'].isnull().sum()
            
            print(f"\n‚úÖ Mapped to target municipalities: {mapped_count} ({mapped_count/len(df)*100:.2f}%)")
            print(f"‚ùå Still missing municipality data: {missing_count} ({missing_count/len(df)*100:.2f}%)")
            print(f"üìà Improvement: +{len(found_from_description)} records mapped from descriptions")
            print(f"\nüìä Total records: {len(df)}")
            
            # Show final distribution by municipality
            print("\n" + "-" * 80)
            print("MUNICIPALITY DISTRIBUTION (after description extraction):")
            print("-" * 80)
            
            final_counts = df['municipality'].value_counts()
            for idx, (municipality, count) in enumerate(final_counts.items(), 1):
                if municipality in target_municipalities:
                    percentage = (count / len(df)) * 100
                    print(f"{idx:3d}. {municipality:<35} {count:5d} records ({percentage:5.2f}%)")
        else:
            print("\n‚ö†Ô∏è  No additional municipalities found in descriptions")
            print("    The descriptions may not contain explicit neighborhood/municipality names")
    else:
        print("\n‚úÖ No records need description-based extraction")
        print("    All records either have municipality data or lack descriptions")
else:
    print("‚ùå Required columns not found")

print("\n" + "=" * 80)

EXTRACTING MUNICIPALITIES FROM DESCRIPTIONS

üìä Records with missing municipality: 624
üìä Records with description available: 1383
üîç Records to check (missing municipality + has description): 624

--------------------------------------------------------------------------------
SEARCHING DESCRIPTIONS FOR MUNICIPALITY KEYWORDS...
--------------------------------------------------------------------------------

‚úÖ SUCCESS! Found 481 municipalities from descriptions!

--------------------------------------------------------------------------------
SAMPLE OF EXTRACTED MUNICIPALITIES (first 10):
--------------------------------------------------------------------------------

1. Found 'Sarajevo Tower' ‚Üí Mapped to 'Sarajevo - Novo Sarajevo'
   Title: Sarajevo...
   Description: Agencija za nekretnine Stanpromet.ba izdvaja prodaju trosobnog stana na 3. spratu zgrade Sarajevo To...

2. Found 'Ilid≈æa' ‚Üí Mapped to 'Ilid≈æa'
   Title: Sarajevo...
   Description: Stanpromet.ba agencija

In [29]:
# # Update the cleaned CSV file without property_type and ad_type columns
# print("UPDATING CLEANED DATA FILE:")
# print("=" * 80)
# output_file = '../data/sarajevo_flats_nekretnine_cleaned_1.csv'
# df.to_csv(output_file, index=False)
# print(f"‚úÖ Updated cleaned data saved to: {output_file}")
# print(f"Total columns: {len(df.columns)}")
# print(f"Column names: {list(df.columns)}")
# print("\n" + "=" * 80)

## üîç Dataset Comparison Analysis

Compare sarajevo_flats_olx and sarajevo_flats_nekretnine_cleaned datasets to see if they can be merged.

In [15]:
import pandas as pd

# Load both datasets
print("=" * 80)
print("LOADING DATASETS")
print("=" * 80)

df_olx = pd.read_csv('../data/sarajevo_flats_olx.csv')
df_nekretnine = pd.read_csv('../data/sarajevo_flats_nekretnine.csv')

print(f"\nüìä OLX Dataset: {len(df_olx)} records, {len(df_olx.columns)} columns")
print(f"üìä Nekretnine Dataset: {len(df_nekretnine)} records, {len(df_nekretnine.columns)} columns")

print("\n" + "-" * 80)
print("OLX COLUMNS:")
print("-" * 80)
for i, col in enumerate(df_olx.columns, 1):
    print(f"{i:2d}. {col}")

print("\n" + "-" * 80)
print("NEKRETNINE COLUMNS:")
print("-" * 80)
for i, col in enumerate(df_nekretnine.columns, 1):
    print(f"{i:2d}. {col}")

print("\n" + "=" * 80)

LOADING DATASETS

üìä OLX Dataset: 1806 records, 15 columns
üìä Nekretnine Dataset: 1506 records, 10 columns

--------------------------------------------------------------------------------
OLX COLUMNS:
--------------------------------------------------------------------------------
 1. title
 2. url
 3. price_numeric
 4. municipality
 5. condition
 6. ad_type
 7. property_type
 8. rooms
 9. square_m2
10. equipment
11. level
12. heating
13. price_per_m2
14. latitude
15. longitude

--------------------------------------------------------------------------------
NEKRETNINE COLUMNS:
--------------------------------------------------------------------------------
 1. title
 2. url
 3. price_numeric
 4. municipality
 5. rooms
 6. square_m2
 7. equipment
 8. description
 9. price_per_m2
10. municipality_original



In [16]:
# Find common and unique columns
print("=" * 80)
print("COLUMN COMPARISON")
print("=" * 80)

olx_cols = set(df_olx.columns)
nekretnine_cols = set(df_nekretnine.columns)

common_cols = olx_cols.intersection(nekretnine_cols)
olx_only = olx_cols - nekretnine_cols
nekretnine_only = nekretnine_cols - olx_cols

print(f"\n‚úÖ COMMON COLUMNS ({len(common_cols)}):")
print("-" * 80)
for i, col in enumerate(sorted(common_cols), 1):
    print(f"{i:2d}. {col}")

print(f"\nüîµ OLX-ONLY COLUMNS ({len(olx_only)}):")
print("-" * 80)
for i, col in enumerate(sorted(olx_only), 1):
    print(f"{i:2d}. {col}")

print(f"\nüü¢ NEKRETNINE-ONLY COLUMNS ({len(nekretnine_only)}):")
print("-" * 80)
for i, col in enumerate(sorted(nekretnine_only), 1):
    print(f"{i:2d}. {col}")

print("\n" + "=" * 80)

COLUMN COMPARISON

‚úÖ COMMON COLUMNS (8):
--------------------------------------------------------------------------------
 1. equipment
 2. municipality
 3. price_numeric
 4. price_per_m2
 5. rooms
 6. square_m2
 7. title
 8. url

üîµ OLX-ONLY COLUMNS (7):
--------------------------------------------------------------------------------
 1. ad_type
 2. condition
 3. heating
 4. latitude
 5. level
 6. longitude
 7. property_type

üü¢ NEKRETNINE-ONLY COLUMNS (2):
--------------------------------------------------------------------------------
 1. description
 2. municipality_original



In [17]:
# Analyze data compatibility for common columns
print("=" * 80)
print("DATA COMPATIBILITY ANALYSIS FOR COMMON COLUMNS")
print("=" * 80)

for col in sorted(common_cols):
    print(f"\n{'='*80}")
    print(f"üìå COLUMN: {col.upper()}")
    print(f"{'='*80}")
    
    # Data types
    olx_dtype = df_olx[col].dtype
    nekretnine_dtype = df_nekretnine[col].dtype
    print(f"Data Type - OLX: {olx_dtype} | Nekretnine: {nekretnine_dtype}")
    
    # Missing values
    olx_missing = df_olx[col].isnull().sum()
    nekretnine_missing = df_nekretnine[col].isnull().sum()
    olx_missing_pct = (olx_missing / len(df_olx)) * 100
    nekretnine_missing_pct = (nekretnine_missing / len(df_nekretnine)) * 100
    
    print(f"Missing - OLX: {olx_missing} ({olx_missing_pct:.1f}%) | Nekretnine: {nekretnine_missing} ({nekretnine_missing_pct:.1f}%)")
    
    # Unique values (for non-text columns)
    if col not in ['title', 'url', 'description', 'equipment']:
        olx_unique = df_olx[col].dropna().nunique()
        nekretnine_unique = df_nekretnine[col].dropna().nunique()
        print(f"Unique Values - OLX: {olx_unique} | Nekretnine: {nekretnine_unique}")
        
        # Show sample values for categorical-like columns
        if olx_unique <= 20 or nekretnine_unique <= 20:
            print(f"\nSample Values from OLX: {list(df_olx[col].dropna().unique()[:5])}")
            print(f"Sample Values from Nekretnine: {list(df_nekretnine[col].dropna().unique()[:5])}")

print("\n" + "=" * 80)

DATA COMPATIBILITY ANALYSIS FOR COMMON COLUMNS

üìå COLUMN: EQUIPMENT
Data Type - OLX: object | Nekretnine: object
Missing - OLX: 0 (0.0%) | Nekretnine: 70 (4.6%)

üìå COLUMN: MUNICIPALITY
Data Type - OLX: object | Nekretnine: object
Missing - OLX: 0 (0.0%) | Nekretnine: 143 (9.5%)
Unique Values - OLX: 9 | Nekretnine: 10

Sample Values from OLX: ['Sarajevo - Novi Grad', 'Sarajevo - Novo Sarajevo', 'Vogo≈°ƒáa', 'Trnovo', 'Sarajevo - Centar']
Sample Values from Nekretnine: ['Sarajevo - Novo Sarajevo', 'Sarajevo - Stari Grad', 'Ilid≈æa', 'Sarajevo - Centar', 'Sarajevo - Novi Grad']

üìå COLUMN: PRICE_NUMERIC
Data Type - OLX: float64 | Nekretnine: float64
Missing - OLX: 378 (20.9%) | Nekretnine: 501 (33.3%)
Unique Values - OLX: 490 | Nekretnine: 570

üìå COLUMN: PRICE_PER_M2
Data Type - OLX: float64 | Nekretnine: float64
Missing - OLX: 378 (20.9%) | Nekretnine: 505 (33.5%)
Unique Values - OLX: 966 | Nekretnine: 812

üìå COLUMN: ROOMS
Data Type - OLX: float64 | Nekretnine: object
Missi

In [18]:
# Merge strategy and execution
print("=" * 80)
print("DATASET MERGE STRATEGY")
print("=" * 80)

print("\nüìù MERGE APPROACH:")
print("-" * 80)
print("1. Keep all common columns")
print("2. Add 'source' column to track origin (OLX vs Nekretnine)")
print("3. Handle missing columns:")
print("   - OLX-only columns: Set to NaN for Nekretnine records")
print("   - Nekretnine-only columns: Set to NaN for OLX records")
print("4. Concatenate both datasets vertically")

# Add source column
df_olx_copy = df_olx.copy()
df_nekretnine_copy = df_nekretnine.copy()

df_olx_copy['source'] = 'OLX'
df_nekretnine_copy['source'] = 'Nekretnine'

# Get all columns from both datasets
all_columns = sorted(set(df_olx_copy.columns) | set(df_nekretnine_copy.columns))

print(f"\nüìä COMBINED DATASET WILL HAVE {len(all_columns)} COLUMNS")
print(f"   - {len(common_cols)} common columns")
print(f"   - {len(olx_only)} OLX-only columns")
print(f"   - {len(nekretnine_only)} Nekretnine-only columns")
print(f"   - 1 source identifier column")

print(f"\nüìà TOTAL RECORDS: {len(df_olx)} (OLX) + {len(df_nekretnine)} (Nekretnine) = {len(df_olx) + len(df_nekretnine)}")

print("\n" + "=" * 80)

DATASET MERGE STRATEGY

üìù MERGE APPROACH:
--------------------------------------------------------------------------------
1. Keep all common columns
2. Add 'source' column to track origin (OLX vs Nekretnine)
3. Handle missing columns:
   - OLX-only columns: Set to NaN for Nekretnine records
   - Nekretnine-only columns: Set to NaN for OLX records
4. Concatenate both datasets vertically

üìä COMBINED DATASET WILL HAVE 18 COLUMNS
   - 8 common columns
   - 7 OLX-only columns
   - 2 Nekretnine-only columns
   - 1 source identifier column

üìà TOTAL RECORDS: 1806 (OLX) + 1506 (Nekretnine) = 3312



In [19]:
# Perform the merge
print("=" * 80)
print("MERGING DATASETS")
print("=" * 80)

# Concatenate the datasets
df_merged = pd.concat([df_olx_copy, df_nekretnine_copy], ignore_index=True)

print(f"\n‚úÖ Merge Complete!")
print(f"üìä Total records: {len(df_merged)}")
print(f"üìä Total columns: {len(df_merged.columns)}")

print("\n" + "-" * 80)
print("COLUMN LIST:")
print("-" * 80)
for i, col in enumerate(df_merged.columns, 1):
    print(f"{i:2d}. {col}")

# Check source distribution
print("\n" + "-" * 80)
print("SOURCE DISTRIBUTION:")
print("-" * 80)
source_counts = df_merged['source'].value_counts()
for source, count in source_counts.items():
    percentage = (count / len(df_merged)) * 100
    print(f"{source:<15} {count:5d} records ({percentage:5.2f}%)")

print("\n" + "=" * 80)

MERGING DATASETS

‚úÖ Merge Complete!
üìä Total records: 3312
üìä Total columns: 18

--------------------------------------------------------------------------------
COLUMN LIST:
--------------------------------------------------------------------------------
 1. title
 2. url
 3. price_numeric
 4. municipality
 5. condition
 6. ad_type
 7. property_type
 8. rooms
 9. square_m2
10. equipment
11. level
12. heating
13. price_per_m2
14. latitude
15. longitude
16. source
17. description
18. municipality_original

--------------------------------------------------------------------------------
SOURCE DISTRIBUTION:
--------------------------------------------------------------------------------
OLX              1806 records (54.53%)
Nekretnine       1506 records (45.47%)



In [20]:
# Analyze missing data in merged dataset
print("=" * 80)
print("MISSING DATA ANALYSIS - MERGED DATASET")
print("=" * 80)

missing_data = df_merged.isnull().sum()
missing_percentage = (missing_data / len(df_merged)) * 100

missing_df = pd.DataFrame({
    'Column': missing_data.index,
    'Missing Count': missing_data.values,
    'Percentage': missing_percentage.values
})

missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print(f"\nColumns with missing data: {len(missing_df)}/{len(df_merged.columns)}")
print("\n" + "-" * 80)
print(f"{'Column':<30} {'Missing Count':>15} {'Percentage':>12}")
print("-" * 80)
for _, row in missing_df.iterrows():
    print(f"{row['Column']:<30} {int(row['Missing Count']):>15} {row['Percentage']:>11.2f}%")

print("\n" + "=" * 80)

MISSING DATA ANALYSIS - MERGED DATASET

Columns with missing data: 14/18

--------------------------------------------------------------------------------
Column                           Missing Count   Percentage
--------------------------------------------------------------------------------
municipality_original                     2534       76.51%
latitude                                  2137       64.52%
longitude                                 2137       64.52%
description                               1929       58.24%
ad_type                                   1506       45.47%
condition                                 1506       45.47%
heating                                   1506       45.47%
level                                     1506       45.47%
property_type                             1506       45.47%
price_per_m2                               883       26.66%
price_numeric                              879       26.54%
municipality                               1

In [21]:
# Save merged dataset
print("=" * 80)
print("SAVING MERGED DATASET")
print("=" * 80)

output_file = '../data/sarajevo_flats_merged.csv'
df_merged.to_csv(output_file, index=False)

print(f"\n‚úÖ Merged dataset saved to: {output_file}")
print(f"üìä Total records: {len(df_merged)}")
print(f"üìä Total columns: {len(df_merged.columns)}")
print(f"\nüîµ OLX records: {(df_merged['source'] == 'OLX').sum()}")
print(f"üü¢ Nekretnine records: {(df_merged['source'] == 'Nekretnine').sum()}")

print("\n" + "=" * 80)
print("MERGE SUMMARY")
print("=" * 80)
print("\n‚úÖ YES, the datasets CAN be merged!")
print("\nCommon columns (will have data from both sources):")
for col in sorted(common_cols):
    print(f"  ‚Ä¢ {col}")
print("\nOLX-only columns (will be empty for Nekretnine records):")
for col in sorted(olx_only):
    print(f"  ‚Ä¢ {col}")
print("\nNekretnine-only columns (will be empty for OLX records):")
for col in sorted(nekretnine_only):
    print(f"  ‚Ä¢ {col}")

print("\n" + "=" * 80)

SAVING MERGED DATASET

‚úÖ Merged dataset saved to: ../data/sarajevo_flats_merged.csv
üìä Total records: 3312
üìä Total columns: 18

üîµ OLX records: 1806
üü¢ Nekretnine records: 1506

MERGE SUMMARY

‚úÖ YES, the datasets CAN be merged!

Common columns (will have data from both sources):
  ‚Ä¢ equipment
  ‚Ä¢ municipality
  ‚Ä¢ price_numeric
  ‚Ä¢ price_per_m2
  ‚Ä¢ rooms
  ‚Ä¢ square_m2
  ‚Ä¢ title
  ‚Ä¢ url

OLX-only columns (will be empty for Nekretnine records):
  ‚Ä¢ ad_type
  ‚Ä¢ condition
  ‚Ä¢ heating
  ‚Ä¢ latitude
  ‚Ä¢ level
  ‚Ä¢ longitude
  ‚Ä¢ property_type

Nekretnine-only columns (will be empty for OLX records):
  ‚Ä¢ description
  ‚Ä¢ municipality_original


‚úÖ Merged dataset saved to: ../data/sarajevo_flats_merged.csv
üìä Total records: 3312
üìä Total columns: 18

üîµ OLX records: 1806
üü¢ Nekretnine records: 1506

MERGE SUMMARY

‚úÖ YES, the datasets CAN be merged!

Common columns (will have data from both sources):
  ‚Ä¢ equipment
  ‚Ä¢ municipality
  ‚Ä¢ p