# 01 – Data Collection

This notebook collects all raw data required for the ARM project.
We combine open government data from the Swiss Federal Statistical Office (BFS)
with web-scraped rental listings to form the basis for our analysis pipeline.

### Contents

1. Load official BFS Bauperiode categories  
2. Store BFS categories in a SQLite database  
3. Scrape rental listings from Homegate (Zürich, Bern, Luzern)  
4. Inspect the scraped dataset  
5. Save raw data to CSV and SQLite

---

The output of this notebook provides the complete raw dataset that will be cleaned 
and prepared in **Notebook 02 – Data Cleaning & Exploratory Data Analysis**.




### 1. Load BFS Bauperiode Categories

We load the cleaned CSV file containing the official BFS construction-period  
categories. The file includes the 12 standardized *Bauperiode* labels used in  
Swiss building statistics.


In [2]:
import pandas as pd

df_bauperiode = pd.read_csv("../Data/bfs_bauperiode_categories.csv")
df_bauperiode


Unnamed: 0,Bauperiode
0,Vor 1919 erbaut
1,Zwischen 1919 und 1945 erbaut
2,Zwischen 1946 und 1960 erbaut
3,Zwischen 1961 und 1970 erbaut
4,Zwischen 1971 und 1980 erbaut
5,Zwischen 1981 und 1990 erbaut
6,Zwischen 1991 und 2000 erbaut
7,Zwischen 2001 und 2005 erbaut
8,Zwischen 2006 und 2010 erbaut
9,Zwischen 2011 und 2015 erbaut


### 2. Save BFS Categories to SQLite

We store the BFS Bauperiode categories in a SQLite database so they can be
reliably accessed in later steps. Using SQLite ensures reproducibility and a 
centralized storage structure for all project data.




In [3]:
import sqlite3

conn = sqlite3.connect("../Data/apartment_database.db")
df_bauperiode.to_sql("bfs_bauperiode_categories", conn, if_exists="replace", index=False)
conn.close()


### 3. Web Scraping (Homegate)

We scraped rental listings from Homegate.ch for Zürich, Bern, and Luzern using 
the Web Scraper Chrome extension. This method is appropriate for ARM because it 
allows extraction of structured tables from dynamically loaded websites.

For each listing, we collected the following fields:

- Rent (CHF)
- Area (m²)
- Number of rooms
- Address
- Canton
- Construction period (if available)

The resulting dataset will be cleaned and enriched in Notebook 02.



### 4. Load Scraped Rental Listings

We load the raw scraped dataset into Python to inspect its structure and verify 
that all relevant variables have been captured before storing the data in SQLite.




In [4]:
df_scraped = pd.read_csv("../Data/homegate_scraped_raw.csv")
df_scraped.head()


Unnamed: 0,web_scraper_order,id,web_scraper_start_url,link_pages,link_listings,area_m2_raw,area_m2,rooms_raw,rent_chf_raw,rent_chf,year_built_raw,address,canton
0,1764272678-4,1,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002530569,134 m2,134,5.5,"2,530.–",2530,1984,"Kesslernmattstr. 14, 8965 Berikon",Zurich
1,1764272681-5,2,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002542635,98 m2,98,3.5,"2,255.–",2255,1989,"Im Spitzler 21, 8902 Urdorf",Zurich
2,1764272684-6,3,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002551340,68 m2,68,3.5,"1,945.–",1945,1973,"Ferdinand Hodler-Str. 14, 8049 Zürich",Zurich
3,1764272687-7,4,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002567338,134 m2,134,5.5,"5,590.–",5590,1906,"Weinbergstrasse 72, 8006 Zürich",Zurich
4,1764272699-10,5,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/real-estate/city-...,https://www.homegate.ch/rent/4002572292,94 m2,94,4.5,"2,360.–",2360,1971,"Bahnhofstr. 73, 8957 Spreitenbach",Zurich


### 5. Save Raw Scraped Data to SQLite

Finally, we store the scraped rental listings in the project’s SQLite database 
and save a CSV backup. This ensures consistent access in later notebooks and 
prepares the dataset for cleaning and analysis in Notebook 02.



In [5]:
conn = sqlite3.connect("../Data/apartment_database.db")
df_scraped.to_sql("rental_listings_raw", conn, if_exists="replace", index=False)
conn.close()


### Summary

This notebook successfully completes the data-collection stage of the ARM project:

- BFS Bauperiode categories were loaded and stored in a structured database.
- Rental listings from Zürich, Bern, and Luzern were scraped and imported.
- The raw dataset was validated and saved in both CSV and SQLite formats.

We now proceed to **Notebook 02**, where the dataset is cleaned, enriched with 
BFS metadata, and explored through descriptive statistics and visualizations.
