# Data extraction

This Notebook shows how to extract data from [centris.ca](https://www.centris.ca) using [Selenium](https://www.selenium.dev) and a custom library `paginate.py` that implements a iterator over the pages until the last one.

> It performs over the `Summary` view that enables more detailed data extractions. 

As it is just a extractor, the data format is the html code without treatment or enrichment, each property are persisted in a text file and stoked inside an extructured data directory that includes year, month, and day in the formation `<data_dir>/yyyy/mm/dd` (_e.g. data/2025/01/27_).

## Import Libraries

In [1]:
# Library for file persistence
import os
import re

# Selenium libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

# Use chrome driver manager
from webdriver_manager.chrome import ChromeDriverManager as DriverManager

# Custom paginate library
from paginate import Paginate

## Parameter Selection

Now you need to select the parameter that you want to apply to the crowler. These parameters will define where the regions that te system use to obtains the data, and the output directory used to the store the extraction result.  

In [2]:
# The search Region
search_region = 'montreal-lasalle'

# The output directory
output_dir = 'data/html'

There are also the fixed parameters that'll be used to generate the url used to perform the crowler.

In [3]:
base_url = 'https://www.centris.ca/en/properties'
page_view = 'Summary'
operation = 'for-sale'

# prepare the url to request
search_region = f"~{search_region}" if search_region else ''
url = f"https://www.centris.ca/en/properties~{operation}{search_region}?view={page_view}"

url

'https://www.centris.ca/en/properties~for-sale~montreal-lasalle?view=Summary'

# Initialise Selenium

In this section there the [selenium web driver](https://www.selenium.dev/documentation/webdriver/) must be initilized, then the [centris.ca](https://www.centris.ca) URL requested.

In [4]:
# initialize webdriver
driver = webdriver.Chrome(service=Service(DriverManager().install()))

# request the created URL
driver.get(url)

Once the URL requested, there is a popup that'll apears, then it must be removed to allow the site elements be clickable. So just find by id the agree button from popup and click on it.

In [5]:
el = driver.find_element(By.ID, 'didomi-notice-agree-button');
el.click()

## Start site pagination 

The Paginate class enable to walk through the [centris.ca](https://www.centris.ca) pages. It implements the generator method thar allow use it as a interator

In [6]:
paginator = Paginate(driver)

## Persistence definition

This function provides a way to persist the data in filesystem. To change it just reimplement it and change the destination.

In [7]:
def persistence(page: str, page_id: str, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)

    file_name = f"centris_{page_id}.html"
    file_path = os.path.join(output_dir, file_name)

    # Write the HTML content to the file
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(page)

## Perform extraction

In [8]:
# Iterate through the pages
for page_number, page in enumerate(paginator, start=1):
    # content = driver.page_source

    page_id = str(page_number).zfill(7)
    persistence(page, page_id, output_dir)
    
driver.close()