# **Data Aquistion**


This notebook downloads the required data for state-of-the-art LCZ Classification:

1. Setup the project by defining the city of interest
2. City boundary polygon download from OpenStreetMap 
3. Sentinel-2 imagery download using AWS S3
4. ALOS DSM 30m download from Google Earth Engine
5. Impervious Surface Area download from Google Earth Engine 
6. Tree Canopy Height download from Google Earth Engine
7. Building Height download from NRCan

## **1. Project Setup**


This code block imports the necessery Python libraries

### 1.1. Import Libraries

In [1]:
%load_ext autoreload
%autoreload 

import sys
import os

# Add the module's parent directory to sys.path
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from lcz_classification.dataset import get_matching_scenes, download_tiles, ee_download
from lcz_classification.util import tiles_from_bbox
import osmnx as ox
import boto3
from botocore import UNSIGNED
from botocore.client import Config
import geopandas as gpd
import arrow
import rasterio as rio
import matplotlib.pyplot as plt
import ee
import geemap
import numpy as np
from shapely.geometry import box
import pandas as pd


### 1.2. Setup Project Parameters

In [2]:
from lcz_classification.config import  *

## USER INPUT ##
# ===============================================================================================================================================================
# Set up Geographic Bounds
CITY="Greater Vancouver" # Select City Name
COUNTRY="Canada" # Select Country Name
# Setup Date Range
START_DATE = '2021-07-14' # Start date in YYYY-MM-DD
END_DATE ='2021-07-16' # End date in YYYY-MM-DD
CITY_BOUNDARY_PATH=f"{STUDY_AREA}/study_area.geojson"

SENT2_BANDS=["B02", "B03", "B04", "B05", "B06", "B07", "B8A", "B11", "B12"] #Sentinel-2, bands from B02 to B07, B8A, B11, and B12 (provided at 20 m spatial resolution by Copernicus) are exploited.

# ===============================================================================================================================================================


DATE_RANGE=[arrow.get(START_DATE, 'YYYY-MM-DD'), arrow.get(END_DATE, 'YYYY-MM-DD')] 
SENT2_TILES_FP = f"{EXTERNAL}/sentinel2/sentinel_2_tiles.geojson" # 



### 1.3 Authenticate Google Earth Engine

In [3]:
if ee.Authenticate() == True:
    ee.Initialize(project=EE_PROJECT_NAME)    
else:
    ee.Authenticate(force = True)
    ee.Initialize(project=EE_PROJECT_NAME)    

## **2. Get City Boundary Polygon**

To manage your area of interest for the analysis, we need to retrieve a polygon dilineating the outer bounds of the city. In this case, I will use OSMNX to retrieve the boundary polygon for Vancouver, British Columbia, Canada

In [4]:

# Get city boundary polygon
city_boundary = ox.geocode_to_gdf(f"{CITY}, {COUNTRY}")
city_boundary.to_crs(city_boundary.estimate_utm_crs(), inplace=True)


# Save as GeoJSON

# city_boundary.to_file(CITY_BOUNDARY_PATH, driver="GeoJSON")
# gpd.GeoDataFrame(geometry=[box(*city_boundary.to_crs(4326).total_bounds)], crs="EPSG:4326").to_file("../data/external/study_area/bbox.shp")

# # Plot the boundary
# city_boundary.explore(style={
#                 "fill": False,
#                 "color": "red"
# })

# **3. Sentinel-2 Download**

Using download_tiles function, Sentinel-2 data is downloaded from the AWS S3 Bucket using boto3. The city boundary polygon and data range are used to derive matching scenes withiin the sentinel-2 database.

https://registry.opendata.aws/sentinel-2/


In [18]:

s2_tiles=gpd.read_file(SENT2_TILES_FP) # Read GeoJSON of Sentinel-2 tile names and bounds
s2_tiles_clipped=s2_tiles.clip(city_boundary) # Clip sentinel-2 tiles to the extent of the city boundaries\
tiles = s2_tiles_clipped.Name.to_list() # Get Sentinel-2 tile names within clipped boundary as list


# Retrieve scene IDs that match the requested tile names and date range
matching_scenes=get_matching_scenes( 
    bucket='sentinel-cogs',
    tiles=tiles,
    date_range=DATE_RANGE,
    prefix="sentinel-s2-l2a-cogs")



# Construct a pandas DataFrame from the matching scenes, used for effective filtering of scenes for downloading
scene_df=pd.DataFrame(
    data=dict(
        scene_id = matching_scenes,
        date = [scene_id.split("_")[2] for scene_id in matching_scenes],
        tile_id=[scene_id.split("_")[1] for scene_id in matching_scenes],
        tile_no = [scene_id.split("_")[3]for scene_id in matching_scenes],

    )
)

# Drop scenes with duplicate tile_ids
# Retrieve scenes first date only  
scene_df = scene_df.sort_values(["date", "tile_id"]).drop_duplicates(["tile_id","tile_no"], keep="first")
date=scene_df.date.iloc[0]
scene_df=scene_df[scene_df.date == date]
download_scenes = scene_df.scene_id.to_list()

# Download Sentinel-2 scenes from AWS S3 Bucket
download_tiles(download_scenes,
               SENT2_BANDS, 
               "sentinel-s2-l2a-cogs",
               out_dir=f"{EXTERNAL}/sentinel2")




- Sentinel-2 -- Downloading Bands B02, B03, B04, B05, B06, B07, B8A, B11, B12 from Scene S2A_10UDA_20210714_0_L2A
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B02.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B03.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B04.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B05.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B06.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B07.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B8A.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B11.tif
../data/external/sentinel2/S2A_10UDA_20210714_0_L2A/B12.tif
- Sentinel-2 -- Downloading Bands B02, B03, B04, B05, B06, B07, B8A, B11, B12 from Scene S2A_10UDA_20210714_1_L2A
../data/external/sentinel2/S2A_10UDA_20210714_1_L2A/B02.tif
../data/external/sentinel2/S2A_10UDA_20210714_1_L2A/B03.tif
../data/external/sentinel2/S2A_10UDA_20210714_1_L2A/B04.tif
../data/external/sentinel2/S2A_10UDA_20210714_1_L2A/

## **4. DSM from ALOS 30m**

In [None]:
# Download ALOS DSM Data from Google Earth Engine

ee_download(
            asset_id="JAXA/ALOS/AW3D30/V3_2",
            bands=["DSM"],
            bbox = list(city_boundary.total_bounds), 
            output_dir=f"{EXTERNAL}/alos_dsm_30m",
            scale=30
            )



## **5. Impervious Surface Area**

In [None]:
# Download Impervious Surface Area from Google Earth Engine

ee_download(
            asset_id="projects/sat-io/open-datasets/GISD30_1985_2020",
            bands=["b1"],
            bbox = list(city_boundary.total_bounds), 
            output_dir=f"{EXTERNAL}/impervious_area",
            scale=30,
            )

IMAGE
Downloading b1 from projects/sat-io/open-datasets/GISD30_1985_2020
Generating URL ...
Downloading data from https://earthengine.googleapis.com/v1/projects/21883930632/thumbnails/c4ea089287536d302fae0888db2ea659-0f60e88454714d1de38b069dce89e546:getPixels
Please wait ...
Data downloaded to d:\GeoAI\projects\LCZ_Classification\data\external\impervious_area\GISD30_1985_2020_b1.tif


## **6. Tree Canopy Height**

In [None]:
# Download Canopy Height from Google Earth Engine


ee_download(
            asset_id="users/nlang/ETH_GlobalCanopyHeight_2020_10m_v1",
            bands=["b1"],
            bbox = list(city_boundary.to_crs(4326).total_bounds), 
            output_dir=f"{EXTERNAL}/canopy_height",
            scale=10,
            tile_dims=(2,2)
            )

## **7. Building Height Data**

Download building height data for British Columbia from the NRCan Website- https://ftp.maps.canada.ca/pub/nrcan_rncan/extraction/auto_building/fgdb/

In [None]:
import requests
from bs4 import BeautifulSoup
import os
from tqdm import tqdm

# Base URL of the directory
BASE_URL = "https://ftp.maps.canada.ca/pub/nrcan_rncan/extraction/auto_building/fgdb/"

# Create a session
session = requests.Session()

def get_bc_links(url):
    print(f"Fetching links from {url}")
    response = session.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')
    links = []
    for link in soup.find_all('a'):
        href = link.get('href')
        if href and "BC" in href and href.endswith('.zip'):
            full_url = url + href
            links.append(full_url)
    return links

def download_file(url, dest_folder="."):
    local_filename = os.path.join(dest_folder, url.split("/")[-1])
    print(f"Downloading: {url}")
    with session.get(url, stream=True) as r:
        r.raise_for_status()
        total = int(r.headers.get('content-length', 0))
        with open(local_filename, 'wb') as f, tqdm(
            desc=local_filename,
            total=total,
            unit='B',
            unit_scale=True,
            unit_divisor=1024,
        ) as bar:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
                    bar.update(len(chunk))
    print(f"Saved to {local_filename}")

if __name__ == "__main__":
    bc_links = get_bc_links(BASE_URL)
    print(f"Found {len(bc_links)} BC-related files.")
    for link in bc_links:
        download_file(link, dest_folder="../data/external/building_height")
