# Brukerhistorie 2
Som bruker ønsker jeg å strømme AR50 slik at jeg kan filtrere data etter behov.  

Funksjonalitet: Det skal være mulig å hente AR50 direkte fra fil. Prosessen omhandler strømming som benytter “partial read" som bruker kan filtrere ut data. 

Akseptkriterier: 
- Jeg skal kunne hente AR50-data direkte fra filer med geografisk filtrering og kolonnefiltrering.  
- Jeg skal kunne filtrere data etter mine egne behov. 

Mål: 
Strømmingen av AR50 skal være en effektiv og brukervennlig prosess. Brukeren skal kunne filtrere data etter egne behov. Prosessen skal også benytte “partial read”, slik at data oppdateres automatisk og ytelsen optimaliseres. 


# Packages

In [0]:
from sedona.spark import *
import requests
import os
import json
from pyspark.sql import Row
import pyspark.sql.functions as F
from pyspark.sql.functions import expr, col, to_timestamp, hour, date_format, lit, to_timestamp, lower, when, desc
from pyspark.sql.types import IntegerType, DoubleType, FloatType, LongType, TimestampType, DateType, StringType
from pyspark.sql import DataFrame
from pyspark.sql.utils import AnalysisException
import pandas as pd
import geopandas as gpd
from shapely import wkb
from datetime import datetime
from shapely.geometry import Point
import geopandas as gpd
import ipywidgets as widgets
from IPython.display import display, clear_output



# Constant variable

In [0]:
A50_TABLE ="land_techtroll_dev.bronze.ar50"
EPSG = 4326
OPERATOR = ["==", "!=", "<", "<=", ">", ">=", "LIKE"]
AREALTYPE = ["10", "20", "30", "50", "60", "70", "81", "82"]

# Config
Define dev_config for the development environment and prod_config for the production environment.</br></br>
Each configuration contains:
- catalog_name: the name of the catalog.
- landing_zone_prefix: file path to the "landing zone" where new data is typically placed initially.
- location_prefix: file path to where static data is stored in the cloud.
- static_data_prefix: general file path to static data.

In [0]:
%run ./config

{'catalog_name': 'land_techtroll_dev',
 'landing_zone_prefix': '/Volumes/land_techtroll_dev/external_dev/landing_zone',
 'location_prefix': '/Volumes/land_techtroll_dev/external_dev/static_data/cloudFiles',
 'static_data_prefix': '/Volumes/land_techtroll_dev/external_dev/static_data',
 'env': 'dev'}

# Set catelog
In this notebook, the catalog is set to `land_techtroll_dev`, as this is where the relevant databases are located. Setting the catalog ensures that all subsequent queries reference the correct data environment, making it easier to access and manage the necessary tables for our testing and development.


In [0]:
spark.sql(f'USE CATALOG {spark.conf.get("conf.catalog_name")}')

DataFrame[]

# Get tabell from database
The database follows the Medallion Architecture, which organizes data into layers:
- Bronze: Raw or minimally processed data (used in this case).
- Silver: Cleaned and structured data.
- Gold: Aggregated, business-ready data

This notebook using only bronze layer to work with raw data for early-stage testing and development.

Important note! <i> Spark table has geometry data as binary, hence it need to be convert</i>

In [0]:
# Read AIS data from database
AR50_DF = spark.table(A50_TABLE)

# Clean the table
Remove unnecessary data table in following columns:
- Arealtype
- Jordbruk
- Skogbonitet
- Treslag
- vegetasjonsdekke

The following values is removed:
- 99: "Ikke kartlagt"
- 98: "Ikke definert"

In [0]:
# If there are more values that needs be remove, add them here
null_values = [98, 99]

AR50_DF = AR50_DF.withColumn(
    "vegetasjonsdekke",
    when(col("vegetasjonsdekke").isin(null_values), None).otherwise(col("vegetasjonsdekke"))
)

AR50_DF = AR50_DF.withColumn(
    "Arealtype",
    when(col("Arealtype").isin(null_values), None).otherwise(col("Arealtype"))
)

AR50_DF = AR50_DF.withColumn(
    "Treslag",
    when(col("Treslag").isin(null_values), None).otherwise(col("Treslag"))
)

AR50_DF = AR50_DF.withColumn(
    "Skogbonitet",
    when(col("Skogbonitet").isin(null_values), None).otherwise(col("Skogbonitet"))
)

AR50_DF = AR50_DF.withColumn(
    "Jordbruk",
    when(col("Jordbruk").isin(null_values), None).otherwise(col("Jordbruk"))
)

# Gis query

In [0]:
def create_buffer_radius(lon, lat, radius_m=500):
    """
    Create a buffer area (radius in meters) around a given longitude and latitude.

    Args:
        lon (float): Longitude of the center point.
        lat (float): Latitude of the center point.
        radius_m (int, optional): Radius of the buffer in meters. Defaults to 500.

    Returns:
        DataFrame: A Spark DataFrame containing the buffer geometry as GeoJSON.
    """
    # Create a DataFrame with a single point (id, longitude, latitude)
    data = [(1, lon, lat)]
    df_raw = spark.createDataFrame(data, ["id", "lon", "lat"])
    df_raw.createOrReplaceTempView("geom_points")

    # Convert the longitude and latitude into a geometry point with SRID 3857
    geom_df = spark.sql("""
    SELECT 
        id,
        ST_SetSRID(ST_Point(lon, lat), 3857) AS geometry
    FROM geom_points
    """)

    # Create a buffer around the point and transform it, returning it as GeoJSON
    query = f"""
        SELECT 
            id,
            st_asgeojson(
                ST_Transform(
                    ST_SetSRID(
                        ST_Buffer(
                            ST_SetSRID(ST_Point(lon, lat), {EPSG}),
                            {radius_m}, True
                        ),
                        {EPSG}
                    ),
                    'EPSG:{EPSG}'
                )
            ) AS geom
        FROM geom_points
    """
    buffer_df = spark.sql(query)
    return buffer_df


def get_geomtry_within(df, lon, lat, radius_m=500):
    """
    Find geometries within a given radius from a specified point.

    Args:
        df (DataFrame): Spark DataFrame containing geometries to query against.
        lon (float): Longitude of the center point.
        lat (float): Latitude of the center point.
        radius_m (int, optional): Search radius in meters. Defaults to 500.

    Returns:
        Tuple[DataFrame, DataFrame]: 
            - DataFrame with geometries that intersect the buffer area.
            - DataFrame with the buffer geometry itself.
    """
    # Create a buffer around the given point
    buffer_df = create_buffer_radius(lon, lat)
    
    # Register the input DataFrame as a temporary view
    df.createOrReplaceTempView("Within_view")

    # Query to select geometries that intersect the buffer
    query = f"""
        SELECT
            gml_Id,
            Treslag,
            geometry
        FROM Within_view
        WHERE ST_Intersects(
            ST_GeomFromText(geometry),
            (SELECT 
                ST_Transform(
                    ST_SetSRID(
                        ST_Buffer(
                            ST_SetSRID(ST_Point(lon, lat), {EPSG}),
                            {radius_m}, True
                        ),
                        {EPSG}
                    ),
                    'EPSG:{EPSG}'
                ) AS geom
            FROM geom_points)
        )
    """
    within_df = spark.sql(query)
    return within_df, buffer_df


# Map visualization


In [0]:
def prepare_for_kepler(df: DataFrame) -> DataFrame:
    """
    Cleans a DataFrame for visualization in SedonaKepler.
    
    - Converts DecimalType to FloatType
    - Converts DateType to String ("yyyy-MM-dd")

    Args:
        df (DataFrame): The input PySpark DataFrame

    Returns:
        DataFrame: Cleaned DataFrame ready for Kepler.gl
    """
    # Convert decimal columns to float
    decimal_cols = [f.name for f in df.schema.fields if "decimal" in str(f.dataType).lower()]
    for col_name in decimal_cols:
        df = df.withColumn(col_name, col(col_name).cast("float"))

    # Convert date columns to string (yyyy-MM-dd)
    if "date" in df.columns:
        df = df.withColumn("date", date_format(col("date"), "yyyy-MM-dd"))

    return df
    
def visualize_map(filter_df: DataFrame, name: str = None):
    """
    Visualizes a filtered PySpark DataFrame on a Kepler-style interactive map.

    Parameters:
    filter_df (DataFrame): The filtered DataFrame containing spatial data (geometry or coordinates).
    name (str, optional): A name or label to assign to the map layer. Defaults to None.

    Returns:
    None: Displays the interactive map in the notebook environment.
    """
    # Preprocess the DataFrame to make it compatible with Kepler visualization
    filtered_df_clean = prepare_for_kepler(filter_df)

    # Create an empty interactive map using SedonaKepler
    map = SedonaKepler.create_map()

    # Add the cleaned DataFrame to the map, with an optional layer name
    SedonaKepler.add_df(map, filtered_df_clean, name=name)

    # Display the map in the notebook
    return map

# Optimization

In [0]:
def difference_between_rows(df: DataFrame, filter_df: DataFrame) -> int:
    """
    Calculates and prints the difference in row counts between the original and filtered DataFrames.

    Parameters:
    df (DataFrame): The original (unfiltered) DataFrame.
    filter_df (DataFrame): The filtered version of the DataFrame.

    Returns:
    int: The number of rows removed during filtering.
    """
    # Count rows in both DataFrames
    original_count = df.count()
    filtered_count = filter_df.count()

    # Print summary
    print(f"The original DataFrame has {original_count} rows.")
    print(f"The filtered DataFrame has {filtered_count} rows.")
    print(f"The difference is {original_count - filtered_count} rows.")

    # Return the difference in row count
    return original_count - filtered_count

In [0]:
def difference_between_size(filter_df: DataFrame) -> float:
    """
    Compares the disk size (in MB) of the original Delta table and a filtered DataFrame.

    This function:
    - Writes the filtered DataFrame to a temporary table
    - Compares table sizes using DESCRIBE DETAIL
    - Cleans up the temporary table afterward
    - Returns the size difference in megabytes

    Parameters:
    original_table (str): The name of the original Delta table (e.g., 'bronze.ais').
    filter_df (DataFrame): The filtered DataFrame to compare.

    Returns:
    float: The difference in size (original - filtered) in megabytes.
    """
    tmp_table = "bronze.ar50_tmp"

    try:
        # Save the filtered DataFrame as a temporary table
        filter_df.write.mode("overwrite").saveAsTable(tmp_table)
        
        # Get size of the original table
        original_info = spark.sql(f"DESCRIBE DETAIL land_techtroll_dev.bronze.ar50")
        original_size_bytes = original_info.select("sizeInBytes").collect()[0][0]
        original_size_mb = original_size_bytes / (1024 * 1024)
        original_size_kb = original_size_bytes / 1024 

        # Get size of the temporary (filtered) table
        tmp_info = spark.sql(f"DESCRIBE DETAIL land_techtroll_dev.bronze.ar50_tmp")
        tmp_size_bytes = tmp_info.select("sizeInBytes").collect()[0][0]
        print(tmp_size_bytes)
        tmp_size_mb = tmp_size_bytes / 1024 

        # Calculate and print the size difference
        size_diff_mb = original_size_kb - tmp_size_mb
        print(f"Original table size: {original_size_mb:.2f} MB")
        print(f"Filtered table size: {tmp_size_mb:.2f} KB")
        print(f"Difference in size: {size_diff_mb:.2f} KB")

        return size_diff_mb

    except AnalysisException as e:
        print(f"[ERROR] Table not found or query failed: {e}")
        return 0.0
    except Exception as e:
        print(f"[ERROR] Unexpected error: {e}")
        return 0.0
    finally:
        # Always drop the temporary table, even if something fails
        try:
            spark.sql(f"DROP TABLE IF EXISTS {tmp_table}")
            print(f"Temporary table '{tmp_table}' dropped.")
        except Exception as drop_err:
            print(f"[WARNING] Could not drop temporary table: {drop_err}")


# User case examples

## Visning av treslag innenfor et gitt område 

Som arealplanlegger i Kristiansand kommune ønsker jeg å hente ut og analysere AR50-data innenfor en radius på 500-meter rundt et valgt punkt, slik at jeg kan finne hvilke treslag som finnes i området og vurdere grøntstruktur i byplanlegging. 

Akseptkriterier: 
- Bruker skal kunne angi koordinater og buffer-radius 
- Systemet skal returnere forekomster av treslag innenfor området 
- Treslagene skal vises både i tabell og på kart 
<br>

Input:
- Koordinater (f.eks. 8.003180405023354, 58.16368151324909)  
- En radius i meter (f.eks. 500) 

Output:
- Tabell 
- Kart 


In [0]:
# Clean up previous widgets
dbutils.widgets.removeAll()

# Create widgets
dbutils.widgets.dropdown("column", AR50_DF.columns[12], AR50_DF.columns, "Kolonne")
dbutils.widgets.text("latitude", "58.1743128719827", "Latitude")
dbutils.widgets.text("longitude", "8.01122450699552", "Longitude")
dbutils.widgets.text("radius", "500", "Radius (m)")


In [0]:
from pyspark.sql.functions import col
from pyspark.sql.utils import AnalysisException
from IPython.display import display

try:
    # Read inputs
    col_name = dbutils.widgets.get("column")
    lat = float(dbutils.widgets.get("latitude"))
    lon = float(dbutils.widgets.get("longitude"))
    radius = int(dbutils.widgets.get("radius"))

    print(f"🔍 Kjører søk for kolonne: {col_name}")
    print(f"📍 Koordinater: lat={lat}, lon={lon}, radius={radius} m")

    # Filter rows where selected column is not null
    filtered_df = AR50_DF.filter(col(col_name).isNotNull())

    # Spatial filter
    df, buffer_df = get_geomtry_within(filtered_df, lon, lat, radius)

    print(f"✅ Fant {df.count()} rader etter spatial filtrering.")
    display(df.limit(10).toPandas())

    # Show map
    print("🗺️ Viser kart...")
    map_df = SedonaKepler.create_map()
    map_df.add_data(df.toPandas(), "points")
    map_df.add_data(buffer_df.toPandas(), "buffer")
    display(map_df)

    # Metrics
    difference_between_rows(AR50_DF, df)
    difference_between_size(df)

except AnalysisException as ae:
    print(f"❌ Analysefeil: {ae}")
except Exception as e:
    print(f"❌ Feil: {e}")


🔍 Kjører søk for kolonne: Treslag
📍 Koordinater: lat=58.1743128719827, lon=8.01122450699552, radius=500 m
✅ Fant 16 rader etter spatial filtrering.


Unnamed: 0,gml_Id,Treslag,geometry
0,idabd9cde9-8983-400f-b1f0-a24e8bd1a1e9,33,"POLYGON ((8.018049048850225 58.1766941662483, ..."
1,id253f2d8b-6d3e-4ba3-8841-591680448c3c,31,POLYGON ((8.008672667297759 58.177530142458366...
2,id47c56cba-5ed8-4354-a602-d21dc33c8479,33,POLYGON ((8.016456205298631 58.175254837853906...
3,id3590952b-a51c-4ac9-86a5-220651b8a7d7,31,"POLYGON ((8.009175778962911 58.16719429992018,..."
4,idaad7410f-fb36-46cd-b532-695eb9ef7145,31,"POLYGON ((8.000602836808843 58.16915524813237,..."
5,idd64d9bb8-75d4-47f1-9ddb-46a14c1d107c,31,POLYGON ((8.017746660139368 58.176497308991216...
6,idd1e215e9-f900-490a-99ba-5e6194b87093,31,"POLYGON ((8.016305020923003 58.17515640762687,..."
7,id33a1874a-c9d6-46cf-a5d1-5ee5760cf9ca,31,"POLYGON ((8.01676545121825 58.17455090519182, ..."
8,id81fa1a38-56d3-4aae-b3b7-089a47741e57,31,"POLYGON ((7.993487747219627 58.17840561671637,..."
9,id3ef6a438-d33b-4e38-a29c-1e9fbe0393f9,31,"POLYGON ((8.008040221825588 58.17208825841706,..."


🗺️ Viser kart...
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'points': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'columns': ['gml_Id…

The original DataFrame has 179790 rows.
The filtered DataFrame has 16 rows.
The difference is 179774 rows.
34154
Original table size: 326.65 MB
Filtered table size: 33.35 KB
Difference in size: 334454.92 KB
Temporary table 'bronze.ar50_tmp' dropped.



## Analyse av arealtype 10 (bebyggelse) i Kristiansand 

Brukerhistorie 

Som arealplanlegger i Kristiansand kommune ønsker jeg å hente ut og analysere AR50-data innenfor en radius på 500-meter rundt et valgt punkt, slik at jeg kan finne hvilke treslag som finnes i området og vurdere grøntstruktur i byplanlegging. 

Akseptkriterier: 
- Bruker skal kunne spesifisere områdekode og arealtype.  
- Systemet skal returnere de 10 største forekomstene basert på areal.  

Input 
- En områdekode (f.eks. 4204) 
- En arealtype (f.eks. 10) 

Output: 

- Tabell med topp 10 forekomster  
- Kart med markering av disse områdene  

 



In [0]:
# Remove any previous widgets
dbutils.widgets.removeAll()

# Static list of arealtyper
dbutils.widgets.dropdown("arealtype", "10", AREALTYPE, "Arealtype")

# Dynamic arealkoder from DataFrame
arealkoder_rows = AR50_DF.select("områdeId").distinct().collect()
arealkoder = sorted([str(row["områdeId"]) for row in arealkoder_rows])
default_code = "4204" if "4204" in arealkoder else arealkoder[0]

dbutils.widgets.dropdown("arealkode", default_code, arealkoder, "Arealkode")

In [0]:
try:
    # Read widget values
    selected_type = int(dbutils.widgets.get("arealtype"))
    selected_kode = int(dbutils.widgets.get("arealkode"))

    print(f"🔍 Filtrerer etter arealtype = {selected_type}, områdeId = {selected_kode}")

    # Filter DataFrame
    filtered_df = AR50_DF.filter(
        (AR50_DF["arealtype"] == selected_type) &
        (AR50_DF["områdeId"] == selected_kode)
    ).limit(10)

    # Display DataFrame
    display(filtered_df.toPandas())

    # Map
    print("🗺️ Viser kart…")
    map_user = SedonaKepler.create_map()
    map_user.add_data(filtered_df.toPandas(), "filtered")
    display(map_user)

    # Show metrics
    difference_between_rows(AR50_DF, filtered_df)
    difference_between_size(filtered_df)

except Exception as e:
    print(f"❌ Feil under kjøring: {e}")


🔍 Filtrerer etter arealtype = 10, områdeId = 4204


Unnamed: 0,gml_Id,Arealtype,lokalId,Navnerom,Informasjon,Jordbruk,Kartstandard,Kopiddato,områdeId,OriginalDatavert,Oppdateringsdato,Skogbonitet,Treslag,vegetasjonsdekke,geometry
0,id8d05ad6f-5a2a-40c5-81c8-532c56668585,10,61189,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,"POLYGON ((7.763553214566632 58.06779879134709,..."
1,id2eb1a453-9dc7-4029-b45e-8d48eb66877c,10,61132,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,"POLYGON ((7.586751265363879 58.30451304519466,..."
2,id01c5dba5-20b2-425b-aa87-0a8238a2cfd4,10,61012,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,POLYGON ((7.846671887376022 58.136600943781154...
3,id8beb8163-976f-4be0-8498-f3caf5821cc3,10,60805,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,"POLYGON ((8.096493050631215 58.14088276661436,..."
4,iddae31640-20d1-4bb7-b63a-1ad9e5cca5db,10,60742,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,"POLYGON ((8.114464931284035 58.17384736220532,..."
5,idd754cc6f-5091-4d6a-8fdf-4de6603d6b63,10,60731,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,POLYGON ((8.104954686452327 58.168375447799754...
6,idb50f32df-8de9-49bf-a44e-df62a6ef451f,10,60679,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,"POLYGON ((8.10152469172866 58.17548807992795, ..."
7,idac3f34a1-4fca-4be3-adb9-db173c2cf659,10,60675,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,POLYGON ((8.155746862104245 58.186170144129306...
8,id30af6d18-e28c-45fd-bc75-4e1d45760709,10,60527,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,"POLYGON ((8.120850988474084 58.22267162197809,..."
9,id6ae5fd1c-3c44-40e6-8b8b-951f9c4e63c7,10,60519,NO_NIBIO_AR50_2022_01,AR50 fra AR5 årsversjon 2021. ARFJELL2 og N50 ...,,AR50,2024-11-17,4204,NIBIO,2022-01-18,,,,POLYGON ((8.117765861097938 58.217458558665626...


🗺️ Viser kart…
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant
  content = self.pack(content)


KeplerGl(data={'filtered': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 'columns': ['gml_Id', 'Arealtype', 'lokal…

The original DataFrame has 179790 rows.
The filtered DataFrame has 10 rows.
The difference is 179780 rows.
25388
Original table size: 326.65 MB
Filtered table size: 24.79 KB
Difference in size: 334463.48 KB
Temporary table 'bronze.ar50_tmp' dropped.
