## Modelling

In this Notebook we can start modelling, with some data from our DB.

- To do this we can connect with our local DB using the `duckdb` library
- When a connection has been made we can start retrieving data from our DB.


### Setup


In [1]:
# Import the right libraries
import duckdb
import polars as pl
from IPython import display

In [6]:
# Use SQL magic
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = True
%config SqlMagic.displaycon = False

%load_ext sql
conn = duckdb.connect(database="../dsp-dagster/data_systems_project.duckdb")
%sql conn --alias duckdb
%sql SHOW ALL TABLES; # shows all available tables

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


Unnamed: 0,database,schema,name,column_names,column_types,temporary


In [3]:
## We can use SQL magic to retrieve data from our DB like so:
# %sql res << SELECT * FROM joined.deployment_incident_vehicles_weather
# res

In [4]:
# Or the more Pythonic way:

# Here we retrieve a table where KNMI weather data and Fire Department data is combined
df = conn.execute(
    """
    SELECT * FROM joined.incident_deployments_vehicles_weather """
).pl()

# Close the database connection
conn.close()

CatalogException: Catalog Error: Table with name incident_deployments_vehicles_weather does not exist!
Did you mean "temp.pg_catalog.pg_views"?
LINE 2:     SELECT * FROM joined.incident_deployments_vehicles_w...
                          ^

In [None]:
df.head()

Station_code,Date,Hour,Dd,Fh,Ff,Fx,T,T10n,Td,Sq,Q,Dr,Rh,P,Vv,N,U,Ww,Ix,M,R,S,O,Y,Incident_ID,Incident_Starttime,Incident_Endtime,Incident_Duration,Incident_Priority,Service_Area,Municipality,Damage_Type,LON,LAT,Incident_Endtime_Hour,Incident_Duration_Hour,Incident_Starttime_Minute,Incident_Endtime_Minute,Incident_Duration_Minute,Deployment_ID,Vehicle_Type,Vehicle_Role,Fire_Station,Fire_Station_Service_Status,Driving_Time_To_Incident,Vehicle
i64,date,i8,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,time,time,time,i64,str,str,str,f64,f64,i8,i8,i8,i8,i8,i64,str,str,str,str,str,str
240,2005-01-01,1,260,40,30,60,68,,57,0,0,0,0,10246,57,8,93,10,7,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,
240,2005-01-01,2,230,30,30,60,65,,52,0,0,0,0,10244,58,8,91,10,7,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,
240,2005-01-01,3,230,40,30,50,43,,34,0,0,0,0,10241,40,1,94,10,7,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,
240,2005-01-01,4,220,40,40,50,38,,32,0,0,0,0,10239,12,0,96,10,7,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,
240,2005-01-01,5,230,40,40,50,38,,34,0,0,0,0,10237,14,3,97,10,7,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,


### Feature Selection


In [5]:
# Select all rows where inicident happended
selected_df = df.filter(pl.col("Incident_ID").is_not_null())
display(selected_df.head())
print(selected_df.columns)

NameError: name 'df' is not defined