In [1]:
# import pandas and duckdb
import pandas as pd
import duckdb
from pathlib import Path

In [2]:
sql_query = """
SHOW TABLES;
"""

In [3]:
# create data folder if it doesn't exist
data_path = Path("data")
data_path.mkdir(parents=True, exist_ok=True)

with duckdb.connect("data/nyc_parking_violations.db") as conn:
    display(conn.execute(sql_query).df())

Unnamed: 0,name


In [4]:
# import data into duckdb
sql_query_import_parking_violation_codes = """
CREATE OR REPLACE TABLE parking_violation_codes AS
SELECT *
FROM read_csv_auto(
    'data/parking_violation_codes.csv',
    normalize_names=True
);
"""

sql_query_import_parking_violations_2023 = """
CREATE OR REPLACE TABLE parking_violations_2023 AS
SELECT *
FROM read_csv_auto(
    'data/parking_violations_2023.csv',
    normalize_names=True
);
"""

In [6]:
# lets connect to the duckdb and run the import queries
with duckdb.connect("data/nyc_parking_violations.db") as conn:
    conn.execute(sql_query_import_parking_violation_codes)
    conn.execute(sql_query_import_parking_violations_2023)
    display(conn.execute("SHOW TABLES;").df())

Unnamed: 0,name
0,parking_violation_codes
1,parking_violations_2023


In [7]:
# lets see what is in parking_violations_codes
with duckdb.connect("data/nyc_parking_violations.db") as conn:
    display(conn.execute("SELECT * FROM parking_violations_codes LIMIT 5;").df())

CatalogException: Catalog Error: Table with name parking_violations_codes does not exist!
Did you mean "parking_violation_codes"?

LINE 1: SELECT * FROM parking_violations_codes LIMIT 5;
                      ^

In [8]:
# lets see what columns are in the parking_violations_2023 table
with duckdb.connect("data/nyc_parking_violations.db") as conn:
    display(conn.execute("SELECT * FROM parking_violations_2023 LIMIT 5;").df())

Unnamed: 0,summons_number,registration_state,plate_type,issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,vehicle_expiration_date,violation_location,...,from_hours_in_effect,to_hours_in_effect,vehicle_color,unregistered_vehicle,vehicle_year,meter_number,feet_from_curb,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation
0,9010912681,CA,PAS,2022-10-11,17,SUBN,FORD,T,20220788,50.0,...,0700A,0400P,BLACK,,0,,0,,,
1,4858762841,NY,PAS,2023-08-21,36,4DSD,HONDA,V,0,,...,,,GY,,2003,,0,,,
2,4854645684,FL,PAS,2023-07-26,36,UT,BMW,V,0,,...,,,WHI,,2022,,0,,,
3,9044582707,NY,PAS,2023-04-10,21,SUBN,SUBAR,T,20231217,79.0,...,0900A,1030A,GY,,2017,,0,,,
4,9041503330,NY,PAS,2023-03-21,21,4DSD,CHEVR,T,20250320,26.0,...,1100A,1230A,BK,,2018,,0,,,


The `DBT init` command is used to create a new dbt project. When you run this command, it auto-generates a new dbt project with all the necessary files and directories you need to get started. 

Run the following command in your terminal to create a new dbt project:

```bash
dbt init nyc_parking_violations
```

After running the command, it will show the following output, enter "1" to select duckdb as the database:

```bash
(venv) rellika@mackbook ~/D/C/DTB (main)> dbt init nyc_parking_violations                       (base) 
05:47:22  Running with dbt=1.11.2
05:47:22  Creating dbt configuration folder at /Users/rellika/.dbt
05:47:22  
Your new dbt project "nyc_parking_violations" was created!

For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:

  https://docs.getdbt.com/docs/configure-your-profile

One more thing:

Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:

  https://community.getdbt.com/

Happy modeling!

05:47:22  Setting up your profile.
Which database would you like to use?
[1] duckdb

(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)

Enter a number: 1
```

It will create a new directory called `nyc_parking_violations` with the following structure:

```
nyc_parking_violations/
analyses/
macros/
models/
seeds/
snapshots/
tests/
.gitignore
dbt_project.yml
README.md
```

- dbt_project.yml: TThe DBT project YAML file is a configuration file used in DBT projects. It stands for `Yet Another Markup Language` and is used to summarize configurations for your project. This file contains key sections that inform your DBT project where to look for various components and what actions to take. It includes details like the project name, version, profiles, paths for models, tests, seeds, macros, and how models are materialized. This file is essential for setting up and managing your DBT project.

We need also to set up the `profiles.yml` file to connect to duckdb. `cd` into `nyc_parking_violations`, and then create a file called `profiles.yml` inside that folder with the following content:

```yaml
default:
    outputs:
        dev:
            type: duckdb
    target: dev
```

**DBT COMMANDS TO KNOW**

- `dbt debug`: This command is used to test the connection between dbt and your data warehouse. It checks if dbt can successfully connect to the database specified in your profiles.yml file and verifies that the configuration is correct. Running this command helps ensure that your dbt setup is properly configured before you start running models or other dbt commands.
- `dbt compile`: This command compiles your dbt models into executable SQL files without actually running them against the database. It processes your model files, applies any Jinja templating, and generates the final SQL code that would be executed. This is useful for checking the generated SQL and ensuring that your models are correctly defined before executing them.
- `dbt run`: This command executes the compiled SQL models against your data warehouse. It runs the SQL code generated by dbt to create or update tables and views in your database based on the models defined in your dbt project. This is the primary command used to build and transform data within your data warehouse using dbt.

In [10]:
# select * from ref_model
sql_query_ref_model = """
SELECT * FROM ref_model;
"""
with duckdb.connect("data/nyc_parking_violations.db") as conn:
    display(conn.execute(sql_query_ref_model).df())

Unnamed: 0,violation_count
0,97
