**Dataset & API**:

- Base API URL: https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api
- Data format: Paginated JSON (1,000 records per page)
- API Pagination: Stop when an empty page is returned

**All code in homework should be run on Google Colab**

**Question 1**: dlt Version

!pip install dlt[duckdb]  
!dlt --version

=> 1.9.0

**Define & Run the Pipeline (NYC Taxi API)**

Steps:
1. Use the `@dlt.resource` decorator to define the API source
2. Implement automatic pagination using dlt's built-in REST client
3. Load the extracted data into DuckDB for querying

In [None]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator

@dlt.resource(name="rides")
def ny_taxi():
    client = RESTClient(
        base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
        paginator=PageNumberPaginator(
            base_page=1,
            total_path=None
        )
    )

    for page in client.paginate("data_engineering_zoomcamp_api"):
        yield page


pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_pipeline",
    destination="duckdb",
    dataset_name="ny_taxi_data"
)

load_info = pipeline.run(ny_taxi)
print(load_info)

**Question 2**: How many tables were created?

=> 4

In [None]:
import duckdb
from google.colab import data_table
data_table.enable_dataframe_formatter()

# A database '<pipeline_name>.duckdb' was created in working directory so just connect to it

# Connect to the DuckDB database
conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# Set search path to the dataset
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")

# Describe the dataset
conn.sql("DESCRIBE").df()

conn.close()

**Question 3**: What is the total number of records extracted?

=> 10000

In [None]:
df = pipeline.dataset(dataset_type="default").rides.df()
df

**Question 4**: What is the average trip duration?
  
=> 12.3049

In [None]:
with pipeline.sql_client() as client:
    res = client.execute_sql(
            """
            SELECT
            AVG(date_diff('minute', trip_pickup_date_time, trip_dropoff_date_time))
            FROM rides;
            """
        )
    # Prints column values of the first row
    print(res)