# Homework - Module 03

The goal of this homework is to create a simple training pipeline, use mlflow to track experiments and register best model, but use an orchestration tool for it.

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), the **Yellow** taxi data for March, 2023. 

### Question 1. Select the Tool

We will use the same tool you used when completing the module: `Prefect`.

### Question 2. Version

The version of our orchestrator is:

In [1]:
# Prefect version
!prefect --version

3.4.6


### Question 3. Creating a pipeline

Let's download and read the March 2023 Yellow taxi trips data.

In [2]:
# Download March data
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 53.5M  100 53.5M    0     0  1197k      0  0:00:45  0:00:45 --:--:-- 1183k


In [3]:
# List data file
!ls -lh yellow*

-rw-r--r--@ 1 cm-mboulou-mac  staff    54M Jun 13 06:37 yellow_tripdata_2023-03.parquet


In [4]:
# Necessary import
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

# Python version
!python --version

Python 3.9.18


In [5]:
# Read the data
df = pd.read_parquet("yellow_tripdata_2023-03.parquet")
# Number of observations
print(f"Number of rows: {len(df)}.")

Number of rows: 3403766.



### Question 4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously. 

This is what we used (adjusted for yellow dataset):

```python
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df
```

For running our workflow with Prefect, we will:

- Launch `MLFlow`:
```sh
mlflow server \
    --backend-store-uri sqlite:///mlflow.db
```
- Launch `Prefect`:
```sh
prefect server start
```
- Configure Prefect locally:
```sh
prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
```
- Run the orchestration script:
```sh
python homework.py --year=2023 --month=3
```

In [6]:
# Run our orchestration script
!python homework.py --year=2023 --month=3

22:35:50.224 | [36mINFO[0m    | Flow run[35m 'sociable-donkey'[0m - Beginning flow run[35m 'sociable-donkey'[0m for flow[1;35m 'main-flow'[0m
22:35:50.226 | [36mINFO[0m    | Flow run[35m 'sociable-donkey'[0m - View at [94mhttp://127.0.0.1:4200/runs/flow-run/dc814097-6d49-44fc-b850-68b1f5bd0970[0m
2025/06/14 22:35:50 INFO mlflow.tracking.fluent: Experiment with name 'nyc-taxi-experiment' does not exist. Creating a new experiment.
Number of rows before preprocessing: 3403766.
Number of rows after preprocessing: 3316216.
22:36:42.797 | [36mINFO[0m    | Task run 'read_dataframe-551' - Finished in state [32mCompleted[0m()
🏃 View run adorable-chimp-52 at: http://127.0.0.1:5000/#/experiments/1/runs/5c67f7a6b3a84f469fff5d245c993ce0
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/1
22:37:24.374 | [36mINFO[0m    | Task run 'train_model-1ce' - Finished in state [32mCompleted[0m()
Model Intercept: 24.78
Successfully registered model 'nyc-yellow-taxi-regressor'.
2025

The size of the result is `3,316,216`.

### Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1, to use it in our pipeline:

* Fit a dict vectorizer.
* Train a linear regression with default parameters.
* Use pick up and drop off locations separately.

The intercept of the model is `24.77`.

### Question 6. Register the model 

The model is trained and saved with MLFlow.

After finding the logged model, and the MLModel file. its size (`model_size_bytes` field) is: `4,534`.

---