# Train a Regression Model

In this notebook we will train a Random Forest Regression Model on the Yellow Taxi Dataset. We will only use one month for the training. And keep only a small number of features. 

We want the model to predict the duration of a trip. This can be useful for the taxi drivers to plan their trips, for the customers to know how long a trip will take but also for the taxi companies to plan their fleet. The first two predictions would need real time predictions because the duration of a trip is not known in advance. The last one could be done in batch mode, as it is more a analytical task that doesn't need to be done in real time.

Additionally, we will use MLFlow to track the model training and log the model artifacts.

In [1]:
import os
from dotenv import load_dotenv
import mlflow

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.pipeline import make_pipeline

In [2]:
year = 2021
month = 1
color = "yellow"

In [3]:
# Download the data
if not os.path.exists(f"./data/{color}_tripdata_{year}-{month:02d}.parquet"):
    os.system(f"wget -P ./data https://d37ci6vzurychx.cloudfront.net/trip-data/{color}_tripdata_{year}-{month:02d}.parquet")

In [4]:
# Load the data

df = pd.read_parquet(f"./data/{color}_tripdata_{year}-{month:02d}.parquet")

In [5]:
df.shape

(1369769, 19)

Now we will set up the connection to MLFlow. For that we have to create a `.env` file with the URI to the MLFlow Server in gcp (this will be `http://<external-ip>:5000`). You can simply run:

```bash
echo "MLFLOW_TRACKING_URI=http://<external-ip>:5000" > .env
```
```bash
echo "MLFLOW_TRACKING_URI=http://35.198.128.91:5000" > .env
```

We also will create an experiment to track the model and the metrics.

In [6]:
load_dotenv()

MLFLOW_TRACKING_URI=os.getenv("MLFLOW_TRACKING_URI")

In [7]:
# Set up the connection to MLflow
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

# Setup the MLflow experiment 
mlflow.set_experiment("yellow-taxi-duration")

<Experiment: artifact_location='gs://mlflow-artifacts-2023/taxi/3', creation_time=1689371108011, experiment_id='3', last_update_time=1689371108011, lifecycle_stage='active', name='yellow-taxi-duration', tags={}>

If everything went well, you should be able to see the experiment now in the MLFlow UI at `http://<external-ip>:5000`.

Let's start now with looking at the data a bit:

In [8]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.1,1.0,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5,
1,1,2021-01-01 00:51:20,2021-01-01 00:52:19,1.0,0.2,1.0,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0,
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1.0,14.7,1.0,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0,
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0.0,10.6,1.0,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0,
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1.0,4.94,1.0,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5,


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1369769 entries, 0 to 1369768
Data columns (total 19 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   VendorID               1369769 non-null  int64         
 1   tpep_pickup_datetime   1369769 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  1369769 non-null  datetime64[ns]
 3   passenger_count        1271417 non-null  float64       
 4   trip_distance          1369769 non-null  float64       
 5   RatecodeID             1271417 non-null  float64       
 6   store_and_fwd_flag     1271417 non-null  object        
 7   PULocationID           1369769 non-null  int64         
 8   DOLocationID           1369769 non-null  int64         
 9   payment_type           1369769 non-null  int64         
 10  fare_amount            1369769 non-null  float64       
 11  extra                  1369769 non-null  float64       
 12  mta_tax                13697

In [10]:
# Look for missing values
df.isnull().sum()

VendorID                       0
tpep_pickup_datetime           0
tpep_dropoff_datetime          0
passenger_count            98352
trip_distance                  0
RatecodeID                 98352
store_and_fwd_flag         98352
PULocationID                   0
DOLocationID                   0
payment_type                   0
fare_amount                    0
extra                          0
mta_tax                        0
tip_amount                     0
tolls_amount                   0
improvement_surcharge          0
total_amount                   0
congestion_surcharge       98352
airport_fee              1369764
dtype: int64

Nearly all features seem to be in the correct type and we have only missings in features that we will not use for the model training. For predicting the duration of a trip, we will use the following features:

- `PULocationID`: The pickup location ID
- `DOLocationID`: The dropoff location ID
- `trip_distance`: The distance of the trip in miles

But first we have to calculate the duration of the trip in minutes because it is our target. For that we will use the `tpep_pickup_datetime` and `tpep_dropoff_datetime` columns. We will also remove all trips that have a duration of 0 and that are longer than 1 hours to remove outliers.

Additionally we will transform `DOLocationID` and `PULocationID` to categorical features. And combine them to a new feature `trip_route` that will contain the route of the trip.

In [11]:
features = ["PULocationID", "DOLocationID", "trip_distance"]
target = 'duration'

In [12]:
# calculate the trip duration in minutes and drop trips that are less than 1 minute and more than 2 hours
def calculate_trip_duration_in_minutes(df):
    df["trip_duration_minutes"] = (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.total_seconds() / 60
    df = df[(df["trip_duration_minutes"] >= 1) & (df["trip_duration_minutes"] <= 60)]
    return df   

In [13]:
def preprocess(df):
    df = df.copy()
    df = calculate_trip_duration_in_minutes(df)
    categorical_features = ["PULocationID", "DOLocationID"]
    df[categorical_features] = df[categorical_features].astype(str)
    df['trip_route'] = df["PULocationID"] + "_" + df["DOLocationID"]
    df = df[['trip_route', 'trip_distance', 'trip_duration_minutes']]
    return df

In [14]:
df_processed = preprocess(df)


In [15]:
df_processed.head()

Unnamed: 0,trip_route,trip_distance,trip_duration_minutes
0,142_43,2.1,6.033333
2,132_165,14.7,27.6
3,138_132,10.6,15.216667
4,68_33,4.94,16.533333
5,224_68,1.6,8.016667


Now that we have the dataframe that we want to train our model on. We need to split it into a train and test set. We will use 80% of the data for training and 20% for testing.

In [16]:
y=df_processed["trip_duration_minutes"]
X=df_processed.drop(columns=["trip_duration_minutes"])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

In [17]:
X_train = X_train.to_dict(orient="records")
X_test = X_test.to_dict(orient="records")

And now we can train the model and track the experiment with MLFlow. We will set tags to the experiment to make it easier to find it later.

- `model`: `random forest regressor`
- `dataset`: `yellow-taxi`
- `developer`: `your-name`
- `train_size`: The size of the train set
- `test_size`: The size of the test set
- `features`: The features that we used for training
- `target`: The target that we want to predict
- `year`: The year of the data
- `month`: The month of the data

We could also log the model parameters but Linear Regression doesn't have any.

And finally we will log the metrics:

- `rmse`: The root mean squared error

We will also log the model artifacts. For that we will need to set the `service account json` that we downloaded earlier as the environment variable `GOOGLE_APPLICATION_CREDENTIALS`. 

In [18]:
SA_KEY=os.getenv("GOOGLE_SA_KEY")
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = SA_KEY

We will now combine the `trip_distance` and the `trip_route` in a dictionary and transform it with the [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) from `sklearn` to a sparse matrix, which is basically a one hot encoding of the categorical features and includes the distance.

In [18]:
with mlflow.start_run():
    
    tags = {
        "model": "random forest regressor",
        "developer": "Johann",
        "dataset": f"{color}-taxi",
        "year": year,
        "month": month,
        "features": features,
        "target": target
    }
    mlflow.set_tags(tags)
    pipeline = make_pipeline(
         DictVectorizer(),
        RandomForestRegressor(max_depth=10,random_state=42,min_samples_leaf=3)
    )
    pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    mlflow.log_metric("rmse", rmse)    
    
    mlflow.sklearn.log_model(pipeline, "model")

Now you should see a new experiment with a new run id in MLFlow. You can also see the pipeline and the model in the UI under `Artifacts`.

In [19]:
pipeline = make_pipeline(DictVectorizer(),RandomForestRegressor(max_depth=10,random_state=42,min_samples_leaf=3))
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

In [20]:
test = X_test[0]
test

{'trip_route': '142_237', 'trip_distance': 1.12}

In [21]:
val = pipeline.predict(test)
val

array([6.52514782])

In [22]:
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2_scr = r2_score(y_test, y_pred)

print('rmse: ', rmse)
print('R2 score: ', r2_scr)

rmse:  4.4331545356692965
R2 score:  0.7504935422197069


In [24]:
def data_split():
    df_processed = preprocess(df)
    y=df_processed["trip_duration_minutes"]
    X=df_processed.drop(columns=["trip_duration_minutes"])
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
    return X_train, X_test, y_train, y_test


In [27]:
X_train, X_test, y_train, y_test = data_split()
y_test.shape

(268651,)

You should now see your run in the MLFlow UI. Under the created experiment. You can also see the logged tags, the metric and the saved model.

![mlflow-ui](./images/mlflow-run.png)

And you can see what you need to do to load the model in an API or script in the UI as long as the application has access to MLFlow.

## Optuna

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import optuna

# Define the objective function for optimization
def objective(trial):
    # Define the search space for hyperparameters
    max_depth = trial.suggest_int('max_depth', 2, 10)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)

    # Create the random forest model with the suggested hyperparameters
    rf = RandomForestRegressor(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_leaf=min_samples_leaf
    )

    # Perform cross-validation
    scores = cross_val_score(rf, X_train, y_train, cv=5)
    score_mean = scores.mean()

    return score_mean

# Create an Optuna study object
study = optuna.create_study(direction='maximize')

# Optimize the objective function
study.optimize(objective, n_trials=100)

# Print the best hyperparameters and best score
print("Best Hyperparameters:", study.best_params)
print("Best Score:", study.best_value)

[I 2023-07-14 18:24:35,479] A new study created in memory with name: no-name-e79c8ef0-7e7e-40be-ac2d-6d03ce9f9b50


[I 2023-07-14 18:26:51,870] Trial 0 finished with value: 0.758359387392949 and parameters: {'n_estimators': 80, 'max_depth': 5, 'min_samples_leaf': 5}. Best is trial 0 with value: 0.758359387392949.
[I 2023-07-14 18:29:17,012] Trial 1 finished with value: 0.7890012376470517 and parameters: {'n_estimators': 50, 'max_depth': 10, 'min_samples_leaf': 7}. Best is trial 1 with value: 0.7890012376470517.
[I 2023-07-14 18:31:16,747] Trial 2 finished with value: 0.7634984538513665 and parameters: {'n_estimators': 60, 'max_depth': 6, 'min_samples_leaf': 9}. Best is trial 1 with value: 0.7890012376470517.
[I 2023-07-14 18:31:59,550] Trial 3 finished with value: 0.7474513035119085 and parameters: {'n_estimators': 30, 'max_depth': 4, 'min_samples_leaf': 1}. Best is trial 1 with value: 0.7890012376470517.
[I 2023-07-14 18:32:45,095] Trial 4 finished with value: 0.7697038361240376 and parameters: {'n_estimators': 20, 'max_depth': 7, 'min_samples_leaf': 9}. Best is trial 1 with value: 0.78900123764705

KeyboardInterrupt: 