### Predicting the duration of a ride, using linear regression. Train on Jan '21 data, validate with Feb '21.

Week 1 Homework for MLops-zoomcamp, DataTalks.Club https://github.com/DataTalksClub/mlops-zoomcamp

Code created by Joshua Harvey, 2022
joshuasharvey@gmail.com

https://github.com/hirschland

Running on AWS EC2, ubuntu

Data from NYC Trip records, https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [11]:
import pandas as pd
import numpy as np

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# Q1. Download the data (Jan/Feb 2021 For-Hire Vehicle Trip Records

In [5]:
janfile = '~/notebooks/data/fhv_tripdata_2021-01.parquet'
dfjan = pd.read_parquet(janfile)

In [7]:
dfjan.info

<bound method DataFrame.info of         dispatching_base_num     pickup_datetime    dropOff_datetime  \
0                     B00009 2021-01-01 00:27:00 2021-01-01 00:44:00   
1                     B00009 2021-01-01 00:50:00 2021-01-01 01:07:00   
2                     B00013 2021-01-01 00:01:00 2021-01-01 01:51:00   
3                     B00037 2021-01-01 00:13:09 2021-01-01 00:21:26   
4                     B00037 2021-01-01 00:38:31 2021-01-01 00:53:44   
...                      ...                 ...                 ...   
1154107               B03266 2021-01-31 23:43:03 2021-01-31 23:51:48   
1154108               B03284 2021-01-31 23:50:27 2021-02-01 00:48:03   
1154109      B03285          2021-01-31 23:13:46 2021-01-31 23:29:58   
1154110      B03285          2021-01-31 23:58:03 2021-02-01 00:17:29   
1154111               B03321 2021-01-31 23:39:00 2021-02-01 00:15:00   

         PUlocationID  DOlocationID SR_Flag Affiliated_base_number  
0                 NaN           Na

### There are 1,154,112 records for January

# Q2. Computing duration

In [12]:
dfjan['duration'] = dfjan.dropOff_datetime - dfjan.pickup_datetime
# convert timedetla to total minutes (seconds/60)
dfjan.duration = dfjan.duration.apply(lambda td: td.total_seconds() / 60)

np.mean(dfjan.duration)

19.1672240937939

### Average trip duration in January is 19.16 mins

# Data preparation

In [15]:
# how mnay records have trips of duration between 1 and 60 mins?
dfjan[(dfjan.duration < 1) | (dfjan.duration > 60)].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44286 entries, 2 to 1154031
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   dispatching_base_num    44286 non-null  object        
 1   pickup_datetime         44286 non-null  datetime64[ns]
 2   dropOff_datetime        44286 non-null  datetime64[ns]
 3   PUlocationID            13027 non-null  float64       
 4   DOlocationID            29973 non-null  float64       
 5   SR_Flag                 0 non-null      object        
 6   Affiliated_base_number  44174 non-null  object        
 7   duration                44286 non-null  float64       
dtypes: datetime64[ns](2), float64(3), object(3)
memory usage: 3.0+ MB


### Dropping 44,286 records

In [17]:
dfjan = dfjan[(dfjan.duration >= 1) & (dfjan.duration <= 60)]
dfjan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1109826 entries, 0 to 1154111
Data columns (total 8 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1109826 non-null  object        
 1   pickup_datetime         1109826 non-null  datetime64[ns]
 2   dropOff_datetime        1109826 non-null  datetime64[ns]
 3   PUlocationID            182818 non-null   float64       
 4   DOlocationID            961919 non-null   float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1109053 non-null  object        
 7   duration                1109826 non-null  float64       
dtypes: datetime64[ns](2), float64(3), object(3)
memory usage: 76.2+ MB


# Q3. Missing values

In [19]:
dfjan = dfjan.fillna(-1)
np.mean(dfjan.PUlocationID < 0)

0.8352732770722617

### 83% of pickup location ID's were missing (and replaced with -1)

# Q4. One-hot encoding

In [24]:
categorical = ['PUlocationID','DOlocationID'] # features for categorical
numerical = ['duration'] # what we will predict

dfjan[categorical] = dfjan[categorical].astype(str)

dv = DictVectorizer()

train_dicts = dfjan[categorical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)


In [25]:
np.shape(X_train)

(1109826, 525)

### The dimensionality of our feature matrix is 525

# Q5. Training a model

In [26]:
target = 'duration'
y_train = dfjan[target].values

In [27]:
lr = LinearRegression()
lr.fit(X_train, y_train) # train model

LinearRegression()

In [30]:
y_pred = lr.predict(X_train) # predicting
mean_squared_error(y_train, y_pred, squared=False) # evaluation

10.5285191072072

### RMSE on train is 10.52

# Q6. Evaluating the model

In [32]:
# get validation data (February)
febfile = '~/notebooks/data/fhv_tripdata_2021-02.parquet'
dffeb = pd.read_parquet(febfile)
dffeb['duration'] = dffeb.dropOff_datetime - dffeb.pickup_datetime
dffeb.duration = dffeb.duration.apply(lambda td: td.total_seconds() / 60)
dffeb = dffeb[(dffeb.duration >= 1) & (dffeb.duration <= 60)]
dffeb = dffeb.fillna(-1)
categorical = ['PUlocationID','DOlocationID'] # features for categorical
dffeb[categorical] = dffeb[categorical].astype(str)


In [33]:
val_dicts = dffeb[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts) # N.B. NOT 'fit_transform'

In [34]:
target = 'duration'
y_train = dfjan[target].values
y_val = dffeb[target].values

In [35]:
lr = LinearRegression()
lr.fit(X_train, y_train) # train model

y_pred = lr.predict(X_val) # predicting

mean_squared_error(y_val, y_pred, squared=False) # evaluation

11.014283163400654

### RMSE on validation is 11.01