In [21]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


Read the data for January. How many records are there?

In [4]:
df_jan = pd.read_parquet('../input/fhv_tripdata_2021-01.parquet')

In [6]:
df_jan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1154112 entries, 0 to 1154111
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1154112 non-null  object        
 1   pickup_datetime         1154112 non-null  datetime64[ns]
 2   dropOff_datetime        1154112 non-null  datetime64[ns]
 3   PUlocationID            195845 non-null   float64       
 4   DOlocationID            991892 non-null   float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1153227 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 61.6+ MB


What's the average trip duration in January?

In [7]:
(df_jan.dropOff_datetime - df_jan.pickup_datetime).dt.total_seconds().mean()/60

19.167224093791013

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive). How many records did you drop?


In [8]:
df_jan['ride_minutes'] = (df_jan.dropOff_datetime - df_jan.pickup_datetime).dt.total_seconds()/60

In [9]:
mask_outliers = (df_jan['ride_minutes'] >= 1) & (df_jan['ride_minutes'] <= 60)

In [10]:
df_jan = df_jan[mask_outliers] 

In [11]:
sum(~mask_outliers)

44286

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

In [12]:
df_jan[['PUlocationID', 'DOlocationID']] = df_jan[['PUlocationID', 'DOlocationID']].fillna(-1)

In [13]:
sum(df_jan.PUlocationID == -1)/df_jan.PUlocationID.shape[0]

0.8352732770722617

One-hot encoding. What's the dimensionality of this matrix? (The number of columns).

In [16]:
df_train = df_jan.copy(deep=True)
df_train[['PUlocationID', 'DOlocationID']] = df_train[['PUlocationID', 'DOlocationID']].astype(str)
dv = DictVectorizer()
X_train = dv.fit_transform(df_train[['PUlocationID', 'DOlocationID']].to_dict(orient='records'))
y_train = df_train['ride_minutes']

In [17]:
X_train.shape[1]

525

Now let's use the feature matrix from the previous step to train a model. Train a plain linear regression model with default parameters Calculate the RMSE of the model on the training data What's the RMSE on train?

In [19]:
LR = LinearRegression()
LR.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Evaluating the model: RMSE using training setm

In [22]:
mean_squared_error(LR.predict(X_train), y_train, squared=False)

10.528519107212672

Evaluating the model: RMSE using validation set



In [23]:
df_feb =pd.read_parquet('../input/fhv_tripdata_2021-02.parquet')
df_feb['ride_minutes'] = (df_feb.dropOff_datetime - df_feb.pickup_datetime).dt.total_seconds()/60
mask_outliers = (df_feb['ride_minutes'] >= 1) & (df_feb['ride_minutes'] <= 60)
df_feb = df_feb[mask_outliers]
df_feb[['PUlocationID', 'DOlocationID']] = df_feb[['PUlocationID', 'DOlocationID']].fillna(-1)


In [24]:
df_val = df_feb[['PUlocationID', 'DOlocationID', 'ride_minutes']].dropna()
df_val[['PUlocationID', 'DOlocationID']] = df_val[['PUlocationID', 'DOlocationID']].astype(str)
X_val = dv.transform(df_val[['PUlocationID', 'DOlocationID']].to_dict(orient='records'))
y_val = df_val['ride_minutes']

In [25]:
mean_squared_error(LR.predict(X_val), y_val, squared=False)


11.014283206926969