# Refactor the Taxi Fare Prediction Problem with a Pipeline

We will refactor the model you built for the Taxi Fare Prediction Problem using:
- Custom encoders for the distance and time features
- OneHot Encoder in order to encode the hour and day of week features
- SimpleImputer to fill missing values
- A simple linear regression
- A pipeline to put all together

Then: 
- train this pipeline
- apply the pipeline on test data
- generate predictions and submit these new predictions to Kaggle

## First pipeline

In [26]:
# import the train dataset ( 1000 first rows)
import pandas as pd
df = pd.read_csv("../01-Kaggle-Taxi-Fare/data/train.csv",nrows=1000)



In [27]:

df.head(25)

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
5,2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
6,2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
7,2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
8,2012-12-03 13:10:00.000000125,9.0,2012-12-03 13:10:00 UTC,-74.006462,40.726713,-73.993078,40.731628,1
9,2009-09-02 01:11:00.00000083,8.9,2009-09-02 01:11:00 UTC,-73.980658,40.733873,-73.99154,40.758138,2


In [28]:
# Hold out ( train and test dplit )
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns='fare_amount'),df['fare_amount'])

In [29]:
X_train.head()

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
414,2010-04-10 16:56:00.000000115,2010-04-10 16:56:00 UTC,-73.966422,40.767587,-73.973363,40.75827,2
731,2012-09-08 05:19:00.0000007,2012-09-08 05:19:00 UTC,-73.996222,40.738397,-73.945933,40.792682,2
816,2014-02-16 09:05:34.0000002,2014-02-16 09:05:34 UTC,-73.996905,40.756151,-73.954566,40.779943,1
223,2012-01-06 07:05:08.0000002,2012-01-06 07:05:08 UTC,-73.991863,40.754275,-73.983796,40.753213,1
504,2009-04-10 17:32:00.000000186,2009-04-10 17:32:00 UTC,-73.993962,40.735152,-74.00358,40.732207,1


### Custom transformers

With the Taxi Fare Prediction Challenge data, using `BaseEstimator` and `TransformerMixin`, implement:

- a transformer that computes the haversine distance between the pickup and dropoff locations
- a custom encoder that extracts the time features from `pickup_datetime`

In [30]:
import numpy as np

def haversine_vectorized(df, 
                         start_lat="pickup_latitude",
                         start_lon="pickup_longitude",
                         end_lat="dropoff_latitude",
                         end_lon="dropoff_longitude"):
    """ 
        Calculates the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df.
        Computes the distance in kms.
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(df[end_lat].astype(float)), np.radians(df[end_lon].astype(float))
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return 6371 * c

In [31]:
# create a DistanceTransformer
from sklearn.base import BaseEstimator, TransformerMixin

class DistanceTransformer(BaseEstimator, TransformerMixin):
    """
        Computes the haversine distance between two GPS points.
        Returns a copy of the DataFrame X with only one column: 'distance'.
    """
   

    def __init__(self,
                 start_lat="pickup_latitude",
                 start_lon="pickup_longitude",
                 end_lat="dropoff_latitude",
                 end_lon="dropoff_longitude"):
        self.start_lat = start_lat
        self.start_lon = start_lon
        self.end_lat = end_lat
        self.end_lon = end_lon
        
        
    def fit(self, X, y=None):

        return self

      

        
    def transform(self, X, y=None):
        
        X2 = X.copy()
        X2['distance']=haversine_vectorized(X2,self.start_lat,self.start_lon,self.end_lat,self.end_lon)
        
    
        return X2[['distance']]

In [32]:
# test the DistanceTransformer

dist_trans = DistanceTransformer()
distance = dist_trans.fit_transform(X_train, y_train)
distance.head(500)

Unnamed: 0,distance
414,1.189552
731,7.373807
816,4.439835
223,0.689690
504,0.874056
...,...
647,3.208400
932,1.521630
85,1.652961
619,1.043607


In [33]:
# create a TimeFeaturesEncoder

from datetime import datetime

class TimeFeaturesEncoder(BaseEstimator, TransformerMixin):
    """
        Extracts the day of week (dow), the hour, the month and the year from a time column.
        Returns a copy of the DataFrame X with only four columns: 'dow', 'hour', 'month', 'year'.
    """

    def __init__(self, time_column, fuseau_horaire='US/New_York'):
        
        self.time_column = time_column
        self.fuseau_horaire = fuseau_horaire
       
    

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X2 = X.copy()
        hours =[]
        years =[]
        months =[]
        dows=[]
        for e in self.time_column:
            datetim = datetime.strptime(e, '%Y-%m-%d %H:%M:%S UTC')

            #self.fuseau_horaire
            datetim=datetim.astimezone()
            X_ = pd.DataFrame()
           
            hours.append(datetim.hour)
            years.append(datetim.year)
            months.append(datetim.month)
            dows.append(datetim.weekday())
            X_['hour']=hours
            X_['year']=years
            X_['month']=months
            X_['dow']=dows
        return X_[['dow', 'hour', 'month', 'year']]

In [34]:
# test the TimeFeaturesEncoder

time_enc = TimeFeaturesEncoder(X_train['pickup_datetime'])
time_features = time_enc.fit_transform(X_train, y_train)
time_features.head()

Unnamed: 0,dow,hour,month,year
0,5,16,4,2010
1,5,5,9,2012
2,6,9,2,2014
3,4,7,1,2012
4,4,17,4,2009


###  Prepocessing pipeline

In [35]:
# visualizing pipelines in HTML
from sklearn import set_config
set_config(display='diagram')

#### Distance pipeline

Create a pipeline for distances:
- convert the pickup and dropoff coordinates into distances with the DistanceTransformer
- standardize these distances

In [36]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
dist_pipeline=Pipeline([('distanceTransformer',DistanceTransformer()),('svc',StandardScaler())])
# create distance pipeline dist_pipe
dist_pipeline

# display distance pipeline


#### Time features pipeline

Create a pipeline for time features
- extract time features from pickup datetime with the TimeFeaturesEncoder
- encode these categorical time features with the OneHotEncoder

In [37]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

time_pipeline=Pipeline([('TimeFeaturesEncoder',TimeFeaturesEncoder(X_train['pickup_datetime'])),('one_hot_encoder',OneHotEncoder())])
# create time pipeline time_pipe

time_pipeline
# display time pipeline


#### Preprocessing pipeline

Wrap up the distance pipeline and the time pipeline into a preprocessing pipeline.

In [38]:
pre_pipeline=Pipeline([('dist_pipeline',dist_pipeline),('time_pipeline',time_pipeline)])
# create preprocessing pipeline preproc_pipe
pre_pipeline
# display preprocessing pipeline


### Model pipeline

Create a pipeline containing the preprocessing and the regression model of your choice.

In [39]:
from sklearn.linear_model import LinearRegression

# Add the model of your choice to the pipeline nammed pipe
mod_pipeline=Pipeline([('standardScaler',StandardScaler()),('linearRegression',LinearRegression())])
mod_pipeline
# display the pipeline with model


### Training and performance

Train the pipelined model and compute the prediction on the test set:

In [40]:
print(X_train)
# train the pipelined model
X_train['distance']=haversine_vectorized(X_train)
X_test['distance']=haversine_vectorized(X_test)
mod_pipeline.fit(X_train[['passenger_count','distance']],y_train)
y_pred=mod_pipeline.predict(X_test[['passenger_count','distance']])
# compute y_pred on the test set


                               key          pickup_datetime  pickup_longitude  \
414  2010-04-10 16:56:00.000000115  2010-04-10 16:56:00 UTC        -73.966422   
731    2012-09-08 05:19:00.0000007  2012-09-08 05:19:00 UTC        -73.996222   
816    2014-02-16 09:05:34.0000002  2014-02-16 09:05:34 UTC        -73.996905   
223    2012-01-06 07:05:08.0000002  2012-01-06 07:05:08 UTC        -73.991863   
504  2009-04-10 17:32:00.000000186  2009-04-10 17:32:00 UTC        -73.993962   
..                             ...                      ...               ...   
147  2014-05-13 22:19:00.000000144  2014-05-13 22:19:00 UTC        -73.982265   
205  2011-01-14 18:10:00.000000129  2011-01-14 18:10:00 UTC        -73.986370   
601  2014-09-03 18:08:00.000000150  2014-09-03 18:08:00 UTC        -74.006960   
110  2014-05-22 18:30:00.000000217  2014-05-22 18:30:00 UTC        -73.982272   
965   2010-07-12 10:50:00.00000067  2010-07-12 10:50:00 UTC        -73.992488   

     pickup_latitude  dropo

Use the RMSE to evaluate the performance of the model:

In [41]:
import math
def compute_rmse(y_pred, y_true):
    
    tot=0
    for pred,true in zip(y_pred,y_true):
        tot+=(pred -true)**2
    return math.sqrt(tot)

In [42]:
# call compute_rmse
compute_rmse(y_pred, y_test)

18272.743865544024

## Complete the workflow with a pipeline

Here we will implement the whole workflow for our Taxifare kaggle challenge.

For that we will refactor the code in functions for more clarity.

Implement the following functions:
- `get_data()` to fetch the data 
- `clean_data()` to clean the data
- `get_pipeline()` to get the pipeline defined earlier
- `train()` to train our model
- `evaluate()` to evaluate our model on test data

In [43]:
# implement get_data() function
def get_data(nrows=10000):
    return  pd.read_csv("../01-Kaggle-Taxi-Fare/data/train.csv",nrows=nrows)

In [44]:
# implement clean_data() function
def clean_data(df, test=False):
    '''returns a DataFrame without outliers and missing values'''
    df=df[df.fare_amount > 0]
    df=df[df.distance < 100]
    df=df[df.passenger_count < 9]
    df=df[df.passenger_count > 0]
    # A COMPLETER
    return df

In [45]:
# implement set_pipeline() function
def set_pipeline():
    pipe=Pipeline([('standardScaler',StandardScaler()),('linearRegression',LinearRegression())])
    return pipe

In [46]:
# implement train() function
def train(X_train, y_train, pipeline):
    '''returns a trained pipelined model'''
    pipeline.fit(X_train,y_train)
    return pipeline

In [47]:
# implement evaluate() function
def evaluate(X_test, y_test, pipeline):
    '''returns the value of the RMSE'''
    y_pred=pipeline.predict(X_test)
    print(y_pred)
    print(y_test)
    rmse =compute_rmse(y_test,y_pred)
    return rmse

### Test the complete worflow

Use the above functions to test the complete workflow.

In [51]:
# store the data in a DataFrame
df = get_data()
df['distance']= haversine_vectorized(df)
df =clean_data(df)
# set X and y
y = df["fare_amount"]
X = df['distance']


# hold out
X_train, X_val, y_train, y_val = train_test_split(X.values.reshape(-1,1), y, test_size=0.15)

# build pipeline
pipeline = set_pipeline()

# train the pipeline
train(X_train, y_train, pipeline)

# evaluate the pipeline
rmse = evaluate(X_val, y_val, pipeline)
print(rmse)
pipeline.score(X_val,y_val)

[26.61492817 35.47854594  7.7926248  ... 10.29933594 12.73536515
  7.22224019]
8179    35.5
7707    56.8
6137     6.5
7729     8.1
1172     7.3
        ... 
2918    23.3
2160     6.9
9094     7.3
8601    11.5
1588     5.0
Name: fare_amount, Length: 1493, dtype: float64
270.74875009077294


0.5097427884399979

### Congrats!

Now we are ready to convert this complete workflow into a packaged code 🚀