# Refactor Taxi Fare Prediction Problem with a Pipeline

Refactor the model you built for the Taxi Fare Prediction Problem using:
- Custom encoders you have to write for distance and time features
- OneHot Encoder to encoder hour and day of week features
- SimpleImputer to fill missing values
- A simple linear regression
- A pipeline to put all together


Then: 
- train this pipeline
- apply the pipeline on test data
- generate predictions and submit these new predictions to Kaggle

## First pipeline

In [1]:
# import the dataset from s3 bucket 
import pandas as pd
url = "s3://wagon-public-datasets/taxi-fare-train.csv"

# Select only 10 000 rows while creating the DataFrame


In [2]:
# prepare X and y


In [3]:
# Hold out 


### Custom transformers

With the Taxi Fare Prediction Challenge data, using `BaseEstimator` and `TransformerMixin`, implement:

- a transformer that computes haversine distance between pickup and dropoff location
- a custom encoder that extract time features from `pickup_datetime`

In [5]:
import numpy as np

def haversine_vectorized(df, 
         start_lat="pickup_latitude",
         start_lon="pickup_longitude",
         end_lat="dropoff_latitude",
         end_lon="dropoff_longitude"):

    """ 
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df
        Computes distance in kms
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(df[end_lat].astype(float)), np.radians(df[end_lon].astype(float))
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return 6371 * c

In [4]:
# Implement the `tarnsform`  method of the DistanceTransformer
from sklearn.base import BaseEstimator,  TransformerMixin 

class DistanceTransformer(BaseEstimator, TransformerMixin):
    """Compute the haversine distance between two GPS points."""

    def __init__(self, 
                 start_lat="pickup_latitude",
                 start_lon="pickup_longitude", 
                 end_lat="dropoff_latitude", 
                 end_lon="dropoff_longitude"):
        self.start_lat = start_lat
        self.start_lon = start_lon
        self.end_lat = end_lat
        self.end_lon = end_lon

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        """Returns a copy of the DataFrame X with only one column: 'distance'"""
        pass

In [None]:
# test the DistanceTransformer
dist_trans = DistanceTransformer()
distance = dist_trans.fit_transform(X_train, y_train)
distance.head()

In [5]:
# Implement the `transform` method of the TimeFeaturesEncoder
class TimeFeaturesEncoder(BaseEstimator, TransformerMixin):
    """Extract the day of week (dow), the hour, the month and the year from a time column."""

    def __init__(self, time_column, time_zone_name='America/New_York'):
        self.time_column = time_column
        self.time_zone_name = time_zone_name
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        """Returns a copy of the DataFrame X with only four columns: 'dow', 'hour', 'month', 'year'"""
        pass

In [None]:
# test the TimeFeaturesEncoder
time_enc = TimeFeaturesEncoder('pickup_datetime')
time_features = time_enc.fit_transform(X_train, y_train)
time_features.head()

###  Prepocessing pipeline

In [10]:
# visualizing pipelines in HTML
from sklearn import set_config; set_config(display='diagram')

#### Distance pipeline

Create a pipeline for distances:
- convert pickup and dropoff coordinates into distances with the DistanceTransformer
- standardize these distances

In [8]:
# create distance pipeline

# display distance pipeline


#### Time features pipeline

Create a pipeline for time features
- extract time features from pickup datetime with the TimeFeaturesEncoder
- encode these categorical time features with the OneHotEncoder

In [9]:
# create time pipeline

# display time pipeline


#### Preprocessing pipeline

Wrap up the distance pipeline and the time pipeline into a preprocesssing pipeline.

In [10]:
# create preprocessing pipeline

# display preprocessing pipeline


### Model pipeline

Create a pipeline containing the preprocessing and the regression model of your choice.

In [11]:
# Add the model of your choice to the pipeline

# display the pipeline with model


<details>
    <summary>
       💡 Hint
    </summary>
The pipeline should look like
<img src='img/pipeline.png'>
</details>

### Training and performance

Train the pipelined model and compute prediction on the test set:

In [12]:
# Train the pipelined model

# compute y_pred on the test set


Use the RMSE to evaluate the model's performance:

In [14]:
def compute_rmse(y_pred, y_true):
    return np.sqrt(((y_pred - y_true)**2).mean())

In [13]:
# call compute_rmse


## Complete workflow with a pipeline

Here we will implement the whole workflow for our Taxifare kaggle challenge.  

For that we will refactor code in functions for more clarity.  

Implement following functions:  
- `get_data()` to fetch data from local path
- `clean_data()` to clean data
- `get_pipeline()` to get the pipeline defined earlier
- `train()` to train our model
- `evaluate()` to evaluate our model on test data

In [18]:
# implement get_data() function
def get_data(nrows=10000):
    '''returns a DataFrame with nrows from s3 bucket'''
    pass

In [19]:
#implement clean_data() function
def clean_data(df, test=False):
    '''returns a DataFrame without outliers and missing values'''
    pass

In [20]:
# implement set_pipeline() function
def set_pipeline():
    '''returns a pipelined model'''
    pass

In [21]:
#implement train() function
def train(X_train, y_train, pipeline):
    '''returns a trained pipelined model'''
    pass

In [24]:
#implement evaluate() function
def evaluate(X_test, y_test, pipeline):
    '''prints and returns the value of the RMSE'''
    pass

### Test the complete worflow

Use the above functions to test the complete workflow.

In [23]:
# store the data in a DataFrame

# set X and y

# hold out

# build pipeline

# train the pipeline

# evaluate the pipeline


### Congrats!

Now we are ready to convert this complete workflow into a packaged code 🚀