# New York City Taxi Fare Prediction

This is a notebook I used for my self-learning and to familiarize myself with real data and problems. I don't want to go too deep, since this is a very complex project and would require a lot of effort and time to get a good model. If you liked my approach, please UPVOTE! As always I'd be grateful to hear from you any suggestions or corrections.

Frederico M. Chaves

## Table of Contents

<ol>
    <li style=""><a id="load-packages-toc" href="#load-packages" style="font-size: 15px; text-decoration: none; color: black;">Load Packages</a></li>
    <li><a id="read-exploration-toc" href="#read-exploration" style="font-size: 15px; text-decoration: none; color: black;">Read Data and Pre-exploration</a></li>
    <ol>
        <li><a id="types-toc" href="#types" style="font-size: 15px; text-decoration: none; color: black;">Types</a></li>
        <li><a id="null-values-toc" href="#null-values" style="font-size: 15px; text-decoration: none; color: black;">Null Values</a></li>
        <li><a id="stat-info-toc" href="#stat-info" style="font-size: 15px; text-decoration: none; color: black;">Statistical Information</a></li>
    </ol>
    <li><a id="bounding-box-toc" href="#bounding-box" style="font-size: 15px; text-decoration: none; color: black;">New York City Bounding Box</a></li>
<li><a id="clear-fare-toc" href="#clear-fare" style="font-size: 15px; text-decoration: none; color: black;">Removing Low and High Values of Taxi Fare</a></li>
<li><a id="features-engineering-toc" href="#features-engineering" style="font-size: 15px; text-decoration: none; color: black;">Features Engineering</a></li>
    <ol>
        <li><a id="date-toc" href="#date" style="font-size: 15px; text-decoration: none; color: black;">Year, Month, Day of Month, Day of Week and Hour</a></li>
        <li><a id="distance-toc" href="#distance" style="font-size: 15px; text-decoration: none; color: black;">Displaced Distance</a></li>
        <li><a id="directions-toc" href="#directions" style="font-size: 15px; text-decoration: none; color: black;">Directions</a></li>
    </ol>
    <li><a id="visualizations-toc" href="#visualizations" style="font-size: 15px; text-decoration: none; color: black;">Features Visualizations</a></li>
    <ol>
        <li><a id="correlations-toc" href="#correlations" style="font-size: 15px; text-decoration: none; color: black;">Correlations</a></li>        
        <li><a id="new-york-city-map-toc" href="#new-york-city-map" style="font-size: 15px; text-decoration: none; color: black;">New York City Map</a></li>
        <li><a id="fare-distribution-toc" href="#fare-distribution" style="font-size: 15px; text-decoration: none; color: black;">Fare's Distribution</a></li>
        <li><a id="linear-regression-toc" href="#linear-regression" style="font-size: 15px; text-decoration: none; color: black;">How Does the Linear Regression of Fare as a Function of Distance Behave in Relation to the other Features?</a></li>        
    </ol>
    <li><a id="model-toc" href="#model" style="font-size: 15px; text-decoration: none; color: black;">Model</a></li>
    <ol>
        <li><a id="model-features-toc" href="#model-features" style="font-size: 15px; text-decoration: none; color: black;">Features</a></li>
        <li><a id="model-split-toc" href="#model-split" style="font-size: 15px; text-decoration: none; color: black;">Split</a></li>
        <li><a id="model-pipeline-toc" href="#model-pipeline" style="font-size: 15px; text-decoration: none; color: black;">Pipeline</a></li>
        <li><a id="model-tunning-toc" href="#model-tunning" style="font-size: 15px; text-decoration: none; color: black;">Parameters Tunning</a></li>
        <li><a id="model-predictions-toc" href="#model-predictions" style="font-size: 15px; text-decoration: none; color: black;">Predictions</a></li>        
    </ol>
</ol>



<a id="load-packages" href="#load-packages-toc" style="font-size: 20px; text-decoration: none; color: black;"><strong>Load Packages</strong></a>

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

from time import time

from warnings import filterwarnings
filterwarnings(action='ignore')

sns.set_style('whitegrid')
sns.set_palette('viridis')

%matplotlib inline

<a id="read-exploration" href="#read-exploration-toc" style="font-size: 20px; text-decoration: none; color: black;"><strong>Read Data and Pre-exploration</strong></a>

In [None]:
df_train = pd.read_csv('../input/train.csv', nrows=50_000)
df_test = pd.read_csv('../input/test.csv')

<a id="types" href="#types-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Types</strong></a>

In [None]:
df_train.info()

<a id="null-values" href="#null-values-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Null Values</strong></a>

In [None]:
nan_train = pd.DataFrame(data=df_train.isnull().sum(), columns=['Train NaN'])
nan_test = pd.DataFrame(data=df_test.isnull().sum(), columns=['Test NaN'])
nan_test.loc['fare_amount'] = 0
pd.concat([nan_train, nan_test], axis=1, sort=False)

Since there are some missing values in the training set, we'll drop them.

In [None]:
df_train.dropna(inplace=True)

<a id="stat-info" href="#stat-info-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Statistical Information</strong></a>

In [None]:
df_train.describe()

There are some points to be considered on the above table:

* The minimum fare is negative, which makes no sense, and indicates that the data may have been collected poorly.
*  The maximum value of fare is too high, and probably indicates long time and long distance travels. It's difficult to work with these points because the taxi can go up and down during the percurse, and we don't have this information.
* Since latitude goes from 0° to 90° positive or negative, it doesn't make sense we have minimum and maximum values as shown in the table above.
* The same happens to longitude, that must vary from 0° to 180° positive or negative, and the table above shows values out of this interval.

<a id="bounding-box" href="#bounding-box-toc" style="font-size: 20px; text-decoration: none; color: black;"><strong>New York City Bounding Box</strong></a>

In this step, I'm gonna remove those weirds coordinates applying a delimiter box around the new york city. You can do it using the following website: https://boundingbox.klokantech.com/

<img src="bounding-box.png" width=500px;>

In [None]:
def bouding_box(df):            
        # Bounding box
        latitude_min, latitude_max = (40.4774, 40.9162)
        longitude_min, longitude_max = (-74.2591, -73.7002)
        # Applying the limits
        true_coordinates = df['pickup_latitude'].between(latitude_min, latitude_max)
        true_coordinates &= df['pickup_longitude'].between(longitude_min, longitude_max)
        true_coordinates &= df['dropoff_latitude'].between(latitude_min, latitude_max)
        true_coordinates &= df['dropoff_longitude'].between(longitude_min, longitude_max)
        return df[true_coordinates]

In [None]:
df_train = bouding_box(df_train)

<a id="clear-fare" href="#clear-fare-toc" style="font-size: 20px; text-decoration: none; color: black;"><strong>Removing Low and High Values of Taxi Fare</strong></a>

A quick Google search on the taxi fare gives us the information that there is a base fee of $\$ 2.50$. Monday to Friday from 8:00pm until 6:00am and Saturday and Sunday all day the base fee is $\$ 3.00$. Of course this fee is not same along the years, but I'll use it as a start to clear the data.

In [None]:
def clear_fare(df):
    
    # Fare interval
    min_fare, max_fare = 1.50, 100
    # Applying the limits
    true_fare = df['fare_amount'].between(min_fare, max_fare)
    return df[true_fare]

In [None]:
df_train = clear_fare(df_train)

<a id="features-engineering" href="#features-engineering-toc" style="font-size: 20px; text-decoration: none; color: black;"><strong>Features Engineering</strong></a>

<a id="date" href="#date-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Year, Month, Day of Month, Day of Week and Hour</strong></a>

A good set of features that we can get from the original data is the day of the week, day of the month, month, year and time the trip happened.

In [None]:
def split_datetime(df):          
    # Split datetime column
    datetime = pd.to_datetime(df['pickup_datetime'])
    df['day_of_week'] = datetime.dt.dayofweek
    df['day_of_month'] = datetime.dt.day
    df['month'] = datetime.dt.month
    df['year'] = datetime.dt.year
    df['hour'] = datetime.dt.hour + datetime.dt.minute/60
    return df

In [None]:
df_train = split_datetime(df_train)
df_test = split_datetime(df_test)

<a id="distance" href="#distance-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Displaced Distance</strong></a>

Because taxi fares are directly related to the displaced distance, we can use the initial and final coordinates to estimate the distance traveled. Here I'm gonna use the distance in kilometers.

In [None]:
def haversine(coordinates):
    
    from math import pi, sqrt, sin, cos, atan2
    
    lat1 = coordinates[0]
    long1 =  coordinates[1]
    lat2 = coordinates[2]
    long2 = coordinates[3]

    degree_to_rad = float(pi / 180.0)

    d_lat = (lat2 - lat1) * degree_to_rad
    d_long = (long2 - long1) * degree_to_rad

    a = pow(sin(d_lat / 2), 2) + cos(lat1 * degree_to_rad) * cos(lat2 * degree_to_rad) * pow(sin(d_long / 2), 2)
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    km = 6367 * c

    return km

def distance(df):
    # Compute the amount of latitude and longitude deslocation
    df['delta_latitude'] = (df['dropoff_latitude'] - df['pickup_latitude'])
    df['delta_longitude'] = (df['dropoff_longitude'] - df['pickup_longitude'])
    # Compute the amount of displacement
    #bs: I'm treating angles as 2D plane coordinates to derive the following feature
    #df['displacement_degree'] = np.linalg.norm(df[['delta_latitude', 'delta_longitude']], axis=1)
    df['distance_km'] = df[['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']].apply(haversine, axis=1, raw=True)    
    return df

In [None]:
df_train = distance(df_train)
df_test = distance(df_test)

<a id="directions" href="#directions-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Directions</strong></a>

Another set of features that can play an important role is the direction of travel. Here I'll use the variation of latitude and longitude to get the four common directions of the rose of winds (N, S, W and E) and then I'll use it to get the second set of directions (NW, NE, SW, SE) . I will also define an unknown direction that will contain trips that begin and end in the same place. As you can notice in the code below, I'm setting trips smaller than $0.1km$ as unknown (For God's sake! No one would take a taxi to travel so short distances!).

In [None]:
def move_directions(df):
    
    new_df = pd.DataFrame()
    # Creates a column with true values for travel going north
    new_df['north'] = ((df['delta_latitude'] > 0) & (df['distance_km'] >= 0.1)).astype('int')
    # Creates a column with true values for travel going south
    new_df['south'] = ((df['delta_latitude'] < 0) & (df['distance_km'] >= 0.1)).astype('int')
    # Creates a column with true values for travel going west
    new_df['west'] = ((df['delta_longitude'] < 0) & (df['distance_km'] >= 0.1)).astype('int')
    # Creates a column with true values for travel going east
    new_df['east'] = ((df['delta_longitude'] > 0) & (df['distance_km'] >= 0.1)).astype('int')
    # Creates a column with true values for travel that start and finish at the same point
    new_df['unknown'] = (df['distance_km'] <= 0.1).astype('int')    
    return new_df

def wind_rose(row):
    name = ''
    directions = {0: 'n', 1: 's', 2: 'w', 3: 'e', 4: 'unknown'}
    for idx, value in enumerate(row):
        if value:
            name += directions[idx]
    return name

In [None]:
directions_train = move_directions(df_train)
directions_test = move_directions(df_test)

df_train['wind_rose'] = directions_train[['north', 'south', 'west', 'east', 'unknown']].apply(wind_rose, axis=1, raw=True)
df_test['wind_rose'] = directions_test[['north', 'south', 'west', 'east', 'unknown']].apply(wind_rose, axis=1, raw=True)

<a id="visualizations" href="#visualizations-toc" style="font-size: 20px; text-decoration: none; color: black;"><strong>Features Visualizations</strong></a>

<a id="correlations" href="#correlations-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Correlations</strong></a>

The colored heatmap below shows us that fare is highly linear correlated with distance, as we would expected, of course. In relation to the other features, the correlation is weak, which makes the model complex.

In [None]:
plt.figure(figsize=(20,2))
sns.heatmap(df_train.corr()[['fare_amount']].sort_values('fare_amount', ascending=False).iloc[1:].T, annot=True, cmap='viridis', vmax=0.88, vmin=-0.21)

<a id="new-york-city-map" href="#new-york-city-map-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>New York City Map</strong></a>

Let's see where people are taking the taxi.

In [None]:
fig = plt.figure(figsize=(10,8))

ax1 = fig.add_axes([0, 0, 1, 1])
ax2 = fig.add_axes([0.15, 0.5, 0.4, 0.4])
ax1.scatter(x='pickup_latitude', y='pickup_longitude', data=df_train.loc[0:10000], color='red', s=0.5)
ax2.scatter(x='pickup_latitude', y='pickup_longitude', data=df_train.loc[0:10000], color='blue', s=0.5)

ax1.set_xlabel('Latitude', fontsize=15)
ax1.set_ylabel('Longitude', fontsize=15)
ax1.set_title('Pickup Coordinates', fontsize=15)
ax2.set_xlabel('Latitude', fontsize=15)
ax2.set_ylabel('Longitude', fontsize=15)
ax2.set_title('Pickup Coordinates - Zoom', fontsize=15)

ax1.set_xlim((40.64, 40.825))
ax1.set_ylim((-74.05, -73.75))
ax2.set_xlim((40.725, 40.775))
ax2.set_ylim((-74.0, -73.95))

<a id="fare-distribution" href="#fare-distribution-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Fare's Distribution</strong></a>

From the graph below we see that the distribution is right skewed. As many Kagglers that has studied this data has reported, there are some values between forty and sixty dollars that correspond to fixed trips like go to the airport and so on. Here, I won't handle these cases, I'll use fare just the way it is.

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df_train['fare_amount'], bins=80, kde=False)
sns.despine(top=True, bottom=True, left=True, right=True)

<a id="linear-regression" href="#linear-regression-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>How Does the Linear Regression of Fare as a Function of Distance Behave in Relation to the other Features?</strong></a>

From the graphs below, we see some features like year, hour and directions, playing an important role because the regression lines are different. The other features don't seem to make too much difference, but this statement is superficial, since I am judging this based on a linear model.

In [None]:
plot_data = df_train.copy()
plot_data['hour'] = plot_data['hour'].astype('int')

fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(15, 18), sharex=True, sharey=True)
sns.set_palette('viridis')
axes = list(axes.ravel())
fig.delaxes(axes[-1])

y_min, y_max = (plot_data['fare_amount'].min()-2, plot_data['fare_amount'].max()+2)

for feature, ax  in zip(['passenger_count', 'year', 'month', 'day_of_month', 'day_of_week', 'hour', 'wind_rose'], axes):
    uniques = sorted(plot_data[feature].unique())
    for feature_value in uniques:
        my_df = plot_data[plot_data[feature] == feature_value]
        sns.regplot(x='distance_km', y='fare_amount', label=str(feature_value), data=my_df, ci=0,  ax=ax)        
        ax.set_title(feature)
        ax.set_xlim((-2, 30))
        ax.set_ylim((y_min, y_max))
        ncols = 2 if len(uniques) > 12 else 1
        ax.legend(title=str(feature), ncol=ncols, loc='best', bbox_to_anchor=(1.12, 1))
fig.tight_layout()

If we take a look at the barplots below, we get a clearer view of how fare and displacement are related. Now we can see that the average range distance for each feature, excluding hour and direction, is between $3.0 km$ and $3.5 km$, and the corresponding average fare is in $\$10$ to $\$12$. For hour and direction, we have more variation.

In [None]:
fig1, axes = plt.subplots(nrows=7, ncols=2, figsize=(15, 20))

for feature, ax  in zip(['passenger_count', 'year', 'month', 'day_of_month', 'day_of_week', 'hour', 'wind_rose'], axes):
    uniques = sorted(plot_data[feature].unique())
    for feature_value in uniques:
        my_df = plot_data[plot_data[feature] == feature_value]        
        sns.barplot(x=feature, y='fare_amount', data=plot_data, ci=None, palette='viridis', ax=ax[0])
        sns.barplot(x=feature, y='distance_km', data=plot_data, ci=None, palette='viridis', ax=ax[1])
        sns.despine(top=True, bottom=True, left=True, right=True)        
fig1.tight_layout()

To get an idea of what would be a linear model between tariff and distance for each year, we can take a look at the table below. For example, in 2011 the price per kilometer was approximately $\$2.1$, and the base fee was $\$3.4$, which is not absurd, since the actual price per kilometer is $\$1.56$ and the base fee is between $\$2.50$ and $\$3.0$ depending on the time and day of the week (<a href="https://www.theawl.com/2012/07/how-much-more-do-taxi-fares-cost-today">New York Fares</a>).

In [None]:
my_X = df_train[['distance_km', 'year']]
my_X = pd.get_dummies(my_X, columns=['year'])
my_y = df_train['fare_amount']

def step_features(df, base_feature, dummies):
    
    new_df = pd.DataFrame()
    for dummy_feature in dummies:
        new_df[base_feature + ' | ' + dummy_feature] = df[base_feature] * df[dummy_feature]
    
    return new_df

new_df = step_features(my_X, 'distance_km', my_X.drop('distance_km', axis=1).columns)
my_X = pd.concat([my_X.drop('distance_km', axis=1), new_df], axis=1)

lr = LinearRegression(fit_intercept=False).fit(my_X, my_y)

base_fee = lr.coef_[0:7]
price_per_km = lr.coef_[7:]

pd.DataFrame(data=[base_fee, price_per_km], columns=df_train['year'].unique(), index=['Base Fee', '$/km']).round(1)

<a id="model" href="#model-toc" style="font-size: 20px; text-decoration: none; color: black;"><strong>Model</strong></a>

Due to data complexity, I'll use a random forest regressor model.

In [None]:
df_train = pd.get_dummies(df_train, columns=['wind_rose'], prefix='', prefix_sep='')
df_test = pd.get_dummies(df_test, columns=['wind_rose'], prefix='', prefix_sep='')

<a id="model-features" href="#model-features-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Features</strong></a>

The following piece of code define which features will be used.

In [None]:
features_on_off = {'key': False,
                   'fare_amount': False,
                   'pickup_datetime': False,
                   'pickup_longitude': False,
                   'pickup_latitude': False,
                   'dropoff_longitude': False,
                   'dropoff_latitude': False,
                   'passenger_count': True,
                   'day_of_week': True,
                   'day_of_month': True,
                   'month': True,
                   'year': True,
                   'hour': True,                   
                   'delta_latitude': True,
                   'delta_longitude': True,
                   'distance_km': True,
                   'n': True,
                   's': True,
                   'w': True,
                   'e': True,
                   'ne': True,
                   'nw': True,
                   'se': True,
                   'sw': True,
                   'unknown': True}

In [None]:
features_on = [key for key, status in features_on_off.items() if status]

<a id="model-split" href="#model-split-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Split</strong></a>

Splitting data into training and test set.

In [None]:
X = df_train[features_on]
y = df_train['fare_amount']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

<a id="model-pipeline" href="#model-pipeline-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Pipeline</strong></a>

The pipeline will be composed by a standardization and a classifier.

In [None]:
clf = Pipeline([('std', StandardScaler()),
                #('pca', PCA()),
                ('classifier', RandomForestRegressor())])

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_train, y_train)

<a id="model-tunning" href="#model-tunning-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Parameters Tunning</strong></a>

Here I'll use grid search to find the best parameters for the classifier.

In [None]:
t0 = time()
param_grid = {'classifier__n_estimators': [100],
              'classifier__max_depth': [10, 15, 20, 25],
              'classifier__min_samples_split': [2, 3, 4, 5],
              'classifier__min_samples_leaf': [1, 2, 3, 4, 5]
             }

grid_search = GridSearchCV(clf, param_grid, cv=2)
grid_search.fit(X_train, y_train)

print(f'Running time: {time()-t0:.2f}s')

In [None]:
grid_search.best_params_

In [None]:
grid_search.score(X_train, y_train)

Let's compare some statistical information between the actual and predicted fare in the test set.

In [None]:
y_pred = grid_search.predict(X_test)

In [None]:
pd.DataFrame({'Real Fare': y_test, 'Predicted Fare': y_pred}).describe().T.drop('count', axis=1)

Let's see some specific predictions.

In [None]:
pd.DataFrame({'Real Fare': y_test, 'Predicted Fare': y_pred}).head(10).T

The model performs well on some points and fails in others! As expected for such complex problem! What about the mean squared error?

In [None]:
mean_squared_error(y_test, y_pred).round()

Well, the mean squared error is big! 🤔

<a id="model-predictions" href="#model-predictions-toc" style="font-size: 15px; text-decoration: none; color: black;"><strong>Predictions</strong></a>

Let's calculate the predictions for the test set.

In [None]:
df_test['n'], df_test['s'], df_test['w'], df_test['e'] = 0, 0, 0, 0
X = df_test[features_on]

In [None]:
y_pred = grid_search.predict(X)

In [None]:
submission = pd.DataFrame({'key': df_test['key'], 'fare_amount': y_pred})

In [None]:
submission.to_csv('submission.csv', index=False)