> **Problem overview**

In this playground competition, hosted in partnership with Google Cloud and Coursera, you are tasked with predicting the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations. While you can get a basic estimate based on just the distance between the two points, this will result in an RMSE of $5-$8, depending on the model used (see the starter code for an example of this approach in Kernels). Your challenge is to do better than this using Machine Learning techniques!

To learn how to handle large datasets with ease and solve this problem using TensorFlow, consider taking the Machine Learning with TensorFlow on Google Cloud Platform specialization on Coursera -- the taxi fare problem is one of several real-world problems that are used as case studies in the series of courses. To make this easier, head to Coursera.org/NEXTextended to claim this specialization for free for the first month!

In [None]:
# import library
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# import data preprocessing from sklearn
from sklearn.preprocessing import RobustScaler

# import model function from sklearn
from sklearn.ensemble import RandomForestRegressor

# import model selection from sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# import model evaluation regression metrics from sklearn
from sklearn.metrics import mean_squared_error

> **Acquiring training and testing data**

We start by acquiring the training and testing datasets into Pandas DataFrames.

In [None]:
# acquiring training and testing data
train_df = pd.read_csv('../input/train.csv', nrows = 2000000, parse_dates=['pickup_datetime'])
test_df = pd.read_csv('../input/test.csv', parse_dates=['pickup_datetime'])

In [None]:
# visualize head of the training data
train_df.head(n=3)

In [None]:
# visualize tail of the testing data
test_df.tail(n=3)

In [None]:
# convert training dataframe fare amount to log fare amount
train_df['fare_amount'] = train_df['fare_amount'].apply(lambda x: np.log1p(x))

In [None]:
# drop na
train_df = train_df.dropna()

In [None]:
# combine training and testing dataframe
train_df['datatype'], test_df['datatype'] = 'training', 'testing'
test_df.insert(1, 'fare_amount', 0)
data_df = pd.concat([train_df, test_df])
data_df.head(n=3)

> **Feature exploration, engineering and cleansing**

Here we generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution together with exploring some data.

In [None]:
# describe training and testing data
data_df.describe(include='all')

In [None]:
# find distance bewteen 2 latitude and longitude
def distance(lat1, lon1, lat2, lon2):
    angle = 0.017453292519943295 #math.pi / 180
    x = 0.5 - np.cos((lat2 - lat1) * angle) / 2 + np.cos(lat1 * angle) * np.cos(lat2 * angle) * (1 - np.cos((lon2 - lon1) * angle)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(x))

In [None]:
# feature extraction: combination of keyword date
data_df['year'] = data_df['pickup_datetime'].dt.year
data_df['quarter'] = data_df['pickup_datetime'].dt.quarter
data_df['month'] = data_df['pickup_datetime'].dt.month
data_df['weekofyear'] = data_df['pickup_datetime'].dt.weekofyear
data_df['weekday'] = data_df['pickup_datetime'].dt.weekday
data_df['dayofweek'] = data_df['pickup_datetime'].dt.dayofweek
data_df['hour'] = data_df['pickup_datetime'].dt.hour

In [None]:
# feature extraction: distance
data_df['distance_euclidean'] = distance(data_df['pickup_latitude'], data_df['pickup_longitude'], \
                                         data_df['dropoff_latitude'], data_df['dropoff_longitude'])
data_df['distance_latitude'] = data_df['dropoff_latitude'] - data_df['pickup_latitude']
data_df['distance_longitude'] = data_df['dropoff_longitude'] - data_df['pickup_longitude']

In [None]:
# feature extraction: distance to specific location
nyc = (40.7128, -74.0060)
jfk = (40.6413, -73.7781)
ewr = (40.6895, -74.1745)
data_df['distance_pickup_to_nyc'] = distance(data_df['pickup_latitude'], data_df['pickup_longitude'], nyc[0], nyc[1])
data_df['distance_pickup_to_jfk'] = distance(data_df['pickup_latitude'], data_df['pickup_longitude'], jfk[0], jfk[1])
data_df['distance_pickup_to_ewr'] = distance(data_df['pickup_latitude'], data_df['pickup_longitude'], ewr[0], ewr[1])
data_df['distance_dropoff_to_nyc'] = distance(data_df['dropoff_latitude'], data_df['dropoff_longitude'], nyc[0], nyc[1])
data_df['distance_dropoff_to_jfk'] = distance(data_df['dropoff_latitude'], data_df['dropoff_longitude'], jfk[0], jfk[1])
data_df['distance_dropoff_to_ewr'] = distance(data_df['dropoff_latitude'], data_df['dropoff_longitude'], ewr[0], ewr[1])

In [None]:
# feature extraction: fare amount per mile
data_df['fare_per_mile'] = data_df['fare_amount'] / data_df['distance_euclidean']
data_df['fare_per_mile'] = data_df['fare_per_mile'].apply(lambda x: 0 if x == float('inf') else x)
data_df['fare_per_mile'] = data_df['fare_per_mile'].fillna(0)

In [None]:
# scatter plot between distance and fare amount
fig, ax = plt.subplots(figsize=(20, 5))
sns.scatterplot(data=data_df[data_df['datatype'] == 'training'], x='distance_euclidean', y='fare_amount')

In [None]:
# scatter plot between distance and fare amount
fig, ax = plt.subplots(figsize=(20, 5))
sns.scatterplot(data=data_df[(data_df['datatype'] == 'training') & (data_df['distance_euclidean'] < 50)], x='distance_euclidean', y='fare_amount')

In [None]:
# scatter plot between distance and fare per mile
fig, ax = plt.subplots(figsize=(20, 5))
sns.scatterplot(data=data_df[(data_df['datatype'] == 'training') & (data_df['distance_euclidean'] > 1)], x='distance_euclidean', y='fare_per_mile')

In [None]:
# scatter plot between distance and fare per mile
fig, ax = plt.subplots(figsize=(20, 5))
sns.scatterplot(data=data_df[(data_df['datatype'] == 'training') & (data_df['distance_euclidean'] > 1) & (data_df['distance_euclidean'] < 50)], x='distance_euclidean', y='fare_per_mile')

In [None]:
# feature extraction: year-month
groupby = data_df[data_df['datatype'] == 'training'].groupby(['year', 'month'])
groupby = groupby.mean()['fare_amount'].reset_index()
groupby.columns = ['year', 'month', 'fare_amount_average']

fig, ax = plt.subplots(figsize=(20, 5))
pointplot = sns.pointplot(data=groupby, join=True, hue='year', x='month', y='fare_amount_average')

In [None]:
# feature extraction: year-hour
groupby = data_df[data_df['datatype'] == 'training'].groupby(['year', 'hour'])
groupby = groupby.mean()['fare_amount'].reset_index()
groupby.columns = ['year', 'hour', 'fare_amount_average']

fig, ax = plt.subplots(figsize=(20, 5))
pointplot = sns.pointplot(data=groupby, join=True, hue='year', x='hour', y='fare_amount_average')

In [None]:
# feature extraction: datatype
data_df['datatype'] = data_df['datatype'].map({'testing': 0, 'training': 1, 'excluded': '2'})

In [None]:
data_df.head(n=3)

After extracting all features, it is required to convert category features to numerics features, a format suitable to feed into our Machine Learning models.

In [None]:
# verify dtypes object
data_df.info()

In [None]:
# convert dtypes object to category
col_obj = data_df.select_dtypes(['object']).columns
data_df[col_obj] = data_df[col_obj].astype('category')
data_df.info()

In [None]:
# convert dtypes category to category codes
col_cat = data_df.select_dtypes(['category']).columns
data_df[col_cat] = data_df[col_cat].apply(lambda x: x.cat.codes)
data_df.info()

In [None]:
data_df.head(n=3)

> **Analyze and identify patterns by visualizations**

Let us generate some correlation plots of the features to see how related one feature is to the next. To do so, we will utilize the Seaborn plotting package which allows us to plot very conveniently as follows.

The Pearson Correlation plot can tell us the correlation between features with one another. If there is no strongly correlated between features, this means that there isn't much redundant or superfluous data in our training data. This plot is also useful to determine which features are correlated to the observed value.

In [None]:
# compute pairwise correlation of columns, excluding NA/null values and present through heat map
corr = data_df[data_df['datatype'] == 1].corr()
fig, ax = plt.subplots(figsize=(20, 15))
heatmap = sns.heatmap(corr, annot=True, cmap=plt.cm.RdBu, fmt='.1f', square=True);

The pairplots is also useful to observe the distribution of the training data from one feature to the other.

In [None]:
# plot pairwise relationships in a dataset
#pairplot = sns.pairplot(data_df[data_df['datatype'] == 1], diag_kind='kde', diag_kws=dict(shade=True), hue='fare_amount')

The pivot table and other visulized plots are also another useful methods to observe the impact between features.

In [None]:
# pivot table
pivottable = pd.pivot_table(data_df[data_df['datatype'] == 1], aggfunc=np.mean,
                            columns=['year'], index=['hour'], values='fare_per_mile')
pivottable.style.background_gradient(cmap='Blues')

> **Model, predict and solve the problem**

Now, it is time to feed the features to Machine Learning models.

In [None]:
# select all features
x = data_df[data_df['datatype'] == 1].drop(['key', 'pickup_datetime', 'fare_amount', 'datatype', 'fare_per_mile'], axis=1)
y = data_df[data_df['datatype'] == 1]['fare_amount']

In [None]:
x.head(n=3)

In [None]:
# create scaler to the features
scaler = RobustScaler()
x = scaler.fit_transform(x)

In [None]:
# perform train-test (validate) split
x_train, x_validate, y_train, y_validate = train_test_split(x, y, random_state=0, test_size=0.25)

In [None]:
# random forest model prediction
forestreg = RandomForestRegressor(max_depth=20, min_samples_split=5, n_estimators=10, random_state=0).fit(x_train, y_train)
forestreg_ypredict = forestreg.predict(x_validate)
forestreg_mse = mean_squared_error(y_validate, forestreg_ypredict) ** 0.5
forestreg_cvscores = np.sqrt(np.abs(cross_val_score(forestreg, x, y, cv=5, scoring='neg_mean_squared_error')))
print('random forest regression\n  root mean squared error: %0.4f, cross validation score: %0.4f (+/- %0.4f)' %(forestreg_mse, forestreg_cvscores.mean(), 2 * forestreg_cvscores.std()))

> **Supply or submit the results**

Our submission to the competition site Kaggle is ready. Any suggestions to improve our score are welcome.

In [None]:
# model selection
model = forestreg

# prepare testing data and compute the observed value
x_test = data_df[data_df['datatype'] == 0].drop(['key', 'pickup_datetime', 'fare_amount', 'datatype', 'fare_per_mile'], axis=1)
x_test = scaler.transform(x_test)
y_test = pd.DataFrame(np.expm1(model.predict(x_test)), columns=['fare_amount'])

In [None]:
# summit the results
out = pd.DataFrame({'key': test_df['key'], 'fare_amount': y_test['fare_amount']})
out.to_csv('submission.csv', index=False)