From the Kaggle competition description:

"In this playground competition, hosted in partnership with Google Cloud and Coursera, you are tasked with predicting the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations. While you can get a basic estimate based on just the distance between the two points, this will result in an RMSE of $5-$8, depending on the model used (see the starter code for an example of this approach in Kernels). Your challenge is to do better than this using Machine Learning techniques!"



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
#from sklearn.model_selection import StratifiedShuffleSplit

First issue: train.csv is massive! (5.7 gb) 
Therefore the file must be read and processed in chunks.
However, when using pd.read_csv(PATH, chunksize=X), the data type returned is a 'TextFileReader' object.

I do not know how to work with the chunksize argument so I'll just use a limited number of rows (10,000,000) for now.


In [3]:
#trainRawChunks = pd.read_csv("/home/mitchell/Desktop/kaggleData/taxiFareKaggle/all/train.csv", chunksize=500)
#testRaw = pd.read_csv("/home/mitchell/Desktop/kaggleData/taxiFareKaggle/all/test.csv")

#count = 0
#for chunk in trainRawChunks:
#    print(chunk.shape)
#    count += 1
    
#print("There are: ", 500 * count, "rows")
    
#trainRawChunks[0].head()

In [4]:
numberOfSamples = 5000000
trainRawSegment = pd.read_csv("/home/mitchell/Desktop/kaggleData/taxiFareKaggle/all/train.csv", nrows=numberOfSamples)


My Lenovo T440 seems to be able to handle 10,000,000 lines of data quite easily. Since only a portion of the total data supplied is being used, this already gives the model a disadvantage but also room for improvement if needed.

In [5]:
print(trainRawSegment.shape)
trainRawSegment.head()

(5000000, 8)


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


Firstly, the raw data will be shuffled even though there seems to be no pattern of how it was input to the dataset. This is so when the validation set is separated from the training set, there is a fair representation of data in each. However, scikit-learn's stratified shuffle split can be used to ensure that this happens. This function also does the separation of the two datasets automatically based on the argument that is passed. 

For this implementation, a simple shuffle of the dataset and then use of the head() and tail() functions will be used to create a 80/20 split of the raw data.

In [6]:
trainRawSegment.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [11]:
np.random.seed(0)
#trainRawSegment = trainRawSegment.reindex(np.random.permutation(trainRawSegment.index))
trainRawSegment = trainRawSegment.iloc[np.random.permutation(np.arange(len(trainRawSegment)))]
trainRawSegment.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
3501095,2010-01-09 01:07:34.0000001,4.5,2010-01-09 01:07:34 UTC,-73.96821,40.755522,-73.978814,40.745056,1
4041967,2013-02-17 15:57:25.0000005,7.0,2013-02-17 15:57:25 UTC,-73.978982,40.752687,-73.985815,40.757773,1
937063,2014-02-12 13:58:00.000000155,17.0,2014-02-12 13:58:00 UTC,-73.985948,40.767917,-73.968107,40.756287,1
141827,2012-09-29 03:42:00.000000119,9.0,2012-09-29 03:42:00 UTC,-73.987697,40.722717,-74.00792,40.71052,1
4266855,2013-03-14 05:31:53.0000001,7.5,2013-03-14 05:31:53 UTC,-73.985218,40.760248,-74.0011,40.736655,1


In [12]:
#must typecast to int for head and tail to work.

train = trainRawSegment.head(int(numberOfSamples*0.8))
validation = trainRawSegment.tail(int(numberOfSamples*0.2))

In [14]:
print(train.shape)
print(validation.shape)

(4000000, 8)
(1000000, 8)


In [15]:
trainLabels = train["fare_amount"]
trainData = train.drop("fare_amount", axis=1)
validationLabels = validation["fare_amount"]
validationData = validation.drop("fare_amount", axis=1)

Looking through the data, things that stand out are dates and times in the "key" & "pickup_datetime" columns. These could be split into separate individual features and checked for correlation with "fare_amount". The pairs of longitude and latitude features could be made into cross features. Time could be bucketed into morning, afternoon and evening categories.

In [16]:
trainData["passenger_count"].value_counts()

1      2767840
2       590358
5       282865
3       175418
4        84836
6        84614
0        14057
208          7
9            2
129          1
51           1
7            1
Name: passenger_count, dtype: int64

Inspection of the "passenger_count" column shows some crazy errors. specifically the taxis that were able to fit 49 to 208 passengers in the one trip. In all seriousness, this looks like an input error or error code. The trips that noted 7 and 9 passengers down are a bit of a mystery, as either it was a data entry error or possibly those taxis did actually have that many passengers. 

These outliers (passenger_count > 6) will be removed from the dataset as they hold such an insignificant percentage of data 17/10,000,000.