This notebook extends on the taxi models notebook. In that notebook we did some exploratory analysis and linear fits to get baselines for the price/(distance and time) for the two datasets. This leaves out plenty of potentially useful information like time of day, shared rides, and location. In this notebook we do the following:

1. Clean the data and prune appropriate vairables with a pipeline, potentially save a clean copy.
 - Convert certain columns into useful derived variables eg.. ride speed, hour, day of week.
 - Delete unused columns.
 - Consider variable scaling. Do we want to just use raw lat/long or perhaps normalize to center of chicago?
 - Prepare data in X, Y format for ML algorithms and train/test split.
 - I think surge pricing is the only way which the taxi vs TNP datasets differ in a useful way.
 
2. Apply a few ML algorithms.
 - We'll use the taxi data first.
      - The taxi pricing model is very clear and there is no surging
      - There is more taxi data, and the model is known to be consistent since Jan 2016 (see prev. notebook)
 - Linear model with the extra categorical variables included. 
 

In [1]:
import pandas as pd
from pandas.plotting import scatter_matrix
from IPython.display import Image
#from tqdm.auto import tqdm  # for progress bars

import numpy as np
import random
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


# Plotting
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [38]:
# generates easy-to-load ~2gb files from ~10gb of input data.
%run -i 'inital_preprocessing.py'

taxi = pd.read_hdf(taxi_out_path, 'df')
tnp = pd.read_hdf(tnp_out_path, 'df')
print(taxi.info(), tnp.info())

Nothing to be done, h5 files exist.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25860156 entries, 0 to 25860155
Data columns (total 14 columns):
Trip Start Timestamp          datetime64[ns]
Trip Seconds                  float32
Trip Miles                    float32
Pickup Community Area         float32
Dropoff Community Area        float32
Fare                          float32
Tips                          float32
Tolls                         float32
Extras                        float32
Trip Total                    float32
Pickup Centroid Latitude      float64
Pickup Centroid Longitude     float64
Dropoff Centroid Latitude     float64
Dropoff Centroid Longitude    float64
dtypes: datetime64[ns](1), float32(9), float64(4)
memory usage: 2.0 GB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17432011 entries, 0 to 17432010
Data columns (total 14 columns):
Trip Start Timestamp          datetime64[ns]
Trip Seconds                  float32
Trip Miles                    float32
Pic

In [18]:
useful_taxi_cols = ['Trip Start Timestamp',
       'Trip Seconds', 'Trip Miles', 'Pickup Community Area','Dropoff Community Area',
       'Fare', 'Tips', 'Tolls', 'Extras','Trip Total',
       'Pickup Centroid Latitude','Pickup Centroid Longitude',
       'Dropoff Centroid Latitude', 'Dropoff Centroid Longitude']

taxi_dtypes = ['float32','float32','float32','float32',
               'float32','float32','float32','float32','float32',
               'float64','float64','float64','float64']

useful_tnp_cols = ['Trip Start Timestamp', 
       'Trip Seconds','Trip Miles','Pickup Community Area', 'Dropoff Community Area', 
       'Fare', 'Tip', 'Additional Charges', 'Trip Total', 'Shared Trip Authorized',
       'Pickup Centroid Latitude', 'Pickup Centroid Longitude', 
       'Dropoff Centroid Latitude', 'Dropoff Centroid Longitude']

tnp_dtypes = ['float32','float32','float32','float32',
               'float32','float32','float32','float32','bool',
               'float64','float64','float64','float64']

def load_and_clean(cols, dtypes, fname):
    """Load csv with specified dtypes and column subset, assumes a time axis is in the frist column elt."""
    column_types = dict(zip(cols[1:], dtypes)) # skip the date

    return pd.read_csv(
             fname, usecols =cols,
             dtype = column_types,parse_dates=[cols[0]],infer_datetime_format=True)

In [24]:
tnp.count()

Trip Start Timestamp          17432011
Trip Seconds                  17429617
Trip Miles                    17432007
Pickup Community Area         16400220
Dropoff Community Area        16270531
Fare                          17431903
Tip                           17432011
Additional Charges            17431903
Trip Total                    17431903
Shared Trip Authorized        17432011
Pickup Centroid Latitude      16410805
Pickup Centroid Longitude     16410805
Dropoff Centroid Latitude     16280471
Dropoff Centroid Longitude    16280471
dtype: int64

Trip Start Timestamp          25860156
Trip Seconds                  25856546
Trip Miles                    25859849
Pickup Community Area         22764517
Dropoff Community Area        22303632
Fare                          25859786
Tips                          25859786
Tolls                         25859786
Extras                        25859786
Trip Total                    25859786
Pickup Centroid Latitude      22764928
Pickup Centroid Longitude     22764928
Dropoff Centroid Latitude     22353899
Dropoff Centroid Longitude    22353899
dtype: int64

Memory Optimization (we have 10 GB dataset)
see https://www.dataquest.io/blog/pandas-big-data/

The only thing with a lot of digits is the lat/long, can use a 64 bit float.

In [17]:
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

In [20]:
gl = tnp

gl_int = gl.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
print(mem_usage(gl_int))
print(mem_usage(converted_int))
compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)

5.32 MB
0.67 MB


Unnamed: 0,before,after
uint8,,2.0
int64,2.0,


In [21]:
gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float))
print(mem_usage(converted_float))
compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)

34.60 MB
17.30 MB


Unnamed: 0,before,after
float32,,13.0
float64,13.0,


In [22]:
gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()

Unnamed: 0,Trip ID,Trip Start Timestamp,Trip End Timestamp,Pickup Centroid Location,Dropoff Centroid Location
count,348802,348802,348802,328519,325882
unique,348802,5856,5857,860,858
top,129f952b7a5d22f8e5812488f7078a61ba94bae2,12/01/2018 07:00:00 PM,12/08/2018 07:15:00 PM,POINT (-87.6327464887 41.8809944707),POINT (-87.6327464887 41.8809944707)
freq,1,155,160,12936,15132


-2

Nothing to be done, h5 files exist.


## Initial Pipeline

In [None]:
def preprocess_trip_data(df, unused_cols,max_fare = 200):
    df.columns = df.columns.str.replace(' ', '_') # for dot notation
    df.drop(unused_cols, axis=1, inplace=True)
    df.dropna(inplace = True)
    # Drop trips with outlier distances or times
    df.drop(df[(df.Trip_Miles <=0)|(df.Trip_Miles >100)].index, inplace=True)
    df.drop(df[(df.Trip_Seconds <=0)|(df.Trip_Seconds >int(1e4))].index, inplace=True)
    # drop unusually large or small fares
    df.drop(df[((df.Fare > 200) | (df.Fare <= 0))].index,inplace = True)
    # Convert datetimes
    df['Trip_Start_Timestamp']= pd.to_datetime(df['Trip_Start_Timestamp'],infer_datetime_format = True)
    return

In [None]:
class TripPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, max_fare = 200.0, max_miles = 100.0):
        self.max_fare = max_fare
        self.max_miles = max_miles
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
    

In [7]:
from path import Path

In [10]:
datap = Path('data')

In [21]:
tp = datap/'taxi.h5'

In [22]:
tp.isfile()

True

In [31]:
str(tp.name)

'taxi.h5'