# Data Processing and Evenly Sampling

To be able to answer the goal of this project, we want to create even samples of Taxi and TNP data to not skew the preferences and features of the trips. We'll use the Taxi dataset as the base since we have a smaller sample size from this dataset.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot
import seaborn as sns

# Taxi Dataset

In [3]:
taxi = ('/mnt/processed/private/msds2021/lt6/chicago-dataset/sample/'
        'taxi_weather.csv')
df = pd.read_csv(taxi, low_memory=False).drop_duplicates()

In [4]:
print("Shape:", df.shape)
df.head()

Shape: (317914, 27)


Unnamed: 0,company,dropoff_census_tract,dropoff_centroid_latitude,dropoff_centroid_location.coordinates,dropoff_centroid_location.type,dropoff_centroid_longitude,dropoff_community_area,extras,fare,payment_type,...,tips,tolls,trip_end_timestamp,trip_id,trip_miles,trip_seconds,trip_start_timestamp,trip_total,Type,Severity
0,Chicago Carriage Cab Corp,,41.901207,"[-87.6763559892, 41.90120699410001]",Point,-87.676356,24.0,1.0,12.25,Cash,...,0.0,0.0,2019-01-01 00:15:00,073b50b66398bf053747a5029ed98ab900040172,3.43,705.0,2019-01-01 00:00:00,13.25,Rain,Light
1,24 Seven Taxi,17031830000.0,41.879067,"[-87.657005027, 41.8790669938]",Point,-87.657005,28.0,2.0,9.0,Credit Card,...,2.0,0.0,2019-01-01 00:15:00,5a18ed944588e71220a19bde7bb8fea00ab9a1e3,1.67,772.0,2019-01-01 00:00:00,13.5,Rain,Light
2,Flash Cab,,41.90007,"[-87.7209182385, 41.9000696026]",Point,-87.720918,23.0,0.0,9.5,Cash,...,0.0,0.0,2019-01-01 00:15:00,f8662b747e43cdec0b545671c6f7be27ef6dbcd3,2.56,831.0,2019-01-01 00:00:00,9.5,Rain,Light
3,Flash Cab,17031320000.0,41.877406,"[-87.6219716519, 41.8774061234]",Point,-87.621972,32.0,0.0,15.75,Cash,...,0.0,0.0,2019-01-01 00:15:00,0f02393355c7bb00064194a4deed1ae950693bd2,5.04,996.0,2019-01-01 00:00:00,15.75,Rain,Light
4,Flash Cab,17031840000.0,41.880994,"[-87.6327464887, 41.8809944707]",Point,-87.632746,32.0,1.0,16.75,Cash,...,0.0,,2019-01-01 00:30:00,3bfa0f260d521734d295fdd5e25c508b17a8ffec,4.9,1219.0,2019-01-01 00:15:00,17.75,Rain,Light


## Clean-up

Drop columns that won't be used

In [5]:
df.drop(columns=['dropoff_census_tract', 'pickup_census_tract',
                 'dropoff_centroid_location.coordinates',
                 'pickup_centroid_location.coordinates',
                 'dropoff_centroid_location.type', 
                 'pickup_centroid_location.type',
                 'taxi_id'],
        inplace=True, errors='ignore')

Rename some columns.

In [6]:
df.rename(columns={'tips': 'tip'}, inplace=True)

Drop rows with no info on the following.

For this stage we drop erroneous data such as dataset with no trip distance, duration, fare, total. And we'll also consider only inter-city travel so we remove trips that go to or come from outside the city limits (denoted by the null drop-off/pick-up centroids or null longitude/latitudes.

In [7]:
# Drop Erroneous Data, No Trip Miles / Seconds
df.drop(df[(df['trip_miles'] == 0) | (df['trip_miles'].isna()) |
           (df['trip_seconds'] == 0) | (df['trip_seconds'].isna()) |
           (df['fare'] == 0) | (df['fare'].isna()) |
           (df['trip_total'] == 0) | (df['trip_total'].isna()) |
           (df['pickup_centroid_latitude'] == 0) |
           (df['pickup_centroid_latitude'].isna()) |
           (df['dropoff_centroid_latitude'] == 0) |
           (df['dropoff_centroid_longitude'].isna()) |
           (df['pickup_community_area'].isna()) |
           (df['dropoff_community_area'].isna())
          ].index, inplace=True)
df.shape

(240304, 20)

# Impute

Check for null values.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 240304 entries, 0 to 317913
Data columns (total 20 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   company                     240304 non-null  object 
 1   dropoff_centroid_latitude   240304 non-null  float64
 2   dropoff_centroid_longitude  240304 non-null  float64
 3   dropoff_community_area      240304 non-null  float64
 4   extras                      240304 non-null  float64
 5   fare                        240304 non-null  float64
 6   payment_type                240304 non-null  object 
 7   pickup_centroid_latitude    240304 non-null  float64
 8   pickup_centroid_longitude   240304 non-null  float64
 9   pickup_community_area       240304 non-null  float64
 10  tip                         240304 non-null  float64
 11  tolls                       236049 non-null  float64
 12  trip_end_timestamp          240304 non-null  object 
 13  trip_id       

Set null `tolls` to 0 since we can't assume that these trips had any toll information.

In [9]:
df.loc[df.tolls.isna(), 'tolls'] = 0

Fix the mapping of `payment_type` to better groupings. There are too many payment types but based on the documentation, some are similar (e.g. Prcard, Pcard and Prepaid) so we'll re-map the `payment_type` values. We'll also create a `payment_is_cashless` as a new feature.

In [10]:
payment_map = {'Cash': 'cash',
               'Credit Card': 'credit_card',
               'Mobile': 'mobile',
               'Unknown': 'unknown',
               'Prcard': 'prepaid_card',
               'No Charge': 'unknown',
               'Dispute': 'unknown',
               'Pcard': 'prepaid_card',
               'Prepaid': 'prepaid_card'}
cashless_map = {'cash': 'cash',
                'credit_card': 'cashless',
                'mobile': 'cashless',
                'prepaid_card': 'cashless',
                'unknown': 'unknown'}
df.loc[:, 'payment_type'] = df.payment_type.map(payment_map)
df.loc[:, 'payment_is_cashless'] = df.payment_type.map(cashless_map)

## Check Final Taxi Dataset

Let's check if the dataset still contains null values.

In [11]:
(~df.isna()).all()

company                        True
dropoff_centroid_latitude      True
dropoff_centroid_longitude     True
dropoff_community_area         True
extras                         True
fare                           True
payment_type                   True
pickup_centroid_latitude       True
pickup_centroid_longitude      True
pickup_community_area          True
tip                            True
tolls                          True
trip_end_timestamp             True
trip_id                        True
trip_miles                     True
trip_seconds                   True
trip_start_timestamp           True
trip_total                     True
Type                          False
Severity                      False
payment_is_cashless            True
dtype: bool

In [12]:
print("Shape:", df.shape)
df.head()

Shape: (240304, 21)


Unnamed: 0,company,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_community_area,extras,fare,payment_type,pickup_centroid_latitude,pickup_centroid_longitude,pickup_community_area,...,tolls,trip_end_timestamp,trip_id,trip_miles,trip_seconds,trip_start_timestamp,trip_total,Type,Severity,payment_is_cashless
0,Chicago Carriage Cab Corp,41.901207,-87.676356,24.0,1.0,12.25,cash,41.899602,-87.633308,8.0,...,0.0,2019-01-01 00:15:00,073b50b66398bf053747a5029ed98ab900040172,3.43,705.0,2019-01-01 00:00:00,13.25,Rain,Light,cash
1,24 Seven Taxi,41.879067,-87.657005,28.0,2.0,9.0,credit_card,41.892042,-87.631864,8.0,...,0.0,2019-01-01 00:15:00,5a18ed944588e71220a19bde7bb8fea00ab9a1e3,1.67,772.0,2019-01-01 00:00:00,13.5,Rain,Light,cashless
2,Flash Cab,41.90007,-87.720918,23.0,0.0,9.5,cash,41.922761,-87.699155,22.0,...,0.0,2019-01-01 00:15:00,f8662b747e43cdec0b545671c6f7be27ef6dbcd3,2.56,831.0,2019-01-01 00:00:00,9.5,Rain,Light,cash
3,Flash Cab,41.877406,-87.621972,32.0,0.0,15.75,cash,41.921778,-87.651062,7.0,...,0.0,2019-01-01 00:15:00,0f02393355c7bb00064194a4deed1ae950693bd2,5.04,996.0,2019-01-01 00:00:00,15.75,Rain,Light,cash
4,Flash Cab,41.880994,-87.632746,32.0,1.0,16.75,cash,41.922083,-87.634156,7.0,...,0.0,2019-01-01 00:30:00,3bfa0f260d521734d295fdd5e25c508b17a8ffec,4.9,1219.0,2019-01-01 00:15:00,17.75,Rain,Light,cash


With the cleaned dataset, we'll store that in a DataFrame then get the shape information as the target size for our sampling for TNP dataset.

In [13]:
taxi_df = df.copy()
target_size = df.shape[0]

# TNP Dataset

In [14]:
tnp = ('/mnt/processed/private/msds2021/lt6/chicago-dataset/sample/'
        'tnp_weather.csv')
df = pd.read_csv(tnp, low_memory=False).drop_duplicates()

In [15]:
print("Shape:", df.shape)
df.head()

Shape: (2199163, 19)


Unnamed: 0,trip_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tip,additional_charges,trip_total,shared_trip_authorized,trips_pooled,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude,Type,Severity
0,d3f1ee5a5668dc4910897cb80706e025ac788379,2019-01-01 00:00:00,2019-01-01 00:15:00,616.0,4.652589,6.0,77.0,7.5,2.0,2.5,12.0,0,1,41.938232,-87.646782,41.993441,-87.657418,Rain,Light
1,daccb5cbeae6589358e2f1c3ed0a025827c23cc5,2019-01-01 00:00:00,2019-01-01 00:15:00,235.0,0.736279,6.0,7.0,2.5,0.0,2.5,5.0,0,1,41.938232,-87.646782,41.930579,-87.642206,Rain,Light
2,df41ec8759ea2358ba09cd53e8ce911290841ce3,2019-01-01 00:00:00,2019-01-01 00:15:00,762.0,4.660277,1.0,6.0,10.0,2.0,2.5,14.5,0,1,42.001698,-87.673574,41.94649,-87.647114,Rain,Light
3,e5c8cde09aa4afe214337aa1cb179fc6d120adae,2019-01-01 00:00:00,2019-01-01 00:45:00,2466.0,4.602994,28.0,8.0,17.5,0.0,2.5,20.0,0,1,41.863118,-87.67292,41.890922,-87.618868,Rain,Light
4,e8fc3a3a1a49d66d1666e643e5bc9a799caa4784,2019-01-01 00:00:00,2019-01-01 00:30:00,1348.0,7.968732,9.0,,15.0,0.0,2.5,17.5,0,1,42.005608,-87.813098,,,Rain,Light


## Clean-up

Drop columns that won't be used

In [16]:
df.drop(columns=['dropoff_census_tract', 'pickup_census_tract',
                 'dropoff_centroid_location.coordinates',
                 'pickup_centroid_location.coordinates',
                 'dropoff_centroid_location.type', 
                 'pickup_centroid_location.type',
                 'trips_pooled', 'shared_trip_authorized'],
        inplace=True, errors='ignore')

Drop rows with no info on the following:

In [17]:
# Drop Erroneous Data, No Trip Miles / Seconds
df.drop(df[(df['trip_miles'] == 0) | (df['trip_miles'].isna()) |
           (df['trip_seconds'] == 0) | (df['trip_seconds'].isna()) |
           (df['fare'] == 0) | (df['fare'].isna()) |
           (df['trip_total'] == 0) | (df['trip_total'].isna()) |
           (df['pickup_centroid_latitude'] == 0) |
           (df['pickup_centroid_longitude'].isna()) |
           (df['dropoff_centroid_latitude'] == 0) |
           (df['dropoff_centroid_longitude'].isna()) |
           (df['pickup_community_area'].isna()) |
           (df['dropoff_community_area'].isna())
          ].index, inplace=True)
df.shape

(1883589, 17)

# Impute

Check for null values

In [18]:
(~df.isna()).all()

trip_id                        True
trip_start_timestamp           True
trip_end_timestamp             True
trip_seconds                   True
trip_miles                     True
pickup_community_area          True
dropoff_community_area         True
fare                           True
tip                            True
additional_charges             True
trip_total                     True
pickup_centroid_latitude       True
pickup_centroid_longitude      True
dropoff_centroid_latitude      True
dropoff_centroid_longitude     True
Type                          False
Severity                      False
dtype: bool

In [21]:
print("Shape:", df.shape)
df.head()

Shape: (1883589, 17)


Unnamed: 0,trip_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tip,additional_charges,trip_total,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude,Type,Severity
0,d3f1ee5a5668dc4910897cb80706e025ac788379,2019-01-01 00:00:00,2019-01-01 00:15:00,616.0,4.652589,6.0,77.0,7.5,2.0,2.5,12.0,41.938232,-87.646782,41.993441,-87.657418,Rain,Light
1,daccb5cbeae6589358e2f1c3ed0a025827c23cc5,2019-01-01 00:00:00,2019-01-01 00:15:00,235.0,0.736279,6.0,7.0,2.5,0.0,2.5,5.0,41.938232,-87.646782,41.930579,-87.642206,Rain,Light
2,df41ec8759ea2358ba09cd53e8ce911290841ce3,2019-01-01 00:00:00,2019-01-01 00:15:00,762.0,4.660277,1.0,6.0,10.0,2.0,2.5,14.5,42.001698,-87.673574,41.94649,-87.647114,Rain,Light
3,e5c8cde09aa4afe214337aa1cb179fc6d120adae,2019-01-01 00:00:00,2019-01-01 00:45:00,2466.0,4.602994,28.0,8.0,17.5,0.0,2.5,20.0,41.863118,-87.67292,41.890922,-87.618868,Rain,Light
5,f71c2bde16576fe46495ba9b38b5d6823bcd18f1,2019-01-01 00:00:00,2019-01-01 00:15:00,823.0,8.963933,15.0,28.0,12.5,0.0,2.5,15.0,41.954028,-87.763399,41.874005,-87.663518,Rain,Light


# Downsampling TNP

We'll randomly sample the TNP dataset with the same size as the taxi dataset.

*Note: Since we did not indicate a random state for this code, this will return a different dataset for each run.*

In [22]:
idx = np.random.choice(range(df.shape[0]), size=target_size, replace=False)

In [23]:
df = df.iloc[idx]

In [24]:
df.shape

(240304, 17)

# Merging the Two Datasets

## New columns

Create a `type` column to identify what kind of trip it is. We'll add mobile as the `payment_type` for TNPs since it is indicated by Chicago regulation that TNPs cannot accept cash transactions, and are essentially cashless.

In [25]:
taxi_df.loc[:, 'TransportType'] = 'taxi'
df.loc[:, 'TransportType'] = 'tnp'
df.loc[:, 'payment_type'] = 'mobile'
df.loc[:, 'payment_is_cashless'] = 'cashless'

Make an `additional_charges` column that aggregates taxi `tolls` and `extras`. This is the equivalent of TNP `additional_charges`

In [26]:
taxi_df['additional_charges'] = taxi_df['tolls'] + taxi_df['extras']
taxi_df.drop(columns=['tolls', 'extras'], inplace=True)

## Concatenate DataFrames

Write file for easier reference. This will be the final dataset that will be used for clustering.

In [27]:
pd.concat([taxi_df, df], ignore_index=True).to_csv('taxi_tnp_weather.csv.gz',
                                                   index=False)