### CH3/09 Solution: Taxi data mean speed
main > Ch03 > solution > download_data.py        
main > Ch03 > solution > mean_speed.py        

# Calculating Mean Speed (miles per hour) from `taxi.parquet`

## Overview
This script loads a Parquet file containing taxi trip data and calculates the mean speed of trips in miles per hour.

1. **Download the Data**
   Run `download_data.py` to obtain the `taxi.parquet` file.

2. **Load the Dataset**  
   The Parquet file is read into a Pandas DataFrame using:

In [11]:
import pandas as pd
df = pd.read_parquet('taxi.parquet')

In [13]:
print(df.head(3))

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2020-08-01 00:02:53   2020-08-01 00:28:54              1.0   
1         2  2020-08-01 00:08:11   2020-08-01 00:16:28              1.0   
2         2  2020-08-01 00:22:14   2020-08-01 00:22:20              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0          13.20         1.0                  N           237            16   
1           2.83         1.0                  N           146           137   
2           0.00         1.0                  N           264           264   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \
0             2         36.5    3.0      0.5         0.0           0.0   
1             1         10.5    0.5      0.5         1.5           0.0   
2             2          2.5    0.5      0.5         0.0           0.0   

   improvement_surcharge  total_amount  congestion_surcharge  airport_fee  
0        

3. **Filter Out Invalid Entries** Ensure that drop-off times are later than pick-up times:

In [42]:
print(df['tpep_dropoff_datetime'] > df['tpep_pickup_datetime'] )

0          True
1          True
2          True
3          True
4          True
           ... 
1007281    True
1007282    True
1007283    True
1007284    True
1007285    True
Length: 1006450, dtype: bool


In [23]:
mask = df['tpep_dropoff_datetime'] > df['tpep_pickup_datetime']
df = df[mask]

4. **Calculate Trip Duration in Hours** Compute the time difference between drop-off and pick-up, then convert it to hours:

In [38]:
times = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
times_hour = times / pd.Timedelta(1, 'hour')

In [40]:
print(times_hour)

0          0.433611
1          0.138056
2          0.001667
3          0.088611
4          0.047500
             ...   
1007281    0.103611
1007282    0.150000
1007283    0.112778
1007284    0.483333
1007285    0.400000
Length: 1006450, dtype: float64


5. **Compute Speed** Divide the trip distance by the trip duration to get speed in miles per hour:

In [45]:
speed = df['trip_distance'] / times_hour

In [47]:
print(speed)

0          30.442024
1          20.498994
2           0.000000
3          26.407524
4          27.789474
             ...    
1007281    13.608579
1007282    15.200000
1007283    14.187192
1007284    18.827586
1007285    32.250000
Length: 1006450, dtype: float64


6. **Calculate Mean Speed** Finally, find the mean speed across all trips:

In [50]:
speed.mean()

17.179585517575283

The result is the average speed of taxi trips in miles per hour.

In [None]:
[Context_Python_Scientific_Stack](./../../Context_Python_Scientific_Stack.md)