In [7]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [8]:
filename = '../data/nyc_taxi_2019-01.csv'

df = pd.read_csv(filename,
                usecols=['passenger_count',
                         'trip_distance', 'total_amount'])
                # dtype={'total_amount':np.float128})

df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.5,9.95
1,1,2.6,16.3
2,3,0.0,5.8
3,5,0.0,7.55
4,5,0.0,55.55


# Beyond 1

In which five rides did people pay the most per mile?

In [9]:
# First, remove 0-length trips.
df = df[df['trip_distance'] != 0]

# Create a new column, in which we calculate the cost per mile
df['cost_per_mile'] = df['total_amount'] / df['trip_distance']

# Now sort the data frame by that column, and get the 5 highest values
df.sort_values('cost_per_mile').tail(5)

# Obviously, the data is a bit messed up, given that some trips went 0.01 
# miles and at least one trip cost $623,261!

Unnamed: 0,passenger_count,trip_distance,total_amount,cost_per_mile
4136499,1,0.01,273.96,27396.0
6403254,1,0.01,322.3,32230.0
7099014,4,0.01,415.3,41530.0
478791,1,0.1,6667.45,66674.5
2499600,1,2.4,623261.66,259692.358333


# Beyond 2

Let's assume that multi-passenger rides are split evenly among the passengers. Given that assumption, in which 10 rides did each individual pay the greatest amount? And again, how far did they travel?

### Note: result differs depending on whether you regenerated the data frame.

In [13]:
df.shape

(2083565, 5)

In [11]:
# Remove trips with <2 passengers
df = df[df['passenger_count'] >= 2]

# Create a new column based on these values
df['payment_per_person'] = df['total_amount'] / df['passenger_count']

# Find the highest per-person payment
df.sort_values('payment_per_person').tail(10)

Unnamed: 0,passenger_count,trip_distance,total_amount,cost_per_mile,payment_per_person
5031491,2,64.3,343.32,5.339347,171.66
4563340,2,0.4,350.3,875.75,175.15
4202883,2,60.23,369.06,6.127511,184.53
4751745,2,100.78,403.5,4.003771,201.75
5726185,2,65.05,416.82,6.407686,208.41
149362,2,17.2,426.8,24.813953,213.4
7593395,2,83.61,449.32,5.373998,224.66
3842620,2,110.04,515.82,4.687568,257.91
3014027,2,16.6,560.76,33.780723,280.38
2972145,2,19.9,589.96,29.646231,294.98


In [12]:
df.shape

(2083565, 5)

# Beyond 3

In the exercise solution, I showed that we needed to use `iloc` or `head`/`tail` to retrieve the first/last 20 rows, because the index was all scrambled after our sort operation. But you can pass `ignore_index=True` to `sort_values`, and then the resulting data frame will have a numeric index, starting at 0. Use this option, and `loc`, to get the mean `total_amount` for the 20 longest trips.

In [7]:
df.sort_values('trip_distance',
                ascending=False,
              ignore_index=True)['total_amount'].loc[:20].mean()


253.65904761904761955