In [29]:
import pandas as pd


You are a Business Analyst on the **Uber** Pool Product Team working to optimize driver compensation. The team aims to understand how trip characteristics impact driver earnings. Your goal is to develop data-driven recommendations that maximize driver earnings potential.

In [30]:
# Load data
fct_trips = pd.read_csv('fct_trips.csv')

# Convert `trip_date` to datetime
fct_trips['trip_date'] = pd.to_datetime(fct_trips['trip_date'])

# Display the dataset
print(fct_trips)

    trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0       101          1  UberPool 2024-07-05            3            10.5   
1       102          1  UberPool 2024-07-15            2             8.0   
2       103          2  UberPool 2024-08-10            4            15.0   
3       104          3     UberX 2024-07-20            1             5.0   
4       105          2  UberPool 2024-09-01            3            12.0   
5       106          4  UberPool 2024-09-15            5            20.0   
6       107          4  UberPool 2024-10-01            3             9.0   
7       108          5  UberPool 2024-08-25            4            11.0   
8       109          1  UberPool 2024-09-30            3             6.0   
9       110          2  UberPool 2024-07-07            2             7.0   
10      111          3  UberPool 2024-08-05            4            13.0   
11      112          5     UberX 2024-09-10            1             4.0   
12      113 

### Question 1 of 3

What is the average driver earnings per completed UberPool ride with more than two riders between July 1st and September 30th, 2024? This analysis will help isolate trips that meet specific rider thresholds to understand their impact on driver earnings.

In [31]:
# Average earnings per UberPool ride (>2 riders) between 2024-07-01 and 2024-09-30
mask = (
    fct_trips['ride_type'].str.strip().str.lower().eq('uberpool') &
    (fct_trips['rider_count'] > 2) &
    fct_trips['trip_date'].between('2024-07-01', '2024-09-30') &
    fct_trips['total_earnings'].notna()
)

subset = fct_trips.loc[mask, ['trip_id', 'total_earnings']]

if subset.empty:
    print("No matching trips found for the specified criteria.")
else:
    avg_earnings = subset['total_earnings'].mean()
    trip_count = len(subset)
    print(f"Average driver earnings per UberPool ride (>2 riders) from 2024-07-01 to 2024-09-30: ${avg_earnings:.2f} (n={trip_count} trips)")


Average driver earnings per UberPool ride (>2 riders) from 2024-07-01 to 2024-09-30: $36.05 (n=10 trips)


### Question 2 of 3

For completed UberPool rides between July 1st and September 30th, 2024, derive a new column calculating earnings per mile (`total_earnings` divided by `total_distance`) and then compute the average earnings per mile for rides with more than two riders. This calculation will reveal efficiency metrics for driver compensation.

In [32]:
# Derive earnings_per_mile and compute average for UberPool rides with >2 riders (2024-07-01 to 2024-09-30)

# New column: earnings per mile (avoid divide-by-zero)
fct_trips['earnings_per_mile'] = (
    (fct_trips['total_earnings'] / fct_trips['total_distance'])
    .where(fct_trips['total_distance'] > 0)
)

mask = (
    fct_trips['ride_type'].str.strip().str.lower().eq('uberpool') &
    fct_trips['trip_date'].between('2024-07-01', '2024-09-30') &
    (fct_trips['rider_count'] > 2) &
    fct_trips['earnings_per_mile'].notna()
)

subset = fct_trips.loc[mask, ['trip_id', 'earnings_per_mile']]

if subset.empty:
    print("No matching trips found for the specified criteria.")
else:
    avg_epm = subset['earnings_per_mile'].mean()
    trip_count = len(subset)
    print(f"Average earnings per mile for UberPool rides (>2 riders) from 2024-07-01 to 2024-09-30: ${avg_epm:.2f}/mile (n={trip_count} trips)")

# Display the updated dataset
print("\nUpdated dataset with earnings per mile:")
print(fct_trips)

# Display the subset for further analysis
print("\nSubset of trips for further analysis:")
print(subset)


Average earnings per mile for UberPool rides (>2 riders) from 2024-07-01 to 2024-09-30: $2.39/mile (n=10 trips)

Updated dataset with earnings per mile:
    trip_id  driver_id ride_type  trip_date  rider_count  total_distance  \
0       101          1  UberPool 2024-07-05            3            10.5   
1       102          1  UberPool 2024-07-15            2             8.0   
2       103          2  UberPool 2024-08-10            4            15.0   
3       104          3     UberX 2024-07-20            1             5.0   
4       105          2  UberPool 2024-09-01            3            12.0   
5       106          4  UberPool 2024-09-15            5            20.0   
6       107          4  UberPool 2024-10-01            3             9.0   
7       108          5  UberPool 2024-08-25            4            11.0   
8       109          1  UberPool 2024-09-30            3             6.0   
9       110          2  UberPool 2024-07-07            2             7.0   
10      111

### Question 3 of 3

Identify the combination of rider count and total distance that results in the highest average driver earnings per UberPool ride between July 1st and September 30th, 2024. This analysis directly recommends optimal trip combination strategies to maximize driver earnings.

In [33]:
# Identify best (rider_count, total_distance bin) combo by average earnings (2024-07-01 to 2024-09-30)

# Filter: UberPool in range with valid earnings and distance
mask = (
    fct_trips['ride_type'].str.strip().str.lower().eq('uberpool') &
    fct_trips['trip_date'].between('2024-07-01', '2024-09-30') &
    fct_trips['total_earnings'].notna() &
    fct_trips['total_distance'].notna() &
    (fct_trips['total_distance'] > 0)
)
df = fct_trips.loc[mask, ['trip_id', 'rider_count', 'total_distance', 'total_earnings']]

if df.empty:
    print("No matching trips found for the specified criteria.")
else:
    # Bin distances into actionable ranges
    bins = [0, 2, 5, 10, 20, float('inf')]
    labels = ['0-2 mi', '2-5 mi', '5-10 mi', '10-20 mi', '20+ mi']
    df['distance_bin'] = pd.cut(df['total_distance'], bins=bins, labels=labels, right=False, include_lowest=True)

    # Aggregate by rider_count and distance_bin
    agg = (
        df.groupby(['rider_count', 'distance_bin'], dropna=False, observed=True)
          .agg(avg_earnings=('total_earnings', 'mean'),
               trip_count=('trip_id', 'count'))
          .reset_index()
          .sort_values(['avg_earnings', 'trip_count'], ascending=[False, False])
    )

    # Prefer combinations with sufficient support
    MIN_N = 5
    strong = agg[agg['trip_count'] >= MIN_N]

    result_table = strong if not strong.empty else agg
    top = result_table.head(10)

    print("Top combinations by average earnings per ride (descending):")
    print(top.to_string(index=False, formatters={'avg_earnings': lambda x: f"${x:,.2f}"}))

    best_row = result_table.iloc[0]
    print(
        f"\nBest combination: rider_count={int(best_row['rider_count'])}, "
        f"distance={best_row['distance_bin']} -> "
        f"avg_earnings={best_row['avg_earnings']:.2f} (n={int(best_row['trip_count'])})"
        + ("" if not strong.empty else " [note: fewer than MIN_N trips in some groups]")
    )


Top combinations by average earnings per ride (descending):
 rider_count distance_bin avg_earnings  trip_count
           5       20+ mi       $55.00           2
           3       20+ mi       $45.00           1
           4     10-20 mi       $34.25           4
           3     10-20 mi       $26.25           2
           2      5-10 mi       $16.50           2
           3      5-10 mi       $16.00           1

Best combination: rider_count=5, distance=20+ mi -> avg_earnings=55.00 (n=2) [note: fewer than MIN_N trips in some groups]
