# NYC Taxi Tipping Behavior

## 1. Research Question

Question 1: "Do credit-card trips tip more than cash?"

Result/limitation: In this dataset, `tip_amount` is effectively only observed for credit-card trips; a credit-vs-cash tipping comparison using this field would measure recording differences, not true tipping behavior

**Decision: Subsequent analyses model tipping outcomes using credit-card trips only.**

Question 2: "What predicts tip amount?"

Question 3: "Among card-paid trips, does the probability of leaving no tip (`tip_amount` = 0) differ between short and long trips (e.g., ≤ 5 miles vs > 5 miles)?"

In this study, “tip more” is defined primarily in terms of the absolute tip amount rather than tip as a percentage of the fare. The absolute tip amount directly reflects customer tipping behavior and is easier to interpret in practical terms. While tip percentage is related to the absolute tip, it is not the primary focus of this analysis.

The analysis is intended to capture typical tipping behavior rather than extreme or unusual cases. Trips involving unusually long distances or atypical travel patterns may naturally result in higher tips due to higher fares, but these cases are not the main focus of this study. Instead, the goal is to understand average tipping behavior across standard taxi trips and to compare how tipping differs by payment method and trip characteristics.

## 2. Dataset Description

Dataset: NYC TLC Trip Record Data (yellow cabs) (PARQUET)

The dataset used in this study is the NYC TLC Yellow Taxi Trip Record Data, obtained from the official NYC government website (nyc.gov). 

The data are provided in Parquet format and include detailed trip-level information for yellow taxi rides in NYC.

Since the data is uploaded with a two-month delay, the most recent year of data is incomplete. 

Each row represents a single trip.

The dataset contains approximately 3.5 million observations for one month and includes 20 variables. Key variables relevant to this analysis includes the followings:

- Pick-Up and Drop-Off Time
- Passenger Count
- Trip Distance
- Rate Code
- Payment Type
- Fare Amount
- Total Amount
- Tip Amount

Due to the large size of the NYC TLC trip records (approximately 3–4 million observations per month), loading an entire year of data simultaneously can be computationally inefficient. To balance statistical robustness with practical constraints, the analysis is conducted on a month-by-month basis for the 2024 calendar year. Each month is processed using the same data cleaning and analysis pipeline, and the results are then combined to produce year-level summaries. This approach allows for scalable analysis while preserving a sufficiently large and representative sample.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_parquet("yellow_tripdata_2024-01.parquet")

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2964624 entries, 0 to 2964623
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int32         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int32         
 8   DOLocationID           int32         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  Airport_fee           

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2024-01-01 00:57:55,2024-01-01 01:17:43,1.0,1.72,1.0,N,186,79,2,17.7,1.0,0.5,0.0,0.0,1.0,22.7,2.5,0.0
1,1,2024-01-01 00:03:00,2024-01-01 00:09:36,1.0,1.8,1.0,N,140,236,1,10.0,3.5,0.5,3.75,0.0,1.0,18.75,2.5,0.0
2,1,2024-01-01 00:17:06,2024-01-01 00:35:01,1.0,4.7,1.0,N,236,79,1,23.3,3.5,0.5,3.0,0.0,1.0,31.3,2.5,0.0
3,1,2024-01-01 00:36:38,2024-01-01 00:44:56,1.0,1.4,1.0,N,79,211,1,10.0,3.5,0.5,2.0,0.0,1.0,17.0,2.5,0.0
4,1,2024-01-01 00:46:51,2024-01-01 00:52:57,1.0,0.8,1.0,N,211,148,1,7.9,3.5,0.5,3.2,0.0,1.0,16.1,2.5,0.0



## 3. Data Cleaning & Processing

- What is my outcome variable?
    - tip amount

- What variables are predictors?
    - payment_type: Q1 - two-sample inferene
    - trip_distance: Q2 - regression

- What variables help me filter invalid trips?
    - trip_distance missing
    - payment_type missing - cannot classify - problematic
    - fare amount/PU DO time missing or negative value - logically impossible


The primary outcome variable in this study is the absolute tip amount (`tip_amount`). Payment type (`payment_type`) is included as a key explanatory variable for comparing tipping behavior between credit card and cash transactions. In addition, trip distance (`trip_distance`) is included as a continuous predictor to examine whether trip characteristics are associated with tip amounts.

Initial Validity checks

In [None]:
df.describe()

Suspicious Extreme Values
- Trip distance = 312,000
- Fare/Total = $5,0000
- Negative Components

In [None]:
df[
    (df["trip_distance"] <= 0) |
    (df["fare_amount"] <= 0) |
    (df["tip_amount"] < 0)
]

1. Trip Distance $\le$ 0
2. Fare Amount $\le$ 0
3. Tip amount $\lt$ 0

These records do not represent completed, positive-fare taxi trips and are therefore not meaningful for analyzing tipping behavior.



#### Cleaning Rules

The following records are excluded from the analysis:
1. Trips with non-positive fare amounts, as these do not represent completed, metered taxi trips.
2. Trips with non-positive trip distances, which are physically implausible.
3. Trips with negative tip amounts, which reflect adjustments or refunds rather than tipping behavior.
4. Trips with payment types other than credit card or cash, as these do not represent voluntary payment decisions relevant to the research question.

These rules are applied uniformly across all months of data.

In [None]:
df_raw = df.copy()
df_clean = df_raw.copy()

start = pd.Timestamp("2024-01-01")
end = pd.Timestamp("2024-02-01")

df_clean = df_clean[
    (df_clean["fare_amount"] > 0) &
    (df_clean["trip_distance"] > 0) &
    (df_clean["tip_amount"] >= 0) &
    (df_clean["payment_type"].isin([1, 2])) &
    (df_clean["tpep_pickup_datetime"] >= start) &
    (df_clean["tpep_pickup_datetime"] < end) &
    (df_clean["tpep_pickup_datetime"] < df_clean["tpep_dropoff_datetime"])
]

print(f"Rows before cleaning: {len(df_raw)}")
print(f"Rows after cleaning: {len(df_clean)}")
print(f"Rows removed: {len(df_raw) - len(df_clean)}")

#### Derived Variables
- Trip duration: Dropoff time - Pickup time (minutes)
    - validates physical plausibility
    - Potential Control Variable

- Tip Percentage: Tip amount / Fare amount (fare > 0 only)
    - Normalizes tipping across fares
    - Used for robustness checks
    
- Credit vs Cash indicator: 1 = Credit Card, 2 = Cash
    - Simplifies group comparisons
    - Aligns with research question
    

In [None]:
df_clean["trip_duration_min"] = (
    df_clean["tpep_dropoff_datetime"] - df_clean["tpep_pickup_datetime"]
).dt.total_seconds() / 60

df_clean["tip_percentage"] = (
    df_clean["tip_amount"] / df_clean["fare_amount"] * 100
)

df_clean["is_credit"] = (df_clean["payment_type"] == 1).astype(int)

df_clean

In [None]:
df_clean[df_clean["tip_percentage"] > 2000].iloc[0]


## 4. Exploratory Analysis

### Sample Overview

In [None]:
df_clean.shape

In [None]:
df_clean["payment_type"].value_counts()

After applying the cleaning rule, 2,721,070 rows remaining for January 2024 data, which is 91.7846% of the raw data. Currently, payment type is sorted down to 1. Credit card, 2. Cash.

### Distribution of key variables

- `tip_amount` (primary outcome)

In [None]:
df_clean["tip_amount"].describe()

In [None]:
plt.figure()
plt.hist(df_clean["tip_amount"] + 0.01, bins=50, log=True)
plt.xlabel("Tip Amount ($)")
plt.ylabel("Frequency")
plt.title("Distribution of Tip Amount (Log Scale)")
plt.show()

`tip_amount` is heavily right-skewed. Most observations fall between 0 and 100 dollars, with counts dropping sharply above 100 dollars. Tips in the 100 to 200 dollar range may be plausible for high-fare trips, but tips above about 300 to 400 dollars appear unusually large and will be investigated as potential anomalies.

In [None]:
df_clean["tip_percentage"].describe()

Half of all trips had a tip of about 25% of the fare or less, which suggests that ~25% is a very typical tip in the data, which is broadly consistent with common tipping ranges.

A maximum tip percentage had extremely large values: 1,400,000\% - clearly not realistic. This makes the standard deviation as `sd = 882.85`, which tells the distribution of tip_percentage is very spread out, and these extreme outliers distort the summary statistics.

For readability, the histogram is truncated using an IQR-based upper bound `(Q3 + 1.5 * IQR)` Values above this bound are retained in the dataset but not shown in the plot.

In [None]:
q1 = df_clean["tip_percentage"].quantile(0.25)
q3 = df_clean["tip_percentage"].quantile(0.75)
iqr = q3 - q1
upper_bound = q3 + 1.5 * iqr
upper_bound

In [None]:
plt.figure()
plt.hist(df_clean.loc[df_clean["tip_percentage"] <= upper_bound, "tip_percentage"] + 0.01, bins=50, log=True)
plt.xlabel("Tip Percentage (%)")
plt.ylabel("Frenquency")
plt.title("Distribution of Tip Percentage (Log Scale)")
plt.show()

In [None]:
df_clean[
    (df_clean["payment_type"] == 2) &
    (df_clean["tip_amount"] != 0)
]

In [None]:
print((
    (df_clean["payment_type"] == 1) & (df_clean["tip_amount"] == 0)
).sum() / (df_clean["payment_type"] == 1).sum()*100)
print((
    (df_clean["payment_type"] == 2) & (df_clean["tip_amount"] == 0)
).sum() / (df_clean["payment_type"] == 2).sum()*100)

`payment_types == 1`: Credit Card

`payment_types == 2`: Cash

- No tip when paid credit card: 4.68%
- No tip when paid cash: 99.99%

### Recording Limitation: tips for cash payments
- In the dataset, `tip_amount` is effectively observed only for credit-card trips.
- Cash trips show ~0 recorded tip in ~99.99% of rows, meaning cash tips are likly not captured in this field.
- Therefore, comparing "tipping behavior" across payment types using tip_amount would reflect data recording, not true behavior.
- From this point onward, tipping models focus on credit-card trips only.

In [None]:
df_card = df_clean[df_clean["is_credit"] == 1]
df_card["payment_type"].value_counts()

After excluding rows with payment type is cash, 2,298,339 rows remaining for January 2024 data. Currently, payment type is sorted down to Credit Card only.

In [None]:
plt.figure()
plt.hist(df_card["tip_amount"] + 0.01, bins=50, log=True)
plt.xlabel("Tip Amount ($)")
plt.ylabel("Frequency")
plt.title("Distribution of Tip Amount - Credit Card (Log Scale)")
plt.show()

In [None]:
q1 = df_card["tip_percentage"].quantile(0.25)
q3 = df_card["tip_percentage"].quantile(0.75)
iqr = q3 - q1
upper_bound = q3 + 1.5 * iqr
upper_bound

In [None]:
plt.figure()
plt.hist(df_card.loc[df_card["tip_percentage"] <= upper_bound, "tip_percentage"] + 0.01, bins=50, log=True)
plt.xlabel("Tip Percentage (%)")
plt.ylabel("Frenquency")
plt.title("Distribution of Tip Percentage - Credit Card (Log Scale)")
plt.show()

In [None]:
df_plot = df_card.sample(n=50000, random_state=42)

x_max = df_card["trip_distance"].quantile(0.99)
y_max = df_card["tip_amount"].quantile(0.99)

plt.figure(figsize=(8,5))
plt.scatter(df_plot["trip_distance"], df_plot["tip_amount"], alpha=0.05, s=5)
plt.xlim(0, x_max)
plt.ylim(0, y_max)
plt.xlabel("Trip Distance (miles)")
plt.ylabel("Tip Amount ($)")
plt.title("Trip Distance vs Tip Amount - Credit Card (sampled scatter)")
plt.show()

In [None]:
x_max = df_card["fare_amount"].quantile(0.99)
y_max = df_card["tip_amount"].quantile(0.99)

plt.figure(figsize=(8,5))
plt.scatter(df_plot["fare_amount"], df_plot["tip_amount"], alpha=0.05, s=5)
plt.xlim(0, x_max)
plt.ylim(0, y_max)
plt.xlabel("Fare Amount ($)")
plt.ylabel("Tip Amount ($)")
plt.title("Fare Amount vs Tip Amount - Credit Card (sampled scatter)")
plt.show()

In [None]:
df_card[["trip_distance","tip_amount"]].corr()

In [None]:
df_card[["trip_distance","tip_amount"]].corr(method="spearman")

In [None]:
x_max = df_card["trip_distance"].quantile(0.99)

plt.figure(figsize=(8,5))
plt.scatter(df_plot["trip_distance"], df_plot["tip_percentage"], alpha=0.05, s=5)
plt.xlim(0, x_max)
plt.ylim(0, 50)
plt.xlabel("Trip Distance (miles)")
plt.ylabel("Tip Percentage (%)")
plt.title("Trip Distance vs Tip Percentage (sampled scatter)")
plt.show()

Using a 50,000-row random sample, tip percentage is plotted against trip distance. The x-axis is capped at the 99th percentile of distance and the y-axis at 0-50% to focus on typical ranges and reduce the influence of extreme values.

Tip percentage concentrates around ~20-30% across most trip distances. Variability is higher for very short trips, likely because small fares make the percentage sensitive to fixed-dollar trips.

Overall, no clear monotonic trend between distance and tip percentage is visible in this plot.

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(df_plot["trip_distance"], df_plot["tip_percentage"], alpha=0.05, s=5)
plt.xlim(0, 5)
plt.ylim(0, 50)
plt.xlabel("Trip Distance (miles)")
plt.ylabel("Tip Percentage (%)")
plt.title("Trip Distance vs Tip % (0–5 miles)")
plt.show()

In [None]:
df_long = df_plot[df_plot["trip_distance"] >= 5]

plt.figure(figsize=(8,5))
plt.scatter(df_long["trip_distance"], df_long["tip_percentage"], alpha=0.05, s=5)
plt.xlim(5, 20)
plt.ylim(0, 50)
plt.xlabel("Trip Distance (miles)")
plt.ylabel("Tip Percentage (%)")
plt.title("Trip Distance vs Tip % (5–20 miles)")
plt.show()

To avoid the dense concentration of short trips overwhelming the visualization, tip percentage vs. trip distance is shown in two ranges: 0–5 miles (short trips) and 5–20 miles (longer trips). 

In the 0–5 mile range, tip percentage has much higher dispersion, including many high tip% cases, consistent with small denominators (fare) making tip% more variable on short trips.

In the 5–20 mile range, tip percentage becomes more tightly concentrated around ~20–25%, with fewer extreme high tip% values and a generally stable band across distance. 

Overall, the split view suggests distance is not strongly associated with higher tip percentage, but short trips exhibit greater variability in tip% than longer trips.

In [None]:
df_card[["trip_distance","tip_percentage"]].corr()

In [None]:
df_card[["trip_distance","tip_percentage"]].corr(method="spearman")

Pearson correlation between trip distance and tip percentage is near zero (little linear relationship), while Spearman correlation is moderately negative, suggesting a monotonic tendency for tip% to decrease as distance increases. 

This likely reflects discrete tipping behavior (common preset tip percentages and fixed-dollar tips) and fare-size effects rather than a smooth linear association.

In [None]:
distance_cutoff = df_card["trip_distance"].quantile(0.99)

df_plot["distance_bin"] = pd.cut(
    df_plot["trip_distance"],
    bins=np.arange(0, distance_cutoff + 1, 5)
)

binned = (
    df_plot
    .groupby("distance_bin", observed=True)["tip_percentage"]
    .mean()
    .reset_index()
)

binned["midpoint"] = binned["distance_bin"].apply(
    lambda x: x.mid
)


plt.figure(figsize=(8,5))

plt.plot(binned["midpoint"],
         binned["tip_percentage"],
         linewidth=3)

plt.xlabel("Trip Distance (miles)")
plt.ylabel("Average Tip Percentage (%)")
plt.title("Binned Average Tip % by Trip Distance")

plt.show()

Trip distance was binned into 5-mile intervals to reduce noise from heavy-tailed tip behavior and to make the average tip percentage trend easier to interpret across distance ranges. The binned curve shows a gradual downward pattern: as trip distance increases, the mean tip percentage tends to decrease slightly. This visual trend is directionally consistent with the negative Spearman correlation observed earlier, suggesting a weak monotonic relationship rather than a strong linear effect.

However, because bin counts were not incorporated here, the stability of the mean tip% at longer distances may vary depending on how many trips fall into each bin. Therefore, this plot should be interpreted as a descriptive summary of the overall pattern, not as evidence of a precise effect size or a reliable trend in sparsely populated distance ranges.

## 5. Statistical Inference (Card-only)

RQ3: Among card-paid trips, does the probability of leaving no tip (`tip_amount` = 0) differ between short and long trips (e.g., ≤ 5 miles vs > 5 miles)?

In [None]:
cutoff = 5

df_card = df_card.copy()
df_card["distance_group"] = pd.cut(
    df_card["trip_distance"],
    bins=[0, cutoff, np.inf],
    labels=["short", "long"],
    include_lowest=True
)

df_card["distance_group"].value_counts()

In [None]:
df_card["is_zero_tip"] = (df_card["tip_amount"] == 0).astype(int)
df_card["is_zero_tip"].value_counts()

### 5.4 Test A: Zero-tip rate difference (two-proportion z-test)

In [None]:
dfA = df_card.loc[:, ["trip_distance", "tip_amount"]].copy()

dfA["is_short"] = dfA["trip_distance"] <= 5
dfA["is_zero_tip"] = dfA["tip_amount"] == 0

group_stats = dfA.groupby("is_short")["is_zero_tip"].agg(
    n="size",
    zero="sum",
    p="mean"
)

group_stats.assign(p_pct=group_stats["p"]*100)

#### Test Hypothesis
Let $p_S$ be the proportion of zero-tip trips among short trips and $p_L$ among long trips.
- Null hypothesis ($H_0$): $p_S = p_L$
- Alternative hypothesis ($H_1$): $p_S \neq p_L$

In [None]:
from statsmodels.stats.proportion import proportions_ztest

count = np.array([
    group_stats.loc[True, "zero"],
    group_stats.loc[False, "zero"]
])

nobs = np.array([
    group_stats.loc[True, "n"],
    group_stats.loc[False, "n"]
])

z_stat, p_value = proportions_ztest(count, nobs, alternative="two-sided")
z_stat, p_value

In [None]:
p_short = group_stats.loc[True, "p"]
p_long  = group_stats.loc[False, "p"]

rd = p_short - p_long
rd

In [None]:
a = count[0]                  # short, zero
b = nobs[0] - count[0]         # short, non-zero
c = count[1]                  # long, zero
d = nobs[1] - count[1]         # long, non-zero

or_ = (a/b) / (c/d)
or_

In [None]:
from statsmodels.stats.proportion import confint_proportions_2indep

ci_low, ci_high = confint_proportions_2indep(
    count1=count[0], nobs1=nobs[0],
    count2=count[1], nobs2=nobs[1],
    method="wald"
)
ci_low, ci_high

#### Test Result Summary

- Compared the proportion of zero tips between short trips (≤5 miles) and long trips (>5 miles) using a two-proportion z-test.
- The observed zero-tip rates were $p_S$ = 0.0373 and $p_L$ = 0.0990.
- The difference in proportions (risk difference) was $RD = p_S − p_L =$ -0.0617, (95% CI: [-0.0627, -0.0606])
- The z-test returned $z$ = -160.1328, $p$ = 0.0.
- The odds of leaving no tip on short trips, $OR$ = 0.353.
- Given the very large sample size, we emphasize the magnitude of RD/OR rather than p-value alone.

### 5.5 Test B: Tip amount difference among tippers (two-sample t-test / Welch)
### 5.6 Effect sizes + confidence intervals
### 5.7 Summary (interpretation in plain academic English)

## 6. Regression Analysis

## 7. Conclusions & Limitations