# NYC Taxi Tipping Behavior

## 1. Research Question

Question 1: "Do credit-card trips tip more than cash?"

Question 2: "What predicts tip amount?"

In this study, “tip more” is defined primarily in terms of the absolute tip amount rather than tip as a percentage of the fare. The absolute tip amount directly reflects customer tipping behavior and is easier to interpret in practical terms. While tip percentage is related to the absolute tip, it is not the primary focus of this analysis.

The analysis is intended to capture typical tipping behavior rather than extreme or unusual cases. Trips involving unusually long distances or atypical travel patterns may naturally result in higher tips due to higher fares, but these cases are not the main focus of this study. Instead, the goal is to understand average tipping behavior across standard taxi trips and to compare how tipping differs by payment method and trip characteristics.

## 2. Dataset Description

Dataset: NYC TLC Trip Record Data (yellow cabs) (PARQUET)

The dataset used in this study is the NYC TLC Yellow Taxi Trip Record Data, obtained from the official NYC government website (nyc.gov). 

The data are provided in Parquet format and include detailed trip-level information for yellow taxi rides in NYC.

Since the data is uploaded with a two-month delay, the most recent year of data is incomplete. 

Each row represents a single trip.

The dataset contains approximately 3.5 million observations for one month and includes 20 variables. Key variables relevant to this analysis includes the followings:

- Pick-Up and Drop-Off Time
- Passenger Count
- Trip Distance
- Rate Code
- Payment Type
- Fare Amount
- Total Amount
- Tip Amount

Due to the large size of the NYC TLC trip records (approximately 3–4 million observations per month), loading an entire year of data simultaneously can be computationally inefficient. To balance statistical robustness with practical constraints, the analysis is conducted on a month-by-month basis for the 2024 calendar year. Each month is processed using the same data cleaning and analysis pipeline, and the results are then combined to produce year-level summaries. This approach allows for scalable analysis while preserving a sufficiently large and representative sample.

In [3]:
!pip install pyarrow fastparquet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
import pandas as pd

df = pd.read_parquet("yellow_tripdata_2024-01.parquet")

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2964624 entries, 0 to 2964623
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int32         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int32         
 8   DOLocationID           int32         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  Airport_fee           

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2024-01-01 00:57:55,2024-01-01 01:17:43,1.0,1.72,1.0,N,186,79,2,17.7,1.0,0.5,0.0,0.0,1.0,22.7,2.5,0.0
1,1,2024-01-01 00:03:00,2024-01-01 00:09:36,1.0,1.8,1.0,N,140,236,1,10.0,3.5,0.5,3.75,0.0,1.0,18.75,2.5,0.0
2,1,2024-01-01 00:17:06,2024-01-01 00:35:01,1.0,4.7,1.0,N,236,79,1,23.3,3.5,0.5,3.0,0.0,1.0,31.3,2.5,0.0
3,1,2024-01-01 00:36:38,2024-01-01 00:44:56,1.0,1.4,1.0,N,79,211,1,10.0,3.5,0.5,2.0,0.0,1.0,17.0,2.5,0.0
4,1,2024-01-01 00:46:51,2024-01-01 00:52:57,1.0,0.8,1.0,N,211,148,1,7.9,3.5,0.5,3.2,0.0,1.0,16.1,2.5,0.0



## 3. Data Cleaning & Processing

## 4. Exploratory Analysis

## 5. Statistical Inference

## 6. Regression Analysis

## 7. Conclusions & Limitations