# Statistical Analysis - Automatidata project 

In this project, I applied descriptive statistics and hypothesis testing in Python to analyze experimental data and extract business insights. The focus is on evaluating whether the **method of payment (credit card vs. cash)** impacts the **fare amount** in the context of taxi rides.
<br/>   

**Objective**The main goal of this project is to demonstrate the ability to design, implement, and analyze an A/B test using Python. Specifically, the test evaluates whether the fare amount differs based on the payment method, and how these insights could help **increase revenue for taxi drivers**.

**Experimental Context** For the purpose of this analysis, the dataset is assumed to come from a randomized experiment where:
- Group A: Customers are required to pay with **credit card**
- Group B: Customers are required to pay with **cash**

This assumption of random assignment is critical to allow **causal conclusions** about how payment method affects fare amount.
  
*This activity has four parts:*

**Part 1:** Imports and data loading

**Part 2:** Conduct EDA and hypothesis testing

**Part 3:** Conclusion and Reflection

# **Conduct an A/B test**

### Task 1. Imports and data loading

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [2]:
import pandas as pd
from scipy import stats

In [None]:
taxi_data = pd.read_csv("../Data/HR_dataset.csv")# Load dataset into a dataframe

# Display first few rows of the dataframe
taxi_data.head()

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [4]:
# descriptive stats code for EDA
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


I'm interested in understanding whether the **type of payment** (credit card vs. cash) has an effect on the **fare amount** paid by the customer. 

A logical first step is to examine the **average fare amount for each payment type**. By comparing these averages, I can begin to explore whether there's an observable difference that might justify a formal hypothesis test.

In [None]:
taxi_data.groupby('payment_type')['fare_amount'].mean()

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

Based on the average fare amounts by payment type, it appears that **customers who pay with a credit card tend to pay higher fares** than those who pay in cash. 

However, this observed difference could simply be due to **random variation** in the sample rather than a true underlying difference in the population.

To determine whether this difference is **statistically significant**, I will perform a **hypothesis test** comparing the two groups.


### Task 3. Hypothesis testing




The next step is to conduct a **two-sample t-test** to determine whether the difference in average fare amounts between payment types is statistically significant.

The process follows these standard steps:

1. **State the hypotheses**
   - **Null hypothesis (H₀):** There is no difference in average fare amount between credit card and cash payments.
   - **Alternative hypothesis (H₁):** There is a significant difference in average fare amount between the two payment types.

2. **Choose a significance level (α)**  
   I will use a common significance level of **0.05**.

3. **Calculate the test statistic and p-value**  
   I will perform an independent t-test assuming unequal variances.

4. **Make a decision**  
   Based on the p-value, I will decide whether to **reject** or **fail to reject** the null hypothesis.


$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.

I chose a **significance level of 5% (α = 0.05)** and proceeded with a **two-sample t-test** to compare the average fare amounts between the two payment groups.

In [8]:
#hypothesis test, A/B test
#significance level

credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)

Ttest_indResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12)

Since the **p-value is much smaller than the 5% significance level**, I **reject the null hypothesis**.

> 🔍 *Note:* The `'e-12'` in the p-value means the result is extremely small (in scientific notation), reinforcing that the observed difference is unlikely due to random chance.

Therefore, I conclude that there is a **statistically significant difference** in the average fare amount between customers who pay with **credit card** and those who pay with **cash**.

### Task 4. Conclusion and Reflection

To wrap up the analysis, I reflected on the following key questions:

1. **What business insight(s) can I draw from the result of the hypothesis test?**

   The main insight is that **customers who pay with credit cards tend to pay higher fares**, and this difference is **statistically significant**. From a business perspective, this suggests that encouraging or facilitating credit card payments could help **increase revenue for taxi drivers**.

2. **What assumptions had to be made, and how realistic is this A/B test?**

   This project relies on the assumption that **customers were randomly assigned to pay either with credit or cash**, and that they complied with this requirement. In reality, the data was likely collected from naturally occurring behavior, not from a randomized experiment. To simulate an A/B test, I had to **assume random group assignment**, which may not reflect how payment decisions are actually made.

   Additionally, there are other potential **confounding variables** that the dataset does not account for. For example, customers taking longer or more expensive trips might be more likely to pay with a credit card simply out of convenience. In that case, **fare amount may influence payment method**, not the other way around. This limitation is important to keep in mind when interpreting the results.
