#  Statistical Analysis : 
# Is there a relationship between total fare amount and payment type?

**The purpose** of this project is to prepare, create, and analyze A/B tests. It's results should aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this exercise, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups:
1) customers who are required to pay with credit card, 

2) customers who are required to pay with cash. 

Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. 
For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  

# Conduct an A/B test

### Imports and data loading

In [1]:
import pandas as pd
from scipy import stats



In [2]:
df = pd.read_csv("/kaggle/input/dataset/2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

# Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown

In [3]:
#  descriptive stats code for EDA
df.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type. 

In [4]:
df.groupby('payment_type')['fare_amount'].mean()

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

>Based on the averages, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash.


However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, conduct a hypothesis test.

# Hypothesis testing

**Null hypothesis**: There is no difference in average fare between customers who use credit cards and customers who use cash. 

**Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash

Our goal in this step is to conduct a two-sample t-test.

Steps for conducting a hypothesis test: 
1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 


**Note:** For the purpose of this exercise, hypothesis test is the main component of our A/B test. 

$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.

Significance level: 5%

In [5]:
#hypothesis test, A/B test

credit_card = df[df['payment_type'] == 1]['fare_amount']
cash = df[df['payment_type'] == 2]['fare_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)

Ttest_indResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12)

Since the p-value is significantly smaller than the significance level of 5%, you reject the null hypothesis. 

*Notice the 'e-12' at the end of the pvalue result.*

There is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.

# A/B test results
There is a statistically significant difference in the average total fare between customers who use credit cards and
customers who use cash. Customers who used credit cards showed a higher total amount compared to cash.

# Next Steps

New York City TLC can encourage customers to pay with credit cards, and create strategies to promote credit card payments. 

For example, the New York City TLC can install signs that read `Credit card payments are preferred` in their cabs, and implement a protocol that requires cab drivers to verbally inform customers that credit card payments are preferred.