In [17]:
# CAMPSTONE PROJECT NUMBER 1: A/B TESTING AND HYPOTHESIS TESTING
# The purpose of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests.
# The A/B test results should aim to find ways to generate more revenue for taxi cab drivers.
# The customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card 
# 2) customers who are required to pay with cash. 


In [10]:
#1 The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. 
#2 For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
# The research question for this data project: “Is there a relationship between total fare amount and payment type?”

In [None]:
# This activity has four parts
# Part 1: Imports packages and data loading


In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [18]:
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv")

In [None]:
# Part 2: Conducting EDA and hypothesis testing
# In general, descriptive statistics are useful because they let you quickly explore and understand large amounts of data.
# In this case, computing descriptive statistics helps me quickly compare the average total fare amount among different payment types
# Using descriptive statistics to conduct Exploratory Data Analysis (EDA).

In [19]:
# descriptive stats code for EDA
taxi_data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,,22687,22688,,,,2,,,,,,,,,,
top,,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,,2,2,,,,22600,,,,,,,,,,
mean,56758490.0,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,32744930.0,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,12127.0,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,28520560.0,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,56731500.0,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,85374520.0,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


In [None]:
# Note: In the dataset, payment_type is encoded in integers:
•	1: Credit card
•	2: Cash
•	3: No charge
•	4: Dispute
•	5: Unknown


In [None]:
# You are interested in the relationship between payment type and the fare amount the customer pays.
# One approach is to look at the average fare amount for each payment type.

In [20]:
taxi_data.groupby('payment_type')['fare_amount'].mean()
#1 Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash.
#2 However, this difference might arise from random sampling, rather than being a true difference in fare amount. 
# To assess whether the difference is statistically significant, you conduct a hypothesis test.

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

In [None]:
# Hypothesis testing
# Null hypothesis: There is no difference in average fare between customers who use credit cards and customers who use cash.
# Alternative hypothesis: There is a difference in average fare between customers who use credit cards and customers who use cash
# Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:
1.	State the null hypothesis and the alternative hypothesis
2.	Choose a signficance level
3.	Find the p-value
4.	Reject or fail to reject the null hypothesis
# Note: For the purpose of this exercise, your hypothesis test is the main component of your A/B test.

In [None]:
#a 𝐻0:There is no difference in the average fare amount between customers who use credit cards and customers who use cash.
#b 𝐻𝐴: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.
# You choose 5% as the significance level and proceed with a two-sample t-test.

In [None]:
#hypothesis test, A/B test
#significance level

In [21]:
credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)

TtestResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12, df=16675.48547403633)

In [None]:
# Since the p-value is significantly smaller than the significance level of 5%, you reject the null hypothesis.
# You conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.

In [None]:
# Task 4. Communicating insights with stakeholders

In [None]:
# Responses:
1.	The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers.
2.	This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it.
3. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test.
4. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card.
5. In other words, it's far more likely that fare amount determines payment type, rather than vice versa.