# **Automatidata Project**
**Coursera - The Power of Statistics**

* **The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. Your A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

* **The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.

## Problem statements
* Is there a relationship between total fare amount and payment type?

### 1. Import packages

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats

In [2]:
# Load dataset into dataframe
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

### 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [3]:
taxi_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [4]:
taxi_data.shape

(22699, 17)

In [5]:
taxi_data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type. 

In [6]:
pd.DataFrame(round(taxi_data.groupby('payment_type')['fare_amount'].mean(), 2))

Unnamed: 0_level_0,fare_amount
payment_type,Unnamed: 1_level_1
1,13.43
2,12.21
3,12.19
4,9.91


In [7]:
pd.DataFrame(round(taxi_data.groupby('payment_type')['VendorID'].count(), 2))

Unnamed: 0_level_0,VendorID
payment_type,Unnamed: 1_level_1
1,15265
2,7267
3,121
4,46


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, let's conduct a hypothesis test.

### **3. Hypothesis testing**

The goal in this step is to conduct a two-sample t-test. Here are the steps for conducting a hypothesis test:

1.   State the null hypothesis and the alternative hypothesis
        * **Null**: there is no difference in average fare between customers who use credit cards and customers who use cash
        * **Alternative**: there is a difference in average fare between customers who use credit cards and customers who use cash accounts
2.   Choose a signficance level
        * We use 5% as the significance level and proceed with a two-sample t-test.
3.   Find the p-value
        * We use z-score to find this
4.   Reject or fail to reject the null hypothesis

**Note:** For the purpose of this exercise, our hypothesis test is the main component of your A/B test. 

In [8]:
user_cash = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
user_credit = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']

test_statistic, p_val = stats.ttest_ind(a=user_cash, 
                                        b=user_credit, 
                                        equal_var=False)

print(f'T-score: {test_statistic}')
print(f'P-value: {p_val}')

T-score: 6.866800855655372
P-value: 6.797387473030518e-12


Here, we got **6.797387473030518e-12** which is far less than our significance level at 5%. So, we can conclude that <i>**there is a statistically significant difference<i/>** in the average fare amount between customers who use credit cards and customers who use cash.

### 4. Communicate insights

1. What business insight(s) can you draw from the result of your hypothesis test?
    * We can infer that in order to **generate more revenue**, **encouraging customers to pay with credit card** is possibly a great idea
2. Consider why this A/B test project might not be realistic, and what assumptions had to be made for this educational project.
    * The conclusion requires an assumption where both payment type is possible to be done, because at some case like when the driver does not bring a lot of cash for change, the passenger is forced to pay using credit card. Therefore the data is collected in the way that is inline with the assumption. 
    * To conclude, it is far more likely that the payment type is determined by the fare amount.