# Automatidata project 
**Course 4 - The Power of Statistics**

You are a data professional in a data consulting firm, called Automatidata. The current project for their newest client, the New York City Taxi & Limousine Commission (New York City TLC) is reaching its midpoint, having completed a project proposal, Python coding work, and exploratory data analysis.

You receive a new email from Uli King, Automatidata’s project manager. Uli tells your team about a new request from the New York City TLC: to analyze the relationship between fare amount and payment type. A follow-up email from Luana includes your specific assignment: to conduct an A/B test. 


# Course 4 End-of-course project: Statistical analysis

In this activity, we will practice using statistics to analyze and interpret data. The activity covers fundamental concepts such as descriptive statistics and hypothesis testing. We will explore the data provided and conduct A/B and hypothesis testing.  
<br/>   

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. The A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this exercise, we assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  
*This activity has four parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct EDA and hypothesis testing
* How did computing descriptive statistics help you analyze your data? 

* How did you formulate your null hypothesis and alternative hypothesis? 

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from your A/B test?

* What business recommendations do you propose based on your results?

# **Conduct an A/B test**


# **PACE stages**


## PACE: Plan 

In this stage, consider the following questions where applicable:
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.


*Ans.* Do passengers who use credit cards pay higher fares than those who pay with cash?
Is the difference in payment statistically significant?
 

### Task 1. Imports and data loading

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint: </strong></h4></summary>

Before you begin, recall the following Python packages and functions that may be useful:

*Main functions*: stats.ttest_ind(a, b, equal_var)

*Other functions*: mean() 

*Packages*: pandas, stats.scipy

</details>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

In [3]:
df = pd.read_csv("../Automatidata_Tableau_dataset.csv")

In [4]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_duration,dur_mins,abs_mins
0,1,2017-01-10 01:40:22,2017-01-10 01:44:41,1,0.4,1,N,186,170,1,4.5,0.0,0.5,1.05,0.0,0.3,6.35,0 days 00:04:19,4.32,4.32
1,1,2017-01-09 08:54:14,2017-01-09 08:59:45,1,1.1,1,N,43,239,1,6.0,0.5,0.5,1.8,0.0,0.3,9.1,0 days 00:05:31,5.52,5.52
2,2,2017-01-07 07:56:59,2017-01-07 08:15:04,6,2.2,1,N,237,100,2,12.5,0.0,0.5,0.0,0.0,0.3,13.3,0 days 00:18:05,18.08,18.08
3,1,2017-01-15 12:23:24,2017-01-15 12:34:15,1,1.2,1,N,211,45,2,8.5,0.0,0.5,0.0,0.0,0.3,9.3,0 days 00:10:51,10.85,10.85
4,2,2017-01-20 01:53:32,2017-01-20 02:09:57,2,4.84,1,N,148,181,1,17.0,0.5,0.5,2.0,0.0,0.3,20.3,0 days 00:16:25,16.42,16.42


## PACE: **Analyze and Construct**

In this stage, consider the following questions where applicable:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


*Ans.* Descriptive statistics represent the whole population by usually a single number from which it is much easier and faster to make decisions rather analyzing every single element of a population. Thus they help in arriving at decisions by using lesser resources and lesser time.

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [5]:
df.groupby('payment_type').size()

payment_type
1    32993
2    16739
3      200
4       68
dtype: int64

We are interested in the relationship between payment type and the total fare amount the customer pays. One approach is to look at the average total fare amount for each payment type. 

In [6]:
df.groupby('payment_type')['total_amount'].mean()

payment_type
1    16.662774
2    12.638223
3    12.494650
4    11.901912
Name: total_amount, dtype: float64

We observe that there is substancial difference of about $4 in average total fares for credit card and cash payments.
Now its time to conduct hypothesis test to confirm if this is due to chance or it is statistically significant.


### Task 3. Hypothesis testing

Before conducting hypothesis test, the following questions should be considered where applicable:

Consider hypotheses for this project as listed below.

$H_0$: There is no difference in the average total fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average total fare amount between customers who use credit cards and customers who use cash.



Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test: 


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 



Lets choose 5% as the significance level and proceed with a two-sample t-test.

In [7]:
credit_df = df[df.payment_type == 1].total_amount
cash_df = df[df.payment_type == 2].total_amount

print("Mean total fare for Credit Card Payments is ${cc:.2f} and for Cash Payments is ${ca:.2f}".format(
    cc = credit_df.mean(), ca = cash_df.mean()))

Mean total fare for Credit Card Payments is $16.66 and for Cash Payments is $12.64


In [8]:
stats.ttest_ind(credit_df, cash_df)

Ttest_indResult(statistic=32.98506024602936, pvalue=4.742886813606497e-236)

The p-value for the two samples was found to be 0% which is less than the significance level of 5%. Hence we reject the null hypothesis.

The problem with this A/B test is that in the data collected, tips paid by cash are not included but those by credit cards are included. So we should perform the same analysis steps for total amounts without tips so as to get a fair unbiased result.

### Tipless Total's Hypothesis Testing

In [9]:
df['Tipless Total'] = df.total_amount - df.tip_amount

In [10]:
df.groupby('payment_type')['Tipless Total'].mean()

payment_type
1    14.080708
2    12.638223
3    12.494650
4    11.901912
Name: Tipless Total, dtype: float64

We observe that there is substancial difference of about $1.5 in average total fares for credit card and cash payments.
Now its time to conduct hypothesis test to confirm if this is due to chance or it is statistically significant.

In [11]:
credit_df_tipless = df.loc[df.payment_type == 1, 'Tipless Total']
cash_df_tipless = df.loc[df.payment_type == 2, 'Tipless Total']

print("Mean total fare for Credit Card Payments is ${cc:.2f} and for Cash Payments is ${ca:.2f}".format(
    cc = credit_df_tipless.mean(), ca = cash_df_tipless.mean()))

Mean total fare for Credit Card Payments is $14.08 and for Cash Payments is $12.64


In [12]:
stats.ttest_ind(credit_df_tipless, cash_df_tipless)

Ttest_indResult(statistic=13.373460965510725, pvalue=1.0164391434311328e-40)

The p-value for the two samples was found to be 0% which is less than the significance level of 5%. Hence we reject the null hypothesis.

Thus the total amount difference of $1.5 in credit card and cash payments is statistically significant and thus credit payment should be encouraged. 

## PACE: **Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### Task 4. Communicate insights with stakeholders

1. What business insight(s) can you draw from the result of your hypothesis test?
2. Consider why this A/B test project might not be realistic, and what assumptions had to be made for this educational project.

*Ans.* 
1. Based on the result of my hypothesis test, I can say that the difference in averages of total fares for credit card and cash payments is statistically significant and might be causal that is due to a relationship between total fare and credit card payments. Also, the hypothesis that customers who pay with credit card pay more is true.

2. The problem with first A/B test was that in the data collected, tips paid by cash are not included but those by credit cards are included. Thus this test was not realistic.

3. But from the 2nd A/B test where tip amounts were excluded from total amount, we still saw a significant difference of $1.5 more in credit card payments than cash payments.

3. Thus credit card payments should be encouraged. Strategies should be created to promote credit card payments. 

4. For example, the New York City TLC can install signs that read “Credit card payments are preferred” in their cabs and implement a protocol that requires cab drivers to verbally inform customers that credit card payments are preferred. 

5. Another recommendation is to request data for cash tips if available.