# Automatidata project

You are a data professional in a data consulting firm, called Automatidata. The current project for their newest client, the New York City Taxi & Limousine Commission (New York City TLC) is reaching its midpoint, having completed a project proposal, Python coding work, and exploratory data analysis.

You receive a new email from Uli King, Automatidata’s project manager. Uli tells your team about a new request from the New York City TLC: to analyze the relationship between fare amount and payment type. A follow-up email from Luana includes your specific assignment: to conduct an A/B test.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

### Course 4 End-of-course project: Statistical analysis

 In this activity, you will practice using statistics to analyze and interpret data. The activity covers fundamental concepts such as descriptive statistics and hypothesis testing. You will explore the data provided and conduct A/B and hypothesis testing.

The purpose of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. Your A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

Note: For the purpose of this exercise, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

The goal is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.

This activity has four parts:

Part 1: Imports and data loading

What data packages will be necessary for hypothesis testing?
Part 2: Conduct EDA and hypothesis testing

How did computing descriptive statistics help you analyze your data?

How did you formulate your null hypothesis and alternative hypothesis?

Part 3: Communicate insights with stakeholders

What key business insight(s) emerged from your A/B test?

What business recommendations do you propose based on your results?



# research question

The research question for this data project: “Is there a relationship between total fare amount and payment type?

### Task 1: Import and data loading

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns
from scipy import stats

In [2]:
import warnings

# Code where you want to ignore warnings
warnings.filterwarnings("ignore")

# Your code here

# Restore warnings
warnings.filterwarnings("default")

In [3]:
df = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
1,35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
2,106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
3,38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
4,30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


### Data Exploratory

In [4]:
#let's the first column for a better table
df = df.drop('Unnamed: 0', axis = 1)

In [5]:
#check for duplicate
df.duplicated().sum()

0

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float64
 15  im

In [7]:
#check for missing values
df.isna().sum()

VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
dtype: int64

We are interested in the relationship between payment type and the total fare amount the customer pays. One approach is to look at the average total fare amount for each payment type.

In [8]:
# let's extract the relevant columns for this project
data = df[['payment_type', 'fare_amount', 'total_amount']]
data.head()

Unnamed: 0,payment_type,fare_amount,total_amount
0,1,13.0,16.56
1,1,16.0,20.8
2,1,6.5,8.75
3,1,20.5,27.69
4,2,16.5,17.8


Note: In the dataset, payment_type is encoded in integers:

1: Credit card

2: Cash

3: No charge

4: Dispute

5: Unknown

In [9]:
data['payment_type'].unique()

array([1, 2, 3, 4], dtype=int64)

In [10]:
data['payment_type'] = data['payment_type'].replace({1: 'Credit card', 2: 'Cash'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['payment_type'] = data['payment_type'].replace({1: 'Credit card', 2: 'Cash'})


In [15]:
data.groupby(['payment_type'])['total_amount'].mean().to_frame().sort_values(by = 'total_amount', ascending = False)

Unnamed: 0_level_0,total_amount
payment_type,Unnamed: 1_level_1
Credit card,17.663577
3,13.579669
Cash,13.545821
4,11.238261


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger total fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in total fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.

## Task 3. Hypothesis testing

𝐻0
 : There is no difference in the average total fare amount between customers who use credit cards and customers who use cash.

𝐻𝐴
 : There is a difference in the average total fare amount between customers who use credit cards and customers who use cash.

### 
Choose a signficance level

Find the p-value

Reject or fail to reject the null hypothesis

In [12]:
credit_card = data[data['payment_type'] == 'Credit card' ] 
cash = data[data['payment_type'] == 'Cash' ]

In [16]:
stats.ttest_ind(a=credit_card['total_amount'], b=cash['total_amount'], equal_var=False)

Ttest_indResult(statistic=20.34644022783838, pvalue=4.5301445359736376e-91)

Since the p-value is extremely small (much smaller than the significance level of 5%), you reject the null hypothesis. You conclude that there is a statistically significant difference in the average total fare amount between customers who use credit cards and customers who use cash.

# Conclusion

The key business insight is that encouraging customers to pay with credit cards will likely generate more revenue for taxi cab drivers.

This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa. The difference between average card payment fare and cash fare is inflated, because we use the total amount as the comparing variable. But cash fares all have tip values of $0, while card payments have non-zero values. A possible reason for this occurance is because cash tips aren't declared. In turn, this means that we capture tips in one group but not in the other. Instead, one could compare the fare_amount column