# Lab 26 - Permutation tests

In this lab, we'll use hypothesis testing to compare the means of two groups to see if they are different.  We will use a type of hypothesis test called a *permutation test* that can be used to test many different kinds of hypotheses.

We will test if trips paid for by credit card are longer on average than trips paid for by cash.

This lab uses the February 4, 2020 (note the different data) green taxi trip data.

Data URL: [https://raw.githubusercontent.com/megan-owen/MAT328-Techniques_in_Data_Science/main/data/Feb4_2020_Green_Taxi_Trip_Data.csv](https://raw.githubusercontent.com/megan-owen/MAT328-Techniques_in_Data_Science/main/data/Feb4_2020_Green_Taxi_Trip_Data.csv)

### Section 1:  Loading and cleaning the data

First, let's import the necessary libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

Load the data from the file into a DataFrame called `taxi`.

Create a new DataFrame called `taxi2` containing only the columns `trip_distance` and `payment_type`.  

Drop any rows with missing data in `taxi2`. 

How many rows were dropped?  Do you think this count cause problems in the analysis?

Remove all rows with a trip distance <= 0.  Alternatively, make a new DataFrame with only rows with a positive trip distance.

What is the mean trip distance for each payment type?  Hint:  Use groupby(), which was introduced in section 4 of Lab 2.

The payment types are:
* 1 = credit card
* 2 = cash
* 3 = no charge
* 4 = dispute
* 5 = unknown
* 6 = voided trip

Which payment type has the longest mean trip distance?  The smallest?

Since we are just interested in trips paid for with credit card (type 1) or cash (type 2), remove all other trips (or alternatively, create a DataFrame containing only credit card or cash trip).

### Section 2: Hypothesis testing step 1

Is there a different in the mean trip distance for trips paid for by credit card (payment type 1) vs. trips paid for by cash (payment type 2)?

We will test this hypothesis.  The null hypothesis is given below.

**Null hypothesis:** The mean trip distance for trips paid for by credit card is the same as the mean trip distance for trips paid for with cash.

What do you think the alternative hypothesis is?  Remember, the null and alternative hypotheses must cover all possibilities.

<details><summary>**Alternative hypothesis:**</summary>
The mean trip distance for trips paid for by credit card is different from the mean trip distance for trips paid for with cash.

### Section 3: Hypothesis testing step 2

Our test statistic will be the difference in mean trip distances between trips paid for by credit card and trips paid for with cash.  To calculate the test statistic for the data:

1. Compute the mean trip distance for trips paid by credit card.
2. Compute the mean trip distance for trips paid by cash.
3. Subtract mean 1 from mean 2 and take the absolute value.

As we will eventually be computing this test statistic for simulated data, we do not want to do any of this computation manually (ex. by looking at the groupby means above, and subtracting one from the other).

First compute the mean trip distance for trips paid by credit card (step 1), and store the result in the variable `data_credit_card_mean`. 

Next compute the mean trip distance for trips paid by cash (step 2), and store the result in the variable `data_cash_mean`.

Finally find the absolute value of the difference between the two means (step 3), and store it in the variable `data_test_statistic`.

### Section 4: Hypothesis testing step 3
Step 3 is to simulate the test statistic assuming the null hypothesis is true.

We will do this by permuting (randomly changing) the payment type column in the dataframe, without changing the trip distance column.  If the payment type doesn't matter, then switching it around shouldn't change the difference in means. 

The following code will permute the `payment_type` column and store the permutation in a new column called `permuted_payment`.

In [None]:
taxi2["permuted_payment"] = np.random.permutation(taxi2['payment_type'])
taxi2.head()

Compare the first few rows of `permuted_payment` with the first few rows of `payment_type`.  Some of the values should be different.  Try re-running the above line of code several times.  What happens?  Does this make sense?

Let's compute the test statistic for the simulated data.  In this case, the code will be similar to the code from Step 3, but should use the `permuted_payment` column instead of the original `payment_type` column.

First compute the mean trip distance for trips paid by credit card (step 1) *according to the `permuted_payment` column*, and store the result in the variable `sim_credit_card_mean`. 

Next compute the mean trip distance for trips paid by cash (step 2) *according to the `permuted_payment` column*, and store the result in the variable `sim_cash_mean`.

Finally compute the absolute value of the difference between the two means (step 3), and store it in the variable `sim_test_statistic`.

In [None]:
sim_test_statistic = np.abs(sim_credit_card_mean - sim_cash_mean)
sim_test_statistic

To find the distribution of the test statistic when the null hypothesis is true, we have to repeatedly permute the `payment_type` column and compute the test statistic using this permuted data.  We'll store these test statistics in a list to be able to plot them next.

Can you figure out how to do this?  Remember, use a small number of iterations to test your code, so it is faster.

<details> <summary>Hint:</summary>
The pseudo-code is:
<code>
create an empty list
loop 10,000 times:
    randomly permute the payment_type column and store in permuted_payment column 
    compute the mean trip distance for credit card trips, using the permuted_payment column
    compute the mean trip distance for cash trips, using the permuted_payment column
    compute the absolute difference between the two means
    store the difference (test statistic) in your list
</code>
</details>

Graph the histogram of the test statistcs that you computed assuming the null hypothesis is true.

### Section 5: Hypothesis testing step 4
Compare the data test statistic with the histogram of the test statistics computed from the simulations that assume the null hypothesis is true.  We computed this histogram at the end of Section 3.

Does your data test statistic look like it comes from the histogram distribution?

Reject or fail to reject the null hypothesis.

### Optional challenge questions:
* create and test another hypothesis for the green taxi trip data.