# Investigating helipcopter fall time

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In [None]:
helicopter = Table.read_table("data/DSSI23_helicopter_data.csv")

In [None]:
good_helicopter = helicopter.where('Time', are.not_equal_to("nan")).where('Anomaly', are.equal_to("No"))

In [None]:
good_helicopter

## Determined the observed statistic

We'll use the difference between the average time for the long rotor and short rotor as our statistic. To compute the value, we can use a few of our Table methods to do this quickly.

### Use `.group`

Use the `.group` method to easily calculate the average time for each rotor length group.

In [None]:
good_helicopter.group('Rotor Length', np.nanmean)

### Use `.column`
You can select the two average times from this table by using `.column` to create an array with the two values of interest.

In [None]:
good_helicopter.group('Rotor Length', np.nanmean).column("Time nanmean")

### Use `.item`

The `.item` method will select a single item out of an array.

In [None]:
good_helicopter.group('Rotor Length', np.nanmean).column("Time nanmean").item(0)

In [None]:
good_helicopter.group('Rotor Length', np.nanmean).column("Time nanmean").item(1)

### Putting it all together

Combining `.group`, `.column`, and `.item` allows you to create one expression that can compute the difference of means that we're hoping to use.

In [None]:
good_helicopter.group('Rotor Length', np.nanmean).column("Time nanmean").item(0) - good_helicopter.group('Rotor Length', np.nanmean).column("Time nanmean").item(1)

## Write a function to compute the statistic

Since our simulation will require us to run the same calculation over and over to compute the statistic of interest, it would be helpful to write a function that can compute for us. A function allows you to reuse the same logic and calculations, but for different configurations of our Table. The function below illustrates how to write a Python function that can perform the same calculation as earlier, but for on Table provided as an input.

In [None]:
def difference_of_means(table_input, group_label):
    first_mean = table_input.group(group_label, np.nanmean).column("Time nanmean").item(0)
    second_mean = table_input.group(group_label, np.nanmean).column("Time nanmean").item(1)
    return first_mean - second_mean

We can confirm that this function obtains the same result as our original commands above by providing it the same Table and same group label as earlier.

In [None]:
difference_of_means(good_helicopter, "Rotor Length")

Let's save this value to `observed_diffrence` so we can reference it again later.

In [None]:
observed_difference = difference_of_means(good_helicopter, "Rotor Length")

Now we have a function we can use to easily calculate our statistic for any Table we create!

## Write a function to shuffle the labels

We'll want to shuffle the observations between the groups, "Long" and "Short", to simulate under the conditions of our null hypothesis, that any difference between the average fall times is simply due to chance. Put another way, that the difference between the average fall times in these groups is 0. The code below achieves this result by effective reassigning the labels "Long" and "Short" to each of the rows in the Table.

In [None]:
def shuffle_table(table_input, group_label):
    labels = table_input.column(group_label)
    np.random.shuffle(labels)
    table_with_shuffled_labels = table_input.with_column("shuffled group labels", labels)
    return difference_of_means(table_with_shuffled_labels, "shuffled group labels")

In [None]:
shuffle_table(good_helicopter, "Rotor Length")

## Simulate the process

Now all we need to do is simulate the shuffling many, many times. A loop is a programming concept that allows for the same set of operations to be run several times in sequence. The loop below will shuffle the table, compute the statistic for the shuffled table, and then append the statistic to an array named `statistics`.

In [None]:
repetitions = 200
statistics = make_array()

for i in np.arange(repetitions):
    statistics = np.append( shuffle_table(good_helicopter, "Rotor Length"), statistics )

In [None]:
statistics

## Visualize the distribution

Let's put our array into a Table so we can create a histogram.

In [None]:
result = Table().with_column("Difference of means", statistics)

In [None]:
result

In [None]:
result.hist()
plots.scatter(observed_difference, -.2, color = 'red', s = 60, zorder = 2, marker="^");

### Compute p-value

We can use some array functions to quickly compute how many of our simulation statistics were more extreme than our observed statistic.

In [None]:
np.count_nonzero(statistics >= observed_difference)

And then, dividing this by the number of shufflings that took place, we can compute the p-value.

In [None]:
np.count_nonzero(statistics >= observed_difference) / repetitions