# Investigating helicopter data
## Secondary Heading
- list 1
- list 2
- list 3

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Loading the data
To load the data from a `.csv` file into a Table, you can use the `read_table` function as shown below. Run the code below to store the entire dataset into a Table named `helicopter`.

In [None]:
helicopter = Table.read_table("data/helicopters2024.csv")

## Investigating the data

You will likely want to learn a bit of summary information regarding your data. There are several commands to help you do this quickly.


### Inspecting the data
You can start by looking at the first 10 rows of the Table just by running a code well with the Table's name.

In [None]:
helicopter

### Size of your data
You can determine the dimensions of your data using the `.num_rows` and `.num_columns` commands.

In [None]:
helicopter.num_rows

In [None]:
helicopter.num_columns

### Summary statistics

You may wish to calculate some summary statistics on your data. You can use the `numpy` library which contains many statistical functions. First, select the function you wish to use, then, select column you wish to use in your computation using the `.column()` command. 

Common `numpy` statistics functions are:

* **Arithmetic Mean**: `np.mean` / `np.nanmean`
* **Median**: `np.median` / `np.nanmedian`
* **Standard Deviation**: `np.std` / `np.nanstd`
* **Variance**: `np.var` / `np.nanvar`

In [None]:
np.mean(helicopter.column('time'))

In [None]:
long_mean = np.mean(helicopter.where('rotor','L').column('time'))

In [None]:
short_mean = np.mean(helicopter.where('rotor','S').column('time'))

### Grouping your data

You'll often want to perform calculations on a particular subgroupd of your dataset. You can use the `.group` function to help you perform such a task.

The `.group` function takes as its argument the label of the column that contains the categories. By default it returns a table of counts of rows in each category.

In [None]:
helicopter.group('rotor')

In [None]:
helicopter.group(['rotor','weight'])

## Filtering the data

You may what to only use part of your dataset at a time. You can use `where` functions to specify how to filter down to just the rows/observations you're interested in using. **Note:** This is not modifying the original table at all, it just creates a new table that contains the requested rows.

Common actions used in filtering are:

* `are.equal_to`
* `are.not_equal_to(x)`
* `are.above(x)`
* `are.above_or_equal_to(x)`
* `are.below(x)`
* `are.below_or_equal_to(x)`
* `are.between(x, y)`
* `are.between_or_equal_to(x,y)`

In [None]:
helicopter.where('rotor','L')

In [None]:
# Does same thing as .where('rotor','L') because are.equal_to is the default predicate)
helicopter.where('rotor',are.equal_to('L'))

In [None]:
# this stacks the outputs together so you only have trials with long rotors and weighted
helicopter.where('rotor','L').where('weight','Y')

If you want to save the result of a filter, you need to assign it a new name. It is a best practice not to overwrite any tables in a notebook, but instead create a new table with a new name to store any filtered or otherwise modified data that you intend to use again later in the notebook.

The code below will create a new Table `good_helicopter` that only contains the observations with numerical times and were not labeled as containing an anomaly in the flight.

In [None]:
good_helicopter = helicopter.where('time', are.not_equal_to("nan")).where('irregular', are.not_equal_to("Y"))

## Visualizing the data

You can easily create bar charts, scatter plots, line plots, and histograms depending on the data you are hoping to visualize.

### Bar charts

To create a bar chart you need a Table that contains a frequency count of categorical varibles. We can use the `.group` function to create such a table, and then generate the bar chart using the `bar()` function. In both cases, be sure to specify which column you are wishing to group and visualize the frequency. **Note:** When creating frequency bar charts, its usually best to sort the data before creating the chart. The code below wil use the `.sort` function to sort by the frequency in descending order before creating the bar chart.

In [None]:
helicopter.group('rotor',np.mean)

In [None]:
good_helicopter.group(['rotor','weight'],np.mean)

In [None]:
helicopter.group('rotor').bar('rotor')

### Histograms

To investigate a distribution of numerical values, use a histograms! The `.hist()` function can handle this quickly. Specify the numerical column you wish to create a histogram from.

In [None]:
good_helicopter.hist('time',bins=np.arange(1,3,.25),unit="second",group='rotor')
plots.scatter(long_mean, -0.01, color = 'red', s = 60, marker="^", zorder = 2);
plots.scatter(short_mean, -0.01, color = 'green', s = 60, marker="^", zorder = 2);

In [None]:
good_helicopter.bin(bins=np.arange(1,3,.25)).column('time count')/good_helicopter.num_rows*100

# AB Testing

To perform an A/B test, we'll need to compute similar statistics for our many simulations we'll run. Writing a function which returns a test statistic will be a great way to save time. Write a function named `find_test_stat` which takes in the arguments `table`, `labels_col`, and `values_col` that calculates the test statistic required for A/B testing.

The `table` passed into this function will be a permutation of our original table and structured the same way. `labels_col` will be passed a string that matches the column label in `table` that contains the labels of the categories you'll be grouping by. `values_col` will be passed a string that matches the column label that contains the values that you'll be using to compute the test statistic.

When you've written this function, you must be able to pass it any table and two specified column labels and the function should compute a test statistic required for an A/B test, not just for this problem, but any problem! For example, running `find_test_stat(helicopter, "rotor", "time")` should return the exact same test statistic you generated in an earlier question and running `find_test_stat(helicopter, "weight", "time")` would compute the test statistic based on the groups of time and **"weight"** columns.

In [None]:
def find_test_stat(table, labels_col, values_col):
    average_values = table.select(labels_col,values_col).group(labels_col,np.average).column(1)
    return average_values.item(0)-average_values.item(1)

observed_difference = find_test_stat(helicopter, "rotor", "time")

In [None]:
# out test st is whether the difference of means between long rotors and short rotors
long_mean - short_mean

In [None]:
find_test_stat(helicopter,"weight","time")

Write a function `simulate_and_test_statistic` to compute one trial of our A/B test. Your function should run a simulation and return a test statistic.

**Hint:** You can "shuffle" the labels by using `.sample(with_replacement = False)` on the entire Table, and then select the column that contains the newly shuffled labels. Then, you can either overwrite the existing labels, **or**, extend the table with a new column labeled something similar to "shuffled labels". Just make sure you pass the correct label on to the `find_test_stat` function!

In [None]:
def simulate_and_test_statistic(table, labels_col, values_col):
    shuffled_table = table.with_column("Shuffled labels",table.sample(with_replacement=False).column(labels_col))
    # print (shuffled_table)
    return find_test_stat(shuffled_table,"Shuffled labels",values_col)
    
simulate_and_test_statistic(helicopter, "rotor", "time")

In [None]:
# simulate again
simulate_and_test_statistic(helicopter, "rotor", "time")

# multiple simulations
Use the `simulate_and_test_statistic` to simulate 5,000 trials of our A/B test and assign an array of the test statistics to `differences`.

In [None]:
differences = make_array()
for i in np.arange(5000):
    differences = np.append(differences,simulate_and_test_statistic(helicopter,"rotor","time"))
differences

Run the cell below to view a histogram of your simulated test statistics plotted with your observed test statistic. Think about what this might imply about the p-value and if there is sufficient evidence to reject the null hypothesis.

In [None]:
Table().with_column('Difference Between Group Means', differences).hist()
plots.scatter(observed_difference, -0.01, color = 'red', s = 60, marker="^", zorder = 2);

In [None]:
empirical_p = np.mean(differences>=observed_difference)
empirical_p