# This is a notebook

A notebook is made up of cells. Those cells can either:
* contain text and images (a markdown cell)
* code that can be executred (code cell)

You can select which type of cell a cell should be by:
1. selecting a cell by clicking onto/into it
2. using the drop drop down menu at the top of the notebook

## Python uses libraries

Most programming languages come with a standard set of features and functions. People can add onto to the standard library of functions using libraries or modules. Some common libraries for working with data are:
* `numpy`
* `pandas`
* `matplotlib`
* `plotly`

These allow for statistical analysis, visualization, and more.

These libraries are designed for folks in industry and in research to conduct high-level data science work. As such, they are not always the most user friendly for beginners. UC-Berkeley has created a library called `datascience` that attempts to simplify the syntax for writing code, and hides a lot of the technical bits from the user/student. We'll make use of this notebook throughout the week. If you're interested in learning more about how the library works, you can visit the website for the library at [https://www.data8.org/datascience/](https://www.data8.org/datascience/).

The code cell below will load up the neccesary features and functions from the `datascience` library for the code found later on in the notebook to correctly execute. Run the cell by clicking into it, then either pressing the "play" button near the top of the notebook, or, pressing `shift+enter` on your keyboard.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Loading the data
To load the data from a `.csv` file into a Table, you can use the `read_table` function as shown below. Run the code below to store the entire dataset into a Table named `helicopter`.

In [None]:
helicopter = Table.read_table("data/DSSI23_helicopter_data.csv")

## Investigating the data

You will likely want to learn a bit of summary information regarding your data. There are several commands to help you do this quickly.


### Inspecting the data
You can start by looking at the first 10 rows of the Table just by running a code well with the Table's name.

In [None]:
helicopter

This allows you to see the variable names, and the types of data they contain.

You can see additional rows by appending the `show()` function to the Table reference after a period `.` and specifying the number of rows to display between the parentheses. The code below will show 30 rows of the Table.

In [None]:
helicopter.show(30)

### Size of your data
You can determine the dimensions of your data using the `.num_rows` and `.num_columns` commands.

In [None]:
helicopter.num_rows

In [None]:
helicopter.num_columns

### Summary statistics

You may wish to calculate some summary statistics on your data. You can use the `numpy` library which contains many statistical functions. First, select the function you wish to use, then, select column you wish to use in your computation using the `.column()` command. 

For example, if I wished to compute the mean time it took for helicopters to fall, I would select the `np.nanmean` function (compute the mean ignoring missing, or nan, values) and then provide the `"Time"` column from the `helicopter` Table as the input.

In [None]:
np.nanmean( helicopter.column("Time") )

Common `numpy` statistics functions are:

* **Arithmetic Mean**: `np.mean` / `np.nanmean`
* **Median**: `np.median` / `np.nanmedian`
* **Standard Deviation**: `np.std` / `np.nanstd`
* **Variance**: `np.var` / `np.nanvar`

### Grouping your data

You'll often want to perform calculations on a particular subgroupd of your dataset. You can use the `.group` function to help you perform such a task.

The `.group` function takes as its argument the label of the column that contains the categories. By default it returns a table of counts of rows in each category. Thus group creates a distribution table that shows how the individuals (helicopter drop observations) are distributed among the categories (Rotor Length).

In [None]:
helicopter.group("Rotor Length")

You can optionally provide a second input to the `group` function that will apply a summary function to the remaining variables within each group. For example, specifying the `np.nanmean` function will apply `np.nanmean` to the "Time", "Color", "Group", "Anomoly", and "Helicopter ID" columns within each group found in the "Rotor Length" column.

In [None]:
helicopter.group("Rotor Length", np.nanmean)

This was really only helpful for the "Time" column, since it was the only column that contained numerical data.

## Filtering the data

You may what to only use part of your dataset at a time. You can use `where` functions to specify how to filter down to just the rows/observations you're interested in using. **Note:** This is not modifying the original table at all, it just creates a new table that contains the requested rows.

The code below will only retain rows of `helicopter` where the value in the "Time" column is not equal to "nan".

In [None]:
helicopter.where('Time', are.not_equal_to("nan"))

You can chain multiple filters together to fine tune your selection process.

In [None]:
helicopter.where('Time', are.not_equal_to("nan")).where('Anomaly', are.equal_to("No"))

Other common actions used in filtering are:

* `are.above(x)`
* `are.above_or_equal_to(x)`
* `are.below(x)`
* `are.below_or_equal_to(x)`
* `are.between(x, y)`
* `are.between_or_equal_to(x,y)`

If you want to save the result of a filter, you need to assign it a new name. It is a best practice not to overwrite any tables in a notebook, but instead create a new table with a new name to store any filtered or otherwise modified data that you intend to use again later in the notebook.

The code below will create a new Table `good_helicopter` that only contains the observations with numerical times and were not labeled as containing an anomaly in the flight.

In [None]:
good_helicopter = helicopter.where('Time', are.not_equal_to("nan")).where('Anomaly', are.equal_to("No"))

## Visualizing the data

You can easily create bar charts, scatter plots, line plots, and histograms depending on the data you are hoping to visualize.

### Bar charts

To create a bar chart you need a Table that contains a frequency count of categorical varibles. We can use the `.group` function to create such a table, and then generate the bar chart using the `bar()` function. In both cases, be sure to specify which column you are wishing to group and visualize the frequency.

In [None]:
good_helicopter.group("Group").barh("Group")

**Note:** When creating frequency bar charts, its usually best to sort the data before creating the chart. The code below wil use the `.sort` function to sort by the frequency in descending order before creating the bar chart.

In [None]:
good_helicopter.group("Group").sort("count", descending=True).barh("Group")

### Histograms

To investigate a distribution of numerical values, use a histograms! The `.hist()` function can handle this quickly. Specify the numerical column you wish to create a histogram from.

In [None]:
good_helicopter.hist("Time")

You might wish to see how the distribution of the variable differs among groups. You can specify an optional input to the `.hist()` function that specifies which column contains your grouping variable. For example, the code below will group by "Rotor Length" and then show the distribution of the "Time" variable for each group.

In [None]:
good_helicopter.hist("Time", group="Rotor Length")