# This is a notebook

A notebook is made up of cells. Those cells can either:
* contain text and images (a markdown cell)
* code that can be executred (code cell)

You can select which type of cell a cell should be by:
1. selecting a cell by clicking onto/into it
2. using the drop drop down menu at the top of the notebook

## Python uses libraries

Most programming languages come with a standard set of features and functions. People can add onto to the standard library of functions using libraries or modules. Some common libraries for working with data are:
* `numpy`
* `pandas`
* `matplotlib`
* `plotly`

These allow for statistical analysis, visualization, and more.

These libraries are designed for folks in industry and in research to conduct high-level data science work. As such, they are not always the most user friendly for beginners. UC-Berkeley has created a library called `datascience` that attempts to simplify the syntax for writing code, and hides a lot of the technical bits from the user/student. We'll make use of this notebook throughout the week. If you're interested in learning more about how the library works, you can visit the website for the library at [https://www.data8.org/datascience/](https://www.data8.org/datascience/).

The code cell below will load up the neccesary features and functions from the `datascience` library for the code found later on in the notebook to correctly execute. Run the cell by clicking into it, then either pressing the "play" button near the top of the notebook, or, pressing `shift+enter` on your keyboard.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Loading the data
To load the data from a `.csv` file into a Table, you can use the `read_table` function as shown below. Run the code below to store the entire dataset into a Table named `helicopter`.

In [None]:
world_data = Table.read_table("data/world_data.csv")

## Investigating the data

You will likely want to learn a bit of summary information regarding your data. There are several commands to help you do this quickly.


### Inspecting the data
You can start by looking at the first 10 rows of the Table just by running a code well with the Table's name.

In [None]:
world_data

This allows you to see the variable names, and the types of data they contain.

You can see additional rows by appending the `show()` function to the Table reference after a period `.` and specifying the number of rows to display between the parentheses. The code below will show 30 rows of the Table.

In [None]:
world_data.show(30)

### Size of your data
You can determine the dimensions of your data using the `.num_rows` and `.num_columns` commands.

In [None]:
world_data.num_rows

In [None]:
world_data.num_columns

### Summary statistics

You may wish to calculate some summary statistics on your data. You can use the `numpy` library which contains many statistical functions. First, select the function you wish to use, then, select column you wish to use in your computation using the `.column()` command. 

For example, if I wished to compute the mean value of the data in the column labeled `"life_expectancy_years"`, I would select the `np.nanmean` function (compute the mean ignoring missing, or nan, values) and then provide the `"life_expectancy_years"` column from the `world_data` Table as the input.

In [None]:
np.nanmean( world_data.column("life_expectancy_years") )

Common `numpy` statistics functions are:

* **Arithmetic Mean**: `np.mean` / `np.nanmean`
* **Median**: `np.median` / `np.nanmedian`
* **Standard Deviation**: `np.std` / `np.nanstd`
* **Variance**: `np.var` / `np.nanvar`

### Grouping your data

You'll often want to perform calculations on a particular subgroupd of your dataset. You can use the `.group` function to help you perform such a task.

The `.group` function takes as its argument the label of the column that contains the categories. By default it returns a table of counts of rows in each category. `group` creates a distribution table that shows how the individuals are distributed among the categories found in the indicated column.

In [None]:
world_data.group("region")

You can optionally provide a second input to the `group` function that will apply a summary function to the remaining variables within each group. For example, specifying the `np.nanmean` function will apply `np.nanmean` to all other numerical columns in the dataset. If there is non-numerical data in those columns, the result will be blank.

In [None]:
world_data.group("region", np.nanmean)

## Filtering the data

You may what to only use part of your dataset at a time. You can use `where` functions to specify how to filter down to just the rows/observations you're interested in using. **Note:** This is not modifying the original table at all, it just creates a new table that contains the requested rows.

The code below will only retain rows of `world_data` where the value in the "year" column is not equal to 2021.

In [None]:
world_data.where('year', are.equal_to(2021))

You can chain multiple filters together to fine tune your selection process.

In [None]:
world_data.where('year', are.equal_to(2021)).where('region', are.equal_to("asia"))

Other common actions used in filtering are:

* `are.not_equal_to(x)`
* `are.above(x)`
* `are.above_or_equal_to(x)`
* `are.below(x)`
* `are.below_or_equal_to(x)`
* `are.between(x, y)`
* `are.between_or_equal_to(x,y)`

If you want to save the result of a filter, you need to assign it a new name. It is a best practice not to overwrite any tables in a notebook, but instead create a new table with a new name to store any filtered or otherwise modified data that you intend to use again later in the notebook.

The code below will create a new Table `world_data_2021` that only contains the observations that occurred in the year 2021.

In [None]:
world_data_2021 = world_data.where('year', are.equal_to(2021))

## Visualizing the data

You can easily create bar charts, scatter plots, line plots, and histograms depending on the data you are hoping to visualize.

### Bar charts

To create a bar chart you need a Table that contains a frequency count of categorical varibles. We can use the `.group` function to create such a table, and then generate the bar chart using the `bar()` function. 

Recall that the `group` function creates a new two-column table.

In [None]:
world_data_2021.group("region")

After group, you can chain the `.barh()` function to the resulting table, specifying the column that contains the categorical variable.

In [None]:
world_data_2021.group("region").barh("region")

**Note:** When creating frequency bar charts, its usually best to sort the data before creating the chart. The code below wil use the `.sort` function to sort by the frequency in descending order before creating the bar chart.

In [None]:
world_data_2021.group("region").sort("count", descending=True).barh("region")

### Scatter plot

When investigating pairs of quantitative data, a scatter plot is a great tool. You can use `.scatter()` on a Table to create one quickly. Specify the labels of the columns that contain the data.

In [None]:
world_data_2021.scatter("fertility_cpw", "child_mortality_dpk")

You may wish to see how your data differents based by category. You can assign each categorical group a different color on your scatterplot by including an optional `group` argument to the function.

In [None]:
world_data_2021.scatter("fertility_cpw", "child_mortality_dpk", group="region")

### Line plots

For time-series data, a line plot helps to illustrate changes over time. Use the `.plot()` function to create a line plot.

First, let's create a Table that has all the years of information just for the United States:

In [None]:
united_states = world_data.where("name", are.equal_to("United States"))

Then, let's look at how child mortality has changed over time.

In [None]:
united_states.plot("year", "child_mortality_dpk")

### Histograms

To investigate a distribution of numerical values, use a histograms! The `.hist()` function can handle this quickly. Specify the numerical column you wish to create a histogram from.

In [None]:
world_data_2021.hist("child_mortality_dpk")

You might wish to see how the distribution of the variable differs among groups. You can specify an optional input to the `.hist()` function that specifies which column contains your grouping variable. 

For example, the code below will group by "region" and then show the distribution of the "life_expectancy_years" variable for each group, with each histogram in a different color.

In [None]:
world_data_2021.hist("life_expectancy_years", group="region")