# ~~First~~ Fast Python Notebook

An accelerated guide to analyzing data with the [Python](https://www.python.org/) programming language and a [Jupyter](https://jupyter.org/) notebook
   
By [Ben Welsh](https://palewi.re/who-is-ben-welsh/)

First developed in 2016, ["First Python Notebook"](https://palewi.re/docs/first-python-notebook/) is a tutorial that guides students through a data-driven investigation of money in California politics. It is most commonly taught as a six-hour, in-person class. This document is an abbreviated spinoff intended to be taught online in two to three hours.

You will learn just enough of the Python computer programming language to work with the [pandas](https://pandas.pydata.org/) library, a popular open-source tool for analyzing data. The course will teach you how to read, filter, join, group, aggregate and rank structured data by recreating a helicopter accident analysis [published by the Los Angeles Times](https://github.com/datadesk/helicopter-accident-analysis).

## What is a Jupyter notebook?

<img src="https://palewi.re/docs/first-python-notebook/_static/img/labpreview.webp" style="max-width:640px;">

A [Jupyter](https://jupyter.org/) notebook is a browser-based interface where you can write, run, remix and republish code. It is free software that anyone can install and run.

[Scientists](https://nbviewer.jupyter.org/github/robertodealmeida/notebooks/blob/master/earth_day_data_challenge/Analyzing%20whale%20tracks.ipynb), [scholars](https://nbviewer.jupyter.org/github/nealcaren/workshop_2014/blob/master/notebooks/5_Times_API.ipynb), [investors](https://github.com/rsvp/fecon235/blob/master/nb/fred-debt-pop.ipynb) and [corporations](https://netflixtechblog.com/notebook-innovation-591ee3221233) use Jupyter to create and share their research. It is also used by journalists to develop stories and show their work. Examples include:

*   [“The Tennis Racket”](https://github.com/BuzzFeedNews/2016-01-tennis-betting-analysis/blob/master/notebooks/tennis-analysis.ipynb) by BuzzFeed and the BBC
*   [“Machine bias”](https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb) by ProPublica
*   [“As Opioid Crisis Ramped Up, Pills Flowed Into Vermont by the Millions”](https://github.com/asuozzo/arcos-opioid-analysis-vt) by Seven Days
*   [More than 35 different notebooks](https://github.com/datadesk/notebooks) published by the Los Angeles Times

There are numerous ways to install and configure Jupyter notebooks. This class is taught using [JupyterLite](https://jupyterlite.readthedocs.io/) a lightweight distribution that runs entirely in your web browser. For instructions on how to install a more powerful version on your computer consult [the full edition of "First Python Notebook.'](https://palewi.re/docs/first-python-notebook/jupyter_desktop.html)

Once you have this notebook up and running, you're ready to write Python in a code cell. Do not stress. There is nothing too fancy about it. You can start by just doing a little simple math.

Select the box below, then hit the play button in the toolbar above the notebook or hit `SHIFT+ENTER` on your keyboard.

In [None]:
2+2

There. You have just run your first Python code. You have entered two integers and added them together using the plus sign operator.

Not so bad, right? Now try writing in your own math problem in the next cell. Maybe `2+3` or `2+200`. Whatever strikes your fancy. After you've typed it in, hit the play button or `SHIFT+ENTER`.

This to-and-fro of writing Python code in a cell and then running it is the rhythm of working in a notebook. If you get an error after you run a cell, look carefully at your code and see that it exactly matches what’s been written in the example.
    
Here's an example of a error that I've added intentionally:

In [None]:
2+2+

Don’t worry. Code crashes are a normal part of life for computer programmers. They’re usually caused by small typos that can be quickly corrected.

In [None]:
2+2+2

Over time you will gradually stack cells to organize an analysis that runs from top to bottom. The cells can contain variables, functions and other Python tools.

<div class="alert alert-block alert-warning">
<p>If you’ve never written code before, we recommend <a href="https://docs.python.org/3/tutorial/introduction.html">&quot;An Informal Introduction to Python&quot;</a> and subsequent sections of python.org’s tutorial.</p>
</div>

A simple example would be storing your number in a variable in one cell:

In [None]:
number = 2

Then adding it to another number in the next:

In [None]:
number + 3

Change the `number` value to 3 and run both cells again. Instead of 5, it should now output 6.

In [None]:
number = 3

In [None]:
number + 3

Now try defining your own numeric variable and doing some math with it. You can name it whatever you want. Want to try some other math operations? The `-` sign does subtraction. Multipication is `*`. Division is `/`.

Once you’ve got the hang of making the notebook run, you’re ready to introduce pandas, a powerful Python analysis library that can do a whole lot more than add a few numbers together.

## What is pandas? 
  
<img src="https://palewi.re/docs/first-python-notebook/_static/img/pandas-pypi.png" style="max-width:640px;">

Lucky for us, Python is filled with functions to do almost anything you’d want to do with a programming language: [navigate the web](http://docs.python-requests.org/), [parse data](https://docs.python.org/2/library/csv.html), [interact with a database](http://www.sqlalchemy.org/), [run fancy statistics](https://www.scipy.org/), [build a pretty website](https://www.djangoproject.com/) and [so](https://www.crummy.com/software/BeautifulSoup/) [much](http://www.nltk.org/) [more](https://pillow.readthedocs.io/en/stable/).

Creative people have put these tools to work to get a [wide range of things](https://www.python.org/about/success/) done in the academy, the laboratory and even in outer space.

Some of those tools are included in a toolbox that comes with the language, known as the standard library. Others have been built by members of Python’s developer community and need to be separately downloaded and installed. One third-party tool that’s important for this class is called [pandas](https://pandas.pydata.org/). Invented by programmers at a [financial investment firm](https://www.aqr.com/), it has become a leading open-source library for accessing and analyzing data.

Here’s how to use pandas yourself. Run the following:
   

In [None]:
import pandas

If nothing happens, that’s good. It means you have it installed and ready as to use.

Since pandas is created by a third party independent from the core Python developers, it may not be available by default if you manually installed Python and Jupyter. It’s available here because JupyterLite, whose developers have curated a list of common utilities to include with their distribution. Consult our [advanced installation guide](https://palewi.re/docs/first-python-notebook/appendix/index.html) if the cell above threw an error.

Now let's run the same code again, but with a small addition.

In [None]:
import pandas as pd

This will alias the pandas library at the shorter variable name of `pd`. This is standard practice in the pandas community. You will frequently see examples of pandas code online using pd as shorthand. It’s not required, but it’s good to get in the habit so that your code will be better understood by other computer programmers.

Those two little letters contain dozens of data analysis tools that we’ll use in future lessons. They can import massive data files, compute advanced statistics, filter, sort, rank and do just about anything else you’d want to do.

We’ll get to all of that soon enough, but let’s start out with something simple. Let's run some simple stats.

## Calculating descriptive statistics

Start by making a list of numbers in a new notebook cell. To keep things simple, we'll start with all of the even numbers between zero and ten. Note the variable name I've assigned. Then press play.

In [None]:
my_list = [2, 4, 6, 8]

If you’re a skilled Python programmer, you can do some cool stuff with any list, including run statistics. But if you hand over to pandas instead, you’ll be impressed by how easily you can analyze the data without much computer code.

In this case, it’s as simple as converting that plain Python list into what pandas calls a [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). Here’s how to make it happen:

In [None]:
my_series = pd.Series(my_list)

Once the data becomes a `Series`, you can immediately run a wide range of <a href="https://en.wikipedia.org/wiki/Descriptive_statistics">descriptive statistics</a>. Let’s try a few. First, let’s sum all the numbers.

In [None]:
my_series.sum()

Then find the maximum value.

In [None]:
my_series.max()

The minimum value.

In [None]:
my_series.min()

How about the average, which also known as the mean?

In [None]:
my_series.mean()

The median?

In [None]:
my_series.median()

The standard deviation?

In [None]:
my_series.std()

Finally, all of the above, plus a little more about the distribution, in one simple command.

In [None]:
my_series.describe()

Before you move on, go back the `my_list` variable and change the list. Maybe add a few more values. Or switch to odds. Then rerun all the cells above. You'll see all the statistics update to reflect the different dataset.

Substitute in a series of 10 million records and your notebook would calculate all the same statistics without you needing to write any more code. Once your data, however large or complex, is imported into pandas, simple statistics become a snap.

## Introducing DataFrames
    
Now it’s time to get our hands on some real data. In 2018, the Los Angeles Times published an investigation headlined, <a href="https://www.latimes.com/projects/la-me-robinson-helicopters/">"The Robinson R44, the world’s best-selling civilian helicopter, has a long history of deadly crashes".</a>
    
It reported that the Robinson R44 led all major models with the highest fatal accident rate from 2006 to 2016. The analysis was <a href="https://github.com/datadesk/helicopter-accident-analysis">published on GitHub</a> as a series of Jupyter notebooks. 

The analysis was based on two key datasets: 
   
1. The National Transportation Safety Board's <a href="https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx">Aviation Accident Database</a>
2. The Federal Aviation Administration's <a href="https://www.faa.gov/data_research/aviation_data_statistics/general_aviation/">General Aviation and Part 135 Activity Survey</a>

After a significant amount of work gathering and cleaning the source data, the number of accidents for each helicopter model were normalized using the flight hours estimates in the survey. For the purposes of this demonstration, we will read in tidied versions of each file that are ready for analysis.
    
The data are structured in rows of comma-separated values. This is known as a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values). It is the most common way you will find data published online.

The pandas library is able to read in files from a variety formats, including CSV. In our next cell, we'll use pandas' `read_csv` method to read in `ntsb-accidents.csv`.

In [None]:
pd.read_csv("ntsb-accidents.csv")

You should see a big table like the one above. It is a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might.

A major advantage of Jupyter over spreadsheets is that rather than manipulating the data through a haphazard series of clicks and keypunches we will be gradually grinding it down using a computer programming script that is transparent and reproducible.

In order to do more with your DataFrame, we need to store it so it can be reused in subsequent cells. We can do this by saving in a variable, which is a fancy computer programming word for a named shortcut where we save our work as we go.

In [None]:
accident_list = pd.read_csv("ntsb-accidents.csv")

After you run it, you shouldn’t see anything. That’s a good thing. It means our DataFrame has been saved under the name `accident_list`, which we can now begin interacting with in the cells that follow.

We can do this by calling “methods” that pandas makes available to all DataFrames. You may not have known it at the time, but `read_csv` is one of these methods. There are dozens more that can do all sorts of interesting things. Let’s start with some easy ones that analysts use all the time.

To preview the first few rows of the dataset, try the <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html">`head`</a> method.

In [None]:
accident_list.head()

It does the first five by default. If you want a different number, submit it as an input.

In [None]:
accident_list.head(1)

To get a look at all of the columns and what type of data they store, try the [`info`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) method.

In [None]:
accident_list.info()

Look carefully at the results and you'll see we have 163 fatal accidents.

## Inspecting columns

To see the contents of a column separate from the rest of the DataFrame, add the column’s name to the DataFrame’s variable following a period. We’ll begin with the `latimes_make_and_model` column, which records the standardized name of the helicopter that crashed.

In [None]:
accident_list.latimes_make_and_model

That will list the column out as a `Series`, just like the ones we created from scratch earlier. Just as we did then, you can now start tacking on additional methods that will analyze the contents of the column.

<div class="alert alert-block alert-warning">    
    <p>You can also access columns a second way, like this: accident_list['latimes_make_and_model'].</p><p>This method isn’t as pretty, but it’s required if your column has a space in its name, which would break the simpler dot-based method.</p>
</div>
    
In this case, the column is filled with characters. So we don’t want to calculate statistics like the median and average, as we did before.

There’s another built-in pandas tool that will total up the frequency of values in a column. The method is called `value_counts` and it’s just as easy to use as sum, min or max. All you need to do it is add a period after the column name and chain it on the tail end of your cell.

Run the code and you should see the locations ranked by their number of sites.

In [None]:
accident_list.latimes_make_and_model.value_counts()

Congratulations, you've made your first finding. With that little line of code, you've calculated an important fact: During the period being studied, the Robinson R44 had more fatal accidents than any other helicopter.
    
You may notice that even though the result has two columns, pandas did not return a clean-looking table in the same way as `head` did for our DataFrame. That’s because our column, a Series, acts a little bit different than the DataFrame created by `read_csv`.

In most instances, if you have an ugly Series generated by a method like `value_counts` and you want to convert it into a pretty DataFram,e you can do so by tacking on the `reset_index` method on the end.

In [None]:
accident_list.latimes_make_and_model.value_counts().reset_index()

Why does `Series` behave differently than a dataframe? Why does `reset_index` have such a weird name?

Like so much in computer programming, the answer is simply, “because the people who created the library said so.” It’s important to learn that all open-source programming tools are made by humans, and humans have their quirks. Over time you’ll see pandas has more than a few.

As a beginner, you should just accept the oddities and keep moving. As you get more advanced, if there’s something about the system you think could be improved you should consider <a href="https://pandas.pydata.org/pandas-docs/stable/development/contributing.html">contributing</a> to the Python code that operates the library.
    
Before we move on to the next chapter, here's a challenge. See if you can answer a few more questions a journalist might ask about our dataset. All four of the questions below can be answered using only tricks we've covered thus far. See if you can do it.

1. What was the total number of fatalities?

2. Which helicopter maker had the most accidents?

3. What was the total number of helicopter accidents by year?

4. What state had the most helicopter accidents?

## Filtering down the dataset

The most common way to filter a DataFrame is to pass an expression as an “index” that can be used to decide which records should be kept and which discarded. You write the expression by combining a column on your DataFrame with an <a href="https://en.wikipedia.org/wiki/Operator_(computer_programming)">“operator”</a> like == or > or < and a value to compare against each row.

<div class="alert alert-block alert-warning">
    <p>If you are familiar with writing <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> to manipulate databases, pandas’ filtering system is somewhat similar to a WHERE query. The <a href="https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#where">official pandas documentation</a> offers direct translations between the two.</p>
</div>
    
Let's try filtering against the `state` field. Save one of the values listed above into a variable. This will allow us to reuse it later.

In [None]:
my_state = "IA"

In the next cell we will ask pandas to narrow down our list of sites to just those that list the location we’re interested in. We will create a filter expression and place it between two flat brackets following the DataFrame we wish to filter.

In [None]:
accident_list[accident_list.state == my_state]

Now we should save the results of that filter into a new variable separate from the full list we imported from the CSV file. Since it includes only the sites for the location we’re interested in let’s call it `my_accidents`.

In [None]:
my_accidents = accident_list[accident_list.state == my_state]

To check our work and find out how many committees are left after the filter, let’s run the DataFrame inspection commands we learned earlier.

First `head`.

In [None]:
my_accidents.head()

Then `info`.

In [None]:
my_accidents.info()

## Pivoting with `groupby`

The [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the <a href="https://en.wikipedia.org/wiki/Pivot_table">"pivot table"</a> feature found in most spreadsheets.

Let's use it to total up the accidents by make and model. You start by passing the field you want to group on to the function.

In [None]:
accident_list.groupby("latimes_make_and_model")

A nice start but you’ll notice you don’t get much back. The data’s been grouped, but we haven’t chosen what to do with it yet. If we wanted the total by model, we would use the `size` method.

In [None]:
accident_list.groupby("latimes_make_and_model").size()

The result is much like `value_counts`, but we're allowed run to all kinds of statistical operations on the group, like `sum`, `mean` and `std`. For instance, we could sum the total number of fatalities for each maker by string that field on the end followed by the statistical method.

In [None]:
accident_list.groupby("latimes_make_and_model").total_fatalities.sum()

Again our data has come back as an ugly Series. To reformat it as a pretty DataFrame use the `reset_index` method again.

In [None]:
accident_list.groupby("latimes_make_and_model").size().reset_index()

Now save that as a variable.

In [None]:
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index()

You can clean up the `0` column name assigned by pandas with the `rename` method. The `inplace` option, found on many pandas methods, will save the change to your variable automatically.

In [None]:
accident_counts.rename(columns={0: "accidents"}, inplace=True)

The result is a DataFrame with the accident totals we'll want to merge with the FAA survey data to calculate rates.

In [None]:
accident_counts.head()

## Merging two dataframes together

Next we'll cover how to merge two DataFrames together into a combined table. Before we can do that, we need to read in a second file. We'll pull `faa-survey.csv`, which contains annual estimates of how many hours each type of helicopter was in the air. It was acquired via a Freedom of Information Act request with the FAA.

We can rip it in the same was the NTSB accident list, with `read_csv`.

In [None]:
survey = pd.read_csv("faa-survey.csv")

When joining two tables together, the first step is to look carefully at the columns in each table to find a common column that can be joined. We can do that with the `info` command we learned earlier.

In [None]:
accident_counts.info()

In [None]:
survey.info()

You can see that each table contains the `latimes_make_and_model` column. We can therefore join the two files using the pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html">`merge`</a> method.

<div class="alert alert-block alert-warning"><p>If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL’s JOIN statement. If you dig into <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html">merge’s documentation</a> you will see it has many of the same options.</p></div>

Merging two DataFrames is as simple as passing both to pandas built-in `merge` method and specifying which field we’d like to use to connect them together. We will save the result into another new variable, which I'm going to call `merged_list`.

In [None]:
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

That new DataFrame can be inspected like any other.

In [None]:
merged_list.head()

By looking at the columns you can check how many rows survived the merge.

In [None]:
merged_list.info()

You can also see that the dataframe now contains the same number of records as `accident_totals`. That's good. It means that every record in each dataframe found a match in the other. It's good idea to do a check like this every time you merge.

## Computing new columns

Here's how you can create a new column based on the data in other columns, a process sometimes known as “computing.” In this case, computing can help us calculate a rate by dividing the accidents by flight hours.

In many cases, it's no more complicated than combining two series with a mathematical operator.

In [None]:
merged_list.accidents / merged_list.total_hours

The resulting series can be added to your dataframe by assigning it to a new column.

In [None]:
merged_list['per_hour'] = merged_list.accidents / merged_list.total_hours

In [None]:
merged_list.head()

In this case, the result is in scientific notation. As is common when calculating per capita statistics, you can multiple all results by a common number to make the numbers more legible.
    
That's as easy as tacking on the multiplication at the end of a computation.

In [None]:
merged_list['per_100k_hours'] = (merged_list.accidents / merged_list.total_hours) * 100000

## Sorting dataframes

Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the highest and lowest members of the group according to a particular column. The <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html">`sort_values`</a> is how pandas does it.

In [None]:
merged_list.sort_values("per_100k_hours")

Note that returns the DataFrame resorted in ascending order from lowest to highest. That is pandas default way of sorting. Here's how you reverse it to show the largest values first.

In [None]:
merged_list.sort_values("per_100k_hours", ascending=False)

## Further reading

Congratulations. With that, we've recreated the analysis published in the Los Angeles Times and covered most of the basic skills necessary to access and analyze data with pandas. If you'd like to learn more, consult <a href="https://palewi.re/docs/first-python-notebook/">the full edition of "First Python Notebook"</a>.