## Learning outcomes for this notebook

* How to import pandas and matplotlib to do data analysis and plotting in Python.
* How to read in a dataset stored in a CSV file using pandas and print it out to examine it.
* How to plot a histogram using the matplotlib Python library
* How to add annotation to a matplotlib graph (i.e., axis labels, titles)
* Learn about unimodal and bimodal distributions
* Appreciate the importance of presenting units in tables and graphs

## Displaying data

The human eye is a natural pattern detector, adept at spotting trends and exceptions in visual displays. For this reason, biologists spend hours creating and examining visual summaries of their data - graphs and, to a lesser extent, tables. Effective graphs enable visual comparisons of measurements between groups, and they expose relationships between different variables. They are also the principal means of communicating results to a wider audience.

Florence Nightingale (1858) was one of the ﬁrst persons to put graphs to good use. 
![](images/night.jpg)
In her famous wedge diagrams, redrawn in the ﬁgure above, she visualised the causes of death of British troops during the Crimean War. The number of cases is indicated by the area of a wedge, and the cause of death by colour. The diagrams showed convincingly that disease was the main cause of soldier deaths during the wars, not wounds or other causes. With these vivid graphs, she successfully campaigned for military and public health measures that saved many lives.

Effective graphs are a prerequisite for good data analysis, revealing general patterns in the data that bare numbers cannot. Therefore, the first step in any data analysis is to graph the data and look at it. Humans are a visual species, with brains evolved to process visual information. Take advantage of millions of years of evolution, and look at visual representations of your data before doing anything else.

Graphs are vital tools for analysing data. They are also used to communicate patterns in data to a wider audience in the form of reports, slide shows, and web content. The two purposes, analysis and presentation, are largely coincident because the most revealing displays will be the best both for identifying patterns in the data and for communicating these patterns to others. Both purposes require displays that are clear, honest, and efficient.

Over the next few notebooks we will look at how different types of datasets are commonly displayed in tables and graphs. The types of datasets we will look at are:
* A single numerical variable
* A single categorical variable
* Two numerical variables
* Two categorical variables
* A categorical variable and a numerical variable
* A categorical variable and two numerical variables
* Two categorical variables and a numerical variable

But before we do that we need to discuss data analysis software.

## Plotting and analysing data in Python

The are many software tools and packages for plotting and analysing data. You've probably come across Microsoft excel - a favourite of biologists, "R" is another commonly used package and also free to download. All have their pros and cons, and for basic analysis and plotting it doesn't really matter which one you use. But in this course, because we have already been learning Python, it seems sensible to carry on using Python to do data analysis. 

Python, however, doesn't do data analysis and plotting. We have to import modules into Python to help us do these things.

The main aims of this notebook are:
1. To introduce you to **pandas**, a python module designed for performing data analysis
2. To show you how to set up Jupyter notebooks for plotting graphs in Python using the module **matplotlib**

We'll use a simple dataset of body masses of Alaskan sockeye salmon to demonstrate these two aims. 

## Body mass of Alaskan sockeye salmon

![](images/salmon.jpg)


In the CSV file [`alaskan_salmon.csv`](alaskan_salmon.csv) are the body masses (in kg) of 228 female sockeye salmon sampled from Pick Creek in Alaska (Hendry *et al.* 1999).

If you click on the link to this file you can download it and examine it in an excel spreadsheet.

The first few lines of `alaskan_salmon.csv` look like this:
```csv
mass
3.09
2.91
3.06
2.69
... and so on
```

The first line is called a header and contains the name of the variable. In this case the variable is called `'mass'`. Each line thereafter contains the mass (in kg) of an individual salmon.

<div class="alert alert-info">

**Pandas: data analysis library**<br>

In this course we are going to use a python package specifically designed for data analysis called pandas. Pandas provides lots of functions for reading in, analysing, manipulating and describing data. The official pandas website is http://pandas.pydata.org. This website provides a lot more information on the use of this library than can be covered in these workshops. 

To use pandas we must include the following code once in each notebook.

```python
import pandas as pd
```

This imports the pandas library and gives it the shorthand name `pd`.
</div>
<br>

### Reading a dataset from file

In an earlier Python workshop you learnt how to open and read a CSV file line by line using Python commands. Now we are going to use pandas to read the file, and that makes life a lot simpler as pandas automatically parses the data into a useable format for us in a single line of code.

To read in the Alaskan salmon file and to call the dataset something sensible we use the pandas method `pd.read_csv('filename.csv')` like so:

```python
salmon_masses = pd.read_csv('alaskan_salmon.csv')
```
<div class="alert alert-danger">

In the code cell below:
* import pandas
* read in the dataset from the file `alaskan_salmon.csv` and call the dataset `salmon_masses`
</div>
<br>

### Examining the dataset

`salmon_masses` is a Python variable of type **DataFrame**; in the same sense that the Python variable `a = 7` is an **integer**, `b = 'Hello, World'` is a **string** and `c = [1, 2, 3]` is a **list**. If you think of a DataFrame as a table containing data with each column representing a measureable variable you can't go wrong.

In this case `salmon_masses` is a DataFrame with one column called `mass` containing the masses of 228 Alaskan salmon.

Having read in a dataset it's always a good idea to have a look at it to see how it is structured. To print the DataFrame type 

```python
print(salmon_masses)
```

or to pretty-print it just type

```python
salmon_masses
```

<div class="alert alert-danger">
Try these in the above code cell
</div>

There are two columns. The first column is an index, or row number, for each salmon. Recall that Python indices start from 0 and not from 1. The second column contains the masses of individual salmon.

As this is a very long DataFrame, pandas has only printed the first and last 30 values; the middle values (from index 30 to index 197) have been omitted. 

Also notice that the shape of the DataFrame has been printed at the bottom, in this case 228 rows and 1 column (the index column is ignored).

If you wanted to print just the first 10 lines, say, use the head() method like so

```python
print(salmon_masses.head(10))
``` 

or the last 7 lines, say, use the tail() method like so

```python
print(salmon_masses.tail(7))
``` 

<div class="alert alert-danger">
Try these in the above code cell.
</div>

To print the shape of the DataFrame only (i.e., the number of rows and columns) use

```python
print(salmon_masses.shape)
```

In addition 

```python
print(len(salmon_masses))
```

prints just the number of rows

If you want to print a list of the variable names (i.e., column headers) you can use the method

```python
print(salmon_masses.columns.values)
```
<div class="alert alert-danger">
Try all of these in the above code cell to see how they work.
</div>

### Plotting variables

<div class="alert alert-info">

**Matplotlib library**<br>
The Python programming language itself does not have functions for plotting graphs. We have to use an additional library to do this. Matplotlib is a popular Python library for plotting graphs. The official matplotlib website is http://matplotlib.org, which has a gallery of possible graph types. <br><br>


To use matplotlib we must include the following code once in each notebook.

```python
%matplotlib inline
import matplotlib.pyplot as plt
```

The line `%matplotlib inline` allows us to display matplotlib-generated graphs within jupyter notebooks. The line `import matplotlib.pyplot as plt` loads the matplotlib library so we can use its plotting functions. In addition we rename the library `plt` for convenience (otherwise we have to keep writing `matplotlib.pyplot` every time we wanted to plot or change something in the graph). 
</div>
<br>

To plot a histogram of Alaskan salmon masses we could use either 

```python
plt.hist(salmon_masses['mass'])
```

or equivalently

```python
plt.hist('mass', data=salmon_masses)
```

The second of these methods of plotting looks a bit long winded compared the the first method. But as we will see in later notebooks when we are plotting several variables simultaneously, the second method is more compact.

<div class="alert alert-danger">

In the code cell below:
* Import the matplotlib library and include `%matplotlib inline`.
* Plot a histogram of Alaskan salmon masses
</div>
<br>

Notice that this histogram shows two distinct peaks. Such a distribution is described as **bimodal**, as in having two modes (peaks). In contrast, a distribution that has just one peak is called **unimodal**.

<div class="alert alert-danger">

Can you think of a reason why these salmon have a bimodal distribution in mass, and why the peak at 3kg is lower than the peak at 1.75kg? Write your answer below.
</div>

> Write your answer here

<div class="alert alert-info">

**Notes**<br>
If you place a semicolon at the end of the last plotting command in a code cell like so:
```python
plt.hist(salmon_masses['mass']);
```
the printing of the arrays of numbers is suppressed making the output cleaner.
<br><br>

If the name of a variable contains no space characters, pandas allows you to dispense with the square brackets and quotation marks. For example
```python
plt.hist(salmon_masses.mass);
```
will plot a histogram of salmon masses. 

If, for example, your variable was called "leg length" then you would need to use the brackets and quotes, otherwise you'll just get an error.


</div>
<br>

### Label your graphs

As with all graphs, the one we plotted above needs to be labelled fully and clearly so that someone else can look at it and know immediately what it is presenting. We need the following:
1. Labels on the $x$ and $y$ axes
2. A title

We add $x$ and $y$ axes labels with the functions
```python
plt.xlabel('mass (kg)')
plt.ylabel('frequency')
```
and a title with the function
```python
plt.title('Masses of Alaskan sockeye salmon')
```
It's worth pointing out that the unit of mass is included in the $x$-axis label. This means someone else immediately knows what scale the masses are in. If the units were missing the reader has to guess if the masses are in grams, kilograms or even pounds or ounces. Try to make life as easy as possible for other people to understand what you are presenting by including relevant information in your graphs and tables.

<div class="alert alert-danger">
Add axis labels and a title to the above histogram. 
</div>

So now you know how to read in a dataset from a file using pandas, print it and plot a labelled histogram using matplotlib. In the next few notebooks you will go on to look at the different types of graphs we normally use to present different types of data.

#### References

Nightingale, F. (1858). *Notes on Matters Affecting the Health, Efficiecy and Hospital Administration of the British Army.* London, Harrison and Sons.

Hendry, A. P., *et al.* (1999). Condition dependence and adaptation-by-time: breeding date, life history, and energy allocation in a population of salmon. *Oikis* **85**:499-514.