# Visualising data

<div class="alert alert-warning">

**In this notebook you will learn how to plot data to see what they look like. Numerical data are plotted in a histogram and categorical data are plotted in a bar plot**.
    
</div>

Reading in a dataset is only the first step toward analysing your data. The next step is to get a feel for your data by plotting it in graphs. 

Python itself does not have functions for plotting graphs. We have to use another module to do this. 

There are several modules we can use to visualise data in Python. A common one used in statistics and data science is called **seaborn**. Seaborn provides easy-to-use data visualisation tools of datasets. The official website is [here](http://seaborn.pydata.org).

To use seaborn in our code we include the following
```python
import seaborn as sns
```
which imports seaborn and creates the shortcut name sns.
</div>

## Visualising numerical data in a histogram

<div>
<img src="attachment:salmon.jpg" width='70%' title="Bureau of Land Management CC BY 2.0"/>
</div>

The masses of the sample of 228 salmon we read in in the last notebook are obviously numerical. The histogram is the main type of graph for visualising a single set of numerical data. 

To plot a histogram of the sample of 228 salmon body masses we use the seaborn function `displot()` like so

```python
g = sns.displot(salmon_masses['mass'])
```

which says "plot a histogram of the `mass` values contained in the DataFrame `salmon_masses` and store the graph in a variable called `g`". We'll see why `g` is needed in a moment.

<div class="alert alert-info">
Run the following code to plot a histogram of salmon masses.
</div>

In [None]:
import pandas as pd
import seaborn as sns

# Read in the salmon mass dataset.
salmon_masses = pd.read_csv('Datasets/alaskan salmon.csv')

# Plot a histogram of the salmon masses.
g = sns.displot(salmon_masses['mass'])

We can now see what the data look like. There are two peaks, one around 1.75kg and another just below 3kg. 

These peaks represent a sample of salmon of two ages: two-year olds and three-year olds. The younger salmon are smaller and more numerous and the older salmon are larger (because they've had more time to grow) and less numerous (because time has killed many of them off).

## Annotate graphs to help readers understand their content

As with all graphs, the one we plotted above needs to be annotated fully and clearly so that someone else can look at it and know immediately what it is presenting. We need the following:
1. More descriptive labels on the $x$ and $y$ axes
2. A title

We add $x$ and $y$ axes labels like so
```python
g.ax.set_xlabel('Body mass (kg)')
g.ax.set_ylabel('Number of salmon')
```
and a title like so
```python
g.ax.set_title('Masses of 228 Alaskan sockeye salmon');
```

Notice that we are using the graph's name `g` to set the labels and titles.

It's worth pointing out that the unit of mass (kg) is included in the $x$-axis label. This means a reader immediately knows what units mass has. If the units were missing the reader has to guess if mass is in grams, kilograms or even pounds or ounces. Try to make life as easy as possible for other people to understand what you are presenting by including relevant information in your graphs and tables.


<div class="alert alert-info">
In the code cell below annotate your histogram with axes labels and a title to make it more understandable. 
</div>

In [None]:
g = sns.displot(salmon_masses['mass'])

# Add some useful annotation to help others understand what the graph contains
g.ax.set_xlabel('Body mass (kg)')
g.ax.set_ylabel('Number of salmon')
g.ax.set_title('Masses of 228 Alaskan sockeye salmon');

Notice the semi-colon at the end of
```python
g.ax.set_title('Masses of 228 Alaskan sockeye salmon');
```

The semi-colon suppresses the text output of this statement. Try removing it in the above code cell to see the difference in the output. 
    
You don't need to use the semi-colon, but it makes your output tidier.

## Visualising categorical data in a bar plot

The file [Datasets/blood groups.csv](Datasets/blood%20groups.csv) contains the blood groups of 100 blood donors. The header is called `group`.

Seaborn uses the same function `displot()` to plot numerical and categorical data. 

<div class="alert alert-info">
Run the following code to plot a bar plot of blood groups.
</div>

In [None]:
import pandas as pd
import seaborn as sns

blood_groups = pd.read_csv('Datasets/blood groups.csv')

# Plot a bar plot of blood groups
g = sns.displot(blood_groups['group'])

# Add some useful annotation
g.ax.set_xlabel('Blood group')
g.ax.set_ylabel('Number of donors')
g.ax.set_title('Blood groups of 100 blood donors');

It's clear that the O+ blood group is the most frequent.

Bar plots of categorical data usually have spaces between the vertical bars to distinguish them from histograms which don't. We can add a space between the bars by shrinking the widths of the bars. This is done by adding `shrink` to the call to `displot()` like so:

```python
g = sns.displot(blood_groups['group'], shrink=0.8)
```

This shrinks the width of the bars to 80%.

<div class="alert alert-info">
    
Try adding `shrink=0.8` to the above call to `displot()` and rerun to see the results.
</div>

In some datasets the category names can be long (we'll see an example in the Exercises) and tend to overlap making them difficult to read. In that case it might be better to plot the bars horizontally rather than vertically so that the category names don't overlap.

To plot horizontal bars we add `y=` to the call to `displot()` like so

```python
g = sns.displot(y=blood_groups['group'], shrink=0.8)
```

<div class="alert alert-info">
    
Run the code cell below to plot horizontal bars.
</div>


In [None]:
# Plot a bar plot of blood groups
g = sns.displot(y=blood_groups['group'], shrink=0.8)

# Add some useful annotation
g.ax.set_xlabel('Number of donors')
g.ax.set_ylabel('Blood group')
g.ax.set_title('Blood groups of 100 blood donors');

## Frequency and relative frequency distributions

The above histogram and bar plot graph the frequency of values (e.g., 34 people with the O+ blood group). These plots are called **frequency distributions**. That is, they show the distribution of the frequencies of values in a dataset.

Instead we can plot the **relative frequency distributions** of values. Remember that relative frequency is the same as proportion (e.g., the proportion of people with blood group O+ is 0.34).

We can use `displot()` to plot relative frequency distributions. We just have to tell seaborn this is what we want to do by adding `stat='proportion'` to the call to `displot()` like so
```python
g = sns.displot(blood_groups['group'], stat='proportion', shrink=0.8)
```

<div class="alert alert-info">

Look at the code below to try and understand what it does. Then run it.
</div>

In [None]:
blood_groups = pd.read_csv('Datasets/blood groups.csv')

# Plot a relative frequency distribution of blood groups
g = sns.displot(blood_groups['group'], stat='proportion', shrink=0.8)

# Add some useful annotation
g.ax.set_xlabel('Blood group')
g.ax.set_title('Blood groups of 100 blood donors');

Notice that the shape of the distribution hasn't changed. Only the values on the *y*-axis have changed and the *y*-axis label is now "Proportion".

If you prefer percentages to decimals you can use the following:

```python
g = sns.displot(blood_groups['group'], stat='percent', shrink=0.8)
```

<div class="alert alert-info">

Try this in the above code cell
</div>

## Saving graphs to file

If you want to use a graph in a report or a poster or a presentation then you need to save it to a file. This is easily done with the command 
```python
g.savefig(filename)
```
You can save graphs to various formats including png, jpg, pdf, tiff, etc. Python knows which format you want to use by the filename extension you give it.

<div class="alert alert-info">

Run the following code cell to save the last graph you created to a jpeg file called blood_groups.jpg.
</div>

In [None]:
g.savefig('blood_groups.jpg')

If you now go to the browser tab with all the notebooks in it, you will find a file called blood_groups.jpg.

If you click on the filename a new browser tab opens with the graph displayed.

To download the file on to your laptop select blood_groups.jpg by clicking on the square box to the left of it so that a tick appears.

![Screenshot_20220623_102250.png](attachment:Screenshot_20220623_102250.png)

Then click on "Download" at the top of the page.

![Screenshot_20220623_102424.png](attachment:Screenshot_20220623_102424.png)

You can then insert the downloaded graph into a Word document or a Powerpoint presentation.

## Exercise Notebook

[Visualising data](../Exercise%20Notebooks/2.7%20-%20Visualising%20data.ipynb)