# Descriptive statistics: proportion

<div class="alert alert-warning">

**In this notebook you will learn how to describe categorical data using proportions.**
    
</div>

Last week we looked at visualising a set of data. Visualising data collected from an experiment or from observations is very important and almost always the first step in statistically analysing your data. But visualising data is not enough. We need to say something concrete about it. And that means describing it with numbers. 

These numbers are called **descriptive or summary statistics of a sample**.

This sounds complicated, but it's not. 

The most important descriptive statistic of categorical data is **proportion**. For example, last week we calculated that out of one hundred blood donors 34 of them had blood group O+. Therefore, the proportion of people with blood group O+ was 34% or 0.34.

Proportions are usually written as $\hat{p}$ (pronounced p-hat).

We mentioned in the notebook "Reading data from files" that pandas provides lots of functions for reading in, analysing, manipulating and describing data. One of the things it does is calculate proportions of categories of categorical data stored in a DataFrame.

Let's read in the blood group data again and calculate the proportions of each group. 

<div class="alert alert-info">

First lets plot it to remind ourselves of what it looks like.
</div>

In [None]:
import pandas as pd 
import seaborn as sns

blood_groups = pd.read_csv('Datasets/blood groups.csv')

g = sns.displot(blood_groups['group'], shrink=0.8)

# Add some useful annotation
g.ax.set_xlabel('Blood group')
g.ax.set_ylabel('Number of donors')
g.ax.set_title('Frequency distribution of Blood groups of 100 blood donors');

## Frequency table

The pandas function `value_counts()` produces a table of frequencies of each category.

<div class="alert alert-info">

Run the following code to see this.
</div>

In [None]:
print( blood_groups['group'].value_counts() )s

This table corresponds to the bar plot. Thirty-four people have blood group O+, etc. (Ignore the `dtype: int64` bit. That's just telling us that the counts are integers.)

## Relative frequency table

To get the proportions we include `normalize=True` in `value_counts()` like so:
```python
blood_groups['group'].value_counts(normalize=True)
```

<div class="alert alert-info">

Run the following code to see this.
</div>

In [None]:
print( blood_groups['group'].value_counts(normalize=True) )

That's it. These proportions are descriptive statistics of blood groups of one hundred donors. (Ignore the `dtype: float64` bit. That's just telling us that the proportions are floats.)

## Exercise Notebook

[Descriptive statistics: proportion](../Exercise%20Notebooks/3.1%20-%20Descriptive%20statistics:%20proportion.ipynb)