# Descriptive statistics in *pandas*

In this Notebook you will see examples of a range of basic statistics available in *pandas*.

We can only really scratch the surface of what is available in *pandas*  and other libraries that have extended the range of *pandas* facilities.  After all, *pandas* was intended to be a data analysis toolset - so having a rich collection of statistical analysis tools and packages should not be surprising.

There are two types of statistical functions we'll cover - the aggregate (where a set is summarised with a single value) and the additive (where the set size is retained, with values changed or augmented by additional values).

Aggregate functions include:
- count	...  Number of non-null observations
- sum	...  Sum of values
- mean	...  Mean of values
- mad	...  Mean absolute deviation
- median ... Arithmetic median of values
- min	...  Minimum
- max	...  Maximum
- mode	...  Mode (most frequent)
- abs	...  Absolute value
- prod	...  Product of values
- std	...  Unbiased standard deviation
- var	...  Unbiased variance
- sem	...  Unbiased standard error of the mean
- quantile ...  Sample quantile (value at %)

Additive functions include:
- cumsum  ...	Cumulative sum
- cumprod ...	Cumulative product
- cummax  ...	Cumulative maximum
- cummin  ...	Cumulative minimum


In [1]:
import pandas as pd
import numpy as np

# Data samples to use as examples

We are going to need a range of data samples to use throughout this Notebook, so we'll set them up in the following cells.

## Series datasets

In [2]:
# A numeric series.
aseries = pd.Series([1, 2, 4, 5, 1, -4, 2, 8])

# A numeric series with at least one NaN.
aseriesnan = pd.Series([30, 42, 24, 75, 82, np.nan, 20, 34])

# A non-numeric series.
nonnumeric_series = pd.Series(['Andy', 'Arnold', 'Bill', 'Anne', 'Xavier',
                               'Walter', 'Angelina', 'Anne'])     

## DataFrame datasets

In [11]:
# A DataFrame with two numeric columns.
adf = pd.DataFrame({'size':[1, 2, 4, 5, 1, 4, 2, 8], 
                    'cost':[30, 42, 24, 75, 82, 50, 20, 34]})

# A DataFrame with two numeric columns, one with at least one NaN.
adfnan = pd.DataFrame({'size':[1, 2, 4, 5, 1, 4, 2, 8], 
                       'cost':[30, 42, 24, 75, 82, np.nan, 20, 34]})

# A DataFrame with a non-numeric column.
adfnonnumeric = pd.DataFrame({'size':[1, 2, 4, 5, 1, 4, 2, 8],  
                              'cost':[30, 42, 24, 75, 82, np.nan, 20, 34],  
                       'owner':['Bill', 'Arnold', 'Bill', 'Ann', 'Xavier',
                                'Walter', 'Bill', 'Anne']})
adf

Unnamed: 0,cost,size
0,30,1
1,42,2
2,24,4
3,75,5
4,82,1
5,50,4
6,20,2
7,34,8


# Aggregate functions
The aggregate functions will be applied to the column of a Series if it is compatible with the data type of the series values,  and will be applied to all the compatible columns of a DataFrame:

In [4]:
# Summing the values in a Series.
print(aseries.sum())
print(aseriesnan.sum())
print(nonnumeric_series.sum())

# Note, it is always checking how functions behave in the presence of NaNs and unusual types.  
# The output of the sum applied to the nonnumeric_series was a little unexpected!

19
307.0
AndyArnoldBillAnneXavierWalterAngelinaAnne


In [5]:
adfnonnumeric.mean()

cost    43.857143
size     3.375000
dtype: float64

If the aggregate can be applied to the type of values in a column it will be.

So, `max` can be applied as a numeric or string maximum depending on the type of values in the set.  

(Also, in a set with mixed types of values in a column there are ways to make the aggregate functions only 'see' specific types of value within the data, but we don't cover them in this module.)

In [6]:
adfnonnumeric.max()

cost         82
owner    Xavier
size          8
dtype: object

## Aggregation by row or by column in a DataFrame
In a DataFrame you can choose to apply the function along the columns or along the rows by specifying the `axis` value. The default is `axis=0` meaning apply to columns, but you can set `axis=1` to apply to rows.

In [7]:
adf.sum(axis=1)

0    31
1    44
2    28
3    80
4    83
5    54
6    22
7    42
dtype: int64

## Include or exclude NaNs
By default any `NaN`s encountered are ignored, but if they are important they can be included in the aggregation by setting `skipna=False`.

In [8]:
adf.sum(skipna=True)

cost    357
size     27
dtype: int64

In [9]:
adfnan.sum(skipna=False)

cost   NaN
size    27
dtype: float64

##  Locating the min or max in a Series or DataFrame
It is sometimes useful to know the index label for the minimum or maximum value in a Series or DataFrame.

The `idxmin()` and `idxmax()` functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values.

In [14]:
aseries.idxmin()

5

In [12]:
adf.idxmin(axis=0)

cost    6
size    0
dtype: int64

In [None]:
adfnan.idxmin(axis=1, skipna=False)

When there are multiple rows (or columns) matching the minimum or maximum value, `idxmin()` and `idxmax()` return the **first** matching index  _(it also works where we supply names for the index labels, overriding the default index numbering)_:

In [None]:
# Here we create a DataFrame with five values in a single column,
# the row index values are e d c b a.
df2 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
df2['A'].idxmin()

# The additive functions
With these functions, the dataset is treated one row at a time, in sequence - so the order of the elements in the dataset will make a difference to the output of the function.   The examples we will look at are the cummulative functions in which each value is dependent on the rows that come before that value.

In [None]:
pd.DataFrame({'The series': aseries, 'The cumulative sum': aseries.cumsum(), 
              'The cumulative max': aseries.cummax(), 'The cumulative min': aseries.cummin()})

The cummulative functions preserve the location of the `NaN`s.

In [None]:
adfnonnumeric.cummax()

# A quick collection of descriptive stats
Sometimes it's useful just to get a quick impression of a dataset.

The `describe()` method throws a handful of useful functions at a dataset all in one method.

*Warning* `describe()` skips NaNs - this cannot be changed.

In [15]:
aseriesnan.describe()

count     7.000000
mean     43.857143
std      24.768452
min      20.000000
25%      27.000000
50%      34.000000
75%      58.500000
max      82.000000
dtype: float64

When applied to non-numeric Series we get a different collection of functions applied:
a simple summary of the count of the elements, the number of unique values, the most frequently occurring value and its frequency of occurrence.

In [16]:
nonnumeric_series.describe()

count        8
unique       7
top       Anne
freq         2
dtype: object

*Warning*: simply applying `describe()` to a mixed DataFrame will generate statistics for each *numeric* column: it will ignore the non-numeric columns!

In [17]:
adfnonnumeric.describe()

Unnamed: 0,cost,size
count,7.0,8.0
mean,43.857143,3.375
std,24.768452,2.386719
min,20.0,1.0
25%,27.0,1.75
50%,34.0,3.0
75%,58.5,4.25
max,82.0,8.0


But, adding `include='all'` will retain each column from the original and apply the required functions to each.   

(The `include` value can be a list of `dtypes` to include, and you can have an `exclude` parameter giving dtypes to exclude from the result.)

In [18]:
adfnonnumeric.describe(include='all')

Unnamed: 0,cost,owner,size
count,7.0,8,8.0
unique,,6,
top,,Bill,
freq,,3,
mean,43.857143,,3.375
std,24.768452,,2.386719
min,20.0,,1.0
25%,27.0,,1.75
50%,34.0,,3.0
75%,58.5,,4.25


Are you wondering about those rows with a percentage index label?  They represent the 25, 50 and 75 percentile values - that is the value at which 25% of the dataset have values below this value (or 50% or 75%).

So, in the above result 50% of the values in the `cost` column are below 34.

It is possible to supply `describe()` with a list of the percentile values (between 0.0 and 1.0) you want displayed (note that you always seem to get the 50 percentile value):

In [19]:
aseries.describe(percentiles=[.05, .25, .75, .95])

count    8.00000
mean     2.37500
std      3.50255
min     -4.00000
5%      -2.25000
25%      1.00000
50%      2.00000
75%      4.25000
95%      6.95000
max      8.00000
dtype: float64

In [20]:
adf.describe(percentiles=[.05, .25, .75, .95])

Unnamed: 0,cost,size
count,8.0,8.0
mean,44.625,3.375
std,23.033749,2.386719
min,20.0,1.0
5%,21.4,1.0
25%,28.5,1.75
50%,38.0,3.0
75%,56.25,4.25
95%,79.55,6.95
max,82.0,8.0


# What next

If you are working through this Notebook as part of the module activities, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `04.3 Simple visualisations in pandas`.