# Calculating Descriptive Statistics
## Variability

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [2]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]

# Measures of Variability
The measures of central tendency aren’t sufficient to describe data. You’ll also need the measures of variability that quantify the spread of data points. In this section, you’ll learn how to identify and calculate the following variability measures:

* Variance
* Standard deviation
* Skewness
* Percentiles
* Ranges

# Variance
The sample variance quantifies the spread of the data. It shows numerically how far the data points are from the mean. You can express the sample variance of the dataset 𝑥 with 𝑛 elements mathematically as

 𝑠² = Σᵢ(𝑥ᵢ − mean(𝑥))² / (𝑛 − 1), where 𝑖 = 1, 2, …, 𝑛 and mean(𝑥) is the sample mean of 𝑥

In [3]:
n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
var_

123.19999999999999

In [4]:
var_ = statistics.variance(x)
var_

123.2

# Standard Deviation
The sample standard deviation is another measure of data spread. It’s connected to the sample variance, as standard deviation, 𝑠, is the positive square root of the sample variance. The standard deviation is often more convenient than the variance because it has the same unit as the data points. 

Once you get the variance, you can calculate the standard deviation with pure Python:

In [5]:
std_ = var_ ** 0.5
std_

11.099549540409287

In [6]:
std_ = statistics.stdev(x)
std_

11.099549540409287

# Skewness
The sample skewness measures the asymmetry of a data sample.

There are several mathematical definitions of skewness. One common expression to calculate the skewness of the dataset 𝑥 with 𝑛 elements is 

(𝑛² / ((𝑛 − 1)(𝑛 − 2))) (Σᵢ(𝑥ᵢ − mean(𝑥))³ / (𝑛𝑠³)). 

A simpler expression is 

Σᵢ(𝑥ᵢ − mean(𝑥))³ 𝑛 / ((𝑛 − 1)(𝑛 − 2)𝑠³), where 𝑖 = 1, 2, …, 𝑛 and mean(𝑥) is the sample mean of 𝑥.

In [7]:
x = [8.0, 1, 2.5, 4, 28.0]
n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
std_ = var_ ** 0.5
skew_ = (sum((item - mean_)**3 for item in x)
         * n / ((n - 1) * (n - 2) * std_**3))
skew_

1.9470432273905929

In [8]:
z = pd.Series(x)
z_with_nan = pd.Series(x_with_nan)

In [9]:
z

0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64

In [10]:
z_with_nan

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64

In [11]:
z.skew()

1.9470432273905924

In [12]:
z_with_nan.skew()

1.9470432273905924

# Percentiles
The sample 𝑝 percentile is the element in the dataset such that 𝑝% of the elements in the dataset are less than or equal to that value. Also, (100 − 𝑝)% of the elements are greater than or equal to that value. If there are two such elements in the dataset, then the sample 𝑝 percentile is their arithmetic mean. Each dataset has three quartiles, which are the percentiles that divide the dataset into four parts:

* The first quartile is the sample 25th percentile. It divides roughly 25% of the smallest items from the rest of the dataset.
* The second quartile is the sample 50th percentile or the median. Approximately 25% of the items lie between the first and second quartiles and another 25% between the second and third quartiles.
* The third quartile is the sample 75th percentile. It divides roughly 25% of the largest items from the rest of the dataset.

Each part has approximately the same number of items. If you want to divide your data into several intervals, then you can use statistics.quantiles():

In [13]:
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
statistics.quantiles(x, n=2)

[8.0]

In [14]:
statistics.quantiles(x, n=4, method='inclusive')

[0.1, 8.0, 21.0]

You can also use np.percentile() to determine any sample percentile in your dataset. For example, this is how you can find the 5th and 95th percentiles:

In [15]:
y = np.array(x)

In [16]:
np.percentile(y, 5)

-3.44

In [17]:
np.percentile(y, 95)

34.919999999999995

# Ranges

The range of data is the difference between the maximum and minimum element in the dataset. You can get it with the function np.ptp():


In [18]:
np.ptp(y)

46.0

In [19]:
np.ptp(z)

27.0

# Summary of Descriptive Statistics

SciPy and Pandas offer useful routines to quickly get descriptive statistics with a single function or method call. You can use scipy.stats.describe()


describe() returns an object that holds the following descriptive statistics:

* nobs: the number of observations or elements in your dataset
* minmax: the tuple with the minimum and maximum values of your dataset
* mean: the mean of your dataset
* variance: the variance of your dataset
* skewness: the skewness of your dataset
* kurtosis: the kurtosis of your dataset

In [20]:
result = scipy.stats.describe(y, ddof=1, bias=False)
result

DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)

Pandas has similar, if not better, functionality. Series objects have the method .describe()

It returns a new Series that holds the following:

* count: the number of elements in your dataset
* mean: the mean of your dataset
* std: the standard deviation of your dataset
* min and max: the minimum and maximum values of your dataset
* 25%, 50%, and 75%: the quartiles of your dataset


In [21]:
result = z.describe()
result

count     5.00000
mean      8.70000
std      11.09955
min       1.00000
25%       2.50000
50%       4.00000
75%       8.00000
max      28.00000
dtype: float64

In [22]:
result['mean']

8.7

# Working With 2D Data

In [23]:
a = np.array([[1, 1, 1],
              [2, 3, 1],
              [4, 9, 2],
              [8, 27, 4],
              [16, 1, 1]])
a

array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])

In [24]:
np.mean(a)

5.4

In [25]:
a.mean()

5.4

In [26]:
np.median(a)

2.0

In [27]:
a.var()

49.84000000000001

As you can see, you get statistics (like the mean, median, or variance) across all data in the array a. Sometimes, this behavior is what you want, but in some cases, you’ll want these quantities calculated for each row or column of your 2D array.

The functions and methods you’ve used so far have one optional parameter called axis, which is essential for handling 2D data. axis can take on any of the following values:

* axis=None says to calculate the statistics across all data in the array. The examples above work like this. This behavior is often the default in NumPy.
* axis=0 says to calculate the statistics across all rows, that is, for each column of the array. This behavior is often the default for SciPy statistical functions.
* axis=1 says to calculate the statistics across all columns, that is, for each row of the array.

In [28]:
a.mean(axis=0)

array([6.2, 8.2, 1.8])

In [29]:
a.mean(axis=1)

array([ 1.,  2.,  5., 13.,  6.])

# DataFrames

In [30]:
row_names = ['first', 'second', 'third', 'fourth', 'fifth']
col_names = ['A', 'B', 'C']
df = pd.DataFrame(a, index=row_names, columns=col_names)
df

Unnamed: 0,A,B,C
first,1,1,1
second,2,3,1
third,4,9,2
fourth,8,27,4
fifth,16,1,1


In [31]:
df.mean()

A    6.2
B    8.2
C    1.8
dtype: float64

In [32]:
df.var()

A     37.2
B    121.2
C      1.7
dtype: float64

What you get is a new Series that holds the results. In this case, the Series holds the mean and variance for each column. If you want the results for each row, then just specify the parameter axis=1

In [33]:
df.mean(axis=1)

first      1.0
second     2.0
third      5.0
fourth    13.0
fifth      6.0
dtype: float64

You can isolate each column of a DataFrame like this:


In [34]:
df['A']

first      1
second     2
third      4
fourth     8
fifth     16
Name: A, dtype: int32

In [35]:
df['A'].mean()

6.2

Sometimes, you might want to use a DataFrame as a NumPy array and apply some function to it. It’s possible to get all data from a DataFrame with .values or .to_numpy()

In [36]:
df.values

array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])

In [37]:
df.to_numpy()

array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])

Like Series, DataFrame objects have the method .describe() that returns another DataFrame with the statistics summary for all columns

In [38]:
df.describe()

Unnamed: 0,A,B,C
count,5.0,5.0,5.0
mean,6.2,8.2,1.8
std,6.09918,11.009087,1.30384
min,1.0,1.0,1.0
25%,2.0,1.0,1.0
50%,4.0,3.0,1.0
75%,8.0,9.0,2.0
max,16.0,27.0,4.0


The summary contains the following results:

* count: the number of items in each column
* mean: the mean of each column
* std: the standard deviation
* min and max: the minimum and maximum values
* 25%, 50%, and 75%: the percentiles

If you want the resulting DataFrame object to contain other percentiles, then you should specify the value of the optional parameter percentiles

You can access each item of the summary like this:

In [39]:
df.describe().at['mean', 'A']

6.2