In [2]:
import numpy as np

# Statistical procedures

A substantial proportion of real world applications in computational modelling require statistical procedures. `numpy` provides a wide variety of efficient statistical functions for you to employ on an arrat.  This section will explore the (simple and) commonly used functions as well as developing our own statistics for **running** sample mean and standard deviation.

<div class="alert alert-block alert-info"><b>Tip:</b> We will explore statistical programming for health data science in a lot more detail in Part 2 using `pandas` and other important libraries`.  It is well worth learning `numpy` capabilities, however, as converting from a `np.ndarray` to a `pandas.DataFrame` during a computational procedure can be expensive. </div>



## Simple data analysis example.

### ED attendance data

We will first use data held in the `minor_illness_ed_attends.csv`.  This is a synthetic time series dataset reporting the number of patients registered at GP surgery who attend ED each week.  The data are standardised to 10k of registered patients.

#### Loading the dataset

Let's first open the data and then construct some summary statistics

In [9]:
file_name = 'data/minor_illness_ed_attends.csv'
ed_data = np.loadtxt(file_name, skiprows=1, delimiter=',')
print(ed_data.shape)

(74,)


Here's a peak the first 5 elements in `ed_data`.

In [6]:
ed_data[:5]

array([2.11927795, 3.49057545, 3.98922908, 2.36860477, 3.24124863,
       2.8672584 , 3.11658522, 2.74259499, 3.61523885, 3.61523885])

#### Calculate summary statistics

* `numpy` makes it easy to calculate means, stdev and other summary statistics of an `ndarray`.

Here we will create a class to act as a convienient container for a dataset.  We will override the `__str__` method so that we can easily print a summary of the dataset to the screen when calling `print`

In [41]:
class AttendanceSummary:
    
    def __init__(self, data, decimal_places=2):
        """
        Returns mean, stdev and 5/95 percentiles of ed data

        Keyword arguments:
        data -- 1d numpy.ndarray containing data to analyse
        """
        self.n = len(data)
        self.mean = data.mean()
        self.std = data.std()
        self.min_attends = data.min()
        self.max_attends = data.max() 
        self.per_95 = np.percentile(data, 95)
        self.dps = decimal_places
        
    def __repr__(self):
        to_print = f'Mean:\t{self.mean:.2f}' \
             + f'\nStdev:\t{self.std:.2f}' \
             + f'\nMin:\t{self.min_attends:.2f}' \
             + f'\nMax:\t{self.max_attends:.2f}' \
             + f'\n95th:\t{self.per_95:.2f}'
        
        return to_print

In [43]:
x = AttendanceSummary(ed_data)
x

Mean:	2.92
Stdev:	0.71
Min:	1.62
Max:	5.11
95th:	3.99

#### Frequency histogram

* NumPy has a `histogram` function.
* You need to specify **bins** (frequency ranges) and supply the data

In [44]:
mybins = np.linspace(start=1.5, stop = 5.5, num=9)
freq, bins = np.histogram(ed_data, bins=mybins, density=False)
print(bins)
print(freq)

[1.5 2.  2.5 3.  3.5 4.  4.5 5.  5.5]
[ 7 19 18 13 14  2  0  1]


In [45]:
print(mybins)

[1.5 2.  2.5 3.  3.5 4.  4.5 5.  5.5]


## A note of caution: working with running statistics

In many computational modelling procedures you will need an estimate of statistics as the code executes. For example, you may need to track a mean or a standard deviation of a performance measure as multi-stage algorithm or a simulation model of a healthcare system executes. 

As we have seen `numpy` provides highly efficient functions for calculating a mean or standard deviation based on data held in an array.  I'm always tempted to make use of these built in procedures. They are indeed fast and incredibly easy to use.  The downside is that you waste computation via repeated iteration over an array.  The other option, that requires more careful thought, is a running estimate of your statistics.  In general, thats implemented in standard python.  Let's look at an example where we compare recalculation using `numpy` function with a running calculation of the mean and standard deviation in python.

We will first modify the `AttendanceSummary` class to include an `update_all()` function.  It will accept a `np.ndarray` that it will use to recalculate the mean and standard deviation.

In [144]:
class AttendanceSummary:
    
    def __init__(self, data=None, decimal_places=2):
        """
        Returns mean, stdev and 5/95 percentiles of ed data

        Params:
        -------
        data: np.ndarray, optional (default = None) 
            Contains an initial data sample.
            
        decimal_places: int, optional (default=2)
            Summary decimal places.
        """
        if isinstance(data, np.ndarray):
            self.n = len(data)
            self.mean = data.mean()
            self.std = data.std(ddof=1)
        else:
            self.n = 0
            self.mean = None
            self.std = None
            
        self.dps = decimal_places
        
    def update(self, data):
        '''
        Update the mean and standard deviation using complete recalculation.
        
        Params:
        ------
        data: np.ndarray
            Vector of data
        '''
        self.n = len(data)
        
        # update the mean and std.
        self.mean = data.mean()
        self.std = data.std(ddof=1)
        
    
    def __str__(self):
        to_print = f'Mean:\t{self.mean:.2f}' \
             + f'\nStdev:\t{self.std:.2f}' \
        
        return to_print

In [153]:
def test_complete_recalculation(data, start=10):
    summary = AttendanceSummary(data[:start])

    for i in range(start, len(data)+1):
        summary.update(data[:i])
    return summary

In [154]:
summary = test_complete_recalculation(ed_data)
print(summary)
print(summary.n)

Mean:	2.92
Stdev:	0.71
74


In [148]:
len(ed_data)

74

In [115]:
%timeit summary = test_complete_recalculation(ed_data)

1.14 ms ± 8.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [180]:
class AttendanceSummary:
    
    def __init__(self, data=None, decimal_places=2):
        """
        Returns mean, stdev and 5/95 percentiles of ed data

        Params:
        -------
        data: np.ndarray, optional (default = None) 
            Contains an initial data sample.
            
        decimal_places: int, optional (default=2)
            Summary decimal places.
        """
        if isinstance(data, np.ndarray):
            self.n = len(data)
            self.mean = data.mean()
            # sum of squares
            self._sq = np.sum(np.square(data))
        else:
            self.n = 0
            self.mean = None
            self._sq = None
            
        self.dps = decimal_places
    
    @property
    def variance(self):
        return self._sq / (self.n - 1)
    
    @property
    def std(self):
        return np.sqrt(self.variance)
    
    def update(self, x):
        '''
        Running update of mean and variance
        
        Params:
        ------
        data: np.ndarray
            Vector of data
        '''
        self.n += 1
        
        # we need to do more work ourselves for running stats!
        if self.n == 1:
            self.mean = x
            self._sq = 0
        else:
            # update the mean
            next_mean = self.mean + ((x - self.mean) / self.n)
        
            # update the sum of squares 
            self._sq += (x - self.mean) * (x - next_mean)
            
            self.mean = next_mean
    
    def __str__(self):
        to_print = f'Mean:\t{self.mean:.5f}' \
             + f'\nStdev:\t{self.std:.5f}' \
        
        return to_print

In [181]:
def test_online_calculation(data, start=10):
    summary = AttendanceSummary()

    for observation in data:
        summary.update(observation)
    return summary

In [182]:
summary = test_online_calculation(ed_data)
print(summary)

Mean:	2.91948
Stdev:	0.71092


In [183]:
%timeit summary = test_online_calculation(ed_data)

44.2 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [184]:
ed_data.mean()

2.919482262743243

In [170]:
ed_data.std(ddof=1)

0.7109174286360931