# 4.18 Intro to Data Science: Measures of Dispersion
* Considered the measures of central tendency—mean, median and mode. 
* Help us categorize typical values in a group.
    * You might use ***mean*** to determine the average height of a class.
    * You might use ***mode*** to determine the most frequently purchased car in a state.
* An entire group is called a **population**. 
* Sometimes a population is quite large, such as the people likely to vote in the next U.S. presidential election, which is a number in excess of 100,000,000 people. 
* For practical reasons, the polling organizations trying to predict who will become the next president work with carefully selected small subsets of the population known as **samples**. 
* Here we introduce **measures of dispersio**n (also called **measures of variability**) that help you understand how **spread out** the values are. 
* We’ll calculate each measure of dispersion both by hand and with functions from the module `statistics`, using the following population of 10 six-sided die rolls:
> 1, 3, 4, 2, 6, 5, 3, 4, 5, 2

### Variance 
* To determine population variance, begin with the mean (average) of these values—3.5. 
* Next, subtract the mean from every die value:
> -2.5, -0.5, 0.5, -1.5, 2.5, 1.5, -0.5, 0.5, 1.5, -1.5
* Then, square each of these results (yielding only positives):
> 6.25, 0.25, 0.25, 2.25, 6.25, 2.25, 0.25, 0.25, 2.25, 2.25
* Finally, calculate the mean of these squares, which is 2.25 (22.5 / 10)—this is the **population
variance**. 
* Squaring the difference between each die value and the mean of all die values emphasizes **outliers**—the values that are farthest from the mean—which can be important in data analysis.
* In data analytics, sometimes we want to pay careful attention to outliers (a large transaction on a credit card) and sometimes we want to ignore them.
* The following code uses the `statistics` module’s `pvariance` function to confirm our manual result:

In [1]:
import statistics

In [2]:
statistics.pvariance([1, 3, 4, 2, 6, 5, 3, 4, 5, 2])

2.25

### Standard Deviation
* The standard deviation is the square root of the variance (in this case, 1.5), which tones
down the effect of the outliers. 
* The smaller the variance and standard deviation are, the closer the data values are to the mean and the less overall dispersion (that is, spread) there is between the values and the mean. 
* The following code calculates the ***population standard deviation*** with the `statistics` module’s `pstdev` function, confirming our manual
result:

In [3]:
statistics.pstdev([1, 3, 4, 2, 6, 5, 3, 4, 5, 2])

1.5

In [4]:
import math

In [5]:
math.sqrt(statistics.pvariance([1, 3, 4, 2, 6, 5, 3, 4, 5, 2]))


1.5

### Summary ###
* A variance of zero indicates that all of the data values are identical. A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another.
* The standard deviation is the average amount of variability in your data set. It tells you, on average, how far each score lies from the mean.  
* If the data points are further from the mean, there is a higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation.

### Advantage of Population Standard Deviation vs. Population Variance
* Suppose you’ve recorded the March Fahrenheit temperatures in your area. 
* You might have 31 numbers such as 19, 32, 28 and 35. 
* The units for these numbers are degrees.
* When you square your temperatures to calculate the population variance, the units of the population variance become **“degrees squared.”**
* When you take the square root of the population variance to calculate the population standard deviation, the units once again become **degrees**, which are the same units as your temperatures.

### Population standard deviation vs sample standard deviation ###
The **population** standard deviation is a parameter, which is a fixed value calculated from every individual in the population.  A **sample** standard deviation is a statistic. This means that it is calculated from only some of the individuals in a population. Since the sample standard deviation depends upon the sample, it has greater variability. Thus the standard deviation of the sample is greater than that of the population.

The formulas to calculate both of these standard deviations are nearly identical:

* Calculate the mean.
* Subtract the mean from each value to obtain deviations from the mean.
* Square each of the deviations.
* Add together all of these squared deviations.

How the calculation of these standard deviations differs:

* If we are calculating the population standard deviation, then we divide by n, the number of data values. (Also called the population variance)
* If we are calculating the sample standard deviation, then we divide by n -1, one less than the number of data values. (Also called the sample variance)
* The final step, in either of the two cases that we are considering, is to take the square root of the quotient from the previous step. (population standard deviation or sample standard deviation)

The larger the value of n is, the closer that the population and sample standard deviations will be.  

In [5]:
help(statistics)

Help on module statistics:

NAME
    statistics - Basic statistics module.

MODULE REFERENCE
    https://docs.python.org/3.7/library/statistics
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides functions for calculating statistics of data, including
    averages, variance, and standard deviation.
    
    Calculating averages
    --------------------
    
    Function            Description
    mean                Arithmetic mean (average) of data.
    harmonic_mean       Harmonic mean of data.
    median              Median (middle value) of data.
    median_low          Low median of data.
    median_high         High median of data.
    median_grouped      Median, or 50th per

[Self Check 4.18](./snippets_ipynb/04_18selfcheck.ipynb)

------
&copy;1992&ndash;2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 4 of the book [**Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud**](https://amzn.to/2VvdnxE).

DISCLAIMER: The authors and publisher of this book have used their 
best efforts in preparing the book. These efforts include the 
development, research, and testing of the theories and programs 
to determine their effectiveness. The authors and publisher make 
no warranty of any kind, expressed or implied, with regard to these 
programs or to the documentation contained in these books. The authors 
and publisher shall not be liable in any event for incidental or 
consequential damages in connection with, or arising out of, the 
furnishing, performance, or use of these programs.                  