In [None]:
%matplotlib inline
%load_ext rpy2.ipython
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
from numpy import mean, median, sqrt, std

# stats60 specific
from code.week1 import (stylized_density,
                        SD_rule_of_thumb_normal, 
                        SD_rule_of_thumb_skewed)
from code.week2 import pearson_lee
from code.week1 import find_percentile
from code.utils import sample_density
figsize = (8,8)

## Numeric summaries

In the last set of notes, we saw *histograms* as a useful
way to summarize a *sample* of numbers.

In this set of notes, we will look at numeric summaries that boil
down the histogram to a set of a few numbers.

In particular, we will look at:
    
- **average** or **mean**;
- **SD** and **SD+**;
- **median**.
    
Along the way, we will discuss **standardized units.**

## Average

- ** The average of a list of numbers equals their sum, divided by how many there are. Average is also called the *mean*.**

- <font color="#820000">Example: Compute the average of the sample: [1,4,6,7,8].
</font>

- <font color="blue">
The answer is (1+4+6+7+8)/5 = 26/5 = 5.2
</font>


In [None]:
mean([1,4,6,7,8])

### Summation notation

We will sometimes use summation notation known with the greek
symbol ["Sigma"](http://en.wikipedia.org/wiki/Summation).

* Call our list $X=[X_1, \dots, X_n]$.
* We often write the sum $$X_1 + X_2 + \dots + X_n = \sum_{i=1}^n X_i.$$
* The mean of a list $X = [X_1, \dots, X_n]$ is often written as $$\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i.$$

## Average

The average can be thought of as the "balancing point" of the sample.

In [None]:
%%capture
hist_opts=dict(align='mid', color='#820000', 
       normed=False, histtype='stepfilled')

with plt.xkcd():
    average_fig = plt.figure(figsize=(7,7))
    a = plt.gca()
    data = [1,4,6,7,8]
    a.hist(data, bins=np.linspace(0,10,101), **hist_opts) 
    annotation = a.annotate('Balance point', xy=(np.mean(data), 0),
                  arrowprops=dict(facecolor='black'), xytext=(np.mean(data),-0.2),
                  fontsize=30,
                  horizontalalignment='center')
    a.set_yticks([])
    a.set_ylim([0,1.2])

In [None]:
average_fig

### Average as balancing point

- Let's try to balance the sample [1,4,6,7,8] at some point $m$. 

- For each $m$, some of the sample points greater than or equal
to $m$ while the rest are less than $m$. Call these lists $U(m), L(m)$ respectively.

- <font color="#820000">
Example: Suppose $m=6.5$,  then $U(6.5) = [7,8]$ and $L(6.5) = [1,4,6]$.
</font>

- We say the sample is balanced at $m$ if 
<font color="#820000">
    sum([v-m for v in U(m)]) = sum([m-v for v in L(m)])
</font>

### Average as balancing point

- **Example: Suppose `m=6.5`**  then 
   * sum([v-m for v in U(m)]) = sum([7-6.5, 8-6.5]) = 2
   * sum([m-v for v in L(m)]) = sum([6.5-1, 6.5-4, 6.5-6] = 8.5

- **Example: Suppose `m=5.2`** then 
   * sum([v-m for v in U(m)]) = sum([6-5.2, 7-5.2, 8-5.2]) = 5.4
   * sum([m-v for v in L(m)]) = sum([5.2-4, 5.2-1]) = 5.4


In [None]:
annotation.set_text('Average')
average_fig

### Average of [1,4]

In [None]:
%%capture
with plt.xkcd():
    average_fig2 = plt.figure(figsize=figsize)
    a = plt.gca()
    data = [1,4]
    a.hist(data, bins=np.linspace(0,10,101), **hist_opts)
    a.annotate('Average', xy=(np.mean(data), 0),
                  arrowprops=dict(facecolor='black'), xytext=(np.mean(data),-0.2),
                  fontsize=20,
                  horizontalalignment='center')
    a.set_yticks([])
    a.set_ylim([0,1.2])
    a.set_xlim([0,5])

In [None]:
average_fig2

### Average of [1,1,1,1,4,4]

In [None]:
%%capture
with plt.xkcd():
    average_fig3 = plt.figure(figsize=figsize)
    a = plt.gca()
    data = [1]*4+[4]*2
    a.hist(data, bins=np.linspace(0,10,101), **hist_opts)
    a.annotate('Average', xy=(np.mean(data), 0),
                  arrowprops=dict(facecolor='black'), 
                  xytext=(np.mean(data),-0.7),
                  fontsize=20,
                  horizontalalignment='center')
    a.set_yticks([])
    a.set_ylim([0,4.3])
    a.set_xlim([0,5])

In [None]:
average_fig3

## Average balances the list
 
- Another way to say that the average balances the list is:
**the sum of the deviations from average is always zero.**


### Example: [1,1,1,1,4,4]

- The average is 2.
- Deviations from average are [-1,-1,-1,-1,2,2].
- Sum of deviations from average: 0. **This is always true.**
- In $\Sigma$ notation: for any list $[X_1, \dots, X_n]$:
$$\sum_{i=1}^n (X_i - \bar{X}) = 0.$$


## Average balances the list

In [None]:
data = [1,1,1,1,4,4]
data_average = mean(data)
data_average

In [None]:
deviations = [d - data_average for d in data]
deviations

In [None]:
sum(deviations)

### Lists with repeated entries

* With repeats, we can think of having a "weight" of 4/6 at 1 and 2/6 at 4.

* The weight is the area of the histogram near 4. 
    
* Average is the weighted sum $4/6*1 + 2/6*4 = 2$.

* Deviations from average is $-1=(1-2)$ at 1 and $2=(4-2)$ at 4.

* Weighted sum of deviations from average: $$4/6 \times (-1) + 2/6 \times 2 = 0.$$
* This is how we can compute the average of a histogram.


## Average of a histogram

- Each bar in a histogram has a midpoint $M$ and an area $A$.

- The average of the histogram is 
          sum([M*A for M, A in histogram])
    
- When the bars get shorter and shorter this looks like the
[midpoint rule](http://en.wikipedia.org/wiki/Rectangle_rule) for integrating
$$
\int_{-\infty}^{\infty} u \cdot f(u) \; du.
$$

In [None]:
%%capture
hist_opts['normed'] = True
hist_fig = plt.figure(figsize=figsize)
a = plt.gca()
data = np.random.beta(2,6,10000)
a.hist(data, bins=np.linspace(0,1,21), **hist_opts)
a.annotate('Average', xy=(np.mean(data), 0),
           arrowprops=dict(facecolor='black'), xytext=(np.mean(data),-0.5),
           fontsize=20)
a.set_yticks([])
hist_ax = a

In [None]:
hist_fig

## Median

- This is another summary of a sample or a histogram.

- Like 
the average, it gives a quantitative *center* or *location* to the sample.


**The median
 of a histogram is the number with half the area to the left and half the area to the right.**
 

In [None]:
hist_ax.annotate('Median', xy=(np.median(data), 0),
               arrowprops=dict(facecolor='black'), xytext=(np.median(data)+0.1,2.5),
               fontsize=20)

In [None]:
hist_fig

### California population by age

Age group | Count | Percentage
--- | --- | ---
0-20 | 10000000 | 29%
20-55 | 17500000 | 17500000 / 34000000 = 52%
55-75 | 4500000 | 13%
75+ | 2000000 | 6%
**Total** | 34000000 | 100 %

In [None]:
%%capture

def CAdensity(figsize=(8,8)):
    bins = [0,20,55,75,100]
    count = [29,52,13,6]
    hist_fig = plt.figure(figsize=(6,6))
    data = np.array([10]*29 + [30]*52 + [60]*13 + [80.]*6)
    hist_plot, dens, CDF = sample_density(data, bins=bins, alpha=0.5, ax=hist_fig.gca(),
                            facecolor='gray')
    hist_plot.set_ylabel('Percentage per year (%/year)', fontsize=20)
    hist_plot.set_xlabel('Age (years)', fontsize=20)
    hist_plot.set_title('California population by age groups', fontsize=22)
    def area(a, b):
        return np.round(100*(CDF(b) - CDF(a)), 1)
    return hist_fig, dens, area

CAhist, CAdens, CAarea = CAdensity(figsize=figsize)

In [None]:
CAhist

In [None]:
CAmedian = 20 + (0.21 / 0.52) * (55-20) 
print ('The median is: %0.1f' % CAmedian)
print ('The gray area between 0 and the median is: %0.1f' % CAarea(0, CAmedian))

In [None]:
%%capture
ax = CAhist.gca()
interval = np.linspace(0.,CAmedian,501)
ax.fill_between(interval, 0*interval, CAdens(interval), 
                facecolor='blue', hatch='/')

In [None]:
CAhist

According to this histogram, the median age of Californians is 34.1 years.



### Examples of medians of histograms


- ** When the histogram is symmetric, the average and the median are equal. **

- Let's look at a more skewed example.


In [None]:
%%capture
hist_fig2 = plt.figure(figsize=figsize)
a = plt.gca()
data = np.random.beta(5,2.2,25000) + np.random.standard_exponential(25000)
a.hist(data, bins=np.linspace(0,5,21), **hist_opts)
a.annotate('Average', xy=(np.mean(data), 0),
           arrowprops=dict(facecolor='black'), xytext=(np.mean(data),-0.1),
           fontsize=20)
a.set_yticks([])
a.annotate('Median', xy=(np.median(data), 0),
           arrowprops=dict(facecolor='black'), xytext=(np.median(data)+0.2,0.6),
           fontsize=20)

In [None]:
hist_fig2

### Median of a list of numbers

Defined similarly to a histogram: put half the data on the left and half the
data on the right.

1. Sort the numbers for smallest to largest.
2.  If the length of the list is odd, the median is the middle entry of the sorted values.
3. Else, the median is the average of the two middle entries.

<h3>
<font color="#820000">
Example: median of [1,4,2,9,8]
</font>
</h3>

1. The sorted values are [1,2,4,8,9].

2. Since the length of the list is 5, the median is the middle entry of the sorted values. **The median is 4.**

<h3>
<font color="#820000">
Example: median of [1,11,3,7,8,3]
</font>
</h3>

1. The sorted values are [1,3,3,7,8,11].
2. Since the length of the list is 6, the median is the average of the middle entries. **The median is (3+7)/2=5.**

In [None]:
sorted([1,4,2,9,8]), median([1,4,2,9,8])

In [None]:
sorted([1,11,3,7,8,3]), median([1,11,3,7,8,3])

### Medians of lists vs. medians of histograms

- **The median of the list of ages of all Californians might be different from 34.1. Why?**

- Below is an example using the heights of mothers from Pearson's study and
a rather coarse histogram.

In [None]:
%%capture
np.random.seed(10)
height = pearson_lee()
mother = height.M #+ np.random.sample(height.M.size)
mother_fig = plt.figure(figsize=figsize)
mother_dens = sample_density(mother, bins=5)[1]
median_mother_hist = find_percentile(mother_dens, 0.5)
ax = mother_fig.gca()
ax.set_xlabel('Height (inches)', fontsize=15)
ax.set_ylabel('Percentage per inch (%/inch)', fontsize=15)
ax.set_title('Median of the histogram is %0.2f' % median_mother_hist, fontsize=15)

In [None]:
mother_fig

But, the median of the list of mother's heights is slightly different.

In [None]:
median(mother)

### Median and average

- **What is the mean of [3,7,4,11,5]? The median?**

- **What is the mean of [3,37,4,41,5]? The median?**

- **What do these examples tell us about the mean and median?**

This tells us that the mean is more sensitive to changes away from 
the center than the median. Statisticians call this **robustness**.

### Symmetric histogram

In [None]:
%%capture
with plt.xkcd():
    symmetric = plt.figure(figsize=figsize)
    a = plt.gca()
    data = np.random.standard_normal(25000) + 3
    a.hist(data, bins=np.linspace(-1,7,21), **hist_opts)
    a.annotate('Average', xy=(np.mean(data), 0),
               arrowprops=dict(facecolor='black'), xytext=(np.mean(data),-0.08),
               fontsize=20)
    a.set_yticks([])
    a.annotate('Median', xy=(np.median(data), 0),
               arrowprops=dict(facecolor='black'), xytext=(np.median(data)+1.5,0.2),
               fontsize=20)
    a.set_title("Symmetric histogram: average = median", fontsize=20, color='red')

In [None]:
symmetric

In [None]:
%%capture
with plt.xkcd():
    skewed_right = plt.figure(figsize=figsize)
    a = plt.gca()
    data = np.random.beta(5,2.2,25000) + np.random.standard_exponential(25000)
    a.hist(data, bins=np.linspace(0,5,21), **hist_opts)
    a.annotate('Average', xy=(np.mean(data), 0),
               arrowprops=dict(facecolor='black'), xytext=(np.mean(data),-0.1),
               fontsize=20)
    a.set_yticks([])
    a.annotate('Median', xy=(np.median(data), 0),
               arrowprops=dict(facecolor='black'), xytext=(np.median(data)+0.2,0.6),
               fontsize=20)
    a.set_title("Skewed right histogram: average > median", fontsize=20, color='red')

In [None]:
skewed_right

In [None]:
%%capture
with plt.xkcd():
    skewed_left = plt.figure(figsize=figsize)
    a = plt.gca()
    data = np.random.beta(5,2.2,25000) - 0.5 * np.random.standard_exponential(25000)
    a.hist(data, bins=np.linspace(-5,2, 41), **hist_opts)
    a.annotate('Average', xy=(np.mean(data), 0),
               arrowprops=dict(facecolor='black'), xytext=(np.mean(data),-0.2),
               fontsize=20)
    a.set_yticks([])
    a.annotate('Median', xy=(np.median(data), 0),
               arrowprops=dict(facecolor='black'), xytext=(np.median(data)-2.2,0.6),
               fontsize=20)
    a.set_title("Skewed left histogram: average < median", fontsize=20, color='red')

In [None]:
skewed_left

## Scale or spread of a data set

- Both average and median summarize the *center* or **location** of a sample.

- Is this everything there is to say about a sample?

In [None]:
%%capture
with plt.xkcd():
    wide_fig = plt.figure(figsize=figsize)
    X = np.random.standard_normal(5000) + 1
    a1 = sample_density(X, bins=50, histtype='stepfilled')[0] 
    a1.set_yticks([])
    a1.set_xlim([-2,4])
    a1.set_title('Histogram of a sample', fontsize=20)

In [None]:
wide_fig

In [None]:
%%capture
with plt.xkcd():
    narrow = plt.figure(figsize=figsize)
    Y = (X-1)*0.5 + 1
    a = sample_density(Y, bins=50, histtype='stepfilled',
                     facecolor='#1C2045')[0] 
    a.set_yticks([])
    a.set_xlim([-2,4])
    a.set_title('A sample with less spread then previous', fontsize=20)

In [None]:
narrow

In [None]:
%%capture
with plt.xkcd():
    f = plt.figure(figsize=figsize)
    Y = (X-1)*0.5 + 1
    a = sample_density(Y, bins=50, histtype='stepfilled',
                     facecolor='#1C2045', ax=a1, alpha=0.5)[0] 
    a.set_yticks([])
    a.set_xlim([-2,4])
    a.set_title('Both samples have the same mean...', fontsize=20)


In [None]:
wide_fig

In [None]:
%%capture
a1.set_title('But their spread is different...', fontsize=20)


In [None]:
wide_fig

## Standard deviation

### Concept:

* A measure of **spread**. The larger the SD the more **spread out** the sample is.
* The SD says how far numbers on a list are away from their average.
* Its units are in the original units of the list.
* It is always a positive number.
* Most entries on the list will be somehere around one SD away from the average. Very few will be more than two or three SDs away.


### Rules of thumb based on SD

* **Roughly 68 % of the entries on a list are within one SD of the average.**
* **Roughly 95 % of the entries on a list are within two SDs of the average.**
* These rules hold for many data sets, not all.


### Rules of thumb

In [None]:
%%capture
sd_fig1 = plt.figure(figsize=figsize)
ax_sd1 = SD_rule_of_thumb_normal(1, facecolor='gray', alpha=0.5)

In [None]:
sd_fig1

In [None]:
%%capture
sd_fig2 = plt.figure(figsize=figsize)
ax_sd2 = SD_rule_of_thumb_normal(2, facecolor='gray', alpha=0.5)


In [None]:
sd_fig2

In [None]:
%%capture
sd_fig3 = plt.figure(figsize=figsize)
ax_sd1 = SD_rule_of_thumb_skewed(1, facecolor='gray', alpha=0.5)


In [None]:
sd_fig3

In [None]:
%%capture
sd_fig4 = plt.figure(figsize=figsize)
ax_sd1 = SD_rule_of_thumb_skewed(2, facecolor='gray', alpha=0.5)


In [None]:
sd_fig4

Let's try with some of our earlier examples of continuous
histograms.

In [None]:
%%capture

with plt.xkcd():
    skew_left = plt.figure(figsize=figsize)
    sample = list(np.random.beta(2,1, size=15000)) + list(np.random.beta(10,1.5, size=15000)) 
    stylized_density(sample, ax=skew_left.gca(), alpha=0.5, mult=1)

In [None]:
skew_left

In [None]:
%%capture
with plt.xkcd():
    skew_right = plt.figure(figsize=figsize)
    sample = list(np.random.beta(1, 2.2, size=25000)) + list(np.random.beta(1,1, size=5000)) 
    stylized_density(sample, ax=skew_right.gca(), mult=2)

In [None]:
skew_right

In [None]:
%%capture
with plt.xkcd():
    bimodal = plt.figure(figsize=figsize)
    sample = (list(np.random.standard_t(40, size=50000)) + 
              list(np.random.standard_t(30, size=20000) + 5))
    stylized_density(sample, ax=bimodal.gca(), mult=1)


In [None]:
bimodal

### Root Mean Square


    r.m.s.(list) = square_root(
                    mean([value^2 for 
                          value in list]))


Example: `r.m.s.([0, 5, 8, -3])`

1. First compute `mean([0,5^2,8^2,(-3)^2]) = mean([0,25,64,9]) = 98/4 = 24.5`.

2. Take square root: $\sqrt{24.5} \approx 4.9$.

3. **The answer is 4.9.**

In [None]:
sqrt(mean([v**2 for v in [0,5,8,-3]]))

### Root Mean Square in summation notation

* Call our list $X=[X_1, \dots, X_n]$.
* Then, using our summation notation:
$$\text{r.m.s.}(X) = \sqrt{\frac{1}{n}\sum_{i=1}^n X_i^2} = \left(\frac{1}{n}\sum_{i=1}^n X_i^2 \right)^{1/2}.$$

### Computing the SD

Now that we've learnt about the rule of thumb, let's compute the
SD of a list.

* Given a list, define 

      deviations from average(list) = \
        [entry - average(list) for entry in list]
    
    
* The SD is computed as 

      SD(list) = r.m.s.(deviations from average(list))
  

### Computing the SD


* In $\Sigma$ notation: 
$$
\text{SD}([X_1,\dots,X_n]) = \sqrt{\frac{1}{n}\sum_{i=1}^n (X_i-\bar{X})^2}.$$

## Example: Compute SD([20,30,25,25])

1. Compute the average: 

    average([20,30,25,25]) = (20+30+25+25)/4 = 100/4 = 25
    
2. Compute deviations from average:

    deviations from average([20,30,25,25]) = [20-25, 30-25, 25-25, 25-25] = [-5, 5, 0, 0]
    
3. Compute the root mean square of this last list:
$$
    \text{r.m.s.}([-5, 5, 0, 0]) = \sqrt{((-5)^2 + 5^2 + 0^2 + 0^2)/4} = \sqrt{50/4} \approx 3.5 $$
    
4. **The answer is 3.5.**

### Calculation in table form

#### Step 1: Compute the average

Entry | Data | Deviation | Deviation$^2$
------|-----|-----------|---------------
1     |  20 |           |   
2     |  30 |           |   
3     |  25 |           | 
4     |  25 |           | 
Total |  100

**The average is 100/4 = 25.**

#### Step 2: Compute the deviations and the squared deviations

Entry | Data | Deviation | Deviation$^2$
------|-----|-----------|---------------
1     |  20 |    -5     |    25
2     |  30 |     5     |    25
3     |  25 |     0     |    0
4     |  25 |     0     |    0
Total | 100 |    (not needed, but always 0)       |    50

#### Step 3: Compute the root mean square

The mean square of the squared deviations is 50/4, so the root
mean square is $\sqrt{50/4}\approx 3.5$.

In [None]:
data = [20,30,25,25]
data_average = mean(data)
print 'average is:', data_average
deviations = [value - data_average for value in data]
print 'deviations are:', deviations
SD = sqrt(mean([value**2 for value in deviations]))
print 'SD is:', (SD, std(data))

### SD versus SD$^+$

- Some calculators (and software) compute a different version of SD than we will use.
- The difference depends on the length of the list.
- If the length of the list is $n$, then 
$$
\text{SD(list)} = \sqrt{\frac{n-1}{n}} \cdot \text{SD$^+$(list)}.
$$

- In $\Sigma$ notation
$$
\text{SD$^+$}([X_1, \dots, X_n]) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2}.
$$


### Example: Compute SD$^+$([20,30,25,25]) 

- We already saw that 
$\text{SD}([20,30,25,25])=\sqrt{50/4}$.

- So, by our new definition $\text{SD}^+([20,30,25,25])=\sqrt{50/4} \cdot \sqrt{4/3} = \sqrt{50/3} \approx 4.1$.

- In small samples, there is a big difference between the two, but as the sample
gets larger, the two quantities get closer and closer.

### Changing location and scale

It is useful to see how `average` and `SD` change when we switch units.

- Suppose you are told that the average max temperature in Palo Alto for April 1  over the last 20 years is 70F with an SD of 6F.
The rule for converting Farhenheit to Celsius is 
$$
T_C = \frac{5}{9} ( T_F - 32).
$$

**What would the average and SD be if you used C (Celsius) instead?**



### Example solution

1. Subtracting 32F changes the mean of our list by 32, but does not change the SD. (Check this)

2. Multiplying this new list by  5/9 changes the SD by a factor of 5/9. (Check this)

3. **The mean in Celsius is $(70-32)*5/9\approx 21 C$ with an SD of $6*5/9 C \approx 3.3C$.**

## Mean and SD under change of units

- Suppose we have a list of 50 measurements in "old units"

       [m_1, ..., m_50]

- We want to convert to new units

       [t_1, ...., t_50]

- The transformation of units can be represented as:

       t = a * m + b
       

## Mean and SD under change of units

- The average transforms like

       mean([t[i] for i in range(50)]) = 
                a * mean([m[i] for i in range(50)]) + b
       
- The SD transforms like:

       SD([t[i] for i in range(50)]) =
                |a| * SD([m[i] for i in range(50)])