# 3 Introduction to Summary Statistics


***
# 3.3 Variance and standard deviation
---

Once again, let's look at the 2008 swing state data on the county level and think about other summary statistics we can calculate.

<img src="img/swarm_mean_ex.png",width=500>

In this bee swarm plot, I also show the means of each state with a horitzontal line. 

In looking at this plot, the mean seems to capture the magnitude of the data, but **what about the variability, or the spread, of the data?**

By just looking into the swarm plot, 
> * Florida seems to have more county-to-county variability than Pennsylvania or Ohio. 

### Variance
We can quantify this spread with the variance. The **variance** is the average of the squared distance from the mean. 

$$ variance = \frac{1}{n} \sum_{i=1}^{n} \left( x_i - \overline{x} \right)^{2} $$

Informally, the variance is a measure of the spread of the data. Let's parse that more carefully with a graphical example, looking specifically at Florida.

<img src="img/variance.png",width=900>

For each data point, we square the distance from the mean, and then take the average of all these values.

### ```variance``` with NumPy
Calculation of the variance is implemented in the ```np.var()``` function.
```python
In [1]: ma_state_FL = pd_data['state'] == "FL"
In [2]:dem_share_FL = pd_data[ma_state_FL]['dem_share']
    
In [3]: var_dem_share_FL = np.var( dem_share_FL )
In [4]: var_dem_share_FL
Out[4]: 147.44278618846067

```

<div class="alert alert-block alert-info">
<b>Standard Deviation.</b> Note that because the calculation of the variance involves squared quantities, it does not have the same units of what we have measured (the mean or median, or the percentiles). Therefore we are interested in the square root of the variance, which is called the standard deviation.
</div>

$$ \sigma = \sqrt{variance} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} \left( x_i - \overline{x} \right)^{2} } $$

### ```std``` with NumPy

```python
In [1]: std_dem_share_FL = np.sqrt( var_dem_share_FL )
Out[1]: 12.142602117687158
    
In [2]: np.std( dem_share_FL )
Out[2]: 12.142602117687158
```

The results are the same as taking the square root of the variance, or using the numpy function ```np.std()```


**Now, when we look at the previous plot, we see that the deviation is a reasonable metric for hte typical spread of the data**

# Let's practice!
***

<div class="alert alert-block alert-success">
<b>Loading data.</b> In the following ipython cell, the necessary data set for this section is loaded
</div>

In [13]:
# all packages are already loaded
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()

# Loading data in the namespace
# columns info: row,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm),species
iris = np.genfromtxt( "data/iris.csv", delimiter=",", skip_header=1)

# Select features for the versicolor type of iris
# species info: 
#       0 for versicolor
#       1 for setosa
#       2 for virginica
versicolor = iris[:,5]== 0
versicolor_petal_length = iris[versicolor,1]

setosa = iris[:,5]== 1
setosa_petal_length = iris[setosa,1]

virginica = iris[:,5]== 2
virginica_petal_length = iris[virginica,1]

<font color=green>
# Exercise 3.1 Computing the variance
</font>
It is important to have some understanding of what commonly-used functions are doing under the hood. Though you may already know how to compute variances, this is a beginner course that does not assume so. In this exercise, we will explicitly compute the variance of the petal length of Iris veriscolor using the equations discussed in the videos. We will then use ```np.var()``` to compute it.

#### Instructions
> - Create an array called differences that is the difference between the petal lengths (versicolor_petal_length) and the mean petal length. The variable versicolor_petal_length is already in your namespace as a NumPy array so you can take advantage of NumPy's vectorized operations.
> - Square each element in this array. For example, x**2 squares each element in the array x. Store the result as diff_sq.
> - Compute the mean of the elements in diff_sq using np.mean(). Store the result as variance_explicit.
> - Compute the variance of versicolor_petal_length using np.var(). Store the result as variance_np.
> - Print both variance_explicit and variance_np in one print call to make sure they are consistent.


In [14]:
# Array of differences to mean: differences
differences = versicolor_petal_length -  versicolor_petal_length.mean() 

# Square the differences: diff_sq
diff_sq = differences**2

# Compute the mean square difference: variance_explicit
variance_explicit = np.mean( diff_sq )

# Compute the variance using NumPy: variance_np
variance_np = np.var( versicolor_petal_length )

# Print the results
print "\n\t variance:  explicit {0} vs numpy.var function {1} \n".format(variance_explicit, variance_np)


	 variance:  explicit 0.121764 vs numpy.var function 0.121764 



<font color=green>
# Exercise 3.2 The Standard Deviation and the variance
</font>
As mentioned before, the _standard deviation_ is the square root of the variance. You will see this for yourself by computing the standard deviation using ```np.std()``` and comparing it to what you get by computing the variance with ```np.var()``` and then computing the square root.

#### Instructions
> - Compute the variance of the data in the ```versicolor_petal_length``` array using ```np.var()``` and store it in a variable called variance.

> - Print the square root of this value.

> - Print the standard deviation of the data in the ```versicolor_petal_length``` array using ```np.std()```.

In [15]:
# Compute the variance: variance
variance = np.var( versicolor_petal_length )

# Print the square root of the variance
print "\n\t SD of variance is ", np.sqrt(variance)

# Print the standard deviation
print "\n\t SD by using the numpy function ", np.std(versicolor_petal_length)


	 SD of variance is  0.348946987378

	 SD by using the numpy function  0.348946987378
