# Introduction to Summary Statistics

***
# 1. The sample ```mean``` and ```median```
---


We have seen that histograms, bee swarm plots, and ECDFs provide effective summaries of data. But in data science is mandatory to summarize data even more succinctly, say in one or two numbers, at least. These numerical summaries are not by any stretch a substitute for the graphical methods we have been employing, but they do take up a lot less real estate.


Let's go back to the elecgtion data from the swing states again. If we could summarize the percentage of the votes for Obama at the county level in Pennylvania in one number, what would we choose?

<img src="img/beeswarm_plot.png",width=500>

The first number that pops into my mind is **the mean**. The mean is the sum of all the data, divided by the number n of the data points


$$\overline{x} = \frac{1}{n} \left( \sum_{i=1}^{n} x_{i} \right) = \frac{x_1+x_2+x_3+\dots+x_n}{n}$$

So, the mean for a given state is just the average percentage of votes over the counties. If we add the means as horizontal lines to the bee swarm plot, we see that they are a reasonable summary of the data.

<img src="img/swarm_mean_ex.png",width=500>

### ```mean``` with NumPy
To compute the mean of a set of data, we call the ```np.mean()``` function. Here used to compute the mean county-level vote for Obama in Pennsylvania

```python
In [1]: import numpy as np
In [2]: np.mean( dem_share_PA )
Out[1]: 45.476417910447765
```

if ```dem_share_PA``` is a ndarray, ```mean``` is an attribute of such objects

```python
In [1]: dem_share_PA.mean()
Out[1]: 45.476417910447765
    
```

### ```mean``` remark
- The mean is a useful statistic and easy to calculare, but a major problem is that <font color="red">it is heavily influenced by outliers</font> (i.e. data points whose value is far greater or less than most of the rest of the data)

### Solution: the ```median```
We might like a summary statistic that is immune to extreme data. The median provides exactly that.

**The median is the middle value of a data set.** It is defined by how it is calculated: sort the data and choosing the datum in the middle. Because it is derived from the ranking of sorted data, and not on the values of the data, the median is immune to data that take on extreme values. 

Here it is displayed on the bee swarm plot:

<img src="img/swarm_median_ex.png",width=500>

It is not tugged up by the counties with large fraction of votes for Obama (these are only the 17% of the data).


### ```median``` with NumPy
To compute the **median** of a set of data, we use the ```np.median()``` function. Here used to compute the median county-level vote for Obama in Utah


```python
In [1]: np.median( dem_share_UT )
Out[1]: 22.469999999999999
```

# Let's see an example
---

Let's take the data from the 2008 elections in all states.

As we said in the last section, a CSV file can be load with several functions. For this exercise, I will use tha package ```pandas``` and read the file by calling the **read_csv** function.

In [1]:
import pandas as pd

pd_data = pd.read_csv("data/2008_all_states.csv")

pd_data.tail(10)

Unnamed: 0,state,county,total_votes,dem_votes,rep_votes,other_votes,dem_share,east_west
3143,OH,Athens County,31098,20722,9742,634,68.02,east
3144,OH,Butler County,173777,66030,105341,2406,38.53,east
3145,OH,Clinton County,19305,6558,12409,338,34.58,east
3146,OH,Cuyahoga County,665352,458422,199880,7050,69.64,east
3147,OH,Franklin County,560325,334709,218486,7130,60.5,east
3148,OH,Hamilton County,425086,225213,195530,4343,53.53,east
3149,OH,Highland County,19186,6856,11907,423,36.54,east
3150,OH,Hocking County,12961,6259,6364,338,49.58,east
3151,OH,Licking County,82356,33932,46918,1506,41.97,east
3152,OH,Madison County,17454,6532,10606,316,38.11,east


Consider the county-level votes for Utah in the 2008 election only

In [37]:
# loading data into NumPy Arrays
mask_state_UT = pd_data['state']=="UT"

state_UT = pd_data[mask_state_UT]['state']
dem_share_UT = pd_data[mask_state_UT]['dem_share']

The mean of the democratic share votes in Utah is

In [38]:
dem_share_UT.mean()

27.611034482758622

In [39]:
import numpy as np

# can be also compute by calling the mean funciton
np.mean( dem_share_UT )

27.611034482758622

# Let's practice!
***

<div class="alert alert-block alert-success">
<b>Loading data.</b> In the following ipython cell, the necessary data set for this section is loaded
</div>

In [71]:
# all packages are already loaded
# import numpy as np
# from matplotlib import pyplot as plt
# import seaborn as sns
# sns.set()

# Loading data in the namespace
# columns info: row,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm),species
iris = np.genfromtxt( "data/iris.csv", delimiter=",", skip_header=1)

# Select features for the versicolor type of iris
# species info: 
#       0 for versicolor
#       1 for setosa
#       2 for virginica
versicolor = iris[:,5]== 0
versicolor_petal_length = iris[versicolor,1]

setosa = iris[:,5]== 1
setosa_petal_length = iris[setosa,1]

virginica = iris[:,5]== 2
virginica_petal_length = iris[virginica,1]

<font color=green>
# Exercise 1.1 Means and medians
</font>
Which one of the following statements is true about means and medians?

#### Possible Answers
> - An outlier can significantly affect the value of both the mean and the median.
> - An outlier can significantly affect the value of the mean, but not the median.
> - Means and medians are in general both robust to single outliers.
> - The mean and median are equal if there is an odd number of data points.

In [72]:
# check it by yourself!


<font color=green>
# Exercise 1.2 1D array: computing means
</font>
The mean of all measurements gives an indication of the typical magnitude of a measurement.

#### Instructions
> - Compute the mean petal length of the three Iris species: versicolor_petal_length, setosa_petal_length and virginica_petal_length (already provided in your namespace). Assign the mean to mean_length_vers, mean_length_seto and mean_length_virg, respectively.
> - Hit submit to print the result.

In [73]:
# Compute the mean: mean_length_vers
mean_length_vers = np.mean( versicolor_petal_length )
mean_length_seto = np.mean( setosa_petal_length )
mean_length_virg = np.mean( virginica_petal_length )

# Print the result with some nice formatting
print '\t Mean. versicolor: {0} , setosa: {1}, and virginica {2} [in cm]'.format(mean_length_vers, 
                                                                        mean_length_seto, mean_length_virg)


	 Mean. versicolor: 5.006 , setosa: 5.936, and virginica 6.588 [in cm]


<font color=green>
# Exercise 1.3 ND array: computing means (I)
</font>
Compute the mean petal length of the **versicolor** iris specie (i.e. value 0) but now using the ```ndarray```  called **iris** (already in your namespace)

#### Remember
> - iris is a matrix of 150 rows with 6 columns
> - the feature **petal length** is stored in the second column
> - the feature **species** in the last column

#### Instructions
> - Create a mask to select only those rows that are classified as **versicolor**. Assign the mask to ma_versicolor
> - Compute the mean petal length. Assign the mean to mean_length_petal_vers
> - Print the result from this exercise, and the previous one, to compare it

In [74]:
# select a subset of rows: column 5 equal to 0
ma_versicolor = iris[:,5] == 0

# compute the petal length mean of this subset
mean_length_petal_vers = iris[ma_versicolor, 1].mean()

# print out
print "\t Mean for the versicolor {0} and from last exercise {1} ".format(mean_length_petal_vers, 
                                                                          mean_length_vers)

	 Mean for the versicolor 5.006 and from last exercise 5.006 


In [75]:
# alternative way to compute it
np.mean( iris[ma_versicolor, 1] )

5.0060000000000002

<font color=green>
# Exercise 1.3 ND array: computing means (II)
</font>
<div class="alert alert-block alert-info">
<b>Note:</b> Using the function ```np.mean()``` you can compute the mean of all the attributes of this subset (or any other subset) in only one step.
</div>

```python
np.mean( iris[ma_versicolor], axis=0 )

```
#### Instructions
> - Copy this code in the ipython cell and try to understand it. 
> - Do the same, but now for the setosa and virginica Iris specie (i.e. value 1, and 2 respectively)

In [76]:
# in one way you have the mean of all afeatures/attributes for this specie
np.mean( iris[ma_versicolor], axis=0 )

array([ 24.5  ,   5.006,   1.464,   0.244,   3.418,   0.   ])

<font color=green>
# Exercise 1.4 Computing medians
</font>
The median is the middle value of a data set.

#### Instructions
> - Compute the median petal length of the three Iris species: versicolor_petal_length, setosa_petal_length and virginica_petal_length (already provided in your namespace). Assign the median to median_length_vers, median_length_seto and median_length_virg, respectively.
> - Hit submit to print the result.

In [77]:
# Compute the mean: mean_length_vers
median_length_vers = np.median( versicolor_petal_length )
median_length_seto = np.median( setosa_petal_length )
median_length_virg = np.median( virginica_petal_length )

# Print the result with some nice formatting
print '\t Median. versicolor: {0} , setosa: {1}, and virginica {2} [in cm]'.format(median_length_vers, 
                                                                        median_length_seto, median_length_virg)


	 Median. versicolor: 5.0 , setosa: 5.9, and virginica 6.5 [in cm]


<font color=green>
# Exercise 1.5 Computing means without using ```np.mean```
</font>
Design your own code to compute the media and the mean values of the petal length of the three Iris species:


#### Instructions
> - Compute the mean petal length of the three Iris species: versicolor_petal_length, setosa_petal_length and virginica_petal_length (already provided in your namespace).
> - Print the result and compare it with results from Ex 2.

In [78]:
def my_mean(data):
    
    n = float(len(data))
    _mean = 0
    for i in data:
        _mean += i
    
    return _mean/n

print "\t petal length mean of versicolor ", my_mean( versicolor_petal_length ), " in cm"
print "\t petal length mean of setosa ", my_mean( setosa_petal_length ), " in cm"
print "\t petal length mean of virginica ", my_mean( virginica_petal_length ), " in cm"

	 petal length mean of versicolor  5.006  in cm
	 petal length mean of setosa  5.936  in cm
	 petal length mean of virginica  6.588  in cm


<font color=green>
# Exercise 1.6 Computing median without using ```np.median```
</font>
Design your own code to compute the media and the mean values of the petal length of the three Iris species:


#### Instructions
> - Compute the median petal length of the three Iris species: versicolor_petal_length, setosa_petal_length and virginica_petal_length (already provided in your namespace). 
> - Print the result.

In [79]:
def my_median(data):

    n = len(data)
    halfn = n/2

    # before to search for the value at the middle of the array, we must sort it!
    data = np.sort(data)    
    if not n%2:
        _median = data[halfn]
    else:
        _median = np.mean([ data[halfn-1],data[halfn] ])

    return _median
            
print "\t petal length median of versicolor ", my_median( versicolor_petal_length ), " in cm"
print "\t petal length median of setosa ", my_median( setosa_petal_length ), " in cm"
print "\t petal length median of virginica ", my_median( virginica_petal_length ), " in cm"

	 petal length median of versicolor  5.0  in cm
	 petal length median of setosa  5.9  in cm
	 petal length median of virginica  6.5  in cm
