In [2]:
import numpy as np

### Mean
The first statistical concept we'll explore is mean, also commonly referred to as an average. The mean is a useful measurement to get the center of a dataset. NumPy has a built-in function to calculate the average or mean of arrays: `np.mean`


In [4]:
a = np.array([2, 5, 8, 3, 4, 10, 15, 5])
a_avg = np.mean(a)
a_avg

6.5

### Mean and Logical Operations
As we know, a logical operator will evaluate each item in an array to see if it matches the specified condition. If the item matches the given condition, the item will evaluate as `True` and equal 1. If it does not match, it will be `False` and equal 0.

In [6]:
np.mean(a > 8)

0.25

The logical statement `a > 8` evaluates which value were greater than 8, and assigns them a value of 1. `np.mean` adds all of the 1s up and divides them by the length of a. The resulting output tells us that 25% of data are more than 8.

### Calculating the Mean of 2D Arrays
If we have a two-dimensional array, `np.mean` can calculate the means of the larger array as well as the interior values.

Let's imagine a game of ring toss at a carnival. In this game, you have three different chances to get all three rings onto a stick. In our `ring_toss` array, each interior array (the arrays within the larger array) is one try, and each number is one ring toss. 1 represents a successful toss, 0 represents a fail.

First, we can use np.mean to find the mean across all the arrays:

In [8]:
ring_toss = np.array([[1, 0, 0], 
                      [0, 0, 1], 
                      [1, 0, 1]])
np.mean(ring_toss)

0.44444444444444442

In [10]:
#To find the means of each interior array, we specify axis 1 (the "rows"):
np.mean(ring_toss, axis=1)

array([ 0.33333333,  0.33333333,  0.66666667])

In [11]:
#To find the means of each index position 
#(i.e, mean of all 1st tosses, mean of all 2nd tosses, ...), we specifiy axis 0 (the "columns"):
np.mean(ring_toss, axis=0)

array([ 0.66666667,  0.        ,  0.66666667])

### Outliers
As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of those values is significantly different from the rest?

Values that don’t fit within the majority of a dataset are known as outliers. It’s important to identify outliers because if they go unnoticed, they can skew our data and lead to error in our analysis (like determining the mean). They can also be useful in pointing out errors in our data collection.

When we're able to identify outliers, we can then determine if they were due to an error in sample collection or whether or not they represent a significant but real deviation from the mean.

Suppose we want to determine the average height for 3rd graders. We measure several students at the local school, but accidentally measure one student in centimeters rather than in inches. If we're not paying attention, our dataset could end up looking like this:
```
[50, 50, 51, 49, 48, 127]```
In this case, 127 would be an outlier.

One way to quickly identify outliers is by sorting our data, Once our data is sorted, we can quickly glance at the beginning or end of an array to see if some values lie far beyond the expected range. We can use the NumPy function `np.sort` to sort our data.

In [12]:
heights = np.array([49.7, 46.9, 62, 47.2, 47, 48.3, 48.7])
np.sort(heights)

array([ 46.9,  47. ,  47.2,  48.3,  48.7,  49.7,  62. ])

### Median
Another key metric that we can use in data analysis is the median. The median is the middle value of a dataset that’s been ordered in terms of magnitude (from lowest to highest).

In [13]:
my_array = np.array([50, 38, 291, 59, 14])
np.median(my_array)

50.0

### Percentiles
The Nth percentile is defined as the point N% of samples lie below it. So the point where 40% of samples are below is called the 40th percentile. Percentiles are useful measurements because they can tell us where a particular value is situated within the greater dataset.

In [18]:
d = np.array([1, 2, 3, 4, 4, 4, 6, 6, 7,8, 8])
np.percentile(d, 40)

4.0

Some percentiles have specific names:

The 25th percentile is called the first quartile
The 50th percentile is called the median
The 75th percentile is called the third quartile
The minimum, first quartile, median, third quartile, and maximum of a dataset are called a *five-number* summary. This set of numbers is a great thing to compute when we get a new dataset.

The difference between the first and third quartile is a value called the interquartile range. For example, say we have the following array:

```
d = [1, 2, 3, 4, 4, 4, 6, 6, 7, 8, 8]
```

We can calculate the 25th and 75th percentiles using np.percentile:

```python
np.percentile(d, 25)
>>> 3.5
np.percentile(d, 75)
>>> 6.5
```
Then to find the interquartile range, we subtract the value of the 25th percentile from the value of the 75th:

```
6.5 - 3.5 = 3
```

50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of how spread out our data is. The smaller the interquartile range value, the less variance in our dataset. The greater the value, the larger the variance.

### Standard Deviation
Similar to the interquartile range, the standard deviation tells us the spread of the data. The larger the standard deviation, the more spread out our data is from the center. The smaller the standard deviation, the more the data is clustered around the mean.

We can find the standard deviation of a dataset using the Numpy function `np.std`:

In [19]:
nums = np.array([65, 36, 52, 91, 63, 79])
np.std(nums)

17.716909687891082

### Review

In [20]:
rainfall = np.array([5.21, 3.76, 3.27, 2.35, 1.89, 1.55, 0.65, 1.06, 1.72, 3.35, 4.82, 5.11])

rain_mean = np.mean(rainfall)
rain_median = np.median(rainfall)
first_quarter = np.percentile(rainfall, 25)
third_quarter = np.percentile(rainfall, 75)
interquartile_range = third_quarter - first_quarter
rain_std = np.std(rainfall)
print(rain_mean)
print(rain_median)
print(first_quarter)
print(third_quarter)
print(interquartile_range) 
print(rain_std)

2.895
2.81
1.6775
4.025
2.3475
1.52673125773
