## A scenario to keep in mind about statistics

You’re a citizen scientist who has started collecting data about rising water in the river next to where you live. For months, you painstakingly measure the water levels and enter your findings into a notebook. But at the end of it, what exactly do you have? What can all this data tell us?

After the river in your town flooded during a recent hurricane, you've become interested in collecting data about the its height. Every day for the past month, you walk to the river, measure the height of the water, and enter this information into a notebook.

Let's look at how you can use NumPy functions to analyze your dataset.

First, we'll import the NumPy module, so we can use its statistical calculation functions.

In [1]:
import numpy as np
water_height = np.array([4.01, 4.03, 4.27, 4.29, 4.19,
                         4.15, 4.16, 4.23, 4.29, 4.19,
                         4.00, 4.22, 4.25, 4.19, 4.10,
                         4.14, 4.03, 4.23, 4.08, 14.20,
                         14.03, 11.20, 8.19, 6.18, 4.04,
                         4.08, 4.11, 4.23, 3.99, 4.23])

Let's use the function np.mean() to find the average water height:

In [2]:
np.mean(water_height)

5.251

But wait! We should sort our data to see if there could be any measurements to throw our data off, or represent a deviation from the mean:

In [3]:
np.sort(water_height)

array([ 3.99,  4.  ,  4.01,  4.03,  4.03,  4.04,  4.08,  4.08,  4.1 ,
        4.11,  4.14,  4.15,  4.16,  4.19,  4.19,  4.19,  4.22,  4.23,
        4.23,  4.23,  4.23,  4.25,  4.27,  4.29,  4.29,  6.18,  8.19,
       11.2 , 14.03, 14.2 ])

Looks like that thunderstorm might have impacted the average height! Let's measure the median to see if its more representative of the dataset:

In [4]:
np.median(water_height)

4.19

While the median tells us where half of our data lies, let's look at a value closer to the end of the dataset. We can use percentiles to use a data points position and get its value:

In [5]:
np.percentile(water_height, 75)

4.265

So far, we've gotten a good idea about specific values. But what about the spread of our data? Let's calculate the standard deviation to understand how similar or how different each data point is:

In [6]:
np.std(water_height)

2.784585367099861

## NumPy and Mean

The first statistical concept we’ll explore is mean, also commonly referred to as an average. The mean is a useful measurement to get the center of a dataset. NumPy has a built-in function to calculate the average or mean of arrays: np.mean

![p](https://i.imgur.com/DRJ0O1P.jpg)

## Mean and Logical Operations

We can also use np.mean to calculate the percent of array elements that have a certain property.

As we know, a logical operator will evaluate each item in an array to see if it matches the specified condition. If the item matches the given condition, the item will evaluate as True and equal 1. If it does not match, it will be False and equal 0.

When np.mean calculates a logical statement, the resulting mean value will be equivalent to the total number of True items divided by the total array length.

![p](https://i.imgur.com/lrAbeos.jpg)

## Calculating the Mean of 2D Arrays

If we have a two-dimensional array, np.mean can calculate the means of the larger array as well as the interior values.

Let’s imagine a game of ring toss at a carnival. In this game, you have three different chances to get all three rings onto a stick. In our ring_toss array, each interior array (the arrays within the larger array) is one try, and each number is one ring toss. 1 represents a successful toss, 0 represents a fail.

![p](https://i.imgur.com/62H2XeV.jpg)

![p](https://i.imgur.com/VzWKrS7.jpg)

## Outliers

As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of those values is significantly different from the rest?

Values that don’t fit within the majority of a dataset are known as outliers. It’s important to identify outliers because if they go unnoticed, they can skew our data and lead to error in our analysis (like determining the mean). They can also be useful in pointing out errors in our data collection.

When we’re able to identify outliers, we can then determine if they were due to an error in sample collection or whether or not they represent a significant but real deviation from the mean.

Suppose we want to determine the average height for 3rd graders. We measure several students at the local school, but accidentally measure one student in centimeters rather than in inches. If we’re not paying attention, our dataset could end up looking like this:

[50, 50, 51, 49, 48, 127]

In this case, 127 would be an outlier.

Some outliers aren’t the result of a mistake. For instance, suppose that one of our 3rd graders had skipped a grade and was actually a year younger than everyone else in the class:

[50, 50, 51, 49, 48, 45]

She might be significantly shorter at 45”, but her height would still be an outlier.

Suppose that another student was just unusually tall for his age:

[50, 50, 51, 49, 48, 58.5]

His height of 58.5” would also be an outlier.

### Sorting and Outliers

![p](https://i.imgur.com/4gJbPrb.jpg)

![p](https://i.imgur.com/9GntxLW.jpg)

![p](https://i.imgur.com/tHxa4Tj.jpg)

![p](https://i.imgur.com/3FINDzd.jpg)

![p](https://i.imgur.com/4yvrqCS.jpg)

![p](https://i.imgur.com/YsI9nIt.jpg)

## Percentiles

As we know, the median is the middle of a dataset: it is the number for which 50% of the samples are below, and 50% of the samples are above. But what if we wanted to find a point at which 40% of the samples are below, and 60% of the samples are above?

This type of point is called a percentile. The Nth percentile is defined as the point N% of samples lie below it. So the point where 40% of samples are below is called the 40th percentile. Percentiles are useful measurements because they can tell us where a particular value is situated within the greater dataset.

![p](https://i.imgur.com/bnEoSS1.jpg)

![p](https://i.imgur.com/60aGrii.jpg)

Some percentiles have specific names:

1. The 25th percentile is called the first quartile
2. The 50th percentile is called the median
3. The 75th percentile is called the third quartile

The minimum, first quartile, median, third quartile, and maximum of a dataset are called a five-number summary. This set of numbers is a great thing to compute when we get a new dataset.

The difference between the first and third quartile is a value called the interquartile range. 50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of how spread out our data is. The smaller the interquartile range value, the less variance in our dataset. The greater the value, the larger the variance.

![p](https://i.imgur.com/QGJkbfk.jpg)

## NumPy and Standard Deviation

While the mean and median can tell us about the center of our data, they do not reflect the range of the data. That’s where standard deviation comes in.

Similar to the interquartile range, the standard deviation tells us the spread of the data. The larger the standard deviation, the more spread out our data is from the center. The smaller the standard deviation, the more the data is clustered around the mean.

![p](https://i.imgur.com/03kQbmj.jpg)

![p](https://i.imgur.com/YNp1PAV.jpg)

![p](https://i.imgur.com/LmwtgYF.jpg)

## Quiz related to the same

![p](https://i.imgur.com/OrfSFgd.jpg)

![p](https://i.imgur.com/CNFMaeO.jpg)

![p](https://i.imgur.com/or8sI4v.jpg)

![p](https://i.imgur.com/zPrV8K0.jpg)

![p](https://i.imgur.com/X5I2nRG.jpg)

![p](https://i.imgur.com/jtTqA80.jpg)

![p](https://i.imgur.com/drnZN0L.jpg)

![p](https://i.imgur.com/iStCait.jpg)