# Intro to Statistics with NumPy

In [None]:
"""
- Mean
- Median
- Percentiles
- Interquartile Range
- Outliers
- Standard Deviation
"""

In [1]:
import numpy as np

In [None]:
"""
After the river in your town flooded during a recent hurricane, you've become interested in collecting data about its height.
Every day for the past month, you walk to the river, measure the height of the water, and enter this information into a 
notebook. 
"""

In [2]:
water_height = np.array([4.01, 4.03, 4.27, 4.29, 4.19,
                         4.15, 4.16, 4.23, 4.29, 4.19,
                         4.00, 4.22, 4.25, 4.19, 4.10,
                         4.14, 4.03, 4.23, 4.08, 14.20,
                         14.03, 11.20, 8.19, 6.18, 4.04,
                         4.08, 4.11, 4.23, 3.99, 4.23])

In [4]:
#np.mean:
average_height = np.mean(water_height)
average_height

5.251

In [8]:
# But wait! We should sort our data to see if there could be any measurements to throw our data off, 
# or represent a deviation from the mean:
np.sort(water_height)

array([ 3.99,  4.  ,  4.01,  4.03,  4.03,  4.04,  4.08,  4.08,  4.1 ,
        4.11,  4.14,  4.15,  4.16,  4.19,  4.19,  4.19,  4.22,  4.23,
        4.23,  4.23,  4.23,  4.25,  4.27,  4.29,  4.29,  6.18,  8.19,
       11.2 , 14.03, 14.2 ])

In [7]:
# Looks like that thunderstorm might have impacted the average height! Let's measure the median to see if 
# its more representative of the dataset:
np.median(water_height)

4.19

In [9]:
# While the median tells us where half of our data lies, let's look at a value closer to the end of the dataset. 
# We can use percentiles to use a data points position and get its value:
np.percentile(water_height, 75)

4.265

In [10]:
# So far, we've gotten a good idea about specific values. But what about the spread of our data? Let's calculate 
# the standard deviation to understand how similar or how different each data point is:
np.std(water_height)

2.784585367099861

In [11]:
# mean and logical statements:
# when np.mean calculates a logical statement, the resulting mean value will be equivalent to the total number of 
# True items divided by the total array length.

In [None]:
"""
Five participants were asked to rate how drowsy the medication made them once a day for three days on a scale of one 
(least drowsy) to ten (most drowsy). Use np.mean to find the average level of drowsiness across all the trials and save 
the result to the variable total_mean. Use np.mean to find the average level of drowsiness across each day of the 
experiment and save to the variable trial_mean. Use np.mean to find the average level of drowsiness across for each 
individual patient to see if some were more sensitive to the drug than others and save it to the variable patient_mean.
"""

In [12]:
allergy_trials = np.array([[6, 1, 3, 8, 2], 
                           [2, 6, 3, 9, 8], 
                           [5, 2, 6, 9, 9]])

total_mean = np.mean(allergy_trials)
trial_mean = np.mean(allergy_trials, axis=1) # mean of rows
patient_mean = np.mean(allergy_trials, axis=0) # mean of columns
print(trial_mean)
print(patient)

In [None]:
"""
outliers: As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean 
is highly influenced by the specific values in our data set. What happens when one of those values is significantly different
from the rest? Values that don’t fit within the majority of a dataset are known as outliers. It’s important to identify 
outliers because if they go unnoticed, they can skew our data and lead to error in our analysis (like determining the mean). 
They can also be useful in pointing out errors in our data collection. When we’re able to identify outliers, we can then 
determine if they were due to an error in sample collection or whether or not they represent a significant but real deviation
from the mean.

One way to quickly identify outliers is by sorting our data, Once our data is sorted, we can quickly glance at the beginning 
or end of an array to see if some values lie far beyond the expected range.
"""

In [21]:
# median: the middle value of a dataset that’s been ordered in terms of magnitude (from lowest to highest).

# Calculate the median, without using Numpy, and save the value to the variable small_set_median:
"""
dataset = np.array([10100, 35500, 105000, 85000, 25500, 40500, 65000])
sorted_dataset = np.sort(dataset)
print(sorted_dataset)
small_set_median = sorted_dataset[len(dataset)/2]
print(small_set_median)
"""

In [22]:
# Unlike a mean, the median is not affected by outliers. This becomes important in skewed datasets, 
# datasets whose values are not distributed evenly. 

In [23]:
# percentiles: As we know, the median is the middle of a dataset - it is the number for which 50% of the samples are below, 
# and 50% of the samples are above. But what if we wanted to find a point at which 40% of the samples are below, and 60% of 
# the samples are above? This type of point is called a percentile. The Nth percentile is defined as the point N% of samples 
# lie below it. So the point where 40% of samples are below is called the 40th percentile. 
# Percentiles are useful measurements because they can tell us where a particular value is situated within the greater dataset.

In [26]:
d = np.array([1, 2, 3, 4, 4, 4, 6, 6, 7,  8, 8])
np.percentile(d, 40)

4.0

In [27]:
patrons = np.array([ 2, 6, 14, 4, 3, 9, 1, 11, 4, 2, 8])

thirtieth_percentile = np.percentile(patrons, 30)
seventieth_percentile = np.percentile(patrons, 70)
print(thirtieth_percentile, seventieth_percentile)

3.0 8.0


In [None]:
"""
Some percentiles have specific names:
The 25th percentile is called the "first quartile"
The 50th percentile is called the "median"
The 75th percentile is called the "third quartile"

The minimum, first quartile, median, third quartile, and maximum of a dataset are called a five-number summary. This set 
of numbers is a great thing to compute when we get a new dataset.

The difference between the first and third quartile is a value called the interquartile range = thirdP(75) - firstP(25). 
50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of 
how spread out the data is. The smaller the interquartile range value, the less variance in our dataset. 
The greater the value, the larger the variance.
"""

In [28]:
# standard deviation: While the mean and median can tell us about the center of our data, they do not reflect the range 
# of the data. That’s where standard deviation comes in. Similar to the interquartile range, the standard deviation tells 
# us the spread of the data. The larger the standard deviation, the more spread out our data is from the center. 
# The smaller the standard deviation, the more the data is clustered around the mean.

In [29]:
nums = np.array([65, 36, 52, 91, 63, 79])
np.std(nums)

17.716909687891082

In [31]:
# Find the average weight for each competition and save them to the variables pumpkin_avg and acorn_squash_avg.
pumpkin = np.array([68, 1820, 1420, 2062, 704, 1156, 1857, 1755, 2092, 1384])

acorn_squash = np.array([20, 43, 99, 200, 12, 250, 58, 120, 230, 215])

pumpkin_avg = np.mean(pumpkin)
acorn_squash_avg =  np.mean(acorn_squash)

# Find how representative the mean values are in relation to the entirety of the submissions. Calculate the standard deviation
# for each of the datasets to find and save them to the variables pumpkin_std and acorn_squash_std. 
# Determine the squash dataset that has the greater standard deviation and save it to the variable winner.
pumpkin_std = np.std(pumpkin)
acorn_squash_std =  np.std(acorn_squash)

print("Pumpkin:", pumpkin_std, "Acorn Squash:", acorn_squash_std)
winner = pumpkin

Pumpkin: 611.3183785884406 Acorn Squash: 87.22505374031019


In [32]:
rainfall = np.array([5.21, 3.76, 3.27, 2.35, 1.89, 1.55, 0.65, 1.06, 1.72, 3.35, 4.82, 5.11])

rain_mean = np.mean(rainfall)
rain_median = np.median(rainfall)
first_quarter = np.percentile(rainfall,25)
third_quarter = np.percentile(rainfall,75)
interquartile_range = third_quarter - first_quarter
rain_std = np.std(rainfall)