# An Introduction to Descriptive Statistics!

In [4]:
# Some Handy Dandy Import Statements to Start Off Our Handy Dandy Little Tutorial
from scipy import stats
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

---

At this point in our course, we've had plenty of time, experience, and practice in manipulating our data. 

However, to really _understand_ our data and underlying patterns across it, we need to dive a layer deeper into the mathematics behind data mechanics and dynamics.

In order to be able to draw conclusions from our data, we need to be able to **describe and interpret our data**.

This will become essential for more advanced data science concepts in data visualization, machine learning, and deep learning. 

## Statistics involves collecting, interpreting, describing behaviors, and inferring trends across data.

We generally describe the realm of _statistics_ to be broken up into **two** major fields:

### 1. Descriptive Statistics

Descriptive Statistics involves describing, presenting, summarizing and organizing your data (population-based), either through numerical calculations or data visualization methods (e.g. graphs, tables). 

### 2. Inferential Statistics

Inferential Statistics allows us to infer trends and make assumptions/assertions about a population based on a study of a sample taken from it. 

Generally, the more inferential a statistical analysis becomes, the deeper and more complex we get with our mathematics. 

### At this point in the course, we'll be focusing primarily on descriptive statistics in order to describe patterns, trends, distributions, and behaviors across our data.

Let's take a dive into applying descriptive statistics to explain our data!

---

## Measures of Central Tendency

In statistics, we often find that describing data by "averages" allows us to more often make more powerful assertions regarding the data as a whole. 

We often use **three key measures of central tendency** to help describe the centroid (arithmetic mean trend across a distribution) of our data:
- **Mean**
- **Median**
- **Mode**

### The mean is the raw average value across our data. 

Calculating the mean is simple: _compute the sum of all values across our data and divide by the total number of values in our dataset_.

We've been using the mean for years and years, but such a surprisingly simple arithmetic calculation turns out to have massive implications across being able to critically understand and break down complex datasets! 

### _Write a function to compute the mean from an arbitrary dataset._

In [66]:
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# TODO: Complete this function by having the function return the average value of our dataset.
def compute_mean(dataset):
    """ Main function that calculates the average value across our data. """
    return

compute_mean(data)

### The median is the "middle value" or midpoint across our data.

Determining the median is as simple as it sounds: _ascertain the data value lying in the exact middle of our dataset_.

One critical exception occurs when our data has an even number of values and thus has **two values** at its center: _in these cases, ascertain the **mean** value of the two medians to obtain the true median across our data_. 

And remember: the median can only be calculated across _sorted data_!

If data is distributed in a non-normal manner, then we can learn a great deal from interpreting what the exact median value of our dataset is. 

### _Write a function to compute the median from an arbitrary dataset._

In [67]:
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# TODO: Complete this function by having the function return the exact true median value of our dataset.
# HINT: Consider using DataFrame slicing to help with identifying the correct median value(s).
def compute_median(dataset):
    """ Main function that determines the median value across our data. """
    count = len(dataset)
    
    if count < 1:
        # TODO: Complete this if-statement
        return
    if count % 2 == 1:
        # TODO: Complete this if-statement
        return
    else:
        # TODO: Complete this if-else statement
        return
    
compute_median(data)

### The mode is the most commonly occurring value or feature across our data.

Determining the mode is relatively simple: _find the value that occurs most frequently across our data_.

Remember that if all values across our data are unique and only occur once, then our data **has no mode**!

The mode is also an interesting measure of central tendency in that it can be applied towards categorical (non-numerical) data; one can find frequently occurring categories without running any calculations. 

### _Write a function to compute the mode from an arbitrary dataset._

In [68]:
# NOTE: Tricker than it looks!
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# TODO: Complete this function by having the function return the relative mode across our dataset.
# HINT: Remember histograms and tokenization from CS 1.2? How many they help you here? 
def compute_mode(dataset):
    """ Main function that determines the mode value across our data. """
    return

compute_mode(data)

There we have it!

Three measures of central tendency that are critically important to understanding the distribution of our data. 

In future classes on distributions and introductory inferential statistics, we'll talk more about exactly **why** these measures are so important. 

---

## Measures of Relative Position

Relative positional statistics is slightly different than our friends in the central tendency community, in that they do the complete opposite than what central tendency statistics do!

We see that *measures of central tendency* describe our data as a whole as per central data points.

We also see that *measures of relative position* describe our data very specifically, using relative data points to partition and divide up our data for internal investigation!

In this course, there are **two key measures of relative position** that we are mainly worried about:
- **Percentile**
- **Quartile**

### The percentile is the value across your dataset below which a given percentage of data points fall.

In other words, your percentile is a value between 0 and 100 that tells you what percentage of your data is **to the left** of your percentile line.

For measures of relative position, we see that it's very helpful to visually depict what we're talking about!

Let's take a look at how a percentile looks across our data with an example!

<img src="https://cdn-images-1.medium.com/max/1600/1*Nu_ZqWswZ0YxxFpBwk7vYA.png" />

This is a great example explaining percentiles in a visually pleasing manner (that we *totally* did not find and steal from Google Images). 

In this example, we see that percentile is not necessarily a direct value from our data. 

Rather, a percentile represents an **index** from 0 to 100 that partitions our data into a proportionate amount based on the value of the index. 

We've actually been using percentile indices all our lives.

In school, we measure student skill levels in assessments/courses by measuring who fell into the top percentiles (99th, 90th, 80th, etc.).

Fortune 500 companies often fall into the top percentiles for company revenue/performance.

Biometric health data that is commonly used – like BMI, height, weight, etc. – are often measured with respect to the average data across people your age so that they can accurately depict what your percentile is!

All these measurements are done with one purpose in mind: **to describe where a specific data point falls relative to the population**. 

### _Use the functions below to answer the following questions regarding percentiles for an arbitrary dataset._

NOTE: Since the values that our data contains range from [0, 10], you may realize that most of the question answers are pretty obvious.

The point here is to test that you can see how and why you can attain the percentile value given a data point across a dataset!

In [51]:
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# NOTE: Use this completed percentile-from-value function to calculate the percentile given a data value.
# NOTE: Remember, the percentile indicates how much data lies to the LEFT of the associated value!
def compute_percentile_from_value(dataset, value):
    """ Main function that determines the percentile given a single value within a dataset. """
    abs_min, abs_max = dataset[0], dataset[-1]
    return 100 * (float(value - abs_min) / (abs_max - abs_min))

# TODO: Uncomment this line to run our function and answer the questions below!
# compute_percentile_from_value(sorted(data))

- What percentile is the value 3.5 at? 

`Write your response here!`

- What percentile is at the value 9.8?

`Write your response here!`

- What percentage of data is contained between the values 6.0 and 7.5?

`Write your response here!`

- How much data is below the value at 2.7?

`Write your response here!`

- How much data is above the value at 4.5?

`Write your response here!`

Now let's do the same thing, but in the other direction.

Let's figure out data values given percentile indices across our data!

In [50]:
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# NOTE: Use this completed value-from-percentile function to calculate the value given a set percentile.
# NOTE: You may notice some unexpected percentile values returned. Why do you think that is? 
def compute_value_from_percentile(dataset, percentile, key=lambda X:X):
    """ Main function that determines the data value given a percentile across a dataset. """
    rel_pos = percentile * (len(dataset) - 1)
    floor, ceil = math.floor(rel_pos), math.ceil(rel_pos)
    if floor == ceil:
        return key(dataset[int(rel_pos)])
    data0 = key(dataset[int(floor)]) * (ceil - rel_pos)
    data1 = key(dataset[int(ceil)]) * (rel_pos - floor)
    return data0 + data1
    
# TODO: Uncomment this line to run our function and answer the questions below!
# compute_value_from_percentile(sorted(data))

- What value is at the 40th percentile?

`Write your response here!`

- What value is at the 77th percentile?

`Write your response here!`

- How much data is contained within the 50th and 60th percentile?

`Write your response here!`

Wow!

We can see that this can get real complex, real fast! 

Percentiles can tell us a lot about our data's distribution and how we should handle values scattered across our data!

In future classes, we'll work further with using percentiles to extract even more insights from our data!

Now, let's quickly look at **quartiles**. 

### The quartile is one of several values that partition your dataset into four equal parts.

It's a special type of percentile value that measures the 25%, 50%, 75%, and 100% points across your data. 

Like percentiles, each quartile index depicts the value by which some set amount of data lies below. 

For example, at the 25% quartile mark, we know that exactly 25% of our data falls below that value. 

Let's take a look at an illustrated example!

<img src="https://www.mathematics-monster.com/images3/quartiles_cut_offs.jpg" />

In this example, we can clearly see that the lower quartile `Q1` marks where 25% of our data lies, while the middle quartile `Q2` marks where half of our data lies and the upper quartile `Q3` marks where 75% of our data lies.

We also can see that the middle quartile actually represents the median value of our data.

That is, where the median lies, we can assume that 50% of our data lies below and 50% of our data falls above. 

(But we already knew that!)

We can also intuit that the minimum value of our data marks the 0th percentile and quartile (since no data falls below it) and the maximum value of our data marks the 100th percentile and quartile (since all of our data falls below it)!

### _Use the two functions above to ascertain the values at the 25th, 50th, and 75th Quartile Indices._

Since you've already gotten good enough practice identifying percentile values across a dataset, there's not much point to writing another function for quartile identification. 

Let's use the tools we already have! 

- 25th Quartile Value (Lower, Q1):

`Write your response here!`

- 50th Quartile Value (Median, Q2):

`Write your response here!`

- 75th Quartile Value (Upper, Q3):

`Write your response here!`

Once you're done, let's move on to studying some more advanced measures of data distribution in our data! 

---

## Measures of Spread and Variance

Like our friends in the central tendency community, measures of spread and variance do their best to describe patterns across our data as a whole.

However, unlike measures of central tendency, which focus on the distribution of our data towards an arithmetic centroid, measures of spread and variance talk about the shape and layout of our data all across the board!

In this course, there are **two key measures of spread and variance** to help describe the shape of our data:
- **Range**
- **Standard Deviation**

### The range is the coordinate pair describing the smallest and largest values our data contains.

In other words, determining the range is as simple as quantifying the absolute minimum and maximum values across our data and shoving them into a coordinate object!

You're probably intimately familiar with the `range()` object in Python: we're kind of constructing that for any dataset! 

Keep in mind that we want the **absolute** min and max: not the _local_ min and max. 

In other words, we don't want just any small and large value – we want the absolutely smallest and largest values that occur in our data.

### _Write a function to compute the range from an arbitrary dataset._

In [49]:
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# TODO: Complete this function by having the function return the effective range of values across our data.
def compute_range(dataset):
    """ Main function that determines the range of values across our data. """
    return ()

compute_range(data)

()

### The standard deviation is the square root of the variance and dispersion of our data from the mean.

Calculating the standard deviation is slightly more complex than our other descriptive statistics.

To calculate the standard deviation, we must first _calculate the **variance** of our data and then take its square root_. 

And to calculate the variance, we must _find the difference between every data point and the true mean, square the difference, sum all the differences up, and take the average of all those numbers_.

The standard deviation is slightly more complicated because it has to do more with the relationship between how spread out our data is from each other and how spread out our data is from the mean. 

Therefore, the standard deviation allows us to _interpret individual data values and whether or not they are a considerable distance from the mean away from the rest of our data_!

This becomes incredibly important when we dive into outlier detection, hypothesis testing, and analyzing data for values that may affect it abnormally.

### _Write a function to compute the standard deviation from an arbitrary dataset._

In [70]:
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# TODO: Complete this function by having the function return the standard deviation of our data.
# NOTE: Since we need the true mean across our data, let's use our previously written function!
def compute_standard_deviation(dataset):
    """ Main function that approximates the standard deviation of our data. """
    true_mean = compute_mean(dataset)
    
    # TODO: Complete these calculations step-by-step to correct the standard deviation calculation.
    sum_diffs_squared = int()
    variance = int()
    
    return math.sqrt(variance)

compute_standard_deviation(data)

0.0

Excellent!

We'll work on more examples of applying these descriptive statistics over the next several weeks, allowing you to see more of how and why these measures can allow us to extract more insights from our data.

---

...by the way, there is one little thing worth mentioning.

### Python has everything you need for descriptive statistics!

That's right!

All the functions we built are useful to understand the structure of these descriptors, but we certainly don't need to write them from scratch. 

Python has all these methods built internally!

To be more precise, we can use the libraries **NumPy** and **SciPy** to get all these descriptive statistics in a much easier manner!

We have all the tools we need right here:

### _Write a function to compute all descriptive statistics for an arbitrary dataset._

In [61]:
data = np.array([1, 3, 5, 2, 3, 7, 8, 4, 10, 0, 6, 7, 3, 0, 3, 0, 5, 7, 10, 1, 4, 9, 3])

# TODO: Complete this function by having the function return all descriptive statistics needed using NumPy and SciPy.
# NOTE: Range is super simple to calculate with vanilla Python and is thus not required here. 
def compute_descriptive_statistics(dataset):
    """ Main function that calculates descriptive statistics across our dataset. """
    # TODO: Calculate the mean of our data using NumPy.
    mean = None
    
    # TODO: Determine the median of our data using NumPy.
    median = None
    
    # TODO: Determine the mode of our data using SciPy's Stats module.
    mode = None
    
    # TODO: Calculate the quartile values using NumPy.
    q1, q2, q3 = None, None, None
    
    # TODO: Determine the standard deviation of our data using NumPy.
    std_dev = None
    
    return mean, median, mode, q1, q2, q3, std_dev

mean, median, mode, q1, q2, q3, std_dev = compute_descriptive_statistics(data)
print("\nThe mean of our data is {}.".format(mean))
print("\nThe median of our data is {}.".format(median))
print("\nThe mode of our data is {}.".format(mode))
print("\nThe quartiles of our data are at the data points {}, {}, and {}, respectively.".format(q1, q2, q3))
print("\nThe standard deviation of our data is {}.".format(std_dev))


The mean of our data is None.

The median of our data is None.

The mode of our data is None.

The quartiles of our data are at the data points None, None, and None, respectively.

The standard deviation of our data is None.
