[View in Colaboratory](https://colab.research.google.com/github/mirman-school/project-impact/blob/master/Distribution_Part_1.ipynb)

# Distribution, Part I

Let's take a step back from pandas for a second and talk a little bit about the kinds of numbers we're pulling from DataFrames, and the mathematical tools we'll use to draw conclusions.

In particular, we'd like to talk about **distribution** and **central tendency**.

Let's start by loading up some Python modules we'll need to work with.

## Central Tendency

Numerical values exist in a range. There's a highest, a lowest, a midpoint, and everything in between. Learning how our values are arranged can help us understand the story of the data.

Let's start by making a dataset. We're going to make a set of test scores. We'll use `numpy` to create a random (but not-so-random) set of scores to start with.

In [0]:
import numpy as np # Easy access to common mathematical functions
import matplotlib.pyplot as plt # Graphing
import random # Making random stuff
import scipy.stats as stats # Extra statistical functions
import math # ...math

In [0]:
min_test_score = 50 # The low end of our test score range
max_test_score = 100 # The high end
total_test_scores = 20 # How many tests to generate

np.random.seed(42) # Guarantee the same output for everyone
test_scores = np.random.randint(min_test_score, max_test_score + 1, total_test_scores) # + 1 on max because randint is top-end-exclusive

Now we need to sort our dataset. Why?

2 of the 3 of our basic analysis tools (mean, median, and mode) require the data to be _sorted_ to work. **Median**, or the midpoint of the data, only makes sense if the values are in order. Similarly, **mode** is easiest to discern when like values are grouped, making it easy to count how many occurrences of each value there are.

In [0]:
sorted_scores = sorted(test_scores)
sorted_scores

### Mean $$\bar{x}\\ \\$$

 
 $$\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n}$$
 
 **Mean** is, as you can _clearly_ see above, is defined as the sum of the values  in a set divided by the number of values in that set. It is the same as the arithmetic mean. 
 
We could write a loop and do some simple Python math to calculate this value. But we're hackers now, so here's a fun hack to use: `np.mean()` will take a list or numpy array and return the arithmetic mean.

Use `np.mean()` below to get the average from `test_scores`.

In [0]:
# Use np.mean() to get the average from test scores
tests_mean = None
# Print it out


### Median $$\tilde{x}$$

A dataset's **median** is the midpoint of the data. When sorted, it's the position where there are as many values above as below. Now you see why we might need `sorted_scores`!

Haha, j/k, there's `np.median()` so you don't need to worry about that too much.

Note that if the median falls between two values in the set, the median is defined by the **mean** (see above) of those two values.

Use `np.median()` to get the median from `test_scores`.

In [0]:
# Use np.median() on test_scores
tests_median = None

# Print it out


### Mode

**Mode** is defined as the most commonly occurring value in a set. While we could easily count up occurrences of values in a small set, we want a solution that scales to big datasets. Luckily, `scipy.stats` has us covered with a built-in `mode()` function, used like:

```python
stats.mode(my_dataset)
```

But what it returns might look kinda weird. Let's check it out.

In [0]:
tests_mode = stats.mode(test_scores)
tests_mode

What in the jam is a `ModeResult`?  It's a custom data type from `scipy` that includes both the (smallest) mode from a dataset, and the count of occurrences of that mode.

Why are they still inside arrays? Because `stats.mode()` can handle numpy arrays of multiple dimensions. Ours is 1-d, but we could imagine an array of 2 or more dimensions. `stats.mode()` would find the mode in each of those sub-arrays.

`ModeResult`s have 2 properties: a `mode` that contains an array of all the modes found in the dataset, and a corresponding `count` array that indicates the number of occurrences of the modes, in order. So the `ModeResult` above shows that there was 1 mode, 73, and that it occurred 3 times in the dataset. 

To just access the `mode` array, use [**dot notation**](http://reeborg.ca/docs/oop_py_en/oop.html). 

We'll worry about that later. For now, let's get the mode from `test_scores`.

In [0]:
# Use dot notation to get just the mode array from test_mode


## Distribution

To understand how our data plays out across min, max, and everything in between, it'll be helpful to create a distribution plot, or **histogram**. 

A histogram is a special kind of bar chart. On the x-axis is the range of values. On the y-axis, the count of how many of each value occurs in the dataset.

Here's how we make one.

In [0]:
plt.hist(test_scores)
plt.title("Test score distribution (10 bins)")
plt.xlabel("Test scores")
plt.ylabel("Count")
plt.show()

### Bins

Our `test_scores` dataset is unique in that the value fall within a small and predictable range of integers. It wouldn't be hard to read to have a bar for each possible value. In larger datasets, it makes more sense to group values together into **bins**. Using `pyplot`, the default number of bins is 10.  Changing the number of bins can change the story the histogram tells. You can change the number of bins by passing the optional `bins` parameter to `plt.hist()`.

In [0]:
plt.hist(test_scores, bins=20) # Change this to see how the graph changes
plt.title("Test score distribution (20 bins)") # Change the title too!
plt.xlabel("Test scores")
plt.ylabel("Count")
plt.show()

### Comparing Datasets

We have our tiny dataset of 20 test scores. What would happen with 200 test scores? 2000? 20000? Our random number generator, by default, should be completely fair, meaning that there's equal probability of retrieving any number in the given range.

Sooooo, what _should_ happen as our datasets get bigger, is that the histogram should get flatter and flatter.

Let's find out.

We'll start by generating 4 new datasets using `np.random.randint()`. We need 4 sets, of **50**, **500**,  **5000**, and **50000**. Make them as above with `np.random.randint()`.

In [0]:
np.random.seed(42)

# Replace None with the correct usage of np.random.randint()
small = np.random.randint(50,101, 50)
medium = np.random.randint(50,101, 500)
large = np.random.randint(50,101, 5000)
xlarge = np.random.randint(50,101, 50000)


#### Comparing by overlay

One way to compare our datasets is by overlaying the plots on the same grid. Obviously the large dataset is going to have much higher counts than the small, but we should still be able to see the shape of the distribution.

All we need to do to overlay our histograms is repeatedly call `plt.hist()` with each dataset in turn.

However, to make all of the hists visible, we need to make all but the first histogram a little see-through. We should also change their color to make them clearly different, like so:

```python
# Makes a histogram with 20 bins, red, and 40% opacity
plt.hist(dataset, bins=20, color="r", alpha=0.4)
```

In [0]:
plt.figure(figsize=[20,10])
plt.hist(small, bins=51)
plt.hist(large, bins=51, color="r", alpha=0.4)
plt.hist(medium, bins=51, color="g", alpha=0.3)
plt.hist(xlarge, bins=51, color="purple", alpha=0.3)
plt.title("Test score distribution")
plt.xlabel("Score")
plt.ylabel("Count")
plt.show()

#### PDF

Right, so that's kinda hard to read. With 50k in the `xlarge` dataset,  everything else is going to look a little squished. We can normalize the sizes by using a [probability density function](https://en.wikipedia.org/wiki/Probability_density_function).

A what?

A proability density function takes a dataset and creates a distribution that shows not raw counts, but the _probability_ that a given value will be at that particular position in the range. PDFs return values between 0 and 1, regardless of the size of the dataset. What that means is that we can look at the PDF for each dataset without squishing any. 

It's pretty easy to use PDFs with `plt.hist()`. It's just a parameter you pass, like `color` or `alpha`. Set `density` to `True` for each of your histograms. Otherwise, reuse the code from above.

In [0]:
# Use the code from above to remake your histograms, but use density set to True for each hist
plt.figure(figsize=[20,10])
plt.hist(small, bins=51, density=True)
plt.hist(large, bins=51, color="r", alpha=0.4, density=True)
plt.hist(medium, bins=51, color="g", alpha=0.3, density=True)
plt.hist(xlarge, bins=51, color="purple", alpha=0.3, density=True)
plt.title("Test score distribution")
plt.xlabel("Score")
plt.ylabel("Count")
plt.show()

#### Subplot

Yeah, so that works, except that now it's hard to tell one plot from another. So instead, we're going to split our plot into 4 pieces and look at the plots side-by-side.

`plt.subplot()` takes 3 arguments: a row, a column, and a cell number. Once you call `plt.sublplot()`,  further `plt` functions will apply to that subdivision of the whole figure. So for example.

```python

plt.subplot(1,1,1) # Row 1, Column 1, Cell 1 (top left)
... # some stuff here
plt.subplot(1,2,2) # Row 1, Column 2, Cell 2 (top right)
... # some stuff here
plt.subplot(2,1,3) # Row 2, Column 1, Cell 3 (bottom left)
... # some stuff here
plt.subplot(2,2,4) # Row 2, Column 2, Cell 4 (bottom right)
```

So what we want to do is call `plt.subplot()`,  then do our work for one single histogram (including title and labels), then call `plt.subplot()` again and repeat the process.

Using `plt.subplot()`, create 4 plots—one for each dataset.

In [0]:
# Reproduce the code above, but 4 times, using plt.subplot() between to create 4 individual plots
# One for each dataset: small, medium, large, and xlarge

plt.figure(figsize=[20,10])

# use subplot to select the right part of the graph

# create the histogram

# add title, xlabel, and ylabel

plt.show()

If everything's gone to plan, you'll see that the `xlarge` distribution gets pretty even. This is what's known as a _regular_ distribution.

#### Adding Mean/Median/Mode to the plot

Remember the central tendency tools we used earlier? It can be handy to add them as vertical lines to our histograms. `plt.axvline()` can help us out here. We just give it the x-value, and the color we want, and bam! A line on the plot. Let's use the `test_scores` dataset as an example.

In [0]:
plt.figure(figsize=[20,10])
plt.hist(xlarge, bins=51, density=True)
plt.axvline(np.mean(xlarge), color="r", linewidth=3)
plt.axvline(np.median(xlarge), color="g", linewidth=3)
plt.axvline(stats.mode(xlarge).mode[0], color="purple", linewidth=3)
plt.show()

## END OF PART 1

Congratulations! You made it! In the next notebook, we'll cover variance in data, and how to use it.