# Demo: Let's Make Some Data!

In [2]:
# range from 20 to 59
ages = range(20, 60)

In [4]:
import random
random_ages = []
for num in range(100):
    random_ages.append(random.choice(ages))

In [5]:
print(random_ages) # using print() eschews pretty-printing

[55, 32, 36, 25, 42, 31, 22, 36, 59, 54, 38, 30, 25, 43, 59, 35, 47, 53, 28, 44, 45, 31, 33, 29, 41, 56, 23, 48, 39, 37, 47, 23, 54, 46, 26, 55, 41, 21, 50, 41, 59, 50, 28, 42, 32, 40, 53, 43, 29, 34, 47, 55, 39, 38, 33, 47, 25, 53, 38, 40, 58, 26, 53, 41, 58, 25, 42, 24, 59, 54, 46, 46, 24, 21, 39, 52, 56, 39, 30, 32, 40, 24, 43, 52, 56, 52, 43, 38, 41, 29, 33, 51, 48, 23, 33, 35, 59, 55, 53, 30]


## Question: What is our oldest age? Youngest age? Average age?

In [6]:
max(random_ages)

59

In [7]:
min(random_ages)

21

# Range: How Wide is the Dispersion of the Data?

In [8]:
def my_range(x):
    '''Python would let us call this function range,
       but if we did that, we would lost access to
       the builtin function range'''
    # pretty simple really..
    return max(x) - min(x)

In [9]:
my_range(random_ages)

38

In [10]:
nums = [10, 10, 100, 100]

In [11]:
my_range(nums)

90

In [12]:
nums = [10, 50, 50, 50, 50, 100]

In [13]:
my_range(nums)

90

## Question: How well does range explain the dataset? How useful is it?

In [14]:
# numpy has a range function
import numpy as np
np.ptp(random_ages) # "peak to peak"

38

# Mean: Average Value
* Mean, unlike range, is effected by all the values in the set.
* A way to measure the center of a dataset.

In [15]:
def mean(x):
    return sum(x) / len(x)

In [16]:
mean(random_ages)

40.68

In [17]:
np.mean(random_ages)

40.68

## Question: What do outliers do to the average?

# Median: Mid-Point of Values
* 50% of values above the median ; 50% below the median
* Another way to measure the center of a dataset.

In [18]:
def median(x):
    n = len(x)
    sorted_x = sorted(x)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_x[mid - 1] + sorted_x[mid]) / 2
    else:
        return (sorted_x[mid])

In [19]:
median(random_ages)

41.0

In [20]:
np.median(random_ages)

41.0

## Question: What do outliers do to the median?

## Percentile: How Much Data Falls Below?
* Median is really 50th percentile
* You can pick an arbitrary percentile

In [21]:
np.percentile(random_ages, 50)

41.0

In [22]:
np.percentile(random_ages, 75)

51.25

In [23]:
np.percentile(random_ages, 25)

31.75

## Interquartile Range (IQR)
* $ IQR = Q_3 - Q_1 $
* 75th percentile - 25th percentile - the "middle" of the dataset
* A quick way to measure spread of the data and ignore the outliers

In [24]:
from scipy import stats
stats.iqr(random_ages)

19.5

## Mode
* The most common (frequently occurring) value in a set 
* Another way of measuring center

In [25]:
stats.mode(random_ages)

ModeResult(mode=array([41]), count=array([5]))

## Consider the spread of data in two hypothetical datasets

<img src="images/skew-2.png" width=400 height=400>

* How can we identify/quantify different spreads?
* Normal distribution or not? 
* Focus on the <span style="color:blue;font-weight:bold;">mean</span>, <span style="color:green;font-weight:bold;">median</span>, and <span style="color:red;font-weight:bold;">mode</span>

## Variance: How much spread is there in the dataset?
* $Var(X) = \frac{1}{n} \sum_{i=1}^n (a_i - \bar x)^2$
* Why is it squared?
* What are the units of variance, assuming our dataset from above?

In [26]:
np.var(random_ages)

124.07760000000002

## Standard Deviation
* $\sqrt {Var(X)}$
* Puts the units back into something we are more familiar with
* "standard variation" from the mean
* Another measure of dispersion

In [27]:
np.std(random_ages)

11.13901252355881

<img style="height: 350px;" src="images/ss-01.png">
## Question: How do we use this?

## In a normal distribution...
* 68% of the data will fall within 1 standard deviation
* 95% of the data will fall within 2 standard deviations
* 99.5% of the data will fall within 3 standard deviations

## Skewness
* if we're trying to draw conclusions about a dataset, and we're expecting our sampling to reflect a normal distribution and then we believe we can make generalizations to the population at large, we will be wrong if our sample is skewed
* e.g., polling people who have home phones/land lines. What's wrong with that?

<img style="height: 200px;" src="images/skew-1.png">

In [28]:
stats.skew(random_ages)

-0.025550022737476627

In [32]:
data = [1, 2, 3, 75, 75, 75, 77, 78, 78, 91, 78, 81, 93, 94, 95, 96, 105, 106 ]

In [34]:
stats.skew(data)

-1.4291927105469837

In [33]:
np.mean(data)

72.38888888888889