# Demo: Let's Make Some Data!

In [1]:
# range from 20 to 59
ages = range(20, 60)

In [2]:
import random
random_ages = [random.choice(ages) for _ in range(100)]

In [3]:
print(random_ages) # using print eschews pretty-printing

[57, 29, 46, 58, 57, 21, 48, 29, 41, 51, 50, 36, 25, 33, 56, 26, 22, 28, 27, 29, 20, 47, 40, 58, 21, 55, 22, 46, 46, 34, 39, 54, 43, 44, 49, 21, 46, 46, 55, 42, 52, 59, 54, 37, 55, 56, 30, 40, 20, 43, 57, 45, 32, 48, 24, 57, 42, 49, 54, 43, 32, 41, 21, 40, 49, 56, 39, 54, 27, 48, 46, 48, 20, 44, 45, 21, 41, 21, 36, 28, 59, 58, 46, 45, 43, 34, 29, 28, 52, 32, 58, 33, 43, 54, 34, 30, 21, 57, 48, 27]


In [4]:
max(random_ages)

59

In [5]:
min(random_ages)

20

## Range: How Wide is the Dispersion of the Data?

In [6]:
def my_range(x):
    '''Python would let us call this function range,
       but if we did that, we would lost access to
       the builtin function range'''
    return max(x) - min(x)

In [7]:
my_range(random_ages)

39

In [8]:
nums = [10, 10, 100, 100]

In [9]:
my_range(nums)

90

In [10]:
nums = [10, 50, 50, 50, 50, 100]

In [11]:
my_range(nums)

90

In [12]:
# numpy has a range function
import numpy as np
np.ptp(random_ages) # "peak to peak"

39

## Mean: Average Value

In [13]:
def mean(x):
    return sum(x) / len(x)

In [14]:
mean(random_ages)

40.82

In [15]:
np.mean(random_ages)

40.82

## Median: Mid-Point of Values

In [16]:
def median(x):
    n = len(x)
    sorted_x = sorted(x)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_x[mid - 1] + sorted_x[mid]) / 2
    else:
        return (sorted_x[mid])

In [17]:
median(random_ages)

43.0

In [18]:
np.median(random_ages)

43.0

## Percentile: How Much Data Falls Below?

In [19]:
np.percentile(random_ages, 50)

43.0

In [20]:
np.percentile(random_ages, 75)

50.25

In [21]:
np.percentile(random_ages, 25)

29.75

## Interquartile Range (IQR)
* $ IQR = Q_3 - Q_1 $
* 75th percentile - 25th percentile 

In [22]:
from scipy import stats
stats.iqr(random_ages)

20.5

## Mode

In [23]:
stats.mode(random_ages)

ModeResult(mode=array([21]), count=array([7]))

## Consider the spread of data in two hypothetical datasets

<img src="../images/skew-2.png" width=400 height=400>

* how can we identify/quantify different spreads?
* focus on the <span style="color:blue;font-weight:bold;">mean</span>, <span style="color:green;font-weight:bold;">median</span>, and <span style="color:red;font-weight:bold;">mode</span>

## Variance: How much spread is there in the dataset?
* $Var(X) = \frac{1}{n} \sum_{i=1}^n (a_i - \bar x)^2$
* Why is it squared?
* What are the units of variance, assuming our dataset from above?

In [24]:
np.var(random_ages)

144.7076

## Standard Deviation
* $\sqrt {Var(X)}$
* puts the units back into something we like
* "standard variation" from the mean

In [None]:
np.std(random_ages)

<img style="height: 350px;" src="images/ss-01.png">

## In a normal distribution...
* 68% of the data will fall within 1 standard deviation
* 95% of the data will fall within 2 standard deviations
* 99.5% of the data will fall within 3 standard deviations

## Skewness
* if we're trying to draw conclusions about a dataset, and we're expecting our sampling to reflect a normal distribution and then we believe we can make generalizations to the population at large, we will be wrong if our sample is skewed
* e.g., polling people who have land lines

<img style="height: 200px;" src="images/skew-1.png">

In [None]:
stats.skew(random_ages)