https://github.com/mica5/statistics/blob/master/statistics.ipynb

# Introduction

Any data here, unless otherwise stated, is from the book Probability and Statistics for Engineering and the Sciences, 9th Edition, by Jay Devore (CENGAGE Learning). https://www.amazon.com/Probability-Statistics-Engineering-Sciences-Devore/dp/1305251806 ; the concepts here are also from taking MATH161A at SJSU (San Jose State University in San Jose, California) in Spring 2018.


# Stratified Sampling

In [9]:
from sklearn.model_selection import train_test_split


i = 100

# generate lists. it will look like [0, 0, ..., 0, 1, 1, ..., 1], where there are twice as many 1s as there are 0s
population = [0]*i + [1]*i*2
# the category is used to first split all the data by categories, then random sampling will be taken from within each category. but the way stratified sampling is meant to happen is that the number of items selected in each category compared to the number of items selected from the other categories is roughly proportional to the number of items in that category compared to the number of items in the population. that way, each category should get a representational sample, instead of categories with more items compared to the number in other catgories getting underrepresented compared to categories that have very few items.
category = [0]*i + [1]*i*2
# we should have twice as many 1s as we have 0s

# take the stratified sample. by default, based on sklearn version 0.19.1, the sample will be 75% of the population (by default, test is 25%, and train is the complement of test, so train will be 75%)
sample, _ = train_test_split(population, stratify=category)

# count the number of 0s and ones
c0 = sample.count(0)
c1 = sample.count(1)
total = c0 + c1
# here are the proportions.
print('proportion of 0s:', c0 / total)
print('proportion of 1s:', c1 / total)
print('all added together to make sure it adds to 1:', (c0+c1)/total)
print('number of 1s compared to 0s, which should be 2:', c1/c0)

proportion of 0s: 0.3333333333333333
proportion of 1s: 0.6666666666666666
all added together to make sure it adds to 1: 1.0
number of 1s compared to 0s, which should be 2: 2.0


# Stem and leaf display

In [10]:
# presidential Commission on the Space Shuttle Challenger Accident, Vol. 1, 1986: 129-131
o_ring_temperatures = [84, 49, 61, 40, 83, 67, 45, 66, 70, 69, 80, 58, 68, 60, 67, 72, 73, 70, 57, 63, 70, 78, 52, 67, 53, 67, 75, 61, 70, 81, 76, 79, 75, 76, 58, 31]

def stem_and_leaf_display(data):
    """print a stem-and-leaf display to standard output

    This is a very primitive algorithm, as it only works for
    two-digit numbers. It is only for demonstrative purposes.
    """
    sld = dict()
    for n in data:
        stem = str(n)[0]
        if stem not in sld:
            sld[stem] = list()
        sld[stem].append(str(n)[-1])
    lowkey = min(sld.keys())
    highkey = max(sld.keys())
    for i in range(int(lowkey), int(highkey)+1, 1):
        i = '{}'.format(i)
        if i not in sld:
            sld[i] = list()
    for stem in sorted(sld.keys()):
        print(stem+'|', end='')
        for value in sorted(sld[stem]):
            print(value, end='')
        print()
stem_and_leaf_display(o_ring_temperatures)

3|1
4|059
5|23788
6|01136777789
7|000023556689
8|0134
