* DSC530: Week 5
* 5.2 Exercise
* Marty Hoehler
* 4-14-24

# Exercise 5-1

If members of the Blue Man Group must be between 5'10" and 6'1", that's between 70 and 73 inches.  (I could apply!  I'm 6' even.)
To work with the normal distribution in the text, we'll need to convert that to cm, using 2.54 as the coversion factor.

In [1]:
lower_ht= 70*2.54
lower_ht

177.8

In [2]:
upper_ht= 73*2.54
upper_ht

185.42000000000002

Next, we import the scipy package.

In [3]:
import scipy.stats
import numpy as np

We need to find the probability of being inbetween 177.8 and 185.42cm.  First we determine the probability of being under 177.8cm.

In [4]:
mean = 178
std = 7.7

cdf_low =scipy.stats.norm.cdf(lower_ht, loc=mean, scale=std)
cdf_low

0.48963902786483265

Next, we need the probability of being under 185.42cm

In [5]:
cdf_high =scipy.stats.norm.cdf(upper_ht, loc=mean, scale=std)
cdf_high

0.8323858654963072

The probability of being between 185.42cm and 177.8cm is the difference of the upper and lower cdfs.

In [6]:
prob_blue_man = cdf_high - cdf_low
prob_blue_man

0.34274683763147457

So about 34% of the male population is the right height to be eligible for the Blue Man Group.

# Exercise 5-2
If we try to fit our height data to the Pareto distribution we will get some unusual results.  First, we will build the distribution and confirm that the median matches what they found in the text..

In [7]:

param_x = 1
param_a = 1.7

pareto_dist = scipy.stats.pareto(b=param_a, scale=param_x)
pareto_dist.median()

1.5034066538560549

The book said the median worked out to 1.5, so this matches.  Next the assignment asks for the mean.

In [8]:
pareto_mean = pareto_dist.mean()
pareto_mean

2.428571428571429

We are already seeing why the Pareto is an inappropriate model for height.  2.43 meters is almost 8 feet tall.

In [9]:
pareto_dist.cdf(pareto_mean)

0.778739697565288

About 78% of the population would fall below the mean of about 8 feet tall.  That leaves about 22% of the population above 8 feet tall.  Next, to find the number of people greater than a km tall, we would find the probability of that, (1 minus the probability of being under a kilometer tall) and then multiplying it by 7 billion.

In [10]:
(1 - pareto_dist.cdf(1000))*7000000000

55602.976430479954

We would have over 55K people over a km tall.  To find the tallest person, we find the probability of being the 7 billionth person, and then using the percent point function (which is the inverse of the cdf) to determine the value of the height at that probability.

In [11]:
pareto_dist.ppf((1-1/7000000000))

618349.6106759505

In "Pareto World", as the book calls it, we would expet the tallest person to be 618km tall.


# Exercise 6-1

We will begin by importing the packages and data needed for the exercise, using the recommended code.

In [12]:
# Import data (from previous assignments)
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc06.csv")

import hinc
import thinkstats2
import hinc2
import numpy as np
import thinkplot


income_df = hinc.ReadData()


This "income_df" dataframe is a summary table of the census data, summarizing respondents witnin income ranges.  With data like this, you can't calculate a true mean of the sample, because we don't know where within the ranges each record sits.  So we use the provided function InterpolateSample() to create a sample that fits the summarized data.  The instructions have us set the upper bound for the last bin $1,000,000, or 10^6.

In [13]:
# Function provided by text in hinc2.py

def InterpolateSample(df, log_upper=6.0):
    """Makes a sample of log10 household income.

    Assumes that log10 income is uniform in each range.

    df: DataFrame with columns income and freq
    log_upper: log10 of the assumed upper bound for the highest range

    returns: NumPy array of log10 household income
    """
    # compute the log10 of the upper bound for each range
    df['log_upper'] = np.log10(df.income)

    # get the lower bounds by shifting the upper bound and filling in
    # the first element
    df['log_lower'] = df.log_upper.shift(1)
    df.loc[0, 'log_lower'] = 3.0

    # plug in a value for the unknown upper bound of the highest range
    df.loc[41, 'log_upper'] = log_upper
    
    # use the freq column to generate the right number of values in
    # each range
    arrays = []
    for _, row in df.iterrows():
        vals = np.linspace(row.log_lower, row.log_upper, int(row.freq))
        arrays.append(vals)

    # collect the arrays into a single sample
    log_sample = np.concatenate(arrays)
    return log_sample

In [14]:
log_sample = InterpolateSample(income_df, log_upper=6.0)

The array needs to be converted back out of log10.  The text shows us how to do this with numpy.

In [15]:
sample = np.power(10, log_sample)

We can now calculate the median, mean, skewness and Pearson's skewness, using the functions for moments provided in the text, (pgs 72-74).

In [16]:
def RawMoment(xs, k):
    return sum(x**k for x in xs) / len(xs)

def Mean(xs):
    return RawMoment(xs, 1)

def CentralMoment(xs, k):
    mean = RawMoment(xs, 1)
    return sum((x - mean)**k for x in xs) / len(xs)

def Var(xs):
    return CentralMoment(xs, 2)

def StandardizedMoment(xs, k):
    var = CentralMoment(xs, 2)
    std = np.sqrt(var)
    return CentralMoment(xs, k) / std**k

def Skewness(xs):
    return StandardizedMoment(xs, 3)

def Median(xs):
    cdf = thinkstats2.Cdf(xs)
    return cdf.Value(0.5)

def PearsonMedianSkewness(xs):
    median = Median(xs)
    mean = RawMoment(xs, 1)
    var = CentralMoment(xs, 2)
    std = np.sqrt(var)
    gp = 3 * (mean - median) / std
    return gp

In [17]:
mean = Mean(sample)
median = Median(sample)
skewness = Skewness(sample)
Pearsons = PearsonMedianSkewness(sample)

print('Mean: ',  mean)
print('Median: ',  median)
print('Skewness: ',  skewness)
print('Pearson Median Skewness: ',  Pearsons)

Mean:  74278.70753118733
Median:  51226.45447894046
Skewness:  4.949920244429583
Pearson Median Skewness:  0.7361258019141782


Now, we'll use thinkstats2 to calculate the probability of falling below the mean.

In [18]:
cdf = thinkstats2.Cdf(sample)

cdf.Prob(mean)

0.660005879566872

This result is influenced by the choice we made in creating the interpolated sample.  Because we limited the upper salary to  1 million dollars, the mean is lower than it would be if there were say some billionaire outliers pulling up the mean.  To demonstrate this, I'll run a quick version where the upper limit for income is set to $10 billion instead of $1 million.

In [19]:
log_sample2 = InterpolateSample(income_df, log_upper=10.0)
sample2 = np.power(10, log_sample2)
mean2 = Mean(sample2)
cdf2 = thinkstats2.Cdf(sample2)

print('New mean: ' , mean2, 'Probability beneath the mean: ' , cdf2.Prob(mean2))


New mean:  22526983.866709236 Probability beneath the mean:  0.986330007022816


Allowing for some very high incomes in the data pulls the mean up to the point that most of the interpolated sample falls below the mean.