# Exercise 5.2
**Skylar Peven**
**DSC 530**
**01/12/2025**

**5.1**
In the BRFSS (see Section 5.4), the distribution of heights is
roughly normal with parameters µ = 178 cm and σ = 7.7 cm for men, and
µ = 163 cm and σ = 7.3 cm for women.
In order to join Blue Man Group, you have to be male between 5’10” and
6’1” (see http://bluemancasting.com).

What percentage of the U.S. male
population is in this range? Hint: use scipy.stats.norm.cdf.

In [5]:
import scipy.stats

# Creates a normal distribution with mu of 178cm and sigma of 7.7cm.
dist = scipy.stats.norm(178, 7.7)

# Evaluate CDF for 5'10"
low = dist.cdf(177.8)  

# Evaluate CDF for 6'1"
high = dist.cdf(185.4)  

print("Percentage of U.S. Male Pop. that could join the Blue Man Group: ", high-low)

Percentage of U.S. Male Pop. that could join the Blue Man Group:  0.3420946829459531


**5.2**
To get a feel for the Pareto distribution, let’s see how different
the world would be if the distribution of human height were Pareto. With the
parameters xm = 1 m and α = 1.7, we get a distribution with a reasonable
minimum, 1 m, and median, 1.5 m.

Plot this distribution.

What is the mean human height in Pareto world?

What fraction of the population is shorter than the mean?

If there are 7 billion people in Pareto world, how many do we expect to be taller than 1 km?

How tall do we expect the tallest person to be?

In [2]:
# Creates a Pareto distribution with alpha of 1.7m and minimum x of 1m.
dist = scipy.stats.pareto(1.7, 1)

# Average height
print("Mean Height in Pareto World: ", dist.mean(), "Meters")

# Fraction of population shorter than the mean.
print("Fraction of the Pop. Shorter than Average: ", dist.cdf(dist.mean()), "Short Kings")

# Percentage of people above 1000m(1km) multiplied by 7 billion on earth.
print("# of People Taller than 1km: ", (1 - dist.cdf(1000)) * 7e9, "Pareto People")

# Tallest Person.
print("Tallest Person in Pareto World: ", dist.ppf(1 - 1 / 7e9), "Meters")

Mean Height in Pareto World:  3.428571428571429 Meters
Fraction of the Pop. Shorter than Average:  0.778739697565288 Short Kings
# of People Taller than 1km:  55697.6292567235 Pareto People
Tallest Person in Pareto World:  618350.6106759505 Meters


**6.1**
The distribution of income is famously skewed to the right. In
this exercise, we’ll measure how strong that skew is.
The Current Population Survey (CPS) is a joint effort of the Bureau of Labor
Statistics and the Census Bureau to study income and related variables.
Data collected in 2013 is available from http://www.census.gov/hhes/www/
cpstables/032013/hhinc/toc.htm. I downloaded hinc06.xls, which is an
Excel spreadsheet with information about household income, and converted
it to hinc06.csv, a CSV file you will find in the repository for this book.
You will also find hinc2.py, which reads this file and transforms the data.
The dataset is in the form of a series of income ranges and the number of
respondents who fell in each range. The lowest range includes respondents
who reported annual household income “Under $5000.” The highest range
includes respondents who made “$250,000 or more.”

To estimate mean and other statistics from these data, we have to make some
assumptions about the lower and upper bounds, and how the values are distributed in each range. hinc2.py provides InterpolateSample, which shows
one way to model this data. It takes a DataFrame with a column, income,
that contains the upper bound of each range, and freq, which contains the
number of respondents in each frame.
It also takes log_upper, which is an assumed upper bound on the highest
range, expressed in log10 dollars. The default value, log_upper=6.0 represents the assumption that the largest income among the respondents is 10^6,
or one million dollars.
InterpolateSample generates a pseudo-sample; that is, a sample of household incomes that yields the same number of respondents in each range as
the actual data. It assumes that incomes in each range are equally spaced
on a log10 scale.

Compute the median, mean, skewness and Pearson’s skewness of the resulting
sample. What fraction of households reports a taxable income below the
mean?

In [2]:
# Download the necessary data.
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)

# Downloads the python files neccasary for use with the course.
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py")
# Downloads the python files neccasary for this exercise.
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/hinc06.csv")

import hinc
import thinkplot
import thinkstats2
import numpy as np

# From Hinc2.py
def InterpolateSample(df, log_upper=6.0):
    """Makes a sample of log10 household income.

    Assumes that log10 income is uniform in each range.

    df: DataFrame with columns income and freq
    log_upper: log10 of the assumed upper bound for the highest range

    returns: NumPy array of log10 household income
    """
    # compute the log10 of the upper bound for each range
    df['log_upper'] = np.log10(df.income)

    # get the lower bounds by shifting the upper bound and filling in
    # the first element
    df['log_lower'] = df.log_upper.shift(1)
    df.loc[0, 'log_lower'] = 3.0

    # plug in a value for the unknown upper bound of the highest range
    df.loc[41, 'log_upper'] = log_upper
    
    # use the freq column to generate the right number of values in
    # each range
    arrays = []
    for _, row in df.iterrows():
        vals = np.linspace(row.log_lower, row.log_upper, int(row.freq))
        arrays.append(vals)

    # collect the arrays into a single sample
    log_sample = np.concatenate(arrays)
    return log_sample


In [3]:
# Create a DF of incomes
income_df = hinc.ReadData()
# Takes a sample of respondents from the full data set on a log scale.
log_sample = InterpolateSample(income_df, log_upper=6.0)

In [4]:
# Inverts the log_sample data.
sample = np.power(10, log_sample)
# Creates a CDF of the sample data.
cdf = thinkstats2.Cdf(sample)

print("Mean: ", thinkstats2.Mean(sample))
print("Median: ", thinkstats2.Median(sample))
print("Skewness: ", thinkstats2.Skewness(sample))
print("Pearson Median Skewness: ", thinkstats2.PearsonMedianSkewness(sample))
print("Percent Households Below the Mean: ", cdf.Prob(thinkstats2.Mean(sample)))


Mean:  74278.7075311872
Median:  51226.45447894046
Skewness:  4.949920244429583
Pearson Median Skewness:  0.7361258019141782
Percent Households Below the Mean:  0.660005879566872


**How do the results depend on the assumed upper bound?**

An upper income cap of one million dollars isn't realistic and as the the cap increases so will all of the descriptive statistics. 