# Random Variables and Defined Distributions

## Package Imports

Run the cell provided below to import packages needed for this assignment.

You may also need to read in additional packages below.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import geom
from scipy.stats import norm
from scipy.stats import expon

<hr>

## <u>Case Study 1</u>: Slot Machine Gambling 

Suppose a gambler has a strategy to keep playing a slot matchine until they win a round. After the gambler **wins** a round, they stop playing for the day. The probability of **winning** any given slot machine round is 0.02.

**a)**  On average, how many rounds of the slot machine will the gambler play before stopping for the day? 

Let X denote number of rounds of the slot machine will the gambler play to win a round. (including a round he wins)

X ~ Geom(0.02)

E(X) = 1/p

In [2]:
p = 0.02
print(1/p)

# gambler will play 50 games on average before stopping for a day. 

50.0


**b)** Suppose that it costs $\$1$ to play each round of the slot machine. The gambler always brings $\$100$ with them to play the slot machine each day. If they run out of this $\$100$ (i.e. they play the game more than 100 times), then they go to borrow money from their friend and keep playing the slot machine until they win. What percent of days does does the gambler need to borrow money from this friend?

In [3]:
# P(X>100)
prob = 1 - geom.cdf(100, p)
prob

0.13261955589475316

Gambler needs about 13.3% of days to borrow money from friend. 

<hr>

## <u>Case Study 2</u>: SAT Score Scholarship Analysis 

Suppose the distribution of SAT scores for the seniors at a local high school (just math and verbal) has an average of 1000, a standard deviation of 100, and follows a normal distribution. 

**a)**  What is the probability that randomly selected senior from the high school scored more than 1250?

In [4]:
1 - norm.cdf(1250, 1000, 100)
# about 0.6% randomly selected from the high school would have scored more than 1250.

0.006209665325776159

**b)** What percent of seniors at the school score between a 900 and 950?

In [5]:
norm.cdf(950, 1000, 100) - norm.cdf(900, 1000, 100)
# about 15% of seniors at the school score between a 900 and 950

0.1498822847945298

**c)** Suppose that all seniors at the school who score higher than a 1200 get a scholarship from the county. Suppose that we randomly select a student that we KNOW got the scholarship. GIVEN that we know that the student has the scholarship, what is the probability that this student's SAT score was less than 1300?

**Hint:** You might want to consider some of our basic probability rules.

In [6]:
p_A = 1 - norm.cdf(1200, 1000, 100)
p_B = norm.cdf(1300, 1000, 100) - norm.cdf(1200, 1000, 100)
p = p_B / p_A
p

0.9406641669285728

<hr>

## <u>Case Study 3</u>: Farm Workers Salaries 

Suppose the standard deviation of the hourly wage of a farm worker in Illinois is 4 dollars per hour and the distribution of hourly farm worker wages follows the normal distribution. Suppose that we know that 25% of farm workers in Illinois make at least 15 dollars per hour.

What is the average hourly wage of farm workers in Illinois?

In [7]:
std = 4
15 - norm.ppf(1 - 0.25) * std 

12.302040999215674

<hr>

## <u>Case Study 4</u>: Wheelchair Basketball 
Suppose that you are a manager for a wheelchair basketball team with 12 players on the roster.  The length of time that an athletic wheelchair can function before needing to be serviced follows an exponential distribution (accessible within Python through the `expon` set of methods from `scipy.stats`).  For this exponential distribution, you should set the scale parameter to 7 weeks.

**a)** Using Python, generate a set of 12 randomly selected draws from the exponential distribution defined above.  You can think of each draw as the amount of time before a wheelchair needs to be serviced for a single player on the roster.

*Hint*: I recommend using a random_state here, so that you can recreate your draws, if needed.

In [9]:
# length of time that a wheelchair can function : X
# X ~ exponential (theta = 7weeks)
rv = expon.rvs(scale = 7, size = 12, random_state = 123)
rv

array([ 8.345905  ,  2.35947238,  1.80098856,  5.61003577,  8.89749655,
        3.85068275, 27.65687448,  8.08249592,  4.59004135,  3.48441592,
        2.94239575,  9.14073928])

**b)**  Calculate the minimum, mean, median, standard deviation, and estimated probability of a wheelchair not needing to be serviced for the 16 weeks of the semester based on this particular set of draws from the exponential distribution, i.e. the time until a wheelchair needs to be serviced for each of the 12 players on the roster.

*Hint*: You may want to use the tools demonstrated in the tutorial below to help perform these calculations.

**Tutorial for working with arrays.**

We don't always have objects in Python that allow us to use our typical methods for calculating summary values, including means, medians, and standard deviations.  If you find that your first attempt at a calculation doesn't work, try adjusting your code to the following format.

In [10]:
x = np.array([1, 2, 3, 4, 5])
# x.median() # doesn't work -- try it out my removing the first "#" to see the error message.
np.median(x)

3.0

In [13]:
minimum = np.min(rv)
mean = np.mean(rv)
median = np.median(rv)
standard_deviation = np.std(rv)
print ('minimum:', minimum, ', mean:', mean, ', median:', median, ', standard deviation:', standard_deviation)

minimum: 1.8009885608193554 , mean: 7.230128642799365 , median: 5.100038563289445 , standard deviation: 6.660100531989132


In [27]:
# estimated probability of a wheelchair not needing to be serviced for the 16 weeks of the semester
probability = np.mean(rv > 16)
probability

0.08333333333333333

**c)** Based on your random sample, what is the first time that a player would need their wheelchair serviced?

How many players would need their wheelchair serviced during the semester?

In [19]:
print(minimum)
print(sum(rv < 16))

1.8009885608193554
11


Based on my random sample, the first time that a player would need one's wheelchair to be serviced is about 1.8 weeks. 

11 players out of 12 sampled players would need their wheelchair seviced during the semester. 

**d)** For the theoretical exponential distribution with a scale of 7, calculate and report the mean, median, standard deviation, and probability of a wheelchair not needing to be serviced for the 16 weeks of the semester.

*Hint*: The exponential distribution is a continuous random variable and has many of the same functions as other continuous random variables we have discussed in class.  You can also find documentation for this distribution online.

In [21]:
th_mean = expon.mean(scale = 7)
th_median = expon.median(scale = 7)
th_std = expon.std(scale = 7)
print ('mean : ', th_mean, ', median : ', th_median, ', standard deviation : ', th_std)

mean :  7.0 , median :  4.852030263919617 , standard deviation :  7.0


In [22]:
prob_not_serviced = 1 - expon.cdf(x = 16, scale = 7)
prob_not_serviced

0.10170139230422681

**e)** How close are the mean, median, standard deviation, and probability values between **4b** and **4d**?

The mean value of my sample (7.23) is slightly higher than that of theoretical mean value(4). 

The median value of my sample (5.10) is slightly higher than that of theoretical median value(4.852).

The standard deviation of my sample (6.66) is slightly lower than that of theoretical standard deviation value(7).

The probability value of my sample (0.083) is slightly lower than that of theoretical probability value(0.1017). 

<hr>

## <u>Case Study 5</u>: Central Limit Theorem 

Consider the population of counties provided in the county.csv file.  We will examine the **per_capita_income** variable from this data.

**a)** Read in the data.

**Note that you will need to clean the data before you can perform your calculations.  The phrase 'data unavailable' represents missing data in this csv file.  You may assume that any counties with missing data are not included in our population for this calculation.**

In [32]:
county = pd.read_csv('county.csv')
county.dtypes       # pop2000, pop2017, pop_change, poverty, unemployment_rate, per_capita_income, median_hh_value should be numerical values

name                  object
state                 object
pop2000               object
pop2010                int64
pop2017               object
pop_change            object
poverty               object
homeownership        float64
multi_unit           float64
unemployment_rate     object
metro                 object
median_edu            object
per_capita_income     object
median_hh_income      object
smoking_ban           object
dtype: object

In [38]:
county = pd.read_csv('county.csv', na_values= 'data unavailable')
county.shape[0]     #3,142
county = county.dropna()
county.shape[0]     #2,559

2559

**b)**  From the information about the population of all counties in the US, calculate the *theoretical* standard error of the sampling distribution for the sample mean (of 50 counties) per capita income (**per_capita_income**).

In [41]:
county_std = county['per_capita_income'].std()
county_std

6208.763069260853

In [42]:
import math
theo_se = county_std / math.sqrt(50)
theo_se

878.0516938109902

**c)** Are the conditions for the Central Limit Theorem met?  That is, will the Central Limit Theorem apply to the *theoretical* standard error of the sampling distribution for the sample mean (of 50 counties) per capita income?

*Note:* You may make reasonable assumptions when needed.

The sample size is 50, which means it will meet the condition for the CLT. (The condition : sample size should be over 30.)

Also, the sampling can be thought as independent events because sample size (50) is smaller than 10% of the population size (255.9).