#  Group Lab 6 

## Random Variables & Confidence Interval Properties

In [57]:
import pandas as pd 
import numpy as np 
import math
from scipy.stats import t


<hr>

## <u>Case Study 1</u>: Properties of the Binomial Random Variable



### 1. Calculating $E(X)$

Let $X$ represent the Binomial random variable with $4$ trials & $p$ probability of success.  X consists of the sum of 4 Bernoulli random variables: $Y_1$, $Y_2$, $Y_3$, and $Y_4$.

Using the properties of expected values from Day 12's lecture material, derive the $E(X)$.  



E(X) = E (Y1 + Y2 + Y3 + Y4) = E(Y1) + E(Y2) + E(Y3) + E(Y4) = 4 * p = 4p

### 2. Calculating $Var(X)$

Derive the $Var(X)$.  


Var(X) = Var (Y1 + Y2 + Y3 + Y4) = Var(Y1) + Var(Y2) + Var(Y3) + Var(Y4) = 4 * p(1-p) = 4p(1-p)

### 3.  Distribution Properties for a Specific $X$

Now, let's apply what you've demonstrated in **1** and **2** above to a situation.

In a previous year (2020), the acceptance rate at UIUC was 59%.  Suppose we conduct an experiment where we randomly select 4 UIUC applicants from the previous year with replacement and observe whether each applicant was accepted into UIUC or not.

Let $X$ represent the number of applicants who were accepted into UIUC from our random sample of size 4.

What distribution does $X$ follow.  Be sure to define this distribution completely. 

Based on the formulas calculated in **1** and **2**, calculate and report the theoretical mean, variance, and standard deviation of $X$.

**We're looking for actual numbers here and not a proof or algebraic statements.**

X follows a binomial distribution. The parameters are n = 4, p = 0.59. 

X ~ Binomial (4, 0.59)

In [11]:
n = 4
p = 0.59
Mean = n * p 
Variance = n * p * (1-p)
Standard_deviation = math.sqrt(Variance)
print('Mean:' , Mean, 'Variance:', Variance, 'Standard_deviation:', Standard_deviation)

Mean: 2.36 Variance: 0.9676 Standard_deviation: 0.983666610188635


<hr>

## <u>Case Study 2</u>: Death Rates from Firearms

The death rates for US states (calculated per 100,000 individuals in the population) from firearms are recorded in the firearms.csv dataset.

### 4. Population Information

For the population of all US state-years, calculate the mean death rate from firearms.

In [14]:
firearms = pd.read_csv('firearms.csv')
firearms.head()
firearms['RATE'].mean()

12.58257142857143

In [54]:
firearms.shape

(350, 5)

### 5.Creating a Confidence Interval

While we technically have the population of all US states (and therefore can-and have-calculated the population mean death rates from firearms), we would still like to calculate confidence intervals for the population mean to learn more about how confidence intervals behave.  By having the population, we can then "check" our confidence intervals to see how well they perform.

Perform the following tasks:

- Generate a random sample of size 40 without replacement from the population of all US states
- Calculate the sample mean and sample standard deviation for each sample
- Calculate a 85% confidence interval around the sample mean

For this process, assume that we do not have any of the corresponding population characteristics (mean or standard deviation) available.

Print the resulting confidence interval.

In [16]:
sample = firearms.sample(40, replace= False, random_state=123)
sample

Unnamed: 0,YEAR,STATE,RATE,DEATHS,URL
142,2017,TX,12.4,3513,/nchs/pressroom/states/texas/texas.htm
75,2018,MT,17.3,186,/nchs/pressroom/states/montana/mt.htm
48,2019,WI,10.0,604,/nchs/pressroom/states/wisconsin/wi.htm
31,2019,NY,3.9,804,/nchs/pressroom/states/newyork/ny.htm
201,2015,AK,23.4,177,/nchs/pressroom/states/alaska/alaska.htm
152,2016,AZ,15.2,1094,/nchs/pressroom/states/arizona/arizona.htm
330,2005,NM,13.9,267,/nchs/pressroom/states/newmexico/newmexico.htm
157,2016,DE,11.0,111,/nchs/pressroom/states/delaware/delaware.htm
239,2015,SC,17.3,850,/nchs/pressroom/states/southcarolina/southcaro...
300,2005,AL,16.0,736,/nchs/pressroom/states/alabama/alabama.htm


In [24]:
sample_mean = sample['RATE'].mean()
sample_std = sample['RATE'].std()
print(sample_mean) 
print(sample_std)

12.004999999999999
4.813812924470048


In [58]:
multiplier = t.ppf((1 - ((1- 0.85 ) / 2)), 39)
multiplier

1.4684578003511024

In [59]:
lower_bound = sample_mean - multiplier * (sample_std / math.sqrt(40))
upper_bound = sample_mean + multiplier * (sample_std / math.sqrt(40))
print("A confidence level is (", lower_bound, ",", upper_bound, ")")

A confidence level is ( 10.887311754687504 , 13.122688245312494 )


### 6. Create a Function to Record our Parameter

Write a function that does the following:

- <u>Input</u> Your function should take as inputs
    - a lower bound of a confidence interval
    - an upper bound of a confidence interval
    - a population parameter value
- <u>Perfomance</u> Your function should 
    - Check whether $lower \: bound \le \mu \le upper \: bound$
    - Return True if the parameter is contained by the confidence interval
    - Return False if the parameter is not contained by the confidence interval

Test your function using your confidence interval and parameter value from above.

In [37]:
def confidence(lower_bound, upper_bound, population):
    if population < upper_bound and population > lower_bound:
        return(True)
    else:
        return(False)
    

In [40]:
population = firearms['RATE'].mean()
print(population)
confidence(lower_bound, upper_bound, population)

12.58257142857143


True

A population mean from my sample (12.583) is within my confidence interval (10.887,13.123). 

### 7. Creating Repeated Confidence Intervals


- Generate 1000 random samples (each of size 40 without replacement) from the population of all US states
- Calculate the 85% confidence interval for each random sample
- Record whether the confidence interval contains the population parameter

In [60]:
list = []

for i in range(1000):
    sample = firearms.sample(40, replace= False)
    x = sample['RATE'].mean()
    sigma = sample['RATE'].std()
    multiplier = t.ppf((1 - ((1- 0.85 ) / 2)), 39)
    lbd = x - multiplier * (sigma / math.sqrt(40))
    ubd = x + multiplier * (sigma / math.sqrt(40))
    record = confidence(lbd,ubd,population)
    list.append(record)

list

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,

### 8. Calculating the Proportion of Confidence Intervals Contained in Our Parameter

From the output of the 1000 random samples, calculate the proportion of confidence intervals that contain our parameter.

In [61]:
output = pd.DataFrame({'record':list})
print(output['record'].mean())

0.885


The proportion of confidence intervals that contain our parameter (12.583) is 88.5%. 

### 9.  Confidence Interval Assumptions

The assumptions for generating a confidence interval are not met for this situation.  Which condition(s) is/are not met?  Explain.

What are the implications of the assumptions not being met?

First of all, our sampling meets a condition for normality, which is for our multiplier to be valid. Our sampling size is 40, which is larger than 30.

However, this situation does not meet a condition for independence, which is for our standard error to be valid. 

We sampled without replacement, therefore it is a random sample. But, our sample size 40 is not smaller than 10% of population size (350 * 10% = 35).

When we sample without replacement, sample size should be quite small so that its probability could be similar to the situation of sampling with replacement. 