In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import scipy.stats as stats

## inference for discrete data p. 54

1000 randomly selected adults responding to questions about how many dogs they own. Q: what's the 95% confidence interval for the average number of dogs in the population

In [2]:
df = pd.DataFrame(data={'n_dogs':[0,1,2,3,4], 'n_ppl':[600,300,50,30,20]})

In [3]:
df

Unnamed: 0,n_dogs,n_ppl
0,0,600
1,1,300
2,2,50
3,3,30
4,4,20


Use formula $\overline{y} = \frac{1}{n}\sum_{i=1}^{n}y_{i}$ where $n=1000$, $y_i = \textrm{n_dogs}\times\textrm{n_ppl}$  

In [4]:
sum(df['n_dogs']*df['n_ppl'])/1000

0.57

In [5]:
mean = 0.57

This is the average number of dogs per person, as agreed with R

Use formula $s_{y} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(y_{i}-\overline{y})^2}$

In [6]:
std = np.sqrt(sum(np.power((df['n_dogs']-0.57),2)*df['n_ppl'])/999)

In [7]:
std

0.8751376268141291

Also matches with the R output

Standard Error is $se = \frac{\sigma}{\sqrt{n}}$

In [8]:
se = std/np.sqrt(1000)

In [9]:
se

0.027674281668470926

95% confidence interval is based on a t-distribution with n-1 degrees of freedom (dof)

`stats.t.ppf` seems to be the equivalent of `qt` in R

In [10]:
stats.t.ppf?

[0;31mSignature:[0m [0mstats[0m[0;34m.[0m[0mt[0m[0;34m.[0m[0mppf[0m[0;34m([0m[0mq[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Percent point function (inverse of `cdf`) at q of the given RV.

Parameters
----------
q : array_like
    lower tail probability
arg1, arg2, arg3,... : array_like
    The shape parameter(s) for the distribution (see docstring of the
    instance object for more information)
loc : array_like, optional
    location parameter (default=0)
scale : array_like, optional
    scale parameter (default=1)

Returns
-------
x : array_like
    quantile corresponding to the lower tail probability q.
[0;31mFile:[0m      ~/miniconda3/envs/ros/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py
[0;31mType:[0m      method


In [11]:
int_95 = mean+stats.t.ppf([0.025, 0.975], 1000-1)*se

In [12]:
int_95

array([0.51569361, 0.62430639])

This looks correct

# Exercises

## 4.1 Comparison of proportions

In [13]:
#average treatment effect
estimate = 0.5-0.4
estimate

0.09999999999999998

In [14]:
#standard error of the treatment effect, see standard error for a comparison
se_ctrl = np.sqrt(0.4*0.6/500)
se_treat = np.sqrt(0.5*0.5/500)
se = np.sqrt(np.power(se_ctrl, 2)+np.power(se_treat,2))

se

0.03130495168499706

Answer: the estimated treatment effect is 0.1$\pm$0.03, this answer is correct, according to [answer key](https://statmodeling.stat.columbia.edu/2019/06/02/question-2-of-our-applied-regression-final-exam-and-solution-to-question-1/)

## 4.2 Choosing sample size

Note that $se_{\textrm{tot}} = \sqrt{se_{1}^2+se_{2}^2}$, and $se = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$, so $se_{\textrm{tot}} = \sqrt{\frac{\hat{p}_1\left(1-\hat{p}_1\right)}{N/2}+\frac{\hat{p}_2\left(1-\hat{p}_2\right)}{N/2}}$, if we assume that men and women are making up 0.5 of the total population

Since we are asking about supporting a candidate, we can assme 0.5 for $\hat{p}$ for both men and women as a mean since the response is either 0 or 1. Then we have

$se_{\textrm{tot}}^2 = \frac{4*0.5^2}{N} \leq 0.05^2 \implies \frac{4*0.5^2}{0.05^2} \leq N \implies 400 \leq N$

In [15]:
4*0.5*0.5/0.05/0.05

400.0

Or, you think, $se = \frac{\sigma}{\sqrt{N}}$, for a general population $\sigma = 1$, so $1/\sqrt{N} \leq 0.05 \implies N \geq 0.05^{-2} = 400$ 

In [16]:
np.power((1/0.05),2)

400.0

## 4.3 Comparison of Proportions

The question is asking for a p-value, with Null hypothesis that the better shooter makes more shots

average (mean) treatment effect is 0.1

standard error is

In [17]:
se = se_ctrl = np.sqrt(0.4*0.6/20)
se_treat = np.sqrt(0.3*0.7/20)
se = np.sqrt(np.power(se_ctrl, 2)+np.power(se_treat,2))

se

0.15

We know for a normal distribution $z=\frac{x-\mu}{\sigma/\sqrt{N}}$, we are trying to find the p-value. Here we know that if $x=0$, then the z-score would be for all the cdf of a Normal distribution up to 0, which means the cdf for this z-score gives the probability mass for us seeing $x\leq0$. So we need $1-\textrm{cdf}(z)$

In [18]:
z = (0-0.1)/(0.15)
z

-0.6666666666666667

In [19]:
1-stats.norm.cdf(z)

0.7475074624530771

So 75% chance of seeing a mean > 0 with normal

But the book talked about t tests. And our N is way too small to use a normal distribution, so what does the t look like?

In [20]:
1-stats.t.cdf(-0.1/0.15, df=40)

0.7455937742565715

for a t distribution it appears that we will see the effect 74% of the time, which looks very close

In [21]:
import pymc3 as pm

In [22]:
with pm.Model() as model:
    n1 = pm.Binomial('n1',p=0.4,n=20)
    n2 = pm.Binomial('n2',p=0.3,n=20)
    diff = pm.Deterministic('diff', n1-n2)
    
    trace = pm.sample(300000)

Multiprocess sampling (3 chains in 3 jobs)
CompoundStep
>Metropolis: [n2]
>Metropolis: [n1]


Sampling 3 chains for 1_000 tune and 300_000 draw iterations (3_000 + 900_000 draws total) took 92 seconds.
The number of effective samples is smaller than 25% for some parameters.


In [23]:
burned_trace = trace[10000:]

In [24]:
burned_trace['n1']

array([ 9, 11, 11, ...,  6,  6,  6])

In [25]:
burned_trace['n2']

array([2, 9, 1, ..., 6, 6, 6])

In [26]:
burned_trace['diff'].mean()

1.9959022988505748

This is correct, we should see $0.1*20 = 2$ as mean

In [27]:
burned_trace['diff'].std()

3.0005980997080774

In [28]:
sum(np.where(burned_trace['diff']>0,1,0))/len(burned_trace['diff'])

0.6924747126436782

So actually it should be 69%, close to the normal value.....

## 4.4 Design an Experiment

This is the inverse of the previous question, I think this is asking for the 95% confidence interval

In [29]:
z = -1.65 #the z score for 95% one sided Normal distribution

In [30]:
sigma = np.sqrt(0.6*0.4+0.3*0.7)

N= np.power(sigma*z/(-0.1),2)

N

122.5125

We need roughly 123 trials

## 4.6 Hypothesis testing

In [31]:
with open('/home/jfyu/projects/ROS-Examples/Girls/girls.dat', 'r') as f:
    data = f.readlines()

In [32]:
data

['Proportion of girl births in 24 successive months in Vienna, 1908-1909, out of an average of 3900 births per month.  Data from Mises (1953). See Chapter 4 in Regression and Other Stories.\n',
 ' .4777\n',
 ' .4875\n',
 ' .4859\n',
 ' .4754\n',
 ' .4874\n',
 ' .4864\n',
 ' .4813\n',
 ' .4787\n',
 ' .4895\n',
 ' .4797\n',
 ' .4876\n',
 ' .4859\n',
 ' .4857\n',
 ' .4907\n',
 ' .5010\n',
 ' .4903\n',
 ' .4860\n',
 ' .4911\n',
 ' .4871\n',
 ' .4725\n',
 ' .4822\n',
 ' .4870\n',
 ' .4823\n',
 ' .4973']

In [33]:
data = data[1:]

In [34]:
data

[' .4777\n',
 ' .4875\n',
 ' .4859\n',
 ' .4754\n',
 ' .4874\n',
 ' .4864\n',
 ' .4813\n',
 ' .4787\n',
 ' .4895\n',
 ' .4797\n',
 ' .4876\n',
 ' .4859\n',
 ' .4857\n',
 ' .4907\n',
 ' .5010\n',
 ' .4903\n',
 ' .4860\n',
 ' .4911\n',
 ' .4871\n',
 ' .4725\n',
 ' .4822\n',
 ' .4870\n',
 ' .4823\n',
 ' .4973']

In [35]:
girl_birth = []
for i in data:
    girl_birth.append(np.float(i.strip('\n')))

In [36]:
df = pd.DataFrame({'girl_birth_rate':girl_birth})

In [37]:
df

Unnamed: 0,girl_birth_rate
0,0.4777
1,0.4875
2,0.4859
3,0.4754
4,0.4874
5,0.4864
6,0.4813
7,0.4787
8,0.4895
9,0.4797


In [38]:
std_observed = df['girl_birth_rate'].std()
std_observed

0.006409724269997214

what would be expected if the birth proportions were constant? See p. 64

In [39]:
p_hat = sum(df['girl_birth_rate']*3900)/(3900*len(df)) #empirical rate
theoretical_std = np.sqrt(p_hat*(1-p_hat)/3900)

In [40]:
theoretical_std

0.008003121095900088

In [41]:
n = 24
chi_stats = sum(np.power(df['girl_birth_rate']-p_hat,2))/np.power(p_hat,2)

In [42]:
chi_stats

0.004006037715187785

In [43]:
stats.t.cdf(chi_stats, df=23)

0.5015809013819601

so there is not enough evidence to reject the hypothesis that the "constant variance" is actually chance

## 4.7 Inference from a proportion with y=0

apparently according to p. 52, if y = 0 then we use a quick correction where $\hat{y} = \frac{y+2}{n+4}$

In [44]:
p_hat = 2/(50+4)

se = np.sqrt(p_hat*(1-p_hat)/(50+4))

In [45]:
se

0.025699580240322626

In [46]:
p_hat

0.037037037037037035

In [47]:
p_hat-2*se

-0.014362123443608217

In [48]:
p_hat+2*se

0.08843619751768228

Effect cannot be 0, so we have \[0, 0.088\]

this answer is correct, according to the [answer key](https://statmodeling.stat.columbia.edu/2019/06/09/question-9-of-our-applied-regression-final-exam-and-solution-to-question-8/)

## 4.8 Transformation of confidence or uncertainty intervals

The multiplicative effect is that $y = 1.42x$, so if we take the log then $\ln{y} = \ln{1.42x} \implies \frac{\ln{y}}{\ln{x}} = \ln{1.42}$

In [50]:
np.log(1.42)

0.35065687161316933

standard error in non-log scale is (1.42-1.02)/2 = 0.21/2 = 0.105, so one could say that the log version is also symmetric (i.e. should be $\ln{1.05}$ in log space)

In [51]:
np.log(1.05)

0.04879016416943205

## 4.9 Inference for a probability

Let p be the proportion of students in the population who would get the question correct. p has an estimate of 0.6 and a standard error of sqrt(0.5^2/100) = 0.05.

Let theta be the proportion of students in the population who actually know the answer. Based on the description above, we can write:
p = theta + 0.25*(1 – theta) = 0.25 + 0.75*theta,
thus theta = (p – 0.25)/0.75.
This gives us an estimate of theta of (0.6 – 0.25)/0.75 = 0.47 and a standard error of 0.05/0.75 = 0.07, so the 95% confidence interval is [0.47 +/- 2*0.07] = [0.31, 0.59]

this I got from [answer](https://statmodeling.stat.columbia.edu/2019/06/03/question-3-of-our-applied-regression-final-exam-and-solution-to-question-2/)

this was a strange question for me because I wasn't sure what the question wanted me to do, as if there's some kind of shortcut somewhere that I am not seeing. Had I not have the context of the chapter, I probably would have started by writing down the Bayesian equation and try to do this in a probability, which is shown in [the first comment](https://statmodeling.stat.columbia.edu/2019/06/02/question-2-of-our-applied-regression-final-exam-and-solution-to-question-1/). But I didn't, kinda makes me feel stupid

In [53]:
p = 0.6
se_p = np.sqrt(0.5*0.5/100) #this is apparently becuase that's standard practice when the probability is near 50%
se_p

0.05

In [54]:
#p = theta + 0.25*(1-theta) where theta are people who knew the answer, 0.25 since pure guess
theta = (p-0.25)/0.75

In [55]:
theta

0.4666666666666666

In [57]:
0.05/0.75 # because SE is invariant under transformation

0.06666666666666667

## 4.10 Survey weighting

Survey A is a random sample of 1000 Americans, survey B is a sample that over samples Lations with 300 randomly sapled Lationos and 700 other randomly sampled from non-Latinos. From a glance it would look like survey A would be a better representation of overall population, and the survey B would be better on the accurate comparisons. 

But we were told to check for the standard errors, so let's see. Assumption is that the national population is 15% Latino, and the questions are yes/no with approximately equal proportions of each response and there's no non-response

In [58]:
# for Survey A, SE = sqrt(se1^2_+ se2^2)
se1 = np.sqrt(0.5*0.5/(0.15*1000))
se2 = np.sqrt(0.5*0.5/(0.75*1000))

se_a = np.sqrt(np.power(se1,2)+np.power(se2,2))

se_a

0.0447213595499958

In [59]:
#for Survey B

se1 = np.sqrt(0.5*0.5/(0.3*1000))
se2 = np.sqrt(0.5*0.5/(0.7*1000))

se_b = np.sqrt(np.power(se1,2)+np.power(se2,2))

se_b

0.03450327796711771

Looks like SE_B would have a smaller 95% confidence interval than SE_A, so that supports that survey B is more appropriate for Latino specific response