## Python statistics essential training - 04_02_confidenceintervals

Standard imports

In [1]:
import math
import io

In [2]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as pp

%matplotlib inline

In [3]:
import scipy.stats
import scipy.optimize
import scipy.spatial

Example scenario: we want to find the popular vote against 2 people (Brown and Green) -- poll citizens by vote -- you call 1,000 voters and ask for their voting intention. I'm giving you a file with your findings.

In [4]:
poll = pd.read_csv('poll.csv')
poll.info()
poll.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   vote    1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


Unnamed: 0,vote
0,Brown
1,Green
2,Brown
3,Brown
4,Brown


In [5]:
poll.vote.value_counts(normalize=True)

Brown    0.511
Green    0.489
Name: vote, dtype: float64

You realize that the limited sample means that the proportion depends on the specific people that you happen to draw. This is known as ***sampling variability***. 

So given this poll, what can you really say about the underlying population of voters? To understand this, we need to study the sampling distribution of the proportion, namely, we wish to understand what range of different samples we may get for the same population, and we'll do this by simulation on a computer.

We can simulate this range of different samples for the same population.

In [6]:
#this simulates a vector of 5 random numbers between 0 and 1
np.random.rand(5)

array([0.24412751, 0.35220184, 0.3121508 , 0.68405954, 0.68099432])

In [7]:
np.random.rand(5) < 0.51

array([ True, False,  True,  True, False])

We can combine and use the numpy function 'where' to convert this boolean to a string value

In [8]:
np.where(np.random.rand(5) < 0.51,'Brown','Green')

array(['Green', 'Green', 'Green', 'Green', 'Green'], dtype='<U5')

We can wrap eveerything into a dataframe and make a function out of it

In [12]:
def sample(brown,n=1000):
    np.where(np.random.rand(5) < 0.51,'Brown','Green')
    return pd.DataFrame({'vote':np.where(np.random.rand(n) < brown,'Brown','Green')})


#np.where(np.random.rand(5) < 0.51,'Brown','Green')

In [13]:
s = sample(0.51)

In [14]:
s

Unnamed: 0,vote
0,Brown
1,Brown
2,Green
3,Brown
4,Brown
...,...
995,Green
996,Brown
997,Brown
998,Brown


In [16]:
#Get the counts for the 2 candidates, remember normalize returns a fraction aka percent
s.vote.value_counts(normalize=True)

Brown    0.518
Green    0.482
Name: vote, dtype: float64