# Computer Book B, Activity 18

Calculating large-sample confidence intervals for a **difference between proportions.**

## Package import

In [1]:
import pandas as pd
from scipy.stats import norm, bernoulli
import seaborn as sns
import matplotlib.pyplot as plt
from math import sqrt

## Data import

In [2]:
df = pd.read_csv("../data/snoring.csv")

# preview the data
df

Unnamed: 0,Column_1-T,Never,Occasionally,Often,Always
0,Heart disease,24,35,21,30
1,No heart disease,1355,603,192,224


In [3]:
df.dtypes

Column_1-T      object
Never            int64
Occasionally     int64
Often            int64
Always           int64
dtype: object

## Data transformation

The `DataFrame` is not partcularly useful for us at the moment, given its format.
Let us transform it so it is useful.

In [4]:
# declare new DataFrames, one for each sample group
disease = df.query("`Column_1-T` == 'Heart disease'").copy(deep=False)
no_disease = df.query("`Column_1-T` == 'No heart disease'").copy(deep=False)

# drop label columns
disease.drop(columns="Column_1-T", inplace=True)
no_disease.drop(columns="Column_1-T", inplace=True)

In [5]:
disease

Unnamed: 0,Never,Occasionally,Often,Always
0,24,35,21,30


In [6]:
no_disease

Unnamed: 0,Never,Occasionally,Often,Always
1,1355,603,192,224


## Estimating the difference between proportions

> Estimate the difference between the proportion of people without heart disease who never snore and the proportion with heart disease who never snore.

Note that we recast these new local variable using `float()`.

In [7]:
# sample size n2
n1 = disease.sum(axis=1).at[0]  # returns Series, so get at idx
n1 = n1

# proportion who never snore withheart disease
p1 = float(disease["Never"]/disease.sum(axis=1))

In [8]:
p1

0.21818181818181817

In [9]:
# sample size n2
n2 = no_disease.sum(axis=1).at[1]  # returns Series, so get at idx

# proportion who never snore without heart disease
p2 = float(no_disease["Never"]/no_disease.sum(axis=1))

In [10]:
p2

0.5707666385846673

In [11]:
# the difference between the two proportions d
p2 - p1

0.3525848204028491

## Calculating a CI for the difference

Note these are just **Bernoulli variable**, they either snore or they don't snore.
We declare `bernoulli` objects to make this link clearer.

In [12]:
# declare two bernoulli variables
b1 = bernoulli(p=p1)  # with heart disease
b2 = bernoulli(p=p2)  # with no heart hisease

In [13]:
# calculate p(1-p)/n for each object
ese_b1 = b1.var()/n1
ese_b2 = b2.var()/n2

In [14]:
# estimated standard error
ese = sqrt(ese_b1 + ese_b2)

In [15]:
# approximate 95% CI using Normal(d, ese^2)
norm(loc=(p2-p1), scale=ese).interval(alpha=0.95)

(0.27287638968801975, 0.43229325111767847)