In [2]:
# import libraries
import numpy as np
import pandas as pd

In [3]:
lifespans = pd.read_csv('familiar_lifespan.csv')
iron = pd.read_csv('familiar_iron.csv')
print(lifespans.head())

     pack   lifespan
0    vein  76.255090
1  artery  76.404504
2  artery  75.952442
3  artery  76.923082
4  artery  73.771212
     pack    iron
0    vein     low
1  artery  normal
2  artery  normal
3  artery  normal
4  artery    high


Extract the lifespan of subscribers to the `vein` pack and store it in the variable `vein_pack_subscribers`.

In [4]:
vein_pack_subscribers = lifespans.lifespan[lifespans.pack == 'vein']

In [5]:
np.mean(vein_pack_subscribers) # print the mean of the `vein` pack subscribers. 

76.16901335636044

## Check if the subscribers lifespan is 73 years

** Null hypothesis: ** the mean is 73 years.

** Alternative hypothesis: ** the mean is not 73 years.

In this case, we have to conduct a one-sample t test.We are going to consider a threshold of 0.05 for significance.

In [6]:
from scipy.stats import ttest_1samp # import the function

In [11]:
t_score, p_value = ttest_1samp(vein_pack_subscribers, popmean=73)
print(p_value)

5.972157921433201e-07


The p_value is lower than the significance threshold. As a consecuence, we reject the null hypothesis. However, there is a 5% of probability of false positive (considering than the means are different, even when actually they aren't)

## Doing the same for the `artery` subscribers

In [13]:
artery_pack_subscribers = lifespans.lifespan[lifespans.pack == 'artery']

In [14]:
np.mean(artery_pack_subscribers) # print the mean of the `vein` pack subscribers. 

74.8736622351704

## Compare `vein` and `artery` lifespan

** Null hypothesis: **  Means are similar.

** Alternative hypothesis: ** Means are not similar.

For evaluating the association between a quantitative variable (lifespan) and a catergorical binary variable we are going to conduct a two-sample t test.

In [15]:
from scipy.stats import ttest_ind

In [19]:
tstat, pval = ttest_ind(vein_pack_subscribers, artery_pack_subscribers)
print(pval)

0.055888830790708194


The p_value is higher than 0.05, so we accept the null hypothesis.

# Side effects: a familiar problem

The Familiar team has provided us with another dataset containing survey data about iron counts for our subscribers. This data has been pre-processed to categorize iron counts as “low”, “normal”, and “high” for each subscriber. Familiar wants to be able to advise potential subscribers about possible side effects of these packs and whether they differ for the Vein vs. the Artery pack.

In [20]:
print(iron.head())

     pack    iron
0    vein     low
1  artery  normal
2  artery  normal
3  artery  normal
4  artery    high


Is there an association between the pack that a subscriber gets (Vein vs. Artery) and their iron level?

In [21]:
# first, we create a contingency table
Xtab = pd.crosstab(iron.pack, iron.iron)
Xtab

iron,high,low,normal
pack,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
artery,87,29,29
vein,20,140,40


Find out if there is a significant association between which pack (Vein vs. Artery) someone subscribes to and their iron level.

Test the following null and alternative hypotheses:

** Null: ** There is NOT an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.

** Alternative: ** There is an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.

Use a significance threshold of 0.05.

To test for an association between two categorical variables, we can use a Chi-Square test. The null hypothesis for a Chi-Square test is that there is no association between the variables and the alternative hypothesis is that there is an association between the variables.

In [23]:
from scipy.stats import chi2_contingency

In [25]:
chi2, pval, dof, expected = chi2_contingency(Xtab)
pval 

9.359749337433008e-25

pval is lower than the significance threshold. As a consequence we reject the null hypothesis and we conclude that there is an association between these two variables.