Given the data table below, determine if there is a relationship between fitness level and smoking habits:

In [2]:
#@title
import scipy.stats
import numpy as np
import pandas as pd

In [24]:
#@title
df = pd.DataFrame(
    data=[
          "113	113	110	159".split("\t"),
          "119	135	172	190".split("\t"),
          "77	91	86	65".split("\t"),
          "181	152	124	73".split("\t"),
          ],
          columns="Low fitness level	Medium-low fitness level	Medium-high fitness level	High fitness level".split("\t"),
          index=["Never smoked", "Former smokers", "1 to 9 cigarettes daily", ">=10 cigarettes daily"]
)
df

Unnamed: 0,Low fitness level,Medium-low fitness level,Medium-high fitness level,High fitness level
Never smoked,113,113,110,159
Former smokers,119,135,172,190
1 to 9 cigarettes daily,77,91,86,65
>=10 cigarettes daily,181,152,124,73


You don't have to fully solve for the number here (that would be pretty time-intensive for an interview setting), but lay out the steps you would take to solve such a problem. 

In [34]:
# Calculate marginals
n = df.to_numpy().astype(int).sum()
row_marginals = df.to_numpy().astype(int).sum(axis=1)
col_marginals = df.to_numpy().astype(int).sum(axis=0)

In [45]:
# Calculate expected values
E = (np.expand_dims(row_marginals, 1) * col_marginals) / n

In [47]:
# Compute X2
x2 = ((df.to_numpy().astype(int) - E)**2 / E).sum()
x2

87.27274636300587

In [40]:
# Get probability
p = scipy.stats.chi2.sf(x2, df=(len(row_marginals) - 1) * (len(col_marginals) - 1))
p

5.7306646048374425e-15

In [49]:
df

Unnamed: 0,Low fitness level,Medium-low fitness level,Medium-high fitness level,High fitness level
Never smoked,113,113,110,159
Former smokers,119,135,172,190
1 to 9 cigarettes daily,77,91,86,65
>=10 cigarettes daily,181,152,124,73


A correlational approach

In [59]:
# We give a numerical value to the categories, for smoking habits we assign the values 0 to 3 respectively
# and for fitness level we assign 0, 1, 2 and 3 from low to high, then we calculate the correlation between the two
# variables 
a = df.to_numpy().astype(int)
v1 = []
v2 = []

for row in range(len(a)):
  for col in range(len(a[0])):
    v1.append([row] * a[row, col])
    v2.append([col] * a[row, col])

r, p = scipy.stats.pearsonr(np.hstack(v1), np.hstack(v2))
(r, p)

(-0.1760369960623594, 4.163857658301945e-15)

In [60]:
# We follow Agresti (1996)'s approach: M2 = (N - 1)r2 , where M2 is a chi-square statistic on 1 degree of freedom, 
# r is the correlation between the two variables, and N is the sample size.
M2 = (n - 1)*r**2
M2, scipy.stats.chi2.sf(M2, df=1)

(60.70749798202925, 6.621744795444811e-15)

In [62]:
# We can go one step further before leaving this approach. We know that the 
# overall Pearson chi-square on 9 df = 87.2727. We also know that we have just 
# calculated a chi-square = 60.7075 on 1 df associated with the linear relationship 
# between the two variables. That linear relationship is part of the total 
# chi-square, and if we subtract the linear component from the overall 
# chi-square we obtain:
dl_df = (len(row_marginals) - 1) * (len(col_marginals) - 1) - 1 # Deviation from linear df
dl_x2 = x2 - M2 # Deviation from linear chi-square
dl_df, dl_x2

(8, 26.565248380976612)

In [63]:
# The departure for linearity is itself a chi-square = 26.5652 on 8 df, 
scipy.stats.chi2.sf(dl_x2, df=dl_df)

0.000840110607100272

In [None]:
# which has a probability under the null of 0.0008. Thus we have evidence that 
# there is other than a linear trend underlying these data. 
# (In other words, the relationship between smoking habits and fitness is curvilinear.)

In [65]:
# If we look at the difference between observations and expected values, it can be appreciated that
# X2 is significantly affected by the values for heavy smokers and low and high levels of fitness
((df.to_numpy().astype(int) - E)**2 / E)

array([[ 0.93383838,  0.97623902,  1.63540918, 10.54172159],
       [ 7.95454545,  2.41741476,  1.95155739,  8.91676578],
       [ 0.09482759,  1.53826507,  0.43833101,  2.56614465],
       [17.75283019,  2.78508749,  0.61437055, 26.15539827]])