# Problem 6.15

## part a.

We initialize a dataset with the parameters Z = penicillin level with five categories (1/8, 1/4, 1/2, 1, 4) and X = delay with two categories (None, 1/2 hour). All but one category in each parameter will be represented by a column in the X matrix. The response variable is ratio cured: cured / (cured + died).

In [69]:
resp = [[0, 6],
         [0, 5],
         [3, 3],
         [0, 6],
         [6, 0],
         [2, 4],
         [5, 1],
         [6, 0],
         [2, 0],
         [5, 0]]
flags = [[0, 0, 0, 0, 0],
         [1, 0, 0, 0, 0],
         [0, 1, 0, 0, 0],
         [1, 1, 0, 0, 0],
         [0, 0, 1, 0, 0],
         [1, 0, 1, 0, 0],
         [0, 0, 0, 1, 0],
         [1, 0, 0, 1, 0],
         [0, 0, 0, 0, 1],
         [1, 0, 0, 0, 1]]

In [70]:
import numpy as np
r = np.array(resp)
y = r[:,0]/np.sum(r, axis = 1)

In [71]:
X = np.array(flags)

In [72]:
import statsmodels.api as sm
result = sm.GLM(y, X, family=sm.families.Binomial(sm.families.links.logit)).fit()
print(result.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                   10
Model:                            GLM   Df Residuals:                        5
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -3.1578
Date:                Tue, 14 Nov 2017   Deviance:                       13.678
Time:                        22:45:57   Pearson chi2:                     4.08
No. Iterations:                    20                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -2.9961      2.688     -1.115      0.265      -8.264       2.272
x2            -0.1633      1.901     -0.086      0.9

The implication of the data set is that the penecillin level of 4 will _always_ lead to a cure, while a penecillin level of 1/8 will _always_ lead to death. If there is no constant term (or intercept), then the expected survival rate from the logit link depends on the parameters as $$\pi(\boldsymbol{x}) = \frac{\exp\left(\sum \beta_i x_i\right)}{1+\exp\left(\sum \beta_i x_i\right)}.$$ To set the survival rate ($\pi$) to zero, then $\exp\left(\sum \beta_i x_i\right)$ must equal zero, which is true when $\sum \beta_i x_i = -\infty$. Similarly, $\pi$ will reach unity when $\sum \beta_i x_i = \infty$. 

Therefore, if the probability of cure at 1/8 penecillin dose is actually zero, the parameter corresponding to the x2 variable should be $-\infty$ and, likewise, if the probility of cure at 4 penecillin dose is actually zero, the parameter corresponding to the x5 variable should be $\infty$.

From the model, we can see that the x5 paramenter is large and positive with huge standard error, and the x2 parameter is negative, with a large relative standard error. The p-values for both of these are very large, much worse than the values for the other parameters. 

In [73]:
result.predict(X)

array([ 0.5       ,  0.04760147,  0.45927655,  0.04072345,  0.93090946,
        0.40242387,  0.99074878,  0.84258455,  1.        ,  1.        ])

We can see from the predicted value of 0.5 the weakness of performing this logistic regression with no intercept. With none of the flags set for the condition where penecillin level is 1/8, and there is no delay, all parameters are multiplied by zero. The logit function becomes $\pi = \frac{e^0}{1+e^0} = 0.5$ as seen. This affects the acuracy of the x2 coefficient as noted above.

## part b.

Presence of delay is the binary predictor ($X$) and while survival is the binary response ($Y$). Since the conditional odds ratio between the levels of the qualitative covariate ($Z$, penecillin dose) is not constant, we must use the reduced model in eqn 6.5, $\text{logit}(\pi_{ik}) = \alpha + \beta^{Z}_{k}$. 

To run this model, we must remove the first column of our previously entered flags data that had represented the $X$ variates. This leaves duplicate rows, so we combine those rows, and sum their response values.

We can multiply the predicted results of this model by the $n$ for each response to make a series of 2x2 predicted contingency tables for each level of Z.

In [84]:
resp2 = np.array(resp[::2]) + np.array(resp[1::2])
y2 = resp2[:,0]/np.sum(resp2, axis = 1)
X2 = X[::2, 1:]

In [85]:
y2, X2

(array([ 0.        ,  0.25      ,  0.66666667,  0.91666667,  1.        ]),
 array([[0, 0, 0, 0],
        [1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1]]))

In [83]:
result = sm.GLM(y2, X2, family=sm.families.Binomial(sm.families.links.logit)).fit()
result.predict(X2)
cont = np.vstack([result.predict(X2), 1-result.predict(X2)])*np.sum(resp2, axis=1)

In [82]:
np.set_printoptions(suppress=True)
np.stack([resp2, cont.T], axis = 1)

array([[[  0.        ,  11.        ],
        [  5.5       ,   5.5       ]],

       [[  3.        ,   9.        ],
        [  3.        ,   9.        ]],

       [[  8.        ,   4.        ],
        [  8.        ,   4.        ]],

       [[ 11.        ,   1.        ],
        [ 11.        ,   1.        ]],

       [[  7.        ,   0.        ],
        [  6.99999999,   0.00000001]]])

We can now test these tables for conditional indepdenence. The test is basically unnecessary since the rows in the table are basically identical. The bottom row is slightly off since the model cannot properly predict infinity as a parameter outcome. The top row is wrong due to the previously noted error that a logistic model without and intercept must predict $\pi$ value of 0.5 when all binary predictor flags are zero. 

Since the model is otherwise identical, we can assume that the response variable and delay time are conditionally independent; that is, rabbit survival is not affected by whether the injection of penecillin is immediate or delayed.