# Problem 4.2

## part a.

We initialize two pandas datasets df_u as ungrouped and df_g as grouped.

In [2]:
import pandas as pd, numpy as np, statsmodels.api as sm

In [4]:
data_u = np.array([1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1])
index_u = ([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2])
df_u = pd.DataFrame(data = data_u, index = index_u, columns=['Sucesses'])
df_u # ungrouped data

Unnamed: 0,Sucesses
0,1
0,0
0,0
0,0
1,1
1,1
1,0
1,0
2,1
2,1


In [17]:
data_g = np.array([[4,1], [4,2], [4,4]])
index_g = [0,1,2]
df_g = pd.DataFrame(data = data_g, index = index_g, columns=['Trials', 'Sucesses'])
df_g # grouped data

Unnamed: 0,Trials,Sucesses
0,4,1
1,4,2
2,4,4


For model $M_0$, $\text{logit}\left[P(Y=1)\right]=\alpha$, there is no parameter for the x values. Therefore, the x-variable is effectively an array of zeroes, while the constant $\alpha$ is represented by an array of ones. The array of zeroes is trivial and can be ignored. First we fit the ungrouped data:

In [55]:
y_u0 = df_u['Sucesses'] 
X_u0 = np.ones(len(df_u['Sucesses']))
result_u0 = sm.GLM(y_u0, X_u0, family=sm.families.Binomial(sm.families.links.logit)).fit()
print(result_u0.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               Sucesses   No. Observations:                   12
Model:                            GLM   Df Residuals:                       11
Model Family:                Binomial   Df Model:                            0
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -8.1503
Date:                Wed, 15 Nov 2017   Deviance:                       16.301
Time:                        16:05:26   Pearson chi2:                     12.0
No. Iterations:                     4                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3365      0.586      0.575      0.566      -0.811       1.484


In [56]:
y_g0 = np.array(df_g['Sucesses'] / df_g['Trials'])
X_g0 = np.ones(len(df_g['Sucesses']))
result_g0 = sm.GLM(y_g0, X_g0, family=sm.families.Binomial(sm.families.links.logit)).fit()
print(result_g0.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                    3
Model:                            GLM   Df Residuals:                        2
Model Family:                Binomial   Df Model:                            0
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2.0376
Date:                Wed, 15 Nov 2017   Deviance:                       4.5799
Time:                        16:05:44   Pearson chi2:                     1.20
No. Iterations:                     4                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3365      1.171      0.287      0.774      -1.959       2.632


In [65]:
y_u1 = np.array(df_u['Sucesses'])
X_u1 = index_u
X_u1 = sm.add_constant(X_u1) # adds intercept (\alpha)
result_u1 = sm.GLM(y_u1, X_u1, family=sm.families.Binomial(sm.families.links.logit)).fit()
print(result_u1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                   12
Model:                            GLM   Df Residuals:                       10
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -5.5141
Date:                Wed, 15 Nov 2017   Deviance:                       11.028
Time:                        16:08:31   Pearson chi2:                     10.1
No. Iterations:                     5                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.5027      1.181     -1.272      0.203      -3.818       0.813
x1             2.0605      1.130      1.823      0.0

In [66]:
y_g1 = np.array(df_g['Sucesses'] / df_g['Trials'])
X_g1 = index_g
X_g1 = sm.add_constant(X_g1) # adds intercept (\alpha)
result_g1 = sm.GLM(y_g1, X_g1, family=sm.families.Binomial(sm.families.links.logit)).fit()
print(result_g1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                    3
Model:                            GLM   Df Residuals:                        1
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -1.3785
Date:                Wed, 15 Nov 2017   Deviance:                       2.5635
Time:                        16:08:32   Pearson chi2:                    0.184
No. Iterations:                     6                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.5027      2.363     -0.636      0.525      -6.133       3.128
x1             2.0605      2.260      0.912      0.3

The maximized log likelihood values of the coefficients are the same for either data entry method for each model. $L_0$ for both the ungrouped and grouped is ($\alpha = 0.3365$), while $L_1$ for both the ungrouped and grouped is ($\alpha = -1.5027$, $\beta = 2.0605$). 

## part b.

The deviances are calculated using $-2\sum y_i\log\frac{y_i}{\mu_i}$. I have no idea where the GLM deviances from the statsmodels package are coming from, so I'm ignoring them. 

In [72]:
y_dev = [i if i != 0 else 1 for i in y_u0] # mathematical trick to avoid taking logs of zero
                                        # terms where observed (y) are zero still zero out due to multiplication by 
                                        # observed value outside of the log
2*np.sum(y_u0 * np.log(y_dev/result_u0.predict(X_u0)))
# Ungrouped, Model 0

7.5459510102576166

In [73]:
y_dev = [i if i != 0 else 1 for i in y_g0]
2*np.sum(y_g0 * np.log(y_dev/result_g0.predict(X_g0)))
# Grouped, Model 0

0.50019339144451425

In [74]:
y_dev = [i if i != 0 else 1 for i in y_u1]
2*np.sum(y_u1 * np.log(y_dev/result_u1.predict(X_u1)))
# Ungrouped, Model 1

5.7809068798934549

In [75]:
y_dev = [i if i != 0 else 1 for i in y_g1]
2*np.sum(y_g1 * np.log(y_dev/result_g1.predict(X_g1)))
# Grouped, Model 1

0.058932358853473782

The deviances are significantly higher for the ungrouped data than they are for the grouped data. This is a function of the number of parameters in a saturated model. There are 12 data points in the ungrouped set, while there are only 3 in the grouped set. Since unscaled deviance is a sum of deviance residuals, more data points will result in a higher deviance. 

# Part c.

I did not observe that the difference between deviances for M_0 and M_1 is the same for each form of data entry.

# Problem 4.12

## part a.

There are 63 parameters in the saturated model, from 3 H times 3 S times 7 T categories. There are 11 parameters in the fitted model, including 1 constant, 2 H, 2 S, and 6 T. The difference is the degrees of freedom: 52.

The p-value of the deviance is the chi-squared distribution with degrees of freedom equal to the model's will tell us whether the fit is good or not.

In [1]:
from scipy.stats import chi2
1-chi2.cdf(43.9, 52)

0.78035694250244336

Due to the high p-value, the fit appears to be poor

## part b.

In both cases, the standard error is less than the magnitude of the parameter coefficient. A 95 percent confidence interval for these effects are:

In [91]:
s2 = .470; s2se = .174; s3 = 1.324; s3se = .152
from scipy.stats import norm

In [92]:
(s2 + norm.ppf(0.025)*s2se, s2 - norm.ppf(0.025)*s2se)

(0.12896626669003053, 0.81103373330996942)

In [93]:
(s3 + norm.ppf(0.025)*s3se, s3 - norm.ppf(0.025)*s3se)

(1.0260854743499117, 1.6219145256500884)

We see that both effects are likely to be non-zero, and that we can be relatively more confident of the effect of S = 3 than S = 2.

## part c.

Adding this term decreased the deviance by 2.4 but the degrees of freedom by 4. This implies that there was not much improvement in the model. This is borne out by checking the chi-square p-values again

In [96]:
1-chi2.cdf(41.5, 48)

0.73468876717479104

While marginally better, this p-value still implies a poor fit. 