# 09 - Negative binomial regression - Coding of categorical variables

## Data

Source of data: DebTrivedi in R package "MixAll"

Data set debtrivedi.csv

In [38]:
library(readr)
debtrivedi <- read_csv("data/debtrivedi.csv",
                 show_col_types = FALSE)
debtrivedi$poorhlth <- ifelse(debtrivedi$health == "poor", 1, 0)
debtrivedi$exclhlth <- ifelse(debtrivedi$health == "excellent", 1, 0)
debtrivedi$male <- ifelse(debtrivedi$gender == "male", 1, 0)
debtrivedi$privins_n <- ifelse(debtrivedi$privins == "yes", 1, 0)

head(debtrivedi)


ofp,ofnp,opp,opnp,emer,hosp,health,numchron,adldiff,region,⋯,married,school,faminc,employed,privins,medicaid,poorhlth,exclhlth,male,privins_n
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,⋯,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
5,0,0,0,0,1,average,2,no,other,⋯,yes,6,2.881,yes,yes,no,0,0,1,1
1,0,2,0,2,0,average,2,no,other,⋯,yes,10,2.7478,no,yes,no,0,0,0,1
13,0,0,0,3,3,poor,4,yes,other,⋯,no,10,0.6532,no,no,yes,1,0,0,0
16,0,5,0,1,1,poor,2,yes,other,⋯,yes,3,0.6588,no,yes,no,1,0,1,1
3,0,0,0,0,0,average,2,yes,other,⋯,yes,6,0.6588,no,yes,no,0,0,0,1
17,0,0,0,0,0,poor,5,yes,other,⋯,no,7,0.3301,no,no,yes,1,0,0,0


## SAS program snippet

The categorical variables in this model are coded manually.



The following SAS code will be executed.

Differences to the default encoding and the reference encoding can be found below.

### R chunk
Packages will be loaded in the chunk were they are first needed.

A similar R program might look like this. It uses the glm.nb() function.

The tidy() function from the broom-packages formats the output into a tibble for easier processing.

In [39]:
library(MASS)
library(broom)
my_glm <- glm.nb(hosp ~ exclhlth + poorhlth + numchron + age + male + school + privins_n, data = debtrivedi)
tidy(my_glm)


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-3.752639734,0.44190797,-8.4919032,2.0328030000000002e-17
exclhlth,-0.697874635,0.1931509,-3.6131058,0.0003025512
poorhlth,0.613926301,0.09521661,6.4476805,1.135749e-10
numchron,0.289418299,0.02541425,11.3880328,4.796974e-30
age,0.238444463,0.05483476,4.3484178,1.371232e-05
male,0.153862333,0.07264766,2.1179256,0.03418137
school,-0.002271456,0.01019223,-0.2228615,0.8236433
privins_n,0.093922247,0.09042523,1.038673,0.2989568


## Results

The output is divided into blocks to explain it and to reproduce it afterwards in the different languages.

### Block 1
![Block 1](img_screenshots/block_1.png)


This block provides the name of the dataset, the response distribution, the link function and the reponse variable.

### R chunk for reproduction

In [40]:
summary(my_glm)


Call:
glm.nb(formula = hosp ~ exclhlth + poorhlth + numchron + age + 
    male + school + privins_n, data = debtrivedi, init.theta = 0.5660185253, 
    link = log)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.752640   0.441908  -8.492  < 2e-16 ***
exclhlth    -0.697875   0.193151  -3.613 0.000303 ***
poorhlth     0.613926   0.095217   6.448 1.14e-10 ***
numchron     0.289418   0.025414  11.388  < 2e-16 ***
age          0.238444   0.054835   4.348 1.37e-05 ***
male         0.153862   0.072648   2.118 0.034181 *  
school      -0.002271   0.010192  -0.223 0.823643    
privins_n    0.093922   0.090425   1.039 0.298957    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.566) family taken to be 1)

    Null deviance: 2879.7  on 4405  degrees of freedom
Residual deviance: 2547.9  on 4398  degrees of freedom
AIC: 5731.1

Number of Fisher Scoring iterations: 1


              Theta:  0.5660

### Block 2
![Block 2](img_screenshots/block_2.png)

This block provides the number of observations read from the dateset and the number of observations used in the model.

### R chunk for reproduction

In [41]:
nrow(debtrivedi)
nobs(my_glm)

### Block 3
![Block 3](img_screenshots/block_3.png)

This block displays some criteria for assessing the goodnes of fit.

### R chunk for reproduction

In [42]:
glance(my_glm)

null.deviance,df.null,logLik,AIC,BIC,deviance,df.residual,nobs
<dbl>,<int>,<logLik>,<dbl>,<dbl>,<dbl>,<int>,<int>
2879.693,4405,-2856.562,5731.125,5788.641,2547.901,4398,4406


In [43]:
my_glm_intercept <- glm.nb(hosp ~ 1, data = debtrivedi)
glance(my_glm_intercept)


null.deviance,df.null,logLik,AIC,BIC,deviance,df.residual,nobs
<dbl>,<int>,<logLik>,<dbl>,<dbl>,<dbl>,<int>,<int>
2490.834,4405,-3009.625,6023.249,6036.031,2490.834,4405,4406


### Block 4
![Block 4](img_screenshots/block_4.png)

This block contains the results from fitting a generalized linear model to the data.

### R chunk for reproduction

In [44]:
tidy(my_glm)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-3.752639734,0.44190797,-8.4919032,2.0328030000000002e-17
exclhlth,-0.697874635,0.1931509,-3.6131058,0.0003025512
poorhlth,0.613926301,0.09521661,6.4476805,1.135749e-10
numchron,0.289418299,0.02541425,11.3880328,4.796974e-30
age,0.238444463,0.05483476,4.3484178,1.371232e-05
male,0.153862333,0.07264766,2.1179256,0.03418137
school,-0.002271456,0.01019223,-0.2228615,0.8236433
privins_n,0.093922247,0.09042523,1.038673,0.2989568


### Block 5

![Block 5](img_screenshots/block_5.png)

This block contains one row for each effect in the model with the name of the effect, the likelihood ratio statistic for testing the significance of the effect, the degrees of freedom for the effect, and the p-value computed from the chi-square distribution.

### R chunk for reproduction

TODO:

In [45]:
library(lmtest)
library(car)
#my_glm <- glm.nb(hosp ~ exclhlth + poorhlth + numchron + age + male + school + privins_n, data = debtrivedi)
#my_glm_age <- glm.nb(hosp ~ exclhlth + poorhlth + numchron + male + school + privins_n, data = debtrivedi)
#lrtest(my_glm_age, my_glm_intercept)
car::Anova(my_glm, type = 3, test.statistic = "Wald")

Unnamed: 0_level_0,Df,Chisq,Pr(>Chisq)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
(Intercept),1,72.11241965,2.0328030000000002e-17
exclhlth,1,13.05453345,0.0003025512
poorhlth,1,41.57258349,1.135749e-10
numchron,1,129.68729106,4.796974e-30
age,1,18.90873703,1.371232e-05
male,1,4.48560891,0.03418137
school,1,0.04966727,0.8236433
privins_n,1,1.07884162,0.2989568


In [46]:
car::Anova(my_glm, type = 3, test.statistic = "LR")

Unnamed: 0_level_0,LR Chisq,Df,Pr(>Chisq)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
exclhlth,14.8593794,1,0.0001158306
poorhlth,41.1621381,1,1.401105e-10
numchron,124.1017411,1,8.003294e-29
age,18.5851057,1,1.624851e-05
male,4.4396288,1,0.03511388
school,0.0495539,1,0.8238413
privins_n,1.0817987,1,0.2982955



==========================


### Block 6
![Block 6](img_screenshots/block_6.png)

This block provides the class level information for the refence coding of the categorical parameters.
See the following code block.

### R chunk for reproduction

In [47]:
my_glm <- glm.nb(hosp ~ health + numchron + age + gender + school + privins_n, data = debtrivedi)
tidy(my_glm)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-3.752639734,0.44190797,-8.4919032,2.0328030000000002e-17
healthexcellent,-0.697874635,0.1931509,-3.6131058,0.0003025512
healthpoor,0.613926301,0.09521661,6.4476805,1.135749e-10
numchron,0.289418299,0.02541425,11.3880328,4.796974e-30
age,0.238444463,0.05483476,4.3484178,1.371232e-05
gendermale,0.153862333,0.07264766,2.1179256,0.03418137
school,-0.002271456,0.01019223,-0.2228615,0.8236433
privins_n,0.093922247,0.09042523,1.038673,0.2989568


### Block 7
![Block 7](img_screenshots/block_7.png)

The block above gives the results for the estimates if the reference coding is used.

It is similar to the provided R code and its output.

==========================

### Block 8
![Block 8](img_screenshots/block_8.png)

This block provides the class level information for the GLM coding of the categorical parameters as default coding. See the following code block.

### R chunk for reproduction

GLM coding is the default value. It generates columns for each level of a categorical variable.

In [48]:
df1 <- debtrivedi
df1$averagehlth <- (df1$health == "average") * 1 + (df1$health== "excellent") * 0 + (df1$health == "poor") * 0
df1$excellenthlth <- (df1$health == "average") * 0 + (df1$health == "excellent") * 1 + (df1$health == "poor") * 0
df1$poorhlth <- (df1$health == "average") * 0 + (df1$health == "excellent") * 0 + (df1$health == "poor") * 1

df1$male <- ifelse(df1$gender == "male", 1, 0)
df1$female <- ifelse(df1$gender == "female", 1, 0)

df1$privins_n <- ifelse(df1$privins == "no", 1, 0)
df1$privins_y <- ifelse(df1$privins == "yes", 1, 0)

my_glm <- glm.nb(hosp ~ averagehlth + excellenthlth + poorhlth + numchron + age + female + male + school + privins_n + privins_y, 
                 data = df1)
tidy(my_glm)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-2.890928852,0.44877953,-6.4417573,1.18098e-10
averagehlth,-0.613926301,0.09521661,-6.4476805,1.135749e-10
excellenthlth,-1.311800936,0.21086082,-6.2211696,4.934625e-10
numchron,0.289418299,0.02541425,11.3880328,4.796974e-30
age,0.238444463,0.05483476,4.3484178,1.371232e-05
female,-0.153862333,0.07264766,-2.1179256,0.03418137
school,-0.002271456,0.01019223,-0.2228615,0.8236433
privins_n,-0.093922247,0.09042523,-1.038673,0.2989568


### Block 9
![Block 9](img_screenshots/block_9.png)

The block above gives the results for the estimates if the reference coding is used. It is similar to the provided R code and its output.