<a href="https://colab.research.google.com/github/jazkre/01ZLMA/blob/main/R/01ZLMA_ex10_Poisson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01ZLMA - Exercise 10
Exercise 10 of the course 01ZLMA. 

## Contents

* Log-linear models with Poisson distributed data
 ---


Dataset and example from Chapter 10

Peter K. Dunn • Gordon K. Smyth, Generalized Linear ModelsWith Examples in R

https://link.springer.com/content/pdf/10.1007%2F978-1-4419-0118-7.pdf

In [None]:
install.packages("GLMsData")
library(GLMsData)
library(tidyverse)

Example 10.1.: As a numerical example, consider the number of incidents of
lung cancer from 1968 to 1971 in four Danish cities. The number of cases of lung cancer in each age group is remarkably similar for Fredericia. However, using the number of cases does not accurately reflect the information in the data, because five times as many people are in the 40–54 age group than in the
over-75 age group. Understanding the data is enhanced by considering the
rate of lung cancer, such as the number of lung cancer cases per unit of population.

In [None]:
data(danishlc)
danishlc$Rate <- danishlc$Cases / danishlc$Pop * 1000 # Rate per 1000
danishlc$Age <- ordered(danishlc$Age, # Ensure age-order is preserved 
   levels=c("40-54", "55-59", "60-64", "65-69", "70-74", ">74") )

# The r function ordered() informs r that the levels of factor Age have a 
# particular order; without declaring Age as an ordered factor, Age is plotted
# with ">74" as the first level.  

In [None]:
head(danishlc)
str(danishlc)

In [None]:
#danishlc$City <- abbreviate(danishlc$City, 1) # Abbreviate city names
matplot( xtabs( Rate ~ Age+City, data=danishlc), pch=1:4, lty=1:4,type="b", lwd=2, col="black", axes=FALSE, ylim=c(0, 25),xlab="Age group", ylab="Cases/1000")
axis(side=1, at=1:6, labels=levels(danishlc$Age))
axis(side=2, las=1); box()
legend("topleft", col="black", pch=1:4, lwd=2, lty=1:4, merge=FALSE,legend=c("Fredericia", "Horsens", "Kolding", "Vejle") )

mame skupiny -> vidíme rastúci trend a outlier Horsens, čo spôsobí rapídny prepad trendujk

In [None]:
#  Same plot by ggplot
ggplot(danishlc, aes(x=Age, y=Rate, group=City, col=City)) +
  geom_line() +
  geom_point(aes(shape = City)) +
  xlab("Age group") + ylab("Cases/1000")

The plots show no clear pattern by city, but the
lung cancer rate appears to grow steadily for older age groups for each city,
then falls away for the `>74` age group. The lung cancer rate for Horsens in
the `>74` age group seems very low.

### Poission regression recap:

We assume:

$Y_i \sim Po(\lambda_i s_i)$

$log(\lambda_i) = x_i^T \beta$

$E[Y_i] = \lambda_i s_i = s_i exp(x_i^T \beta) = exp(ln(s_i) + x_i^T \beta)$ 

$\lambda_i = \frac{E[Y_i]}{s_i}$

We will estimate coefficients $β_j$'s, but we don't have to estimate parameters for an offset term $ln(si)$.



In [None]:
danishlc

nejak mu blbol age -> chceme to ako faktor:

In [None]:
danishlc$age = as.factor(as.numeric(danishlc$Age))
danishlc

In [None]:
# Model 1 with offset term and facors City, Age and theirs interaction.
dlc_m1 <- glm( Cases ~ offset( log(Pop) ) + City * age, family=poisson, data=danishlc)
summary(dlc_m1)

vek je najdôlešitejší, čo máme, interakcia mesto:vek -> jediné horsens, age6



Question: Compare previous model with saturated one.

In [None]:
# Test predictors significance:
anova(dlc_m1, test="Chisq")
# Model without interaction
dlc_m1wo <- glm( Cases ~ offset( log(Pop) ) + City + age, family=poisson, data=danishlc)
anova(dlc_m1wo, test="Chisq")


city: 3.34e-01 -> nemáme dk

city:age -> nemame dk vyradíme

zostane len model v závislosti veku

In [None]:
# More tests:
anova(dlc_m1, dlc_m1wo, test = "LRT")
anova(dlc_m1, dlc_m1wo, test = "Rao")


Keep only `Age`

In [None]:
# Drop City
dlc_m2 <- update(dlc_m1, . ~ offset(log(Pop)) + age )
summary(dlc_m2)


AIC 136, D 28

predchadzaj model(inter) : AIC D skoro 0 \\
pred model(bez int: (došlo vlastne k zlepšeniu v AIC (D sa o moc nezhoršila)

Mutate dataset to have Age as quantitative. Using the lower class boundary of each class, since all classes have a lower
boundary.

In [None]:
# Add numerical variable: AgeNum.
danishlc <- danishlc %>%
 add_column(AgeNum = rep( c(40, 55, 60, 65, 70, 75), 4))


Question: Discuss the application of different boundaries: lower, midpoint, upper.

In [None]:
# Build model 3 with Age as numerical variable.
dlc_m3 <- update(dlc_m1, . ~ offset( log(Pop) ) + AgeNum)
summary(dlc_m3)
anova(dlc_m3, test="Chisq")

ako rastie poiss s vekom ?  

odhad rozptylu rovnaký

In [None]:
# With numerical varaible, we can apply quadratic relationship
dlc_m4 <- update( dlc_m3, . ~ offset( log(Pop) ) + poly(AgeNum, 2) )
summary(dlc_m4)
anova(dlc_m4, test="Chisq")

In [None]:
# Compare linear and quadratic models.
anova(dlc_m3,dlc_m4, test="Chisq")

The quadratic model is significant improvement compare to linear one.

Just for academic purpose: Check the deviance by hand computation

In [None]:
y <- danishlc$Cases
mu_hat4 <- fitted(dlc_m4)               # predicted values by model 1
dev_stat_m4 <- 2*sum(y*log(y/mu_hat4) - (y - mu_hat4))
data.frame(computed_by = c("hand","glm in R"), deviance = c(dev_stat_m4,deviance(dlc_m4)))


### *Results*

Compare models with numerical `AgeNum` and with categorical `Age` variable by AIC and create summary table of outputs from all models

In [None]:
x       <- list(m1=dlc_m1,m2=dlc_m2,m3=dlc_m3,m4=dlc_m4)
results <- data.frame(model_name = c("dlc_m1","dlc_m2","dlc_m3","dlc_m4"),
       age_type = c("categorical","categorical","numerical","numerical"),
       model_type = c("with interaction (sat_model)","without interaction","AgeNum Linear", "AgeNum quadratic"))

results <- tibble::rownames_to_column(results, var = "model_number") %>%
 add_column(AIC = as.numeric(lapply(x,AIC)),
            deviance = lapply(x,deviance) %>% as.numeric() %>% round(2),
            df = lapply(x,df.residual) %>% as.numeric()) %>%
  mutate(c_val = ifelse(df>0,qchisq(0.05, df, ncp=0, lower.tail = FALSE),NA),
         P_val = ifelse(df>0,pchisq(deviance, df, lower = FALSE),NA)  )          
results


Both models, with factor varialbe `Age` and with quadratic `AgeNum` are reasonably adequate.

čisto cez AIC: model 4 -> vek a kvadrat zavislosť

p-val->

Plot deviance residuals against fitted values

In [None]:
par(mfrow=c(2,2))
scatter.smooth(predict(dlc_m2, type='response'), rstandard(dlc_m2, type='deviance'))
scatter.smooth(sqrt(fitted(dlc_m2)), rstandard(dlc_m2, type='deviance'))

scatter.smooth(predict(dlc_m4, type='response'), rstandard(dlc_m4, type='deviance'))
scatter.smooth(sqrt(fitted(dlc_m4)), rstandard(dlc_m4, type='deviance'))

jake druhy rezid plotov sme používali pri binom r? žiadne!

ale teraz možeme



Question: why to plot sqrt of fitted values istead of fitted values only?

kompenzuje sa tým závislosť Var na Str.hodnote

Plot residuals against predictors



In [None]:
par(mfrow=c(2,3))
plot(danishlc$Age,  rstandard(dlc_m2, type='deviance'), col='gray')
plot(as.numeric(danishlc$Age),  rstandard(dlc_m2, type='deviance'), col='gray')
scatter.smooth(danishlc$AgeNum,  rstandard(dlc_m2, type='deviance'), col='gray')

scatter.smooth(danishlc$Age, rstandard(dlc_m4, type='deviance'), col='gray')
scatter.smooth(danishlc$AgeNum, rstandard(dlc_m4, type='deviance'), col='gray')
scatter.smooth(danishlc$AgeNum^2, rstandard(dlc_m4, type='deviance'), col='gray')


Checking the link function

In [None]:
par(mfrow=c(1,2))
scatter.smooth(predict(dlc_m2, type='response'), resid(dlc_m2, type='working'), col='gray')
scatter.smooth(predict(dlc_m4, type='response'), resid(dlc_m4, type='working'), col='gray')

rezidua: working type

Checking if Poisson regression is appropriate

Quantile residuals: 

Dunn and Gordon (2018) introduce quantile residuals for discrete response variables. Their primary benefits are they do not show weird patterns (due to variable’s discreteness).

In [None]:
install.packages("statmod")
library(statmod) # For quantile residuals
install.packages("surveillance")
library(surveillance) # For anscombe residuals - na prednáške boli tieto

par(mfrow=c(2,4))
qqnorm(qresid(dlc_m2))
qqline(qresid(dlc_m2))
qqnorm(rstandard(dlc_m2, type="pearson")); qqline(qresid(dlc_m2))
qqnorm(rstandard(dlc_m2, type="deviance")); qqline(qresid(dlc_m2))
qqnorm(anscombe.residuals(dlc_m2, 1)); qqline(qresid(dlc_m2))


qqnorm(qresid(dlc_m4))
qqline(qresid(dlc_m4))
qqnorm(rstandard(dlc_m4, type="pearson")); qqline(qresid(dlc_m4))
qqnorm(rstandard(dlc_m4, type="deviance")); qqline(qresid(dlc_m4))
qqnorm(anscombe.residuals(dlc_m4, 1)); qqline(qresid(dlc_m4))



Outliers and influential observations



In [None]:
n = 24
# Critical value for cook distance: 8/(n-2*p)
# Critical value for hat values: 2*p/n

par(mfrow=c(1,2))
plot(cooks.distance(dlc_m2), type='h',las=1, main="Cook's D",ylab="Cook's distance, D")
plot(hatvalues(dlc_m2),ylim=c(0,1))
abline(2*(n-df.residual(dlc_m2))/n,0)

plot(cooks.distance(dlc_m4), type='h',las=1, main="Cook's D",ylab="Cook's distance, D")
plot(hatvalues(dlc_m4),ylim=c(0,1))
abline(2*(n-df.residual(dlc_m4))/n,0)


In [None]:
influence.measures(dlc_m2)
influence.measures(dlc_m4)


In [None]:
#12 0.542327 0.293   * hlavny bod

In [None]:
 which(influence.measures(dlc_m4)$is.inf[,'cook.d'] )
 which(influence.measures(dlc_m4)$is.inf[,'hat'] )

Task:
* can you model previous problem with binomial distribution?
* If so, how do you do it. Run the experiment and model the probability of lung cancer. What is the odds ratio between people living in different locations? What is the odds ratio between people ten years older? 

In [None]:
head(danishlc)
str(danishlc)

In [None]:
response=cbind(danishlc$Cases,danishlc$Pop-danishlc$Cases) # pocet priaznivych/nepriaznivych? 
#danishlc$Pop,danishlc$Pop-danishlc$Cases popul vs priaznive! ?? whááááát?
attach(danishlc)
model_b=glm(response~ AgeNum, family= binomial())

In [None]:
#response=cbind(danishlc$Pop,danishlc$Cases)
#attach(danishlc)
model_bin=glm(response~ I(AgeNum/10), family= binomial())

In [None]:
summary(model_bin)

In [None]:
exp(coef(model_b)[2])

In [None]:
exp(coef(model_bin)[2])/(1+exp(coef(model_bin)[2]))

HW:

 * Plot predictions and realization of cases from previous model.
 * What is the suitable saddlepoint approximation for Poisson models? If it's violated, select similar groups and merge them.
 * Re-run the analysis again with such a newly grouped dataset.


 * Transform data frame into the long format by `pivot_longer` with new variable `cancer` with levels `yes` and `no`.
 * Run the analysis with new contingency table, where columns are `cancer`, `age`, `city`, `number` (number of population in the group). 
 

In [None]:
head(danishlc)