<a href="https://colab.research.google.com/github/francji1/01ZLMA/blob/main/R/01ZLMA_ex08_Binary_Data_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

01ZLMA - Exercise 08

Exercise 08 of the course 01ZLMA. 

# GLM for Discrete response - Binary Data Analysis

Alternative and Binomial responses

**Bernoulli (Alternative) Model**

$$Y_{i,j} \sim Be(\pi_i) \ i = 1,\ldots,K \ \text{and} \ j = 1,\ldots, n_i.$$
$K$ is number of groups, $n_i$ is number of observations in group $i$ and $\sum_{i=1}^{K} = N$
$$ E[Y_{i,j}] = \pi_i \ \text{and} \ g(\pi_i) = \eta_i =x_i^T \beta $$


**Binomial Model**
$$Y_i = \sum_{j=1}^{n_i} Y_{i,j} \sim Bi(n_i, \pi_i)$$

**Without continuos covariate (only factor variables)**

$K$ is constant and $n_i \rightarrow \infty $

**With at least one continuos covariate**

$n_i \approx 1$ ( $n_i$ is small enough) and $K \rightarrow \infty$



## Link functions for binary data

**Logistic function:**

The logistic function is the canonical link function for binary responses, and it is CDF of the standard logistic distribution.

$$\pi_i = \frac{1}{1+e^{-x_i^T \beta}} $$ 


**Probit function:**

The CDF of the normal distribution. 
$$\pi_i = \Phi({x_i^T \beta}) $$ 


**Cauchit function:**

The CDF of the Cauchy distribution

$$\pi_i = \frac{1}{\pi}\text{arctan}(x_i^T \beta) + \frac{1}{2} $$ 


**Complementary log-log (cloglog) function:**

The inverse of the conditional log-log function (CDF of the Gumbel distribution)

$$\pi_i = 1 − e^{-e^{x_i^T \beta}}$$

The counter part of the cloglog function is log-log link function.

In [None]:
library(tidyverse)
#library(Matrix)
#library(MASS)

In [None]:
? make.link

In [None]:
map(c("logit", "probit", "cauchit", "cloglog"),  make.link) %>%
map_df(
  function(link) {
    tibble(x = seq(-5, 5, length.out = 101),
           y = link$linkinv(x),
           link_name = link$name)
  }
  ) %>%
  ggplot(aes(x = x, y = y, colour = link_name)) +
  geom_line()

## Logistic regression with Titanic dataset

https://www.kaggle.com/c/titanic/data

| Variable |                 Definition                 |                       Key                      |
|:--------:|:------------------------------------------:|:----------------------------------------------:|
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |                                                |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
install.packages("titanic")
library(titanic)
knitr::kable(head(titanic_train))
summary(titanic_train)
summary(titanic_test)

In [None]:
# Number of NA's
colSums(is.na(titanic_train))
colSums(is.na(titanic_test))

We will modify dataset_train for our academic purpose :-)

### Model where all covariates are discrete

In [None]:
data_dis <- titanic_train %>%
  dplyr::select(Survived,Pclass,Sex,Embarked)

data_dis %>% mutate_if(is.character,as.factor) %>% summary()

data_dis <- data_dis %>%
  filter(Embarked %in% c("C","Q","S")) %>%
  transmute(survived = as.factor(Survived),
            #survived = Survived,
            class = as.factor(Pclass),
            sex = as.factor(Sex),
            embarked = as.factor(Embarked))

summary(data_dis)           
str(data_dis)


In [None]:
install.packages("GGally")
library(GGally)
ggpairs(data_dis)

In [None]:
table_data_dis <- table(data_dis)
table_data_dis

In [None]:
#prop.table(table_data_dis)
#prop.table(table_data_dis,margin=2)*100
table(data_dis$survived,data_dis$class)
prop.table(table(data_dis$survived,data_dis$class),margin=1)*100
prop.table(table(data_dis$survived,data_dis$class),margin=2)*100

table(data_dis$survived,data_dis$sex)
prop.table(table(data_dis$survived,data_dis$sex),margin=2)*100

table(data_dis$survived,data_dis$embarked)
prop.table(table(data_dis$survived,data_dis$embarked),margin=2)*100



In [None]:
# Odss ratio (empirický poměr šancí)
OR        <- function(tab){tab[1,1]/tab[1,2]/(tab[2,1]/tab[2,2])}
table_sex <- table(data_dis$survived,data_dis$sex)
table_sex
OR(table_sex)


In [None]:
#install.packages("mosaic")
#library(mosaic)
oddsRatio(table_sex, verbose = TRUE)

In [None]:
#install.packages("epitools")
library(epitools)
oddsratio.wald(table_sex, conf.level = 0.95)

In [None]:
chisq.test(table_sex)

### Null model

* Compute the null model (assume that the probability of survival was the same for all passangers)

* How do we interpret estimated parameter?

In [None]:
mod0=glm(survived~1,family=binomial(link = "logit"),data_dis) #
summary(mod0)

In [None]:
# The chances of survival according to training data.
exp(coef(mod0))

# The probability of survival.
exp(coef(mod0))/(1+exp(coef(mod0)))


### Model with varaible: sex

* Compute the model with one covariate sex. 

* How can we interpret estiamted coefficients? 

* Did survival depend on gender (`sex`) ?

* Perform an appropriate tests.

* Did women have a better chance of survival? 


In [None]:
mod_sex=glm(survived~sex,family=binomial(link = "logit"),data_dis) #
summary(mod_sex)

Use deviance to test submodels `anova(model_1,model_2,test="Chisq")`.

In [None]:
# The chances of survival according to training data.
exp(coef(mod_sex))
#sexmale:    0.081668331668578
anova(mod_sex,mod0,test="Chisq")


In [None]:
#Function to estimate OR with lower and upper limit of 95% CI for OR
OR_coef <- function(variable,model,CI){
  param <- coef(model)
  where <- grep(variable,names(param))[1]
  beta  <- param[where]
  se <- summary(model)$coef[where,2]
  or <- exp(beta)
  ci <- exp(beta+c(-1,1)*qnorm(CI/2+0.5)*se)
  out <- data.frame(or,ci[1],ci[2])
  names(out) <- c("OR","LCL","UCL")
  out
}
OR_coef("sex",mod_sex,0.95)

Compare with results obtained from contingency table.

### Your turn:

Estimate model with one covariate `class` and compute: 

* Did survival depend on (`class`) ?

* Perform an appropriate tests.

* Compute odds ratios between classes.

* Did passangers in second class have a better chance of survival than in third? 


### Model with all discrete covariates without interactions

In [None]:
# Simple Logistic Regression model with all discrete covariates without interactions
mod1=glm(survived~.,family=binomial(link = "logit"),data_dis) #
summary(mod1)

Deviance tests to add/drop independent variables.

`drop1(model,test="Chisq")`

`add1(model,terms.to.add,test="Chisq")`

In [None]:
drop1(mod1,test="Chisq")


In [None]:
add1(mod0,survived~sex+class+embarked, test="Chisq")


In [None]:
data_dis2 <- mutate(data_dis, embarked = fct_recode(embarked, "Q" = "C"))
str(data_dis2)

mod1=glm(survived~.,family=binomial(link = "logit"),data_dis2) #
summary(mod1)

In [None]:
#mod1=glm(survived~relevel(factor(sex),ref="male")+class+embarked,family=binomial(link = "logit"),data_dis2) 
#summary(mod1)



In [None]:
OR_coef("sex",mod1,0.95)

Interpret previous result:

* By how many percentage is the chance of survival lower for  men? 

* Interpret confidence intrval and its significance.


Lets try model with second order interactions.


In [None]:
add1(mod1,~.^2,test="Chisq")

In [None]:
mod2_all <- glm(survived~(.)^2,family=binomial(link = "logit"),data_dis) #
summary(mod2_all)


In [None]:
step(mod2)

In [None]:
mod2 <- glm(survived~ class + sex + embarked + class:sex + sex:embarked,family=binomial(link = "logit"),data_dis) #
summary(mod2)

In [None]:
anova(mod2_all,mod2,test="Chisq")

Interpretation by OR in models with interactions is more complitacated, see Lecture notes.

Lets try model with merged factor levels.




In [None]:
data_dis3 <- mutate(data_dis2, class = fct_recode(class, "2" = "1"))
str(data_dis3)

In [None]:
mod2 <- glm(survived~ class + sex + embarked + class:sex + sex:embarked,family=binomial(link = "logit"),data_dis3) #
summary(mod2)

In [None]:
mod3 <- glm(survived~ (.)^2,family=binomial(link = "logit"),data_dis3) #
anova(mod2,mod3,test="Chisq")


## Model with continuous independent variable.


In [None]:
str(titanic_train)

In [None]:
data_con <- titanic_train %>%
  dplyr::select(Survived,Pclass,Sex,Embarked,Age,Fare)

data_con %>% mutate_if(is.character,as.factor) %>% summary()

data_con <- data_con %>%
  filter(Embarked %in% c("C","Q","S")) %>%
  transmute(survived = as.factor(Survived),
            #survived = Survived,
            class = as.factor(Pclass),
            sex = as.factor(Sex),
            embarked = as.factor(Embarked),
            age = Age,
            fare = Fare) %>%
  drop_na()          

summary(data_con)           
str(data_con)

In [None]:
ggpairs(data_con  %>% dplyr::select(survived,age,fare,class))

In [None]:
ggplot(data_con, aes(x=sex, y=age, fill = survived)) + 
  geom_boxplot()+
  labs(title="Gender boxplot",x="Gender", y = "Age")+
  #geom_jitter(shape=16, position=position_jitter(0.2)) +
  stat_summary(fun=mean, geom="point", shape=23, size=3) +
  theme_classic()

In [None]:
ggplot(data_con, aes(x=class, y=fare, fill = survived)) + 
  geom_boxplot()+
  labs(title="Class x Fare",x="Class", y = "Fare")+
  #geom_jitter(shape=16, position=position_jitter(0.2)) +
  stat_summary(fun=mean, geom="point", shape=23, size=3) +
  theme_classic()

Continuous variable as factor

In [None]:
data_con_fac <- data_con %>%
  mutate(age = cut(age,
                    breaks=c(-Inf, 15, 50, Inf), 
                    labels=c("child","adult","senior")))
ggpairs(data_con_fac)

In [None]:
mod_0 <- glm(survived ~ 1, family = binomial,data = data_con_fac )

In [None]:
mod_age_fac <- glm(survived ~ age, family = binomial,data = data_con_fac )
summary(mod_age_fac)
exp(coef(mod_age_fac))

Is the chance decreasing with increasing age?

In [None]:
anova(mod_age_fac,mod_0,test="Chisq")

In [None]:
mod_age <- glm(survived ~ I(age/10), family = binomial,data = data_con )
summary(mod_age)
exp(coef(mod_age))

Question:

* With increasing age by 10 years, chance to survive decreased by 11%. 

* What do you think about causality in this result?

In [None]:
anova(mod_age,mod_0,test="Chisq")

Question:

* Can we compare by deviance test models `mod_age` and `mod_age_fac`?
* Which model do you prefere and why?
* For which approach (factorized or continuous) saturated model is useful and why?


In [None]:
#mod_sat_fac <- glm(survived ~ sex*age*embarked*class, family = binomial,data = (data_con %>% mutate(age= as.factor(age), fare = as.factor(fare)) ))
#summary(mod_sat_fac)

Your turn:

Consider a model with continuos variables `age`, `fare`, and any factor variable. 

* Create factor `child`, which takes values 1 (child) and 0 (adult).
* Create factor from varaible `fare`, where each level break is by 10 pounds.
* Estimate a model, where the chance of survival depends on factorized `fare` and `sex` and `child`.
* What percentage is the chance of survival lower for adult compare to child? 
* Depends the probability of survival on fare? Test it.
* Assume that the chance of survival increases with exponential increasig fare. How the chance of survival increased if the person spent an extra 10 pound for a ticket? 
* Build a model where the probabilty of survival depends on both `age` and `fare`. Are both covariates significant?
* 

Next lessons (9,10):

* Logistic regression and binary classification (ROC, accuracy, ...)
* Residual analysis
* Prediction and confidence intervals
* Logistic regression and ML approach



In [None]:
#install.packages("epiDisplay")
#library(epiDisplay)
lroc(mod1)

In [None]:
ggplot(data, aes(x=x, y=y, color = group, shape = group)) + 
  geom_point()+
  labs(title="Achievement score scatterplot",x="Aptitude scores", y = "Achievement score")+
  theme_classic()

Úkolem je zjistit, zda se jednotlivé metody mezi sebou liší. Nejdříve provedeme analýzu pomocí odvozených vzorců, poté využijeme funkce implementované v `R`.

## ANCOVA - pomocí odvozených vzorců  (Viz přednáška 07)

###  Saturovaný model

Odhadneme parametry a spočteme deviační statistiku obecného modelu:

In [None]:
one <- rep(1,7)
zero <- rep(0,7)

Z <- matrix(c(one,zero,zero,zero,one,zero,zero,zero,one),ncol=3)
G <- diag(c(1/7,1/7,1/7))
P <- diag(rep(1,21)) - Z %*% G %*% t(Z)
A <- x %*% P %*% x
A1 <- solve(A)

beta <- as.numeric(A1 %*% x %*% P %*% y)
beta
ednáška 
u1 <- G %*% t(Z) %*% y
u2 <- G %*% t(Z) %*% x %*% beta
u <- u1 - u2
u

y.hat <- beta*x + Z%*%u
D <- crossprod(y-y.hat)
D


###  Zúžený model 


Odhadneme parametry a spočteme deviační statistiku modelu za platnosti $H_0$:

In [None]:
Z0 <- rep(1,21)
G0 <- 1/21
P0 <- diag(rep(1,21)) - 1/21 * matrix(1,21,21)
A <- x %*% P0 %*% x
A1 <- solve(A)

beta0 <- as.numeric(A1 %*% x %*% P0 %*% y)
beta0

u0 <- mean(y) - mean(x)*beta0
u0

y.hat0 <- beta0*x + Z0*u0
D0 <- crossprod(y-y.hat0)
D0


### Porovnání modelů pomocí F statistiky

In [None]:
F <- (D0-D)/2/(D/(21-3-1))
F
CV <- qf(0.95,2,17)
CV
p.val <- 1 - pf(F,2,17)
p.val

Hypotézu, že mezi metodami není rozdíl zamítáme.

### Test významnosti vysvětlujíí proměnné $x$ 


Otestujme hypotézu $H_0: \beta = 0$ pomocí porovnání obecného modelu s modelem za platnosti $H_0$

In [None]:
mod.x <- lm(y~group-1)
summary(mod.x)

jehož deviace je

In [None]:
Dx <- crossprod(y-fitted(mod.x))
Dx

Pro $F$-statistiku porovnávající oba modely platí

In [None]:
F <- (Dx-D)/1/(D/(21-3-1)); F
CV <- qf(0.95,1,17); CV
p.val <- 1 - pf(F,1,17); p.val

a hypotézu tedy zamítáme, tzn. proměnná $x$ je v modelu významná. Srovnání provedeme pro ilustraci ještě pomocí funkce `anova`

In [None]:
modAOC <- lm(y~x+group-1)  # obecný model
anova(mod.x, modAOC, test = "F")

### Vícenádobné porovnávání (Bonferroni) 

Vraťme se k obecnému modelu, a protože byla zamítnuta hypotéza rovnosti efektů jednotlivých metod, proveďme vícenásobné porovnávání s cílem zjistit, které dvojice se významně liší.

Hodnoty pevných efektů pro jednotlivé metody jsou

In [None]:
u.1 <- u[1]; u.2 <- u[2]; u.3 <- u[3]

a tabulka jejich rozdílů je

In [None]:
difu1<-c(u.1-u.2,u.1-u.3); difu1<-abs(difu1)
difu2<-c(0,u.2-u.3); difu2<-abs(difu2)
meanabs<-rbind(difu1,difu2)
c.names<-c("mean.g.B","mean.g.C")
r.names<-c("mean.g.A","mean.g.B")
dimnames(meanabs)<-list(r.names,c.names)
meanabs

Spočteme kritické hodnoty pro Bonferoniho metodu vícenásobného porovnávání

In [None]:
sigma.hat <- D/(21-3-1); sigma.hat
t.val <- qt(1-0.05/6,17); t.val

x.m1 <- mean(x[group=="A"])
x.m2 <- mean(x[group=="B"])
x.m3 <- mean(x[group=="C"])

n<-tapply(y, group, length)   # počet pozorování v jednotlivých kategoriích "group"

x.m <- c(rep(x.m1,n[1]),rep(x.m2,n[2]),rep(x.m3,n[3]))

Exx <- crossprod(x-x.m); Exx

BF.12 <- sqrt(sigma.hat)*t.val*sqrt(1/n[1] + 1/n[2] + 1/Exx*(x.m1-x.m2)^2)
BF.13 <- sqrt(sigma.hat)*t.val*sqrt(1/n[1] + 1/n[3] + 1/Exx*(x.m1-x.m3)^2)
BF.23 <- sqrt(sigma.hat)*t.val*sqrt(1/n[2] + 1/n[3] + 1/Exx*(x.m2-x.m3)^2)

#tabulka hodnot BF
BF1<-c(BF.12,BF.13)
BF2<-c(0,BF.23)
BF<-rbind(BF1,BF2)
dimnames(BF)<-list(r.names,c.names)
BF

Porovnáním hodnot v obou tabulkách zjistíme, které dvojice se významně liší

In [None]:
SIGNIF<-meanabs>BF; SIGNIF


Významně se liší metoda $A$ od $B$ a metoda  $A$ od $C$.

Obrázek dat proložených modelem

In [None]:
plot(x, y, pch = c(15:17)[group], col = c("red","blue","black")[group], 
     xlab = "Před tréninkem", ylab = "Po tréninku")
legend("topleft",inset = .01, bty="n", legend = c("metoda A", "metoda B", "metoda C"), 
       pch = c(15:17), col = c("red","blue","black"), cex=0.9)
abline(coef = c(u[1],beta),col = "red")
abline(coef = c(u[2],beta),col = "blue")
abline(coef = c(u[3],beta),col = "black")

## ANOVA - pomocí funkcí `R` 
<!-- ######################## -->

Pro ilustraci proveďme i analýzu rozptylu ANOVA, tzn. nebudeme uvažovat proměnnou x.

In [None]:
is.factor(group)  #ověření, že se jedná o faktorovou proměnnou

In [None]:
group = as.factor(group)

Dvě možnosti, jak získat tabulku analýzy rozptylu jsou 

In [None]:
aov_m1 <- aov(y~group)
summary(aov_m1)

In [None]:
lm_m1 <- lm(y~group)
anova(lm_m1)

In [None]:
opar <- par(mfrow=c(2,2))
plot(aov(y~group))
#plot(lm_m1)
par(opar)

Závěr: proměnná group je významná, neboli jednotlivé metody tréninku se mezi sebou liší i bez započtení efektu proměnné $x$.

Spočteme průměry pro jednotlivé skupiny tréninku 

In [None]:
model.tables(aov(y~group), type="means")

nebo přímo

In [None]:
tapply(y,group,mean)

a provedeme vícenásobné porovnávání. Tentokráte Tukeyovo HSD.

In [None]:
Tukey_CI <- TukeyHSD(aov_m1, c("group"), ordered = FALSE, conf.level = 0.95)
Tukey_CI


Výstupem jsou 95% intervaly spolehlivosti pro rozdíl průměrů a p-hodnoty testu hypotéz, že je daný rozdíl nulový. Opět vidíme rozdílnost mezi skupinami $A$, $B$ a $A$,$C$. Na obrázku to vypadá následovně

In [None]:
plot(Tukey_CI)

Pokud zobrazený interval spolehlivosti neobsahuje 0, příslušný rozdíl je statisticky významný.

Další možnost je použít balík `multcomp`, který bude fungovat i pro model ANCOVA.

In [None]:
install.packages("multcomp")
library(multcomp)

In [None]:
amod<-aov(y~group) # vytvoříme model
# vícenásobné porovnávání Tukey
Tukey <- glht(amod, linfct = mcp(group = "Tukey"))
summary(Tukey)

Případně zobrazíme krabicové diagramy pro vícenásobné porovnávání

In [None]:
install.packages("multcompView")
library(multcompView)
multcompBoxplot(y~group, data=data,compFn="TukeyHSD",sortFn="mean", decreasing=TRUE)

Opět vidíme, že metoda $A$ se liší od $B$ a $C$.

Pro ilustraci ještě porovnejme Bonferroniho metodu a klasický dvouvýběrový t-test

In [None]:
pairwise.t.test(y, group, p.adjust.method="bonferroni")
pairwise.t.test(y, group, "none")


Vidíme poměrně významný rozdíl v p-hodnotách.

### Fisher LSD

In [None]:
#install.packages("agricolae")
library(agricolae)
#LSD_out <- LSD.test(aov_m1,"group", p.adj="bonferroni")
#LSD_out
LSD_out <- LSD.test(aov_m1,"group",18,1.5)
LSD_out



## ANCOVA - pomocí funkcí `R` 


In [None]:
modAOC <- lm(y~x+group-1)
summary(modAOC)
anova(modAOC)

Závěr: obě proměnné jsou významné.

Odhadnuté koeficienty

In [None]:
coef<-summary(modAOC)$coefficients
 coef

Model za platnosti $H_0$

In [None]:
modAOC.0 <- lm(y~x)
summary(modAOC.0)
anova(modAOC.0)

Porovnání modelů:

In [None]:
anova(modAOC.0, modAOC, test = "F")

Porovnání lze provést i přímo

In [None]:
modAOC1 <- lm(y~x+group)
#summary(modAOC1)
anova(modAOC1)

Vícenásobné porovnávání (Tukey HSD)

In [None]:
amod<-aov(y~x+group)
posthoc <- glht(amod, linfct = mcp(group = "Tukey"))
summary(posthoc)
confint(posthoc)

In [None]:
plot(posthoc)

Pozor, následující funkce funguje pouze pro model ANOVA!

In [None]:
CI<-TukeyHSD(aov(y~x+group), which="group")
CI

Je třeba také provést analýzu reziduí a influenčních pozorování. Model ANCOVA předpokládá normalitu reziduí a také společnou hodnotu parameru $\sigma^2$.

In [None]:
X<-model.matrix(modAOC)
n<-nrow(X); p<-ncol(X)
fit <- predict(modAOC, type = "response")

In [None]:
# pákové body
hii <- hatvalues(modAOC)
# Kritérium pro páková pozorování
Infl<-hii>2*p/n; Infl
# Cookova vzdalenost
c.d <- cooks.distance(modAOC)


Grafické zobrazení

In [None]:
par(mfrow=c(1,2))

plot(hii,col="red", cex=1.5, lwd=2, ylim = c(0,0.4)) 
abline(2*p/(n),0)

plot(c.d,col="red", cex=1.5, lwd=2, ylim = c(0,0.7))
abline(8/(n-2*p),0)

neukazuje žádné podezřelé body.

Ještě spočteme studentizovaná rezidua 

In [None]:
res <- rstudent(modAOC)

a otestujeme jejich normalitu pomocí Shapirova testu

In [None]:
shapiro.test(res)

Hypotéza normality reziduí nebyla zamítnuta. Provedeme ještě grafickou analýzu reziduí.

In [None]:
par(mfrow=c(2,2))
# QQplot
qqnorm(res)
qqline(res)
# rezidua vs. fitted values
plot(fitted(modAOC),res, col="red", xlab="Predikované hodnoty", ylab="Rezidua", cex=1.5, lwd=2)
abline(0,0)
# rezisua vs. x
plot(x,res,col="red", xlab="Proměnná x", ylab="Rezidua", cex=1.5, lwd=2)
abline(0,0)
# rezisua vs. group
plot(group,res,col="red", xlab="Metoda tréninku", ylab="Residuals", cex=1.5, lwd=2)
abline(0,0)

Ani zde není žádný očividný probklém. Předpoklady modelu tedy můžeme považovat za splněné.

Další možnost, jka zobrazit rezidua je např.

In [None]:
plot(modAOC, which = 1)