<a href="https://colab.research.google.com/github/jazkre/01ZLMA/blob/main/R/01ZLMA_ex07_Binary_Data_1_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercise 07 of the course 01ZLMA. 

# GLM for Discrete response - Binary Data Analysis

Alternative and Binomial responses

**Bernoulli (Alternative) Model**

$$Y_{i,j} \sim Be(\pi_i) \ i = 1,\ldots,K \ \text{and} \ j = 1,\ldots, n_i.$$
$K$ is number of groups, $n_i$ is number of observations in group $i$ and $\sum_{i=1}^{K} = N$
$$ E[Y_{i,j}] = \pi_i \ \text{and} \ g(\pi_i) = \eta_i =x_i^T \beta $$


**Binomial Model**
$$Y_i = \sum_{j=1}^{n_i} Y_{i,j} \sim Bi(n_i, \pi_i)$$

**Without continuos covariate (only factor variables)**

$K$ is constant and $n_i \rightarrow \infty $

**With at least one continuos covariate**

$n_i \approx 1$ ( $n_i$ is small enough) and $K \rightarrow \infty$



## Link functions for binary data

**Logistic function:**

The logistic function is the canonical link function for binary responses, and it is CDF of the standard logistic distribution.

$$g(\pi_i)=log(\frac{\pi_i}{1-\pi_i}) $$

$$\pi_i = \frac{1}{1+e^{-x_i^T \beta}} = \frac{e^{x_i^T \beta}}{1+e^{x_i^T \beta}} $$ 




**Probit function:**

The CDF of the normal distribution. 
$$\pi_i = \Phi({x_i^T \beta}) $$ 


**Cauchit function:**

The CDF of the Cauchy distribution

$$\pi_i = \frac{1}{\pi}\text{arctan}(x_i^T \beta) + \frac{1}{2} $$ 


**Complementary log-log (cloglog) function:**

The inverse of the conditional log-log function (CDF of the Gumbel distribution)

$$\pi_i = 1 − e^{-e^{x_i^T \beta}}$$

The counter part of the cloglog function is log-log link function.

In [None]:
library(tidyverse)
#library(Matrix)
#library(MASS)

In [None]:
? make.link

In [None]:
map(c("logit", "probit", "cauchit", "cloglog"),  make.link) %>%
map_df(
  function(link) {
    tibble(x = seq(-5, 5, length.out = 101),
           y = link$linkinv(x),
           link_name = link$name)
  }
  ) %>%
  ggplot(aes(x = x, y = y, colour = link_name)) +
  geom_line()

## Logistic regression with Titanic dataset

https://www.kaggle.com/c/titanic/data

| Variable |                 Definition                 |                       Key                      |
|:--------:|:------------------------------------------:|:----------------------------------------------:|
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |                                                |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
install.packages("titanic")
library(titanic)
knitr::kable(head(titanic_train))

In [None]:

summary(titanic_train)
summary(titanic_test)

In [None]:
# Number of NA's
colSums(is.na(titanic_train))
colSums(is.na(titanic_test))

We will modify dataset_train for our academic purpose :-)

### Model where all covariates are discrete

In [None]:
data_dis <- titanic_train %>%
  dplyr::select(Survived,Pclass,Sex,Embarked)

data_dis %>% mutate_if(is.character,as.factor) %>% summary()

data_dis <- data_dis %>%
  filter(Embarked %in% c("C","Q","S")) %>%
  transmute(survived = as.factor(Survived),
            #survived = Survived,
            class = as.factor(Pclass),
            sex = as.factor(Sex),
            embarked = as.factor(Embarked))

summary(data_dis)           
str(data_dis)


In [None]:
install.packages("GGally")
library(GGally)
ggpairs(data_dis)

In [None]:
table_data_dis <- table(data_dis)
table_data_dis

In [None]:
#prop.table(table_data_dis)
#prop.table(table_data_dis,margin=2)*100
table(data_dis$survived,data_dis$class)
prop.table(table(data_dis$survived,data_dis$class),margin=1)*100
prop.table(table(data_dis$survived,data_dis$class),margin=2)*100


In [None]:
# Count observations
table(data_dis$survived,data_dis$sex)
# Conditional proportions given columns
prop.table(table(data_dis$survived,data_dis$sex),margin=2)*100
# Conditional proportions given rows
prop.table(table(data_dis$survived,data_dis$sex),margin=1)*100

In [None]:

table(data_dis$survived,data_dis$embarked)
prop.table(table(data_dis$survived,data_dis$embarked),margin=2)*100
? prop.table


In [None]:
#table_sex[1,2]

In [None]:
# Odss ratio (empirický poměr šancí)
OR        <- function(tab){tab[1,1]/tab[1,2]/(tab[2,1]/tab[2,2])}
table_sex <- table(data_dis$survived,data_dis$sex)
table_sex
OR(table_sex)
# Men have 

In [None]:
#install.packages("mosaic")
#library(mosaic)
#oddsRatio(table_sex, verbose = TRUE)

In [None]:
install.packages("epitools")
library(epitools)
oddsratio.wald(table_sex, conf.level = 0.95)

In [None]:
chisq.test(table_sex)

### Null model

* Compute the null model (assume that the probability of survival was the same for all passangers)

* How do we interpret estimated parameter?

In [None]:
mod0=glm(survived~1,family=binomial(link = "logit"),data_dis) #
summary(mod0)

In [None]:
# The chances of survival according to training data.
exp(coef(mod0))

# The probability of survival.
exp(coef(mod0))/(1+exp(coef(mod0)))


### Model with varaible: sex

* Compute the model with one covariate sex. 

* How can we interpret estiamted coefficients? 

* Did survival depend on gender (`sex`) ?

* Perform an appropriate tests.

* Did women have a better chance of survival? 


In [None]:
mod_sex=glm(survived~sex,family=binomial(link = "logit"),data_dis) #
summary(mod_sex)

Use deviance to test submodels `anova(model_1,model_2,test="Chisq")`.

In [None]:
# The chances of survival according to training data.
exp(coef(mod_sex))
#sexmale:    0.081668331668578
anova(mod_sex,mod0,test="Chisq")


In [None]:
#Function to estimate OR with lower and upper limit of 95% CI for OR
OR_coef <- function(variable,model,CI){
  param <- coef(model)
  where <- grep(variable,names(param))[1]
  beta  <- param[where]
  se <- summary(model)$coef[where,2]
  or <- exp(beta)
  ci <- exp(beta+c(-1,1)*qnorm(CI/2+0.5)*se)
  out <- data.frame(or,ci[1],ci[2])
  names(out) <- c("OR","LCL","UCL")
  out
}
OR_coef("sex",mod_sex,0.95)

Compare with results obtained from contingency table.

### Your turn:

Estimate model with one covariate `class` and compute: 

1. Did survival depend on (`class`) ?

2. Perform an appropriate tests.

3. Compute odds ratios between classes.

4. Did passangers in second class have a better chance of survival than in third? 


In [None]:
#1. 
mod_class <- glm(survived~class,family=binomial(link = "logit"),data_dis)
summary(mod_class)

In [None]:
#2. 
anova(mod_class,mod0,test = "Chisq")

In [None]:
#3.
OR_coef("class",mod_class,0.95)

beta0 = 0.5158
beta1 = -0.6246
beta2 = -1.6556

exp(beta1) #pomer sanci na preziti(mezi 2. a 1. tridou)
exp(beta2) #pomer sanci na preziti(mezi 3 a 1)



In [None]:
#4.
exp(beta2-beta1) #OR mezi 2. a 3. tridou

mod_class_2 <- glm(survived~relevel(factor(class),ref = "2"),family=binomial(link = "logit"),data_dis)
summary(mod_class_2)

### Model with all discrete covariates without interactions

In [None]:
# Simple Logistic Regression model with all discrete covariates without interactions
mod1=glm(survived~.,family=binomial(link = "logit"),data_dis) #
summary(mod1)

Deviance tests to add/drop independent variables.

`drop1(model,test="Chisq")`

`add1(model,terms.to.add,test="Chisq")`

In [None]:
drop1(mod1,test="Chisq")


In [None]:
add1(mod0,survived~sex+class+embarked, test="Chisq")


In [None]:
data_dis2 <- mutate(data_dis, embarked = fct_recode(embarked, "Q" = "C"))
str(data_dis2)

mod1=glm(survived~.,family=binomial(link = "logit"),data_dis2) #
summary(mod1)

In [None]:
#mod1=glm(survived~relevel(factor(sex),ref="male")+class+embarked,family=binomial(link = "logit"),data_dis2) 
#summary(mod1)



In [None]:
OR_coef("sex",mod1,0.95)

Interpret previous result:

* By how many percentage is the chance of survival lower for  men? 

* Interpret confidence intrval and its significance.


Lets try model with second order interactions.


In [None]:
add1(mod1,~.^2,test="Chisq")

In [None]:
mod2_all <- glm(survived~(.)^2,family=binomial(link = "logit"),data_dis) #
summary(mod2_all)


In [None]:
step(mod2_all)

In [None]:
mod2 <- glm(survived~ class + sex + embarked + class:sex + sex:embarked,family=binomial(link = "logit"),data_dis) #
summary(mod2)

In [None]:
anova(mod2_all,mod2,test="Chisq")

Interpretation by OR in models with interactions is more complitacated, see Lecture notes.

Lets try model with merged factor levels.




In [None]:
data_dis3 <- mutate(data_dis2, class = fct_recode(class, "2" = "1"))
str(data_dis3)

In [None]:
mod2 <- glm(survived~ class + sex + embarked + class:sex + sex:embarked,family=binomial(link = "logit"),data_dis3) #
summary(mod2)

In [None]:
mod3 <- glm(survived~ (.)^2,family=binomial(link = "logit"),data_dis3) #
anova(mod2,mod3,test="Chisq")


## Model with continuous independent variable.


Discuss difference from models without continuous variable (again)!!!

In [None]:
str(titanic_train)

In [None]:
data_con <- titanic_train %>%
  dplyr::select(Survived,Pclass,Sex,Embarked,Age,Fare)

data_con %>% mutate_if(is.character,as.factor) %>% summary()

data_con <- data_con %>%
  filter(Embarked %in% c("C","Q","S")) %>%
  transmute(survived = as.factor(Survived),
            #survived = Survived,
            class = as.factor(Pclass),
            sex = as.factor(Sex),
            embarked = as.factor(Embarked),
            age = Age,
            fare = Fare) %>%
  drop_na()          

data_con <- na.omit(data_con)
summary(data_con)           
str(data_con)

In [None]:
ggpairs(data_con  %>% dplyr::select(survived,age,fare,class))

In [None]:
ggplot(data_con, aes(x=sex, y=age, fill = survived)) + 
  geom_boxplot()+
  labs(title="Gender boxplot",x="Gender", y = "Age")+
  #geom_jitter(shape=16, position=position_jitter(0.2)) +
  stat_summary(fun=mean, geom="point", shape=23, size=3) +
  theme_classic()

In [None]:
ggplot(data_con, aes(x=class, y=fare, fill = survived)) + 
  geom_boxplot()+
  labs(title="Class x Fare",x="Class", y = "Fare")+
  #geom_jitter(shape=16, position=position_jitter(0.2)) +
  stat_summary(fun=mean, geom="point", shape=23, size=3) +
  theme_classic()

Continuous variable as factor

In [None]:
data_con_fac <- data_con %>%
  mutate(age = cut(age,
                    breaks=c(-Inf, 15, 50, Inf), 
                    labels=c("child","adult","senior")))
ggpairs(data_con_fac)

In [None]:
mod_0 <- glm(survived ~ 1, family = binomial,data = data_con_fac )

In [None]:
mod_age_fac <- glm(survived ~ age, family = binomial,data = data_con_fac )
summary(mod_age_fac)
exp(coef(mod_age_fac))

Is the chance decreasing with increasing age?

In [None]:
anova(mod_age_fac,mod_0,test="Chisq")

In [None]:
mod_age <- glm(survived ~ I(age/10), family = binomial,data = data_con ) #co se stane, pokud vek vzroste o 10 let
summary(mod_age)
exp(coef(mod_age))

Question:

* With increasing age by 10 years, chance to survive decreased by 11%. 

* What do you think about causality in this result?

In [None]:
anova(mod_age,mod_0,test="Chisq")

Question:

* Can we compare by deviance test models `mod_age` and `mod_age_fac`?
* Which model do you prefere and why?
* For which approach (factorized or continuous) saturated model is useful and why?


In [None]:
#mod_sat_fac <- glm(survived ~ sex*age*embarked*class, family = binomial,data = (data_con %>% mutate(age= as.factor(age), fare = as.factor(fare)) ))
#summary(mod_sat_fac)

In [None]:
# je uzitecny pro faktorizovany, ale ve spojitem muzeme pak odpovedet na otazky, ktere nam faktorizovany neda (napr. rozdil mezi 25 a 35 letym)

Your turn:

Consider a model with continuos variables `age`, `fare`, and any factor variable. 

* Create factor `child`, which takes values 1 (child) and 0 (adult).
* Create factor from varaible `fare`, where each level break is by 10 pounds.
* Estimate a model, where the chance of survival depends on factorized `fare` and `sex` and `child`.
* What percentage is the chance of survival lower for adult compare to child? 
* Depends the probability of survival on fare? Test it.
* Assume that the chance of survival increases with exponential increasig fare. How the chance of survival increased if the person spent an extra 10 pound for a ticket? 
* Build a model where the probabilty of survival depends on both `age` and `fare`. Are both covariates significant?
* 

*   Využijeme *data_con*, kde proměnné *fare* a *age* jsou spojité a *sex* faktorová.



In [None]:
str(data_con)

model1 <- glm(survived ~ age + fare + sex,family = binomial, data = data_con)
summary(model1)

*   Vytvoříme faktorové proměnné *child* a *fare*:



In [None]:
data_con$child <- factor(ifelse(data_con$age < 16 , 1, 0))
data_con$fare_factor <- cut(data_con$fare, breaks = seq(0, max(data_con$fare), by=20), include.lowest=TRUE)

data_con <- na.omit(data_con)
summary(data_con)
str(data_con)

*   Model závisející na faktorizované proměnné *fare, sex* a *child*:

In [None]:
model2 <- glm(survived ~ sex + child + fare_factor, family = binomial, data = data_con)
summary(model2)

*   Abychom vyjádřili o kolik procent má dospělý nižší šanci na přežití, musíme si nejprve spočítat poměr šancí pro dospělého oproti dítěti.



In [None]:
OR_child <- exp(coef(model2)[3])    # poměr šancí na přežití child vs adult
OR_child

OR_adult <- 1/OR_child              # poměr šancí na přežití adult vs child
OR_adult

percentage_change <- (OR_adult -1)*100
percentage_change

> Z poměru šancí dostaneme, že dítě má 2,18krát vyšší šanci na přežití a tedy šance na přežití dospělého klesne o 54%.





*   Vytvoříme model bez proměnné *fare* a pomocí testu anova porovnáme s modelem 2 s proměnnou *fare*.



In [None]:
model_without_fare <- glm(survived ~ sex + child, family = binomial, data = data_con)
summary(model_without_fare)

anova(model2,model_without_fare,test = "Chisq")

> Dle testu anova je model s proměnnou *fare* statisticky odlišný od modelu bez této proměnné, a tedy pravděpodobnost přežití na proměnné *fare* závisí.



*   O kolik procent se zvýší šance na přežití, když si cestující připlatí 10 liber?




In [None]:
model3 <- glm(survived ~ I(fare/10), family = binomial, data = data_con)
summary(model3)

> Pokud cestující utratí o 10 liber víc, šance na přežití se zvýší o 15,7%.



* Model závisející na proměnných *age* a *fare*:



In [None]:
model_age_fare <- glm(survived ~ age + fare, family = binomial(link = logit), data = data_con)
summary(model_age_fare)

> Obě proměnné vychází statisticky významné.



Next Exercises (8 and 9):

* Logistic regression and binary classification (ROC, accuracy, ...)
* Residual analysis
* Prediction and confidence intervals
* Logistic regression with  ML approach

