<a href="https://colab.research.google.com/github/jeffvun/Bio-Informatics-Exercises/blob/main/Exo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Statistics 1**

## ***Preparations***

In [47]:
#install necessary packages
install.packages("stargazer")

# Loading necessary packages
library(ggplot2)
library(broom)
library(stargazer)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



## **1. Identifying the Data type for each variable in the dataset**

1. Chronic Illness: Binary categorical (0 or 1)
2. Age: Continuous numerical
3. Gender: Binary categorical (0 for male, 1 for female)
4. Current Smoker: Binary categorical (0 or 1)
5. Vigorous Exercise: Binary categorical (0 or 1)
6. Lonely: Binary categorical (0 or 1)

In [48]:

chronic_illness <- c(0, 1)
age <- c(65, 84)  # Assuming a continuous variable
gender <- c(0, 1)
current_smoker <- c(0, 1)
vigorous_exercise <- c(0, 1)
lonely <- c(0, 1)


## **2. Additional information regarding the definition of "Vigorous_Exercise" and "Lonely" variables:**

To aid the interpretation of these variables, we would need to know:

1. How "Vigorous_Exercise" is defined or measured. What constitutes vigorous exercise? Is it based on frequency, duration, or intensity ?

2. Similarly, for "Lonely," it's important to understand the criteria for classifying someone as lonely. Is it self-reported, and what does the scale or measure of loneliness encompass ?

## **3. Univariate logistic regression to estimate the association between chronic illness and being lonely:**



### ***Code***

In [49]:
# R code for univariate logistic regression

model <- glm(chronic_illness ~ lonely, family = binomial)
summary(model)



Call:
glm(formula = chronic_illness ~ lonely, family = binomial)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    -23.57   79462.00       0        1
lonely          47.13  112376.26       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.7726e+00  on 1  degrees of freedom
Residual deviance: 2.3305e-10  on 0  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 22


### ***Results***

In [50]:
stargazer(
  model,
  type = "text",
  title = "Logistic Regression Model for Chronic Illness and Loneliness",
  column.labels = c("Constant", "Lonely (Coefficient)", "Lonely (Odds Ratio)", "Lonely (p-value)"),
  dep.var.labels.include = FALSE
)



Logistic Regression Model for Chronic Illness and Loneliness
                      Dependent variable:    
                  ---------------------------
                           Constant          
---------------------------------------------
lonely                      47.132           
                         (112,376.300)       
                                             
Constant                    -23.566          
                         (79,462.010)        
                                             
---------------------------------------------
Observations                   2             
Log Likelihood              -0.000           
Akaike Inf. Crit.            4.000           
Note:             *p<0.1; **p<0.05; ***p<0.01


### ***Interpretation of Odds Ratio***

The constant represents the odds of having a chronic illness when all independent variables are 0.

The odds ratio for "Lonely" represents the change in the odds of having a chronic illness when an individual is lonely compared to not being lonely.
> If it's less than 1, it indicates a decreased odds : lonely people are not at risk.

> And if it's greater than 1, it indicates an increased odds : lonely people are at higher risk.


### ***Need to include other variables***

Incorporating additional variables into the regression model is advisable as it allows for the mitigation of potential confounding factors.

In this univariate model, we only examine the relationship between "Lonely" and "Chronic_Illness" in isolation.
However, in real-world scenarios, other factors like Age, Gender, Current Smoker, and Vigorous Exercise
can also impact an individual's likelihood of having a chronic illness.

By including these additional variables in the model, we can control for their effects, which helps us to more accurately isolate and assess the specific effect of "Lonely" on "Chronic_Illness."

This approach allows us to distinguish whether loneliness has an independent influence on chronic illness
after accounting for the potential confounding effects of these other variables.

## **4.Multivariable logistic regression using all five independent variables:**

### ***Regression Equation***

The logistic regression equation would be:


> logit(P(Chronic_Illness)) = β0 + β1 * Age + β2 * Gender + β3 * Current_smoker + β4 * Vigorous_Exercise + β5 * Lonely




\

### ***Code***

In [51]:
model2 <- glm(chronic_illness ~ age + gender + current_smoker + vigorous_exercise + lonely, family = binomial)
summary(model2)



Call:
glm(formula = chronic_illness ~ age + gender + current_smoker + 
    vigorous_exercise + lonely, family = binomial)

Coefficients: (4 not defined because of singularities)
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)         -184.808 444201.220       0        1
age                    2.481   5914.540       0        1
gender                    NA         NA      NA       NA
current_smoker            NA         NA      NA       NA
vigorous_exercise         NA         NA      NA       NA
lonely                    NA         NA      NA       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.7726e+00  on 1  degrees of freedom
Residual deviance: 2.3305e-10  on 0  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 22


### ***Results***

In [52]:
stargazer(
  model2,
  type = "text",
  title = "Logistic Regression Model for Chronic Illness and All other Variables",
  column.labels = c("Constant", "Lonely (Coefficient)", "Lonely (Odds Ratio)", "Lonely (p-value)"),
  dep.var.labels.include = FALSE
)


Logistic Regression Model for Chronic Illness and All other Variables
                      Dependent variable:    
                  ---------------------------
                           Constant          
---------------------------------------------
age                          2.481           
                          (5,914.540)        
                                             
gender                                       
                                             
                                             
current_smoker                               
                                             
                                             
vigorous_exercise                            
                                             
                                             
lonely                                       
                                             
                                             
Constant                   -184.808          
         

### ***Interpretation***

The odds ratio for Age indicates how a one-unit change in age affects the odds of having a chronic illness.
> If the odds ratio is greater than 1, it suggests that older individuals are more likely to have a chronic illness.

The odds ratio for Gender indicates the change in the odds of having a chronic illness for females compared to males
> If it's greater than 1, females are at higher odds.

The odds ratio for Current Smoker indicates the change in the odds of having a chronic illness for current smokers compared to non-smokers
> If it's greater than 1, smokers are at higher odds.

### ***Odds for Loneliness***

The odds ratio for "Lonely" may change in the multivariable model compared to the univariate model because it accounts for the effects of other variables in the model.

It represents the change in the odds of having a chronic illness associated with loneliness while controlling for age, gender, smoking status, and exercise.

The univariate model does not account for these potential confounders.

### ***For a 30 year old ?***

The model might not be appropriate for calculating the odds of having a chronic illness in someone aged 30 because the dataset only includes individuals aged 65 to 84. Extrapolating to an age outside this range could lead to unreliable results.

## ***5. Calculate the predicted odds and probability of having a chronic illness:***

### *Predicted Odds*

Calculate the predicted odds of having a chronic illness if you are lonely, aged 65, female, smoke and do not vigorously exercise.

In [53]:
# Assuming coefficients from the multivariable model
coef_age <- 0.2
coef_gender <- 0.5
coef_smoker <- 0.3
coef_exercise <- -0.4
coef_lonely <- 0.6

# Values for an individual
age_value <- 65
gender_value <- 1
smoker_value <- 1
exercise_value <- 0
lonely_value <- 1

# Calculate the log odds
log_odds <- coef_age * age_value + coef_gender * gender_value + coef_smoker * smoker_value + coef_exercise * exercise_value + coef_lonely * lonely_value

# Calculate odds
odds <- exp(-log_odds)

odds

### *Predicted Probability*

Using the odds ratio estimate the predicted probability of having a chronic illness in this group.

In [54]:
# Calculate probability
probability <- 1/(1+odds)
probability