# Logistic Regression

Building off of the previous set of notes that looked for associations between categorical attributes using the chi-square test. This final section of the course will discuss logistic regression. Logistic regression is largely a generalization of linear regression, except instead of the outcome being continuous, for logistic regression, the outcome is dichotomous. 

We are going to use the [General Social Survey](https://gss.norc.org/) again to explore this model. 

In [None]:
library(tidyverse)

head(gss_cat)

The general framework of the model is as follows:

$$
log(\frac{P(Y = 1)}{1 - P(Y = 1)}) = \beta_{0} + \beta_{1} X + \beta_{k} X_{k}
$$

The left-hand side of the equation above is read as: take the log (natural log) of the probability of the data being equal to 1 compared to the data being equal to 0. The left hand side is typically referred to as a logit. 

The right hand side is similar to linear regression, representing the attributes that are thought to be associated with the likelihood of the data being equal to 1. By default, these are on the logistic metric and are interepreted just like linear regression coefficients on the logistic metric. The non-linearity is done through the log transformation which keeps the probabilities between 0 and 1 inclusive. 

The model also does not predict a specific value, instead it predicts the probability or likelihood of being a 1 based on the values of the attributes. 

In [None]:
gss_cat <- gss_cat %>%
  mutate(partyid_collapse = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  ),
  ind_binary = ifelse(partyid_collapse == 'ind', 1, 0)
  ) |> 
  filter(partyid_collapse != 'other') 

count(gss_cat, partyid_collapse) %>%
  mutate(prop = n / sum(n))

## Continuous Attribute

Let's explore and see if the individual's age helps predict if they are an independent political affiliation. 

In [None]:
tv_ind <- glm(ind_binary ~ I(age - 30), data = gss_cat, family = "binomial")

broom::tidy(tv_ind)
broom::glance(tv_ind)

The interpretation for the intercept and slope are similar to before. The intercept says that when age is 30 (notice how I centered the term at age 30 above), the model implied logit is -0.148. The slope says for every unit increase in age, the logit decreases by -0.0149 units. These by default are difficult to interpret as we typically don't think in logit metrics. 

For continuous predictors, interpreting these on a probability scale is often helpful. The easiest way to do this is with the fitted function. Before doing that however, it is possible to compute the probability for the intercept (age 30). 

In [None]:
1 / ( 1 + exp(0.148))

### Different Age Values

In [None]:
new_age = data.frame(
    age = 18:89
)

new_age <- new_age |> 
  mutate(prob = predict(tv_ind, newdata = new_age, type = 'response'))

head(new_age)

In [None]:
library(ggformula)

theme_set(theme_bw(base_size = 18))

gf_line(prob ~ age, data = new_age, linewidth = 2) |> 
  gf_labs(x = "Age",
         y = 'Probability')

## Categorical Predictor