 ## Linear Regression - Categorical Predictor


 ## Description of the Data
 These data contain information on mother's and baby's health for 1,174 pregnant women.

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)

theme_set(theme_bw(base_size = 18))

baby <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/master/data-raw/baby.csv")
baby <- baby |>
  mutate(smoker = ifelse(maternal_smoker, 1, 0))
head(baby)

 ## Categorical Predictor
 Before, linear regression has been ran with a continuous attribute. In both models, the baby's birth weight was the outcome of interest and the predictor in one model was the number of gestational days and in the other was the age of the mother at time of birth. What happens when a categorical predictor is used instead of a continuous predictor? This section will introduce that idea with a categorical predictor that has two different levels.

 ### Mother's smoking
 It is known that a mother smoking while pregnant can hamper the development of the unborn fetus. Will this transition into lower birth weight for baby's born to mothers who smoked during the pregnancy? First, let's explore the distribution and calculate descriptive statistics for birth weight across the two groups.

In [None]:
gf_density(~ birth_weight, color = ~ maternal_smoker, size = 1.25, 
                      fill = 'gray80', data = baby) |>
  gf_labs(x = 'Birth Weight (in oz)',
          color = 'Smoked?')

 What are the general take-aways from the distributions above? To give some additional information, a violin plot may be helpful.

In [None]:
gf_violin(birth_weight ~ maternal_smoker, data = baby, draw_quantiles = c(0.1, 0.5, 0.9), 
           fill = 'gray85', size = 1) |>
  gf_refine(coord_flip()) |>
  gf_labs(y = "Birth Weight (in oz)",
          x = "Smoker?")

 Any additional information shown here that shows differences? To finish the descriptive exploration, let's compute some descriptive statistics.

In [None]:
baby |>
  df_stats(birth_weight ~ maternal_smoker, mean, sd, median, quantile(c(0.25, 0.75)), length)

 ## Linear Regression - Categorical Predictor
 Now it is time to fit a model to the data here to explore if there indeed is a difference in the population. We know descriptively there is a difference in the two group means and medians, but is this difference large enough to be practical? The model is fitted similar to before with the `lm()` function and a similar formula as before. The outcome (birth weight) is to the left of the `~` and the predictor (maternal smoking status) is to the right.

In [None]:
smoker_reg <- lm(birth_weight ~ maternal_smoker, data = baby)
coef(smoker_reg)

 To explore what these coefficients mean in a bit more detail, let's look at the data a bit more.

In [None]:
head(baby)

 Instead of using the `maternal_smoker` attribute, instead let's run the model with the `smoker` attribute again.

In [None]:
smoker_reg_new <- lm(birth_weight ~ smoker, data = baby)
coef(smoker_reg_new)

 Notice that the coefficients for the linear regression are the same no matter which attribute is entered into the model. When a categorical attribute is entered into the regression in R, the attribute is automatically converted into something called an indicator or dummy variable. This means that one of the two values are represented with a 1, the other with a 0. The value that is represented with a 0 is the one that is closer to the letter "A", meaning that the 0 is the first category in alphabetical order.

 To again get a better grasp, the descriptive stats and the coefficients from the regression are shown together below.

In [None]:
baby |>
  df_stats(birth_weight ~ maternal_smoker, mean, sd, median, quantile(c(0.25, 0.75)), length)

coef(smoker_reg)