# NB 8: Modelling Premature Infants

*Modified from Data Science in a Box*




In [None]:
# This code will load the R packages we will use
install.packages(c("csucistats", "openintro"),
                 repos = c("https://inqs909.r-universe.dev", "https://cloud.r-project.org"))
library(csucistats)
library(tidyverse)
library(openintro)


# Uncomment and run for categorical plots
# csucistats::install_plots()
# library(ggtricks)
# library(ggmosaic)
# library(waffle)

# Uncomment and run for themes
# csucistats::install_themes()
# library(ThemePark)
# library(ggthemes)

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state.
This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children.
We will work with a random sample of observations from this data set.



## Data

The data can be found in the **openintro** package, and it's called `ncbirths`.
Since the dataset is distributed with the package, we don't need to load it separately; it becomes available to us when we load the package.
You can find out more about the dataset by inspecting its documentation, which you can access by running `?ncbirths` in the Console or using the Help menu in RStudio to search for `ncbirths`.
You can also find this information [here](https://www.openintro.org/data/index.php?data=ncbirths).





## Part 1: Premature vs. smoking

Consider the possible relationship between a mother's smoking habit and if an infant is premature. A first good step is to get a cross-tabulations table of the two variables.




1.  Create a cross-tabs table between `habit` and `premie`.



2. Looking at smoking mothers, what is the proportion of infants who are premature?



3. Looking at nonsmoking mothers, what is the proportion of infants who are premature?



4. What is the difference of proportions in premature infants between smoking and nonsmoking mothers?

## Part 2: Using Logistic Regression

1. Use the logistic regression to model the outcome `premie` and `habit`.

2. What is the probability of having a premature baby for a smoking mother?

3. What is the probability of having a premature baby for a nonsmoking mother?

4. What is the difference in the probabilities?

5. Compare the results from **Part 1**. Comment on what you found.

## Part 3: Adjusting for Mother's Mature Status

Now consider how Mother's Maturity affects the likelihood of seeing a premature infant.

1. Use a logistic regression model to characterize the association between the outcome `premie` and predictors `habit` and `mature`.

2. Interpret the odds ratio for `habit`.

3. Interpret the odds ration for `mature`.

4. What is the probability of observing a premature infant to a mother who is smoking and young?

5. What is the probability of observing a premature infant to a mother who is smoking and older?

## Part 4: Releveling Premature Variable

Modelling full-term instead of premature.

1. Use the `levels` function on `ncbirths$premie` to determine which is the reference level (first category).

2. Type `relevel(ncbirths$premie, ref = "premie")`, and explain in words what happened.


3. Model **Part 3** with a "full term" infant instead of a "premie".

4. Interpret the odds ratio for mature.

5. What is the probability of observing a full term infant to a mother who is smoking and young?

6. Compare your results to Part 3 Problem 4. How are they related to each other?