# Maternal Smoking and Birth Weight

**Note**: 

> This exercise has been written out in something called a Jupyter Notebook. We'll discuss Jupyter Notebooks in more detail later in this specialization—they are very a powerful tool for data science communication!—but for the time being, the notebook is just a convenient way for us to write out the exercise. You don't need to *do* anything with the notebook except read its contents—just use write your Python code in a regular `.py` file.

**WARNING:**

> When asked to round your answers to a certain number of decimals, do *not* round any results until you've finished your computations and have your final answer! For example, if you were to calculate the average hourly wage for workers, and you did so by first calculating the average weekly salary of workers and the average hours worked per week, then divided the first number by the second, you should NOT round the average weekly salary of workers or the average hours worked per week. Rounding intermediate results can lead to compounding errors that cause problems for the autograder.


These days, it is widely understood that mothers who smoke during pregnancy risk exposing their babies to many health problems. This was not common knowledge fifty years ago. One of the first studies that addressed the issue of pregnancy and smoking was the Child Health and Development Studies, a comprehensive study of all babies born between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA. The original reference for the study is Yerushalmy (1964, American Journal of Obstetrics and Gynecology, pp. 505-518). The data and a summary of the study are in Nolan and Speed (2000, Stat Labs, Chapter 10) and can be found at [the book’s website.](https://www.stat.berkeley.edu/users/statlabs/)

There were about 15,000 families in the study. We will only analyze a subset of the data, in particular 869 male single births where the baby lived at least 28 days. The researchers interviewed mothers early in their pregnancy to collect information on socioeconomic and demographic characteristics, including an indicator of whether the mother smoked during pregnancy. The variables in the dataset are described in the `birthweight_codebook.txt` code book. In this exercise, we will attempt to use this data to better understand whether mothers who smoke tend to give birth to babies with lower weights than mothers who do not smoke?


### Exercise 1

Load the data `smoking.csv` in the `data` folder. This data includes information on both newborns and also mothers (variables prefixed with the letter `m` are attributes of the mother).

In [6]:
import pandas as pd
import numpy as np

pd.set_option("mode.copy_on_write", True)

smoking_and_bw = pd.read_csv("data/smoking.csv")

In [7]:
smoking_and_bw.head()

Unnamed: 0,id,date,gestation,bwt_oz,parity,mrace,mage,med,mht,mpregwt,inc,smoke
0,8038,1665,256,108,0,1,26,5,67,130,3,0
1,6251,1408,256,78,0,8,29,5,65,123,7,0
2,6611,1453,257,102,1,4,25,1,66,135,1,0
3,6177,1416,257,138,1,0,38,2,67,138,1,0
4,6017,1428,258,102,0,7,22,4,65,135,0,0


### Exercise 2

Our interest in this exercise will be on whether the babies of mothers who smoked during pregnancy had lower birthweights than the babies of mothers who did not smoke. 

Let's evaluate this relationship using `statsmodels`. 

Using `ols` from `statsmodels.formula.api`, regress birthweight on whether the infant's mother smoked. What is the average difference in the weight of newborns for mothers who did not smoke as compared to mothers who smoke?

When interpreting the coefficient of your model, remember that the reported coefficient is equal to the average value of birthweight *when the indicator variable is equal to 1* minus the average value of birthweight *when the indicator variable is equal to 0*.

**Please round your answer to 2 decimal places.**

In [9]:
import statsmodels.formula.api as smf

s_and_bw_model = smf.ols("bwt_oz ~ smoke", smoking_and_bw).fit()
s_and_bw_model.summary()

0,1,2,3
Dep. Variable:,bwt_oz,R-squared:,0.055
Model:,OLS,Adj. R-squared:,0.054
Method:,Least Squares,F-statistic:,47.62
Date:,"Tue, 06 Aug 2024",Prob (F-statistic):,1.04e-11
Time:,11:23:49,Log-Likelihood:,-3438.4
No. Observations:,814,AIC:,6881.0
Df Residuals:,812,BIC:,6890.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,123.3824,0.787,156.732,0.000,121.837,124.928
smoke,-8.0356,1.164,-6.901,0.000,-10.321,-5.750

0,1,2,3
Omnibus:,8.836,Durbin-Watson:,1.757
Prob(Omnibus):,0.012,Jarque-Bera (JB):,11.208
Skew:,0.127,Prob(JB):,0.00368
Kurtosis:,3.516,Cond. No.,2.53


> Answer: 8.04 ounces. 

### Exercise 3

The longer a pregnancy, the heavier a newborn will tend to be. Suppose we are interested in whether the newborns of mothers who don't smoke are heavier than newborns of mothers who do smoke *when their pregnancies are the same duration.* To answer this question, please add `gestation` as a second variable in our model. 

Now what is the average difference in the weight of newborns for mothers who did not smoke as compared to mothers who smoke *for pregnancies of the same length*? 

**Please round your answer to two decimal places.**

In [10]:
s_and_bw_model = smf.ols("bwt_oz ~ smoke + gestation", smoking_and_bw).fit()
s_and_bw_model.summary()

0,1,2,3
Dep. Variable:,bwt_oz,R-squared:,0.145
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,68.74
Date:,"Tue, 06 Aug 2024",Prob (F-statistic):,2.6399999999999998e-28
Time:,11:28:58,Log-Likelihood:,-3397.9
No. Observations:,814,AIC:,6802.0
Df Residuals:,811,BIC:,6816.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.3847,13.449,-0.029,0.977,-26.785,26.015
smoke,-7.4319,1.111,-6.692,0.000,-9.612,-5.252
gestation,0.4393,0.048,9.217,0.000,0.346,0.533

0,1,2,3
Omnibus:,8.849,Durbin-Watson:,1.956
Prob(Omnibus):,0.012,Jarque-Bera (JB):,10.993
Skew:,0.136,Prob(JB):,0.0041
Kurtosis:,3.501,Cond. No.,6850.0


> 7.43 ounces.

### Exercise 4

Now fit the same model you just fit (with `gestation` and `smoke`) but do so using `patsy` and the `OLS` method from `statsmodels.api` (*not* the `ols` method from `statsmodels.formula.api`).

In [13]:
import statsmodels.api as sm
import patsy

y, X = patsy.dmatrices("bwt_oz ~ smoke + gestation", smoking_and_bw)

model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,bwt_oz,R-squared:,0.145
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,68.74
Date:,"Tue, 06 Aug 2024",Prob (F-statistic):,2.6399999999999998e-28
Time:,11:40:36,Log-Likelihood:,-3397.9
No. Observations:,814,AIC:,6802.0
Df Residuals:,811,BIC:,6816.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.3847,13.449,-0.029,0.977,-26.785,26.015
smoke,-7.4319,1.111,-6.692,0.000,-9.612,-5.252
gestation,0.4393,0.048,9.217,0.000,0.346,0.533

0,1,2,3
Omnibus:,8.849,Durbin-Watson:,1.956
Prob(Omnibus):,0.012,Jarque-Bera (JB):,10.993
Skew:,0.136,Prob(JB):,0.0041
Kurtosis:,3.501,Cond. No.,6850.0


### Exercise 5

Now let's test for whether there is an interaction between the mother's race and the effect of smoking by adding an interaction term to this regression. 

`mrace` is coded as follows:

```
mrace    Mother's race or ethnicity
         0-5 = White
         6   = Mexican
         7   = Black
         8   = Asian
         9   = Mix
         99  = Unknown
```

As most variation in this data is between "White" and other categories, create a new variable that takes on a value of 1 when the mother is White and 0 otherwise.

(As you can tell, people in the 1960s were not as thoughtful about collecting detailed data on race and ethnicity as most modern researchers, and nor did they go out of their way to ensure their data included enough data on non-White groups to allow detailed sub-population analyses.)

What share (a value between 0 and 1) of mothers in the data are identified as White? 

**Please round your answer to two decimal places**

In [15]:
smoking_and_bw["mrace"].value_counts(normalize=True)

mrace
0    0.460688
7    0.181818
5    0.111794
3    0.052826
4    0.049140
1    0.040541
8    0.039312
6    0.027027
2    0.019656
9    0.017199
Name: proportion, dtype: float64

In [17]:
smoking_and_bw["white"] = (smoking_and_bw["mrace"] >= 0) & (smoking_and_bw["mrace"] < 6)
smoking_and_bw["white"].value_counts(normalize=True)

white
True     0.734644
False    0.265356
Name: proportion, dtype: float64

> 0.73

### Exercise 6

Now regress birthweight on length of pregnancy, whether the mother smoked during pregnancy, whether the mother was White, and the interaction of whether the mother was White and whether the mother smoked.

(Note: depending on how you write your formula, you may not have to enter all those into the regression explicitly.)

What is the coefficient on `smoke`? 

This time, please extract the coefficient on `smoke` from the model using Python, then round it with `np.round()` (so you can get some practice accessing model coefficients).

**Please round your answer to two decimal places.**

In [21]:
s_and_bw_model3 = smf.ols("bwt_oz ~ gestation + C(white)*smoke", smoking_and_bw).fit()
np.round(s_and_bw_model3.params["smoke"], 2)

-7.59

## Bonus Exercises

What follows are BONUS EXERCISES. You do not have to get these right to pass the quiz for this module as they get into interpretation of interaction terms (which are quite tricky if you haven't taken a linear regression class before).

### Exercise 7

Based on the regression you ran in Exercise 6, is the impact of smoking greater for White mothers than non-White mothers? (We are ignoring the question of whether this difference is statistically significant for the moment, if you know what that means — just focus on the coefficients in the model.)

> The impact is greater.

### Exercise 8

Again, using the regression you fit above, answer the following question.

For White mothers, what is the average difference in the weight of newborns between mothers who do NOT smoke and those who DO smoke, assuming their pregnancies lasted the same amount of time?

Please do not do any math by hand — extract the coefficients from the model using Python.

**Please round your answer to two decimal places.**

In [20]:
s_and_bw_model3.params["smoke"] + s_and_bw_model3.params["C(white)[T.True]:smoke"]

-7.926048814780051

> 7.93