# Regression

In [None]:
import re
from functools import partial
from typing import List, Tuple

In [None]:
import sys
sys.path.append('lib')

In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import patsy

In [None]:
import nsfg
import fwf

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from IPython.core.pylabtools import figsize
sns.set_theme()
figsize(11, 5)

In [None]:
r4 = partial(np.round, decimals=4)

## StatsModels

Let's load up the NSFG data again.

In [None]:
live = nsfg.read_live_fem_preg()

In [None]:
live.loc[:, ['totalwgt_lb', 'agepreg']].apply(lambda col: col.isna().sum())

Here's birth weight as a function of mother's age (which we saw in the previous chapter).

In [None]:
model = smf.ols('totalwgt_lb ~ agepreg', data=live)
results = model.fit()
results.summary()

We can extract the parameters, and the p-value of the slope estimate.

In [None]:
def summarize_results(results):
    """Prints the most important parts of linear regression results:

    results: RegressionResults object
    """
    for name, param in results.params.items():
        pvalue = results.pvalues[name]
        print(f'{name:26}: {param:0.4f}: {pvalue:0.4f}')
    try:
        print(f'R^2      : {results.rsquared:0.4f}')
        print(f'Std(ys)  : {results.model.endog.std():0.4f}')
        print(f'Std(res) : {results.resid.std():0.4f}')
    except AttributeError:
        print(f'R^2      : {results.prsquared:0.4f}')

In [None]:
summarize_results(results)

In [None]:
inter = results.params['Intercept']
slope = results.params['agepreg']
slope_pvalue = results.pvalues['agepreg']
r4((inter, slope, slope_pvalue))

And the coefficient of determination.

In [None]:
r4(results.rsquared)

The `std` of the dependent variable and the residuals

In [None]:
r4(live.totalwgt_lb.std())

In [None]:
r4(results.resid.std())

`std(ys)` is the standard deviation of the dependent variable, which is the RMSE if you have to guess birth weights without the benefit of any explanatory variables. `std(res)` is the standard deviation of the residuals, which is the RMSE if your guesses are informed by the mother’s age. As we have already seen, knowing the mother’s age provides no substantial improvement to the predictions.

## Multiple regression

In [Chapter 4](04_Cumulative_Distribution_Functions.ipynb) we saw that first babies tend to be lighter than others, and this effect is statistically significant. But it is a strange result because there is no obvious mechanism that would cause first babies to be lighter. So we might wonder whether this relationship is spurious.

In fact, there is a possible explanation for this effect. We have seen that birth weight depends on mother’s age, and we might expect that mothers of first babies are younger than others.

With a few calculations we can check whether this explanation is plausible. Then we’ll use multiple regression to investigate more carefully. First, let’s see how big the difference in weight is:

In [None]:
results = smf.ols('totalwgt_lb ~ agepreg', data=live).fit()
slope = results.params['agepreg']

In [None]:
live.loc[:, ['birthcat', 'totalwgt_lb', 'agepreg']].groupby('birthcat').mean()

In [None]:
diff_weight = np.diff(live.groupby('birthcat')['totalwgt_lb'].mean()).item()
r4(diff_weight)

First babies are 0.125 lbs lighter, or 2 ounces. And the difference in ages:

In [None]:
diff_age = np.diff(live.groupby('birthcat')['agepreg'].mean()).item()
r4(diff_age)

The mothers of first babies are 3.59 years younger. Running the linear model again, we get the change in birth weight as a function of age:

The slope is 0.0175 pounds per year. If we multiply the slope by the difference in ages, we get the expected difference in birth weight for first babies and others, due to mother’s age:

In [None]:
r4(slope * diff_age)

The result is 0.063, just about half of the observed difference. So we conclude, tentatively, that the observed difference in birth weight can be partly explained by the difference in mother’s age.

Using multiple regression, we can explore these relationships more systematically.

In [None]:
live['isfirst'] = live.birthcat == 'firsts' 
results = smf.ols('totalwgt_lb ~ isfirst', data=live).fit()
summarize_results(results)

Because `isfirst` is a boolean, ols treats it as a categorical variable, which means that the values fall into categories, like True and False, and should not be treated as numbers. The estimated parameter is the effect on birth weight when isfirst is true, so the result, -0.125 lbs, is the difference in birth weight between first babies and others.

The slope and the intercept are statistically significant, which means that they were unlikely to occur by chance, but the the $R^2$ value for this model is small, which means that `isfirst` doesn’t account for a substantial part of the variation in birth weight.

The results are similar with agepreg:

In [None]:
results = smf.ols('totalwgt_lb ~ agepreg', data=live).fit()
summarize_results(results)

Again, the parameters are statistically significant, but $R^2$ is low.

These models confirm results we have already seen. But now we can fit a single model that includes both variables

In [None]:
results = smf.ols('totalwgt_lb ~ isfirst + agepreg', data=live).fit()
summarize_results(results)

As expected, when we control for mother's age, the apparent difference due to `isfirst` is cut in half.

## Nonlinear relationships

Remembering that the contribution of agepreg might be nonlinear, we might consider adding a variable to capture more of this relationship. One option is to create a column, `agepreg2`, that contains the squares of the ages:

In [None]:
live['agepreg2'] = live.agepreg**2
results = smf.ols('totalwgt_lb ~ isfirst + agepreg + agepreg2', data=live).fit()
summarize_results(results)

Now by estimating parameters for agepreg and agepreg2, we are effectively fitting a parabola

The parameter of `agepreg2` is negative, so the parabola curves downward, which is consistent with the shape of the lines in chapter 10. The quadratic model of `agepreg` accounts for more of the variability in birth weight; the parameter for isfirst is smaller in this model, and no longer statistically significant.

Using computed variables like `agepreg2` is a common way to fit polynomials and other functions to data. This process is still considered linear regression, because the dependent variable is a linear function of the explanatory variables, regardless of whether some variables are nonlinear functions of others.

When we do that, the apparent effect of `isfirst` gets even smaller, and is no longer statistically significant.

These results suggest that the apparent difference in weight between first babies and others might be explained by difference in mothers' ages, at least in part.

In this example, mother’s age acts as a control variable; including agepreg in the model “controls for” the difference in age between first-time mothers and others, making it possible to isolate the effect (if any) of isfirst.

## Data Mining

Now suppose that you really want to win the pool. What could you do to improve your chances? Well, the NSFG dataset includes 244 variables about each pregnancy and another 3087 variables about each respondent. Maybe some of those variables have predictive power. To find out which ones are most useful, why not try them all?

Testing the variables in the pregnancy table is easy, but in order to use the variables in the respondent table, we have to match up each pregnancy with a respondent. In theory we could iterate through the rows of the pregnancy table, use the caseid to find the corresponding respondent, and copy the values from the correspondent table into the pregnancy table. But that would be slow.

We can use `join` to combine variables from the preganancy and respondent tables.

In [None]:
live.query('prglngth > 30', inplace=True)

In [None]:
resp = nsfg.read_fem_resp().set_index('caseid')

In [None]:
# suffix appended to overlapping columns in the right table
join = live.join(resp, on='caseid', rsuffix='_r')

And we can search for variables with explanatory power.

Because we don't clean most of the variables, we are probably missing some good ones.

In [None]:
def go_mining(df: pd.DataFrame):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    variables = []
    for name in df.columns:
        try:
            # check that the explanatory variable has some variability
            if df[name].var() < 1e-7:
                continue
            formula = 'totalwgt_lb ~ agepreg + ' + name
            model = smf.ols(formula, data=df)
            # reject models that use less than half of the data
            if model.nobs < len(df)/2:
                continue
            results = model.fit()
            variables.append((round(results.rsquared, 4), name))
        except (ValueError, TypeError, patsy.PatsyError) as e:
            continue
    return variables

For each variable we construct a model, compute $R^2$ , and append the results to a list. The models all include agepreg, since we already know that it has some predictive power.

I check that each explanatory variable has some variability; otherwise the results of the regression are unreliable. I also check the number of observations for each model. Variables that contain a large number of nans are not good candidates for prediction.

For most of these variables, we haven’t done any cleaning. Some of them are encoded in ways that don’t work very well for linear regression. As a result, we might overlook some variables that would be useful if they were cleaned properly. But maybe we will find some good candidates.

In [None]:
variables = go_mining(join)

In [None]:
variables.sort(reverse=True)

In [None]:
variables[:30]

The following functions report the variables with the highest values of $R^2$.

In [None]:
def read_variables():
    vars = fwf.read_stata_dictionary('data/2002FemPreg.dct')
    vars.extend(fwf.read_stata_dictionary('data/2002FemResp.dct'))
    return vars

Some of the variables that do well are not useful for prediction because they are not known ahead of time.

Combining the variables that seem to have the most explanatory power.

In [None]:
# try adding lbw1
formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + '
               'nbrnaliv>1 + paydu==1 + totincr')
results = smf.ols(formula, data=join).fit()
summarize_results(results)

## Logistic regression

As an example of logistic regression, suppose a friend of yours is pregnant and you want to predict whether the baby is a boy or a girl. You could use data from the NSFG to find factors that affect the “sex ratio”, which is conventionally defined to be the probability of having a boy.

Example: suppose we are trying to predict `y` using explanatory variables `x1` and `x2`.

In [None]:
y = np.array([0, 1, 0, 1])
# think of these as feature column vectors
x1 = np.array([0, 0, 0, 1])
x2 = np.array([0, 1, 1, 1])

According to the logit model the log odds for the $i$th element of $y$ is

$\log o = \beta_0 + \beta_1 x_1 + \beta_2 x_2 $

So let's start with an arbitrary guess about the elements of $\beta$:



In [None]:
beta = [-1.5, 2.8, 1.1]

Plugging in the model, we get log odds.

In [None]:
log_o = beta[0] + beta[1] * x1 + beta[2] * x2
log_o

Which we can convert to odds.

In [None]:
o = np.exp(log_o)
o

And then convert to probabilities.

In [None]:
p = o / (o+1)
p

The likelihoods of the actual outcomes are $p$ where $y$ is 1 and $1-p$ where $y$ is 0. 

In [None]:
likes = np.where(y, p, 1-p)
likes

The likelihood of $y$ given $\beta$ is the product of `likes`:

In [None]:
like = np.prod(likes)
like

Logistic regression works by searching for the values in $\beta$ that maximize `like`.

Here's an example using variables in the NSFG respondent file to predict whether a baby will be a boy or a girl.

In [None]:
live['boy'] = (live.babysex==1).astype(int)

The mother's age seems to have a small effect.

In [None]:
model = smf.logit('boy ~ agepreg', data=live)
results = model.fit()
summarize_results(results)

The parameter of `agepreg` is positive, which suggests that older mothers are more likely to have boys, but the p-value is 0.798, which means that the apparent effect could easily be due to chance.

here’s a model that includes several factors believed to be associated with sex ratio:

In [None]:
formula = 'boy ~ agepreg + hpagelb + birthord + C(race)'
model = smf.logit(formula, data=live)
results = model.fit()
summarize_results(results)

Along with mother’s age, this model includes father’s age at birth (`hpagelb`), birth order (`birthord`), and `race` as a categorical variable.

None of the estimated parameters are statistically significant. The pseudo-$R^2$ value is a little higher, but that could be due to chance.


## Accuracy

To make a prediction, we have to extract the exogenous and endogenous variables.

In [None]:
# dependent variable, or response variable
model.endog_names

In [None]:
# predictors
model.exog_names

The baseline prediction strategy is to guess "boy".  In that case, we're right almost 51% of the time.

In [None]:
actual = model.endog
baseline = actual.mean()
r4(baseline)

Since actual is encoded in binary integers, the mean is the fraction of boys, which is 0.507.

If we use the previous model, we can compute the number of predictions we get right.

In [None]:
predict = (results.predict() >= 0.5)
# multiply by actual yields 1 if we predict a boy and get it right, otherwise 0
true_pos = predict * actual
true_neg = (1 - predict) * (1 - actual)
sum(true_pos), sum(true_neg)

And the accuracy, which is slightly higher than the baseline.

In [None]:
acc = (sum(true_pos) + sum(true_neg)) / len(actual)
r4(acc)

The result is 0.513, slightly better than the baseline, 0.507. But, you should not take this result too seriously. We used the same data to build and test the model, so the model may not have predictive power on new data.

To make a prediction for an individual, we have to get their information into a `DataFrame`.

In [None]:
columns = ['agepreg', 'hpagelb', 'birthord', 'race']
new = pd.DataFrame([[35, 39, 3, 2]], columns=columns)
y = results.predict(new)
y

This person has a 51% chance of having a boy (according to the model).

## Exercises

**Exercise:** Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.

The following are the only variables I found that have a statistically significant effect on pregnancy length.

In [None]:
model = smf.ols('prglngth ~ birthord==1 + race==2 + nbrnaliv>1', data=live)
results = model.fit()
summarize_results(results)

**Exercise:** The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis

Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.

As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?

In [None]:
join['boy'] = (join.babysex==1).astype(int)

In [None]:
def go_mining(df):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    
    variables = []
    for name in df.columns:
        try:
            if df[name].var() < 1e-7:
                continue

            formula='boy ~ agepreg + ' + name
            model = smf.logit(formula, data=df);
            nobs = len(model.endog)
            if nobs < len(df)/2:
                continue
            results = model.fit();
            variables.append((results.prsquared, name))
        except:
            continue
    return variables

In [None]:
# Solution

#Here are the 30 variables that yield the highest pseudo-R^2 values.

variables = go_mining(join)

In [None]:
variables.sort(reverse=True)
for rsq, name in variables[:30]:
    print(f'{name:20}: {rsq:0.5f}')

In [None]:
# Solution

# Eliminating variables that are not known during pregnancy and 
# others that are fishy for various reasons, here's the best model I could find:

formula='boy ~ agepreg + fmarout5==5 + infever==1'
model = smf.logit(formula, data=join)
results = model.fit()
summarize_results(results)

**Exercise:** If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called `poisson`. It works the same way as `ols` and `logit`. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called `numbabes`.

Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?

In [None]:
# Solution

# I used a nonlinear model of age.

join.numbabes.replace([97], np.nan, inplace=True)
join['age2'] = join.age_r**2

In [None]:
# Solution

formula='numbabes ~ age_r + age2 + age3 + C(race) + totincr + educat'
formula='numbabes ~ age_r + age2 + C(race) + totincr + educat'
model = smf.poisson(formula, data=join)
results = model.fit()
summarize_results(results)

Now we can predict the number of children for a woman who is 35 years old, black, and a college
graduate whose annual household income exceeds $75,000

In [None]:
# Solution

columns = ['age_r', 'age2', 'age3', 'race', 'totincr', 'educat']
new = pd.DataFrame([[35, 35**2, 35**3, 1, 14, 16]], columns=columns)
results.predict(new)

**Exercise:** If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called `mnlogit`. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called `rmarital`.

Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?

In [None]:
name = 'stuff'
print(f'{name:26} :')

In [None]:
# Solution

# Here's the best model I could find.

formula='rmarital ~ age_r + age2 + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary()

Make a prediction for a woman who is 25 years old, white, and a high
school graduate whose annual household income is about $45,000.

In [None]:
# Solution

# This person has a 75% chance of being currently married, 
# a 13% chance of being "not married but living with opposite 
# sex partner", etc.

columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new = pd.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)
results.predict(new)