# Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [1]:
from __future__ import print_function, division

%matplotlib inline

import numpy as np
import pandas as pd

import random

import thinkstats2
import thinkplot

## Exercises

**Exercise 11.1:** Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.

In [2]:
import first
live, firsts, others = first.MakeFrames()
live = live[live.prglngth>30]

The following are the only variables I found that have a statistically significant effect on pregnancy length.

In [3]:
import statsmodels.formula.api as smf
model_prglngth = smf.ols('prglngth ~ birthord>2 + race==1 + nbrnaliv>1', data=live)
results_prglngth = model_prglngth.fit()
results_prglngth.summary()

0,1,2,3
Dep. Variable:,prglngth,R-squared:,0.011
Model:,OLS,Adj. R-squared:,0.01
Method:,Least Squares,F-statistic:,31.55
Date:,"Sun, 01 Nov 2020",Prob (F-statistic):,2.82e-20
Time:,12:33:12,Log-Likelihood:,-18252.0
No. Observations:,8884,AIC:,36510.0
Df Residuals:,8880,BIC:,36540.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,38.9473,0.025,1549.158,0.000,38.898,38.997
birthord > 2[T.True],-0.0711,0.050,-1.423,0.155,-0.169,0.027
race == 1[T.True],-0.1229,0.046,-2.689,0.007,-0.212,-0.033
nbrnaliv > 1[T.True],-1.4962,0.165,-9.092,0.000,-1.819,-1.174

0,1,2,3
Omnibus:,1569.474,Durbin-Watson:,1.621
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6083.936
Skew:,-0.842,Prob(JB):,0.0
Kurtosis:,6.688,Cond. No.,8.75


In [4]:
# Assuming office colleague's second pregnancy, race as black and number of kids to be born = 1

columns = ['birthord', 'race', 'nbrnaliv']
new_prglngth = pd.DataFrame([[2, 1, 1]], columns=columns)
pred_prglngth = results_prglngth.predict(new_prglngth)

print(f"Predicted pregnancy length for the suggested office colleague: {pred_prglngth[0]} weeks")

Predicted pregnancy length for the suggested office colleague: 38.824476674243634 weeks


**So, looking at above results, if we consider Halloween day (10/31/2020) as 30 weeks completion, the new baby would be expected to be born in 8.82 weeks time i.e. on New year's day (01/01/2021)**

**Exercise 11.3:** If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called `poisson`. It works the same way as `ols` and `logit`. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called `numbabes`.

Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?

In [5]:
import nsfg

live = live[live.prglngth>30]
resp = nsfg.ReadFemResp()
resp.index = resp.caseid
join = live.join(resp, on='caseid', rsuffix='_r')

numbabes_df = join.dropna(subset=["numbabes"])

In [6]:
# Variables used for predictions -> age, race, total income and education

numbabes_df.head()

formula='numbabes ~ age_r + C(race) + totincr + educat'

model_babes = smf.poisson(formula, data=join)
results_babes = model_babes.fit()
results_babes.summary()

Optimization terminated successfully.
         Current function value: 1.687055
         Iterations 5


0,1,2,3
Dep. Variable:,numbabes,No. Observations:,8884.0
Model:,Poisson,Df Residuals:,8878.0
Method:,MLE,Df Model:,5.0
Date:,"Sun, 01 Nov 2020",Pseudo R-squ.:,0.03109
Time:,12:33:32,Log-Likelihood:,-14988.0
converged:,True,LL-Null:,-15469.0
Covariance Type:,nonrobust,LLR p-value:,1.106e-205

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.0842,0.045,23.995,0.000,0.996,1.173
C(race)[T.2],-0.1398,0.015,-9.464,0.000,-0.169,-0.111
C(race)[T.3],-0.0914,0.025,-3.717,0.000,-0.140,-0.043
age_r,0.0208,0.001,20.474,0.000,0.019,0.023
totincr,-0.0179,0.002,-9.442,0.000,-0.022,-0.014
educat,-0.0443,0.003,-15.139,0.000,-0.050,-0.039


Now we can predict the number of children for a woman who is 35 years old, black, and a college
graduate whose annual household income exceeds $75,000

In [7]:
columns = ['age_r', 'race', 'totincr', 'educat']
pred_df = pd.DataFrame([[35, 1, 14, 16]], columns=columns)
pred_babes = results_babes.predict(pred_df)
print(f"Predicted number of children for the suggested office colleague: {pred_babes[0]}")

Predicted number of children for the suggested office colleague: 2.3421823755364537


**Looking at above results, predicted number of children for the woman with aforementioned attributes would be approx 2.34 which we can round off to 2 children on an average**

**Exercise 11.4:** If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called `mnlogit`. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called `rmarital`.

Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?

In [8]:
# Variables used for predictions -> age, race, total income and education

formula='rmarital ~ age_r + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 1.087603
         Iterations 8


0,1,2,3
Dep. Variable:,rmarital,No. Observations:,8884.0
Model:,MNLogit,Df Residuals:,8854.0
Method:,MLE,Df Model:,25.0
Date:,"Sun, 01 Nov 2020",Pseudo R-squ.:,0.1655
Time:,12:33:33,Log-Likelihood:,-9662.3
converged:,True,LL-Null:,-11579.0
Covariance Type:,nonrobust,LLR p-value:,0.0

rmarital=2,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,4.4532,0.279,15.977,0.000,3.907,5.000
C(race)[T.2],-0.9219,0.089,-10.409,0.000,-1.095,-0.748
C(race)[T.3],-0.6334,0.136,-4.674,0.000,-0.899,-0.368
age_r,-0.0570,0.006,-9.754,0.000,-0.068,-0.046
totincr,-0.1302,0.012,-11.298,0.000,-0.153,-0.108
educat,-0.2051,0.019,-11.017,0.000,-0.242,-0.169
rmarital=3,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-4.5432,0.916,-4.960,0.000,-6.338,-2.748
C(race)[T.2],-0.4405,0.236,-1.865,0.062,-0.904,0.023
C(race)[T.3],0.0329,0.335,0.098,0.922,-0.623,0.689


Make a prediction for a woman who is 25 years old, white, and a high
school graduate whose annual household income is about $45,000.

In [9]:
columns = ['age_r', 'race', 'totincr', 'educat']
new = pd.DataFrame([[25, 2, 11, 12]], columns=columns)
results.predict(new)

Unnamed: 0,0,1,2,3,4,5
0,0.748384,0.125474,0.001103,0.035295,0.023813,0.065931


**Looking at above results, predicted marital status for the woman with aforementioned attributes would be - approx 74.83% chance for being married, approx 12.55% chance of being unmarried but living with opposite sex partner, 0.11% chance of being widowed, 3.53% chance of being divorced, 2.38% chance of being separated from spouse due to not getting along and 6.59% chance of never been married. Thus chances of being married or living with partner are higher at 25 years age and lower chances for being separated from partner, as expected.**