DSC530: Week 9
9.2 Exercise
Marty Hoehler
5-12-24

# Exercise 11-1

First, we'll download data and import libraries as in weeks prior.

In [1]:
import numpy as np
import pandas as pd

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)

download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/scatter.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/first.py")

import nsfg
import thinkstats2
import thinkplot
import first
import statsmodels.formula.api as smf

live, firsts, others = first.MakeFrames()

I opened up 2002FemPreg.dct file to review all the fields it came with.  I landed on the following list to check.  I'm pulling in anything that there is a reasonable chance to know about a co-worker.  It'll all depend on how close you are to this person.
- Multbrth:  (I figure that it might be common knowledge at the office if they're expecting twins or more.)
- hpageend:  (Father's age... I imagine I might be able to at least guess the father's age.  But this is not always populated, so we'll exclude.)
- postsmks:  (You'd probably know if she was going out for smoke breaks...)
- workpreg:  (You'd obviously know if she's going to work...)
- cmlastlb:  (You'd probably know when her last child was born.  I plan to subtract this from "CM date of conception" to get a number of months since the last birth.
- cmfstprg:  (You'd probably know when her first child was born.)
- birthord:  (You'd probably know which number child this was.)
- ageprg:  (You could probably guess, at least, if you don't know.)
- datecon:  (Using this to calculate months since previous and first pregnencies.)
- fmarout5:  (Marital status - this is likely known.)
- educat:  (Years of schooling.)
- hieduc:  (Highest degree)
- race
- insuranc:  (You likely know how your company's plan is.  It's a guess whether they're on it, I suppose.)
- religion:  (You might know this.)



In [2]:
mthsfirst = live.datecon-live.cmfstprg
mthslast = live.datecon-live.cmlastlb

formula = 'prglngth ~ multbrth + postsmks + workpreg + mthsfirst + mthslast + birthord + agepreg + fmarout5 + educat + hieduc + race + insuranc + religion'
model1 =smf.ols(formula, data=live)
results = model1.fit()
    
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               prglngth   R-squared:                       0.451
Model:                            OLS   Adj. R-squared:                  0.285
Method:                 Least Squares   F-statistic:                     2.717
Date:                Sat, 11 May 2024   Prob (F-statistic):            0.00692
Time:                        18:01:31   Log-Likelihood:                -143.92
No. Observations:                  57   AIC:                             315.8
Df Residuals:                      43   BIC:                             344.4
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     14.1084      5.495      2.568      0.0

Based on this, there are several variables that are coming through with a low p value.  However, there are so many variables that it's hurt the overall strength of my model, as can be seen in the adjusted R<sup>2</sup>.  I'll remove some of the variables with high P values to see if we can improve the Adj R<sup>2</sup>.

In [3]:
formula = 'prglngth ~   multbrth + postsmks + mthsfirst + mthslast + agepreg + fmarout5 + race'
model1 =smf.ols(formula, data=live)
results = model1.fit()
    
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               prglngth   R-squared:                       0.443
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                     5.566
Date:                Sat, 11 May 2024   Prob (F-statistic):           9.33e-05
Time:                        18:01:31   Log-Likelihood:                -144.33
No. Observations:                  57   AIC:                             304.7
Df Residuals:                      49   BIC:                             321.0
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     13.1613      4.382      3.003      0.0

I think this list of variables makes sense.  
- Multiple birth situations tend to happen early.  
- Smoking has proven to have an effect on pregnancies.  
- Months since last birth and months since first birth ended up being helpful, to the point that I was able to remove "birth order" out of the equation.  (I imagine if mthslast is null, that's basically a replacement for birth order.)
- Agepreg had a minimal effect on length of pregnancy in our studies, but added in with the other variables, it does seem to contribute.
- Marital status seems to contribute as well.
- I knew from the provided resources that race would be on the list.


Having studied the provided resources, though, the author provides code to analyze the joined data sets.  I'll edit that to determine if there are any variables we missed.

In [4]:
# This imports the response data and joins it to our
import patsy

download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dct")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dat.gz")

live = live[live.prglngth>30]
resp = nsfg.ReadFemResp()
resp.index = resp.caseid
join = live.join(resp, on='caseid', rsuffix='_r')
join.shape


(8884, 3331)

In [5]:
import patsy

#This formula from the text has been modified to mine for variables that predict pregnancy length.
def GoMining(df):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    variables = []
    for name in df.columns:
        try:
            if df[name].var() < 1e-7:
                continue
# Changing the formula here to look for variables that predict pregnancy length.
            formula = 'prglngth ~ ' + name
            model = smf.ols(formula, data=df)
            if model.nobs < len(df)/2:
                continue

            results = model.fit()
        except (ValueError, TypeError, patsy.PatsyError) as e:
            continue
        
        variables.append((results.rsquared, name))

    return variables

In [6]:
variables = GoMining(join)

In [7]:
# The author provides this code to bring in descriptions of all fields and return values with highest R-squared

import re

def ReadVariables():
    """Reads Stata dictionary files for NSFG data.

    returns: DataFrame that maps variables names to descriptions
    """
    vars1 = thinkstats2.ReadStataDct('2002FemPreg.dct').variables
    vars2 = thinkstats2.ReadStataDct('2002FemResp.dct').variables

    all_vars = pd.concat([vars1, vars2])
    all_vars.index = all_vars.name
    return all_vars

def MiningReport(variables, n=30):
    """Prints variables with the highest R^2.

    t: list of (R^2, variable name) pairs
    n: number of pairs to print
    """
    all_vars = ReadVariables()

    variables.sort(reverse=True)
    for r2, name in variables[:n]:
        key = re.sub('_r$', '', name)
        try:
            desc = all_vars.loc[key].desc
            if isinstance(desc, pd.Series):
                desc = desc[0]
            print(name, r2, desc)
        except (KeyError, IndexError):
            print(name, r2)

In [8]:

MiningReport(variables)

prglngth 1.0 DURATION OF COMPLETED PREGNANCY IN WEEKS
wksgest 0.8062434116139234 GESTATIONAL LENGTH OF COMPLETED PREGNANCY (IN WEEKS)
totalwgt_lb 0.12445743148120247
birthwgt_lb 0.11977307804917126 BD-3 BIRTHWEIGHT IN POUNDS - 1ST BABY FROM THIS PREGNANCY
lbw1 0.10372542204583435 LOW BIRTHWEIGHT - BABY 1
mosgest 0.09562431989592779 GESTATIONAL LENGTH OF COMPLETED PREGNANCY (IN MONTHS)
prglngth_i 0.022053775796470054 PRGLNGTH IMPUTATION FLAG
canhaver 0.0060504952681986746 DF-1 PHYSICALLY DIFFICULT FOR R TO HAVE A BABY
datcon01_i 0.005817755299875937 DATCON01 IMPUTATION FLAG
con1mar1_i 0.005546376136239095 CON1MAR1 IMPUTATION FLAG
nbrnaliv 0.004577565785532922 BC-2 NUMBER OF BABIES BORN ALIVE FROM THIS PREGNANCY
mar1con1_i 0.0031508022538594416 MAR1CON1 IMPUTATION FLAG
anynurse 0.002452024883708881 BH-1 WHETHER R BREASTFED THIS CHILD AT ALL - 1ST FROM THIS PREG
bfeedwks 0.0023691839446681184 DURATION OF BREASTFEEDING IN WEEKS
pregend1 0.002249389433801041 BC-1 HOW PREGNANCY ENDED - 1ST M

Looking through these fields, I'm not seeing any that I would imagine knowing as a co-worker.  So the variables I selected above are the best that I could find.


# Exercise 11-2

For this exercise, the variable studied is a Boolean, so we will use a logistic regression model.  We will edit the "GoMining" function above to use a logistic regression. 


In [9]:
def GoMiningBoy(df):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    
    # This code adds a column to the dataframe to capture if the sex of the child is male.
    df['boy'] = (df.babysex==1).astype(int)
    variables = []
    for name in df.columns:
        try:
            if df[name].var() < 1e-7:
                continue

            formula='boy ~ agepreg + ' + name
            model = smf.logit(formula, data=df)
            nobs = len(model.endog)
            if nobs < len(df)/2:
                continue

            results = model.fit()
        except:
            continue

        variables.append((results.prsquared, name))

    return variables

variablesBoy = GoMiningBoy(join)

Optimization terminated successfully.
         Current function value: 0.692991
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692961
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692849
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692996
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692903
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692724
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692992
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692985
  

         Current function value: 0.692776
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.692638
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692838
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.692971
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692971
         Iterations 3




Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692973
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692973
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692810
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693003
  

Optimization terminated successfully.
         Current function value: 0.692964
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692801
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693074
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692959
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692995
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693004
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692911
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692833
  

Optimization terminated successfully.
         Current function value: 0.692742
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693007
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692958
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692975
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692848
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692942
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692953
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692915
  

Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692994
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692990
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692965
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692994
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692818
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692980
  

Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692932
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692619
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692779
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692886
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692739
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692662
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692621
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692792
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692926
  

         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692995
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692974
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692992
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692995
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692994
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692866
         Iterations 3
Optimization terminated successfully.
         Current funct

Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692894
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692859
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692924
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692982
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692914
  

Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693082
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692636
         Iterations 3
         Current function value: 0.692709
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.692863
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692926
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692726
         Iterations 3




Optimization terminated successfully.
         Current function value: 0.692774
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692999
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692861
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692705
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692723
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692803
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692956
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692786
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693007
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692989
  



Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692658
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692789
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693008
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692855
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692855
  



Optimization terminated successfully.
         Current function value: 0.692957
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692983
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692258
         Iterations 5
         Current function value: 0.692696
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.692993
         Iterations 3




Optimization terminated successfully.
         Current function value: 0.693009
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692853
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692971
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692639
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692917
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692760
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692832
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693028
  

Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693000
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
         Current function value: 0.692939
         Iterations: 35
Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692831
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692999
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692997
         Iterations 3




Optimization terminated successfully.
         Current function value: 0.692795
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692693
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692457
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692815
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693002
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692989
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693008
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693011
  

In [10]:
MiningReport(variablesBoy)

totalwgt_lb 0.009696855253233383
birthwgt_lb 0.009274460080281988 BD-3 BIRTHWEIGHT IN POUNDS - 1ST BABY FROM THIS PREGNANCY
constat3 0.0010985419170438382 3RD PRIORITY CODE FOR CURRENT CONTRACEPTIVE STATUS
lbw1 0.0010519527860075595 LOW BIRTHWEIGHT - BABY 1
nplaced 0.001010368752280555 # OF R'S BIO CHILDREN SHE PLACED FOR ADOPTION (BASED ON BPA)
fmarout5 0.0009096579032891183 FORMAL MARITAL STATUS AT PREGNANCY OUTCOME
rmarout6 0.000818252143711895 INFORMAL MARITAL STATUS AT PREGNANCY OUTCOME - 6 CATEGORIES
infever 0.0008115919859909004 EVER USED INFERTILITY SERVICES OF ANY KIND
frsteatd 0.0007675331422082321 AGE (IN MOS) WHEN 1ST SUPPLEMENTED - 1ST FROM THIS PREG
splstwk1 0.0007334122339932581 IF-1 H/P DOING WHAT LAST WEEK (EMPLOYMENT STATUS) 1ST MENTION
pmarpreg 0.0007245809157658822 WHETHER PREGNANCY ENDED BEFORE R'S 1ST MARRIAGE (PREMARITALLY)
usefstp 0.0007122387685902787 EF-3 USE METHOD AT FIRST SEX WITH 1ST PARTNER IN PAST 12 MONTHS?
outcom02 0.0007015744602576479 OUTCOME OF PREG

Of these fields, the ones that stand out as possible contenders for a model would be 
- marital status ('fmarout5'), and infertility services ('infever') are high on the list
- "born outside of US" (brnout) could theoretically be a contender.  If the theory we're testing is that the mother's experience can effect the sex of the child, perhaps the experience of imigration would fit that theory.
- and I see that one of the age flages is coming through.  Since we've worked with agepreg often, we'll check that one too.

We'll use the formulas built in question 1 to get a summary of these variables.

In [11]:

formula = 'boy ~ agepreg + fmarout5==5 + infever==1 + brnout'
model = smf.logit(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 0.691321
         Iterations 4


0,1,2,3
Dep. Variable:,boy,No. Observations:,8884.0
Model:,Logit,Df Residuals:,8879.0
Method:,MLE,Df Model:,4.0
Date:,"Sat, 11 May 2024",Pseudo R-squ.:,0.002451
Time:,18:03:04,Log-Likelihood:,-6141.7
converged:,True,LL-Null:,-6156.8
Covariance Type:,nonrobust,LLR p-value:,4.503e-06

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.0066,0.130,-0.051,0.960,-0.262,0.249
fmarout5 == 5[T.True],0.1651,0.049,3.351,0.001,0.069,0.262
infever == 1[T.True],0.2320,0.065,3.559,0.000,0.104,0.360
agepreg,0.0048,0.004,1.115,0.265,-0.004,0.013
brnout,-0.0410,0.013,-3.130,0.002,-0.067,-0.015


The Pseudo R-squ for this is a very low .002451, so it's very unlikely that this model will contribute much.  In the solution text provided, the author does not use 'brnout' (his Pseudo R-squ was .001653).  This model seems to be very slightly better with 'brnout' included.  Althought the added variable may erase the difference in Pseudo R-squ, when you adjust for number of variables.


# Exercise 11-3

For this exercise, we're looking to predict number of children ('numbabes')  We're given age, race, education and household income.
- Age = 35
- Race = 1
- Education 'educat' = 16
    - "College Graduate" would be 16 years of school.
- Total Income = 14.  
    - The total income goes from 1-14 with ranges of about 5k.  So over $75,000 would be the highest number, '14'


In [12]:
formula = 'numbabes ~ age_r + C(race) + totincr + educat'
model = smf.poisson(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 1.687055
         Iterations 5


0,1,2,3
Dep. Variable:,numbabes,No. Observations:,8884.0
Model:,Poisson,Df Residuals:,8878.0
Method:,MLE,Df Model:,5.0
Date:,"Sat, 11 May 2024",Pseudo R-squ.:,0.03109
Time:,18:03:04,Log-Likelihood:,-14988.0
converged:,True,LL-Null:,-15469.0
Covariance Type:,nonrobust,LLR p-value:,1.106e-205

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.0842,0.045,23.995,0.000,0.996,1.173
C(race)[T.2],-0.1398,0.015,-9.464,0.000,-0.169,-0.111
C(race)[T.3],-0.0914,0.025,-3.717,0.000,-0.140,-0.043
age_r,0.0208,0.001,20.474,0.000,0.019,0.023
totincr,-0.0179,0.002,-9.442,0.000,-0.022,-0.014
educat,-0.0443,0.003,-15.139,0.000,-0.050,-0.039


In [13]:
columns = ['age_r', 'race', 'totincr', 'educat']
new = pd.DataFrame([[35, 1, 14, 16]], columns=columns)
results.predict(new)

0    2.342182
dtype: float64

Based on this, you would predict that the woman has 2.34 children.  (One thing I don't know, is if this data set includes women who have never been pregnant.  That would be worth further study.)

Note:  Having reviewing the provided solution, I see that the author created a polynomial model for age.  I'll try that to see how it changes my answer.

In [14]:
join['age2'] = join.age_r**2
#join['age3'] = join.age_r**3

formula = 'numbabes ~ age_r + age2 +  C(race) + totincr + educat'
model = smf.poisson(formula, data=join)
results = model.fit()

columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new = pd.DataFrame([[35, 35**2,  1, 14, 16]], columns=columns)
results.predict(new)

Optimization terminated successfully.
         Current function value: 1.677002
         Iterations 7


0    2.496802
dtype: float64

In [15]:
results.summary()

0,1,2,3
Dep. Variable:,numbabes,No. Observations:,8884.0
Model:,Poisson,Df Residuals:,8877.0
Method:,MLE,Df Model:,6.0
Date:,"Sat, 11 May 2024",Pseudo R-squ.:,0.03686
Time:,18:03:04,Log-Likelihood:,-14898.0
converged:,True,LL-Null:,-15469.0
Covariance Type:,nonrobust,LLR p-value:,3.681e-243

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.0324,0.169,-6.098,0.000,-1.364,-0.701
C(race)[T.2],-0.1401,0.015,-9.479,0.000,-0.169,-0.111
C(race)[T.3],-0.0991,0.025,-4.029,0.000,-0.147,-0.051
age_r,0.1556,0.010,15.006,0.000,0.135,0.176
age2,-0.0020,0.000,-13.102,0.000,-0.002,-0.002
totincr,-0.0187,0.002,-9.830,0.000,-0.022,-0.015
educat,-0.0471,0.003,-16.076,0.000,-0.053,-0.041


This changes the projection to be 2.5 children.  The Pseudo R-squared is also slightly better, so it is likely that this has improved the model.