Exercise from Think Stats, 2nd Edition (thinkstats2.com)<br>
Allen Downey

In [81]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import nsfg
import chap01soln
import statsmodels.formula.api as smf

Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool. 

First, I want to load in the data and just restrict the database right off the bat. I'll also use this time to drop my nan values. 

In [82]:
births = nsfg.ReadFemPreg()
live = births[(births.outcome == 1) & (births.prglngth > 30)]
resp = chap01soln.ReadFemResp()


In [83]:
join = live.join(resp, on='caseid', rsuffix='_r')


In [84]:
t = []
for name in join.columns:
    try:
        if join[name].var() < 1e-7:
            continue
        formula = ('prglngth ~ ' + name).encode('ascii')
        model = smf.ols(formula, data=join)
        if model.nobs < len(join)/2:
            continue
        results = model.fit()
    except (ValueError, TypeError):
        continue
    t.append((results.rsquared, name))

In [85]:
t.sort(reverse=True)
for mse, name in t[:30]:
    print(name, mse)

(u'prglngth', 1.0)
(u'wksgest', 0.80624341161392343)
('totalwgt_lb', 0.12445743148120236)
(u'birthwgt_lb', 0.11977307804917292)
(u'lbw1', 0.10372542204583624)
(u'mosgest', 0.095624319895927679)
(u'prglngth_i', 0.0220537757964685)
(u'nbrnaliv', 0.0045775657855365859)
(u'mardat02_i', 0.003084214232312199)
(u'wantrp07_i', 0.0028805878918768402)
(u'oldwr07_i', 0.0028805878918768402)
(u'oldwp07_i', 0.0028805878918768402)
(u'parts12', 0.0027898760463058725)
(u'rmarout07_i', 0.0027500058046551201)
(u'condomr_i', 0.0024693444168197853)
(u'anynurse', 0.0024520248837134329)
(u'bfeedwks', 0.0023691839446664531)
(u'mon12prt', 0.0023356623983744607)
(u'pregend1', 0.0022493894338005971)
(u'intr_ec3', 0.0020549332842115797)
(u'cmlastlb', 0.0020431424422021616)
(u'warm', 0.0020189345316214968)
(u'fmarcon5_i', 0.0019681593242552031)
(u'evuseint', 0.0018917527758629538)
(u'agecon07_i', 0.0018117332152514098)
(u'marcon05_i', 0.0017807675795284972)
(u'knowfp', 0.0017743566721165616)
(u'sexever_i', 0.00175

Okay, so this predicts the weeks of gestation for all the variables. Now, let's only use the variables that we'd have access to.

In [86]:
joinFiltered = join[['prglngth','pregordr','agepreg','marstat','hisp',
                     'roscnt','goschol','higrade','havedip','havedeg',
                     'numpregs','cntry00','brnout','evwrk6mo','wrk12mos',
                     'educat','race','numfmhh']]

In [87]:
tFiltered = []
for name in joinFiltered.columns:
    try:
        if join[name].var() < 1e-7:
            continue
        formula = ('prglngth ~ ' + name).encode('ascii')
        model = smf.ols(formula, data=joinFiltered)
        if model.nobs < len(join)/2:
            continue
        results = model.fit()
    except (ValueError, TypeError):
        continue
    tFiltered.append((results.rsquared, name))

In [88]:
tFiltered.sort(reverse=True)
for mse, name in tFiltered[:30]:
    print(name, mse)

(u'prglngth', 1.0)
(u'pregordr', 0.00062224148606715435)
(u'educat', 0.00059068244033599893)
(u'marstat', 0.00055259939593688134)
(u'numpregs', 0.00039871360360510533)
(u'havedip', 0.00035383982253922586)
(u'race', 0.00031440218809630771)
(u'numfmhh', 0.00027737775283898092)
(u'hisp', 0.00012628508855627718)
(u'roscnt', 0.0001200107015281171)
(u'higrade', 4.7422830185106513e-05)
(u'wrk12mos', 3.6221957431581409e-05)
(u'brnout', 2.7050223118885164e-05)
(u'evwrk6mo', 2.2877588194858411e-05)
(u'agepreg', 1.7603962211398816e-05)
(u'goschol', 1.984423825263093e-07)


So, it doesn't seem like there is anything that's super related to the birth day in the columns that I looked through. Other than, of course, the length of the pregnancy. It seems like the sorts of things that the people at an office would know don't really predict when the baby will be born. 

There might well be different things that I could use to predict the date of birth, but I spent about half an hour looking through the codebook for columns that coworkers might now, and these are the ones I found. Perhaps I missed a different method of selecting column names that might have been faster. 

## Clarifying Questions

Use this space to ask questions regarding the content covered in the reading. These questions should be restricted to helping you better understand the material. For questions that push beyond what is in the reading, use the next answer field. If you don't have a fully formed question, but are generally having a difficult time with a topic, you can indicate that here as well.

It seems like there are a lot of variables that Allen defined earlier on in think stats that he used in this chapter. It wasn't immediately ovious to me (maybe I missed something) in this chapter or the previous one how these variables would be defined, and it makes reading the chapters/doing the exercises out of order more confusing/time consuming. 

One of the questions that I had when trying to answer a question like this is whether I use only data of women 30 weeks pregnant to answer this question, or is it acceptable to use all of the data? Or only data for women pregnant less than 30 weeks? More than 30 weeks? How do I know what the "right" portion of the data to answer a question like this? 

## Enrichment Questions

Use this space to ask any questions that go beyond (but are related to) the material presented in this reading. Perhaps there is a particular topic you'd like to see covered in more depth. Perhaps you'd like to know how to use a library in a way that wasn't show in the reading. One way to think about this is what additional topics would you want covered in the next class (or addressed in a followup e-mail to the class). I'm a little fuzzy on what stuff will likely go here, so we'll see how things evolve.

When you have a dataset that's as large as the one we have, how do you go about managing it/selecting what columns to use for analysis? There are more than 1000 columns and looking through them all seems a little impratical.

## Additional Resources / Explorations

If you found any useful resources, or tried some useful exercises that you'd like to report please do so here. Let us know what you did, what you learned, and how others can replicate it.

Nothing here!