Exercise from Think Stats, 2nd Edition (thinkstats2.com)<br>
Allen Downey

In [1]:
%matplotlib inline



Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool. 

In [2]:
import nsfg

data = nsfg.ReadFemPreg()

In [3]:
data_filtered = data[(data.prglngth > 30) & (data.prgoutcome == 1)]

In [4]:
import thinkstats2

def ReadFemResp(dct_file='2002FemResp.dct',
                dat_file='2002FemResp.dat.gz',
                nrows=None):
    """Reads the NSFG respondent data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file)
    df = dct.ReadFixedWidth(dat_file, compression='gzip', nrows=nrows)
    return df

respondents = ReadFemResp()

In [5]:
import pandas as pd

joined = data_filtered.join(respondents, on='caseid', rsuffix='_r')
print "Number of columns joined", len(joined.columns)
print "Number of columns original", len(data_filtered.columns)

data_mining = {}
for column_name, column_data in joined.iteritems():
    if column_name == 'prglngth':
        continue
    if column_data.isnull().mean() < 0.10:
        relevant_columns = pd.concat((column_data, joined.prglngth), axis=1).dropna()
        try:
            intercept, slope = thinkstats2.LeastSquares(relevant_columns[column_name],
                                                        relevant_columns.prglngth)
            res = slope*relevant_columns[column_name]+intercept - relevant_columns.prglngth
            data_mining[column_name] = thinkstats2.CoefDetermination(relevant_columns.prglngth,
                                                                     res)
        except Exception as ex:
            print ex

for w in sorted(data_mining, key=data_mining.get, reverse=True)[0:20]:
    print w, data_mining[w]

Number of columns joined 3331
Number of columns original 244
rcurpreg_i nan
pregnum_i nan
birthwgt_lb 0.119773078049
mosgest 0.0956243198959
race 0.000314402188101
birthord_i nan
prgoutcome nan
nbrnaliv 0.00457756578553
cmlastlb 0.00204314244221
fmarcon5_i 0.00196815932428
birthord 0.00123727367368
gestasun_w 0.00105137990873
outcome_i nan
poverty 0.00112341537571
cmintstr 0.00087396929583
educat 0.000590682440346
wantresp 0.000499185401084
ager_i nan
fmarital_i nan
fmarital 0.000572478064062


Of these, only a subset would actually reasonably be available to us.  I'd choose: race, nbrnaliv, birthord, fmarcon5.

In [6]:
from sklearn.linear_model import LinearRegression
import numpy as np

model = LinearRegression()
filtered = joined.dropna(subset=['birthord', 'race', 'nbrnaliv', 'fmarcon5','prglngth'])
X = filtered[['birthord', 'race', 'nbrnaliv', 'fmarcon5']]
y = filtered.prglngth
model.fit(X, y)
print "RMS model", np.sqrt(((model.predict(X) - y)**2).mean())
print "RMS baseline", np.sqrt(((y - y.mean())**2).mean())

1.89264008596
1.89815725743
