Exercise from Think Stats, 2nd Edition (thinkstats2.com)<br>
Allen Downey

In [3]:
%matplotlib inline
import nsfg
import thinkstats2
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool. 

In [4]:
data = nsfg.ReadFemPreg()
data_filtered = data[(data.prglngth > 30) & (data.prgoutcome == 1)]

In [6]:
# From solution set
def ReadFemResp(dct_file='2002FemResp.dct',
                dat_file='2002FemResp.dat.gz',
                nrows=None):
    """Reads the NSFG respondent data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file)
    df = dct.ReadFixedWidth(dat_file, compression='gzip', nrows=nrows)
    return df

respondents = ReadFemResp()

In [7]:
# From solutions
joined = data_filtered.join(respondents, on='caseid', rsuffix='_r')
print "Number of columns joined", len(joined.columns)
print "Number of columns original", len(data_filtered.columns)

data_mining = {}
for column_name, column_data in joined.iteritems():
    if column_name == 'prglngth':
        continue
    if column_data.isnull().mean() < 0.10:
        relevant_columns = pd.concat((column_data, joined.prglngth), axis=1).dropna()
        try:
            intercept, slope = thinkstats2.LeastSquares(relevant_columns[column_name],
                                                        relevant_columns.prglngth)
            res = slope*relevant_columns[column_name]+intercept - relevant_columns.prglngth
            data_mining[column_name] = thinkstats2.CoefDetermination(relevant_columns.prglngth,
                                                                     res)
        except Exception as ex:
            print ex

for w in sorted(data_mining, key=data_mining.get, reverse=True)[0:20]:
    print w, data_mining[w]

Number of columns joined 3331
Number of columns original 244
rcurpreg_i nan
pregnum_i nan
birthwgt_lb 0.119773078049
mosgest 0.0956243198959
race 0.000314402188093
birthord_i nan
prgoutcome nan
nbrnaliv 0.00457756578553
cmlastlb 0.0020431424422
fmarcon5_i 0.00196815932425
birthord 0.00123727367366
gestasun_w 0.00105137990872
outcome_i nan
poverty 0.0011234153757
cmintstr 0.000873969295819
educat 0.000590682440337
wantresp 0.00049918540108
ager_i nan
fmarital_i nan
fmarital 0.000572478064053


In [8]:
model = LinearRegression()
filtered = joined.dropna(subset=['birthord', 'race', 'nbrnaliv', 'fmarcon5','prglngth'])
X = filtered[['birthord', 'race', 'nbrnaliv', 'fmarcon5']]
y = filtered.prglngth
model.fit(X, y)
print "RMS model", np.sqrt(((model.predict(X) - y)**2).mean())
print "RMS baseline", np.sqrt(((y - y.mean())**2).mean())

RMS model 1.89264008596
RMS baseline 1.89815725743


## Clarifying Questions

Use this space to ask questions regarding the content covered in the reading. These questions should be restricted to helping you better understand the material. For questions that push beyond what is in the reading, use the next answer field. If you don't have a fully formed question, but are generally having a difficult time with a topic, you can indicate that here as well.

## Enrichment Questions

Use this space to ask any questions that go beyond (but are related to) the material presented in this reading. Perhaps there is a particular topic you'd like to see covered in more depth. Perhaps you'd like to know how to use a library in a way that wasn't show in the reading. One way to think about this is what additional topics would you want covered in the next class (or addressed in a followup e-mail to the class). I'm a little fuzzy on what stuff will likely go here, so we'll see how things evolve.

## Additional Resources / Explorations

If you found any useful resources, or tried some useful exercises that you'd like to report please do so here. Let us know what you did, what you learned, and how others can replicate it.