# Chapter 1: Exercise 1

In [14]:
from __future__ import print_function, division

import nsfg

## Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.

In [15]:
preg = nsfg.ReadFemPreg() 

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [16]:
preg.birthord.value_counts().sort_index()
    #birthord: column in dataset
    #.value_Counts(): returns the series containing counts of unique values
    #sort_index(): sorts objects by labels along the given axis

1.0     4413
2.0     2874
3.0     1234
4.0      421
5.0      126
6.0       50
7.0       20
8.0        7
9.0        2
10.0       1
Name: birthord, dtype: int64

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [17]:
# prglngth: column in dataset
# .value_Counts(): returns the series containing counts of unique values
# sort_index(): sorts objects by labels along the given axis
preg.prglngth.value_counts().sort_index()


0       15
1        9
2       78
3      151
4      412
5      181
6      543
7      175
8      409
9      594
10     137
11     202
12     170
13     446
14      29
15      39
16      44
17     253
18      17
19      34
20      18
21      37
22     147
23      12
24      31
25      15
26     117
27       8
28      38
29      23
30     198
31      29
32     122
33      50
34      60
35     357
36     329
37     457
38     609
39    4744
40    1120
41     591
42     328
43     148
44      46
45      10
46       1
47       1
48       7
50       2
Name: prglngth, dtype: int64

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [18]:
preg['totalwgt_kg'] = preg.totalwgt_lb / 2.2 
    #sets totalwgt_lb values/2.2 to new column named 'totalwgt_kg'
    #Divides by 2.2 to convert lbs to kg
preg.totalwgt_kg.mean()
    #Gets mean of newly added column, 'totalwgt_kg'


3.302558389828807

Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [19]:
resp.age_r.value_counts().sort_index()

#Youngest is 15 & oldest is 44

15    217
16    223
17    234
18    235
19    241
20    258
21    267
22    287
23    282
24    269
25    267
26    260
27    255
28    252
29    262
30    292
31    278
32    273
33    257
34    255
35    262
36    266
37    271
38    256
39    215
40    256
41    250
42    215
43    253
44    235
Name: age_r, dtype: int64

How old is the respondent with `caseid` 1?

In [20]:
resp[resp.caseid==1].age_r
    #returns age where caseid =1
    #age is 44 of caseid 1

1069    44
Name: age_r, dtype: int64

What are the pregnancy lengths for the respondent with `caseid` 2298?

In [21]:
preg[preg.caseid==2298].prglngth #fids preg record where caseid =2298

2610    40
2611    36
2612    30
2613    40
Name: prglngth, dtype: int64

What was the birthweight of the first baby born to the respondent with `caseid` 5012?

In [22]:
preg[preg.caseid==5012].birthwgt_lb
    #there is only 1 record for 5012, so just need to return the record, no need to find first

5515    6.0
Name: birthwgt_lb, dtype: float64

# Chapter 1 Exercise 2

In [23]:
"""This file contains code for use with "Think Stats",
by Allen B. Downey, available from greenteapress.com
Copyright 2014 Allen B. Downey
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
"""

from __future__ import print_function, division

import numpy as np
import sys

import nsfg
import thinkstats2


#Create a function named ReadFemResp
#3 Arguments are passed to function: dct_file, dat_file, and nrows
## 1st argument reads in '2002FemResp.dct' file and sets to variable named dct_file
## 2nd argument reads in ''2002FemResp.dat.gz' file and sets to variable named dat_file
## 3rd argument sets number of rows of file to read. Useful for reading pieces of large files.*
### *None means FALSE, so all rows are read in when nrows = None


def ReadFemResp(dct_file='2002FemResp.dct',
                dat_file='2002FemResp.dat.gz',
                nrows=None):
    """Reads the NSFG respondent data.
    dct_file: string file name
    dat_file: string file name
    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file) 
        # Runs function from thinkstats2 module named 'ReadStataDct' that reads in dct_file and sets to dct variable
        # From ThinkStats2: ReadStatDict takes the name of the dictionary file and returns dct...
        # ... dct is a FixedWidthVariables object that contains the information from the dictionary file. 
    df = dct.ReadFixedWidth(dat_file, compression='gzip', nrows=nrows)
        # From ThinkStats2: dct provides ReadFixedWidth, which reads the data file.

    CleanFemResp(df)
        # Passes df variable to CleanFemResp funtion and runs
    return df
        # This functin returns a df, which is a datframe


def CleanFemResp(df):
    """Recodes variables from the respondent frame.
    df: DataFrame
    """
    # This function does nothing... is a placeholder 
    pass


def ValidatePregnum(resp):
    """Validate pregnum in the respondent file.
    resp: respondent DataFrame
    """
    # read the pregnancy frame
    preg = nsfg.ReadFemPreg()

    # make the map from caseid to list of pregnancy indices
    # This function is ran from nsfg.py file saved in directory. code is:
        # d = defaultdict(list)
        # for index, caseid in df.caseid.iteritems():
            # d[caseid].append(index)
        #return d
    preg_map = nsfg.MakePregMap(preg)
        
    # iterate through the respondent pregnum series
    ## for each index & each 
    for index, pregnum in resp.pregnum.items():
        caseid = resp.caseid[index]
        indices = preg_map[caseid]

        # check that pregnum from the respondent file equals
        # the number of records in the pregnancy file
        if len(indices) != pregnum:
            print(caseid, len(indices), pregnum) #if the index does not match the pregnum, then print caseid, index, & pregnum
            return False

    return True


def main(script):
    """Tests the functions in this module.
    script: string script name
    """
    resp = ReadFemResp()
    #tests if length of resp dataset is 7643 returns true, error raised if false
    assert(len(resp) == 7643) 
    
    #tests if count of values of pregnum dataset =1267 returns true, error raised if false
    assert(resp.pregnum.value_counts()[1] == 1267) 
    
    #tests if return value from validateProgram function returns true, error raised if false
    assert(ValidatePregnum(resp))

    #Prints 'all tests passes and script name' if no errors raised from assert statement
    print('%s: All tests passed.' % script)

if __name__ == '__main__':
    main(sys.argv[0]) #I had to alter this to get it to work. The *sys.agrv kept throwing too many argumetns errors

C:\Anaconda\lib\site-packages\ipykernel_launcher.py: All tests passed.


# Chapter 2 Exercise 1

Question 1: Which summary statistic would you use if you wanted to get a story on the evening news? Which one would you use if you wanted to reassure an anxious patient?

    The news would probably be interested to see the highest and lowest agest for births as well as the highest and lowest weights for births. I would accompany this with the average to show how significant the variation is.

    For reassuring an anxious patient, I would share the probabilities of a long pregnancy or an underweight/over-weight baby and how it is less ikely.

"Do First Babies Arrive Late?" Write a paragraph that uses the results in this chapter to answer the question clearly, precisely, & honestly.

       Pregnancy terms for the first birth for a mother is not typically longer than any subsequent birth. This can be shown by comparing the mean of all births (first and not first) to the mean of just first births. The mean pregnancy lengths of live births is 38.6 weeks. A 2-3 week deviation is normal. For the first birth, pregnancies have a mean length of 38.601 weeks. First births vary from the mean of all births by ~0.02% (only about 13 hours of a difference). The standard deviation between first births and other births is less than 0.03%, showing that the effect of whether or not it's the first birth for a mother has no real effect on the length of the pregnancy.

# Chaper 2 Exercise 4

In [24]:
# Use the variable tatalwgt_lb, 
# investigate whether first babies are lighter or heavier than others
# Compute Cohen's d to quanitfy the difference between the groups
# How does it compare to the difference in pregnancy lengths?

In [25]:
from __future__ import print_function, division

%matplotlib inline

import numpy as np

import nsfg
import first

In [26]:
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]

In [27]:
firsts = live[live.birthord == 1] # sets only values where birthorder was 1
others = live[live.birthord != 1] # sets only values where birthorder was not 1


In [28]:
def CohenEffectSize(group1, group2):
    """Computes Cohen's effect size for two groups.
    
    group1: Series or DataFrame
    group2: Series or DataFrame
    
    returns: float if the arguments are Series;
             Series if the arguments are DataFrames
    """
    diff = group1.mean() - group2.mean() #set diff to group1 mean - group2 mean

    var1 = group1.var() #assigns variance value of group1 to var1
    var2 = group2.var() #assigns variance value of group2 to var1
    n1, n2 = len(group1), len(group2) #sets length of groups to variables

    #multiply len of group1 by variance of group 1...
    #and add len of group2 mutiplied by variance of group2
    #set result to pooled_var variable
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    #divide the difference in means of the two groups by the aqrt of the pooled_var variable
    d = diff / np.sqrt(pooled_var)
    return d

In [29]:
# pass in first birth dataset and other birth data set...
# into coheneffectsize function defined above
CohenEffectSize(firsts.prglngth, others.prglngth)

0.028879044654449883

In [30]:
CohenEffectSize(firsts.totalwgt_lb, others.totalwgt_lb)

-0.088672927072602

In [31]:
resp = nsfg.ReadFemResp()

In [32]:
# select repsondents with highest income (level 14)
rich = resp[resp.totincr == 14]
# get all vales from resp dataset where totincr value is <14, and assigns to not_rich variable 
not_rich = resp[resp.totincr < 14]
CohenEffectSize(rich.parity, not_rich.parity)

-0.1251185531466061

Synopsis:

The Cohen Effect size of the rich vs not _rich is ~-0.125 and is ~0.029 for 1st pregnancies vs other pregnancies
The Effect size of the rich vs not rich is much greater than the birth index effext
This means that the size of the effect of income is much higher on parity than the index of the birth (1st or other) is on the weight of the baby.