# Getting started with data

In [None]:
import numpy as np
import pandas as pd

In [None]:
import sys
sys.path.append('lib')

In [None]:
import nsfg

## The National Survey of Family Growth

Since 1973 the U.S. Centers for Disease Control and Prevention (CDC) have conducted the National Survey of Family Growth (NSFG), which is intended to gather “information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health. The survey results are used to plan health services and health education programs, and to do statistical studies of families, fertility, and health.” See [here](https://www.cdc.gov/nchs/nsfg/index.htm)

The NSFG is a cross-sectional study, which means that it captures a snapshot of a group at a point in time. The most common alternative is a longitudinal study, which observes a group repeatedly over a period of time.


The NSFG has been conducted seven times; each deployment is called a cycle. We will use data from Cycle 6, which was conducted from January 2002 to March 2003.

The NSFG is not representative; instead it is deliberately oversampled. The designers of the study recruited three groups: Hispanics, African-Americans and teenagers - at rates higher than their representation in the U.S. population, in order to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inferences.

When working with this kind of data, it is important to be familiar with the codebook, which documents the design of the study, the survey questions, and the encoding of the responses. The codebook and user’s guide for the NSFG data are available [here](https://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm)

## Dataframes

In [None]:
preg = nsfg.read_fem_preg()
preg.head()

Print the column names.

In [None]:
preg.columns

Select a single column name.

In [None]:
preg.columns[1]

Select a column and check what type it is.

In [None]:
pregordr = preg['pregordr']
type(pregordr)

A Series is like a Python list with some additional features. When you print a Series, you get the indices and the corresponding values:

In [None]:
pregordr

You can access the elements of a Series using integer indices and slices:

In [None]:
pregordr[0]

Select a slice from a column.

In [None]:
pregordr[2:5]

Select a column using dot notation.

In [None]:
pregordr = preg.pregordr

## Variables

- `caseid` is the integer ID of the respondent.
- `prglngth` is the integer duration of the pregnancy in weeks.
- `outcome` is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.
- `pregordr` is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.
- `birthord` is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank.
- `birthwgt_lb` and birthwgt_oz contain the pounds and ounces parts of the birth weight of the baby.
- `agepreg` is the mother’s age at the end of the pregnancy.
- `finalwgt` is the statistical weight associated with the respondent. It is a floating-point value that indicates the number of people in the U.S. population this respondent represents.

If you read the codebook carefully, you will see that many of the variables are recodes, which means that they are not part of the raw data collected by the survey; they are calculated using the raw data.


For example, `prglngth` for live births is equal to the raw variable `wksgest` (weeks of gestation) if it is available; otherwise it is estimated using `mosgest * 4.33` (months of gestation times the average number of weeks in a month).

Recodes are often based on logic that checks the consistency and accuracy of the data. In general it is a good idea to use recodes when they are available, unless there is a compelling reason to process the raw data yourself.

Count the number of times each value occurs.

In [None]:
# sort by index 1-6
preg.outcome.value_counts().sort_index()

Check the values of another variable.

In [None]:
preg.birthwgt_oz.value_counts().sort_index()

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

In [None]:
caseid = 10229
preg_map = nsfg.make_preg_map(preg)
indices = preg_map[caseid]

preg.outcome[indices].values

Using this list as an index into df.outcome selects the indicated rows and yields a Series. Instead of printing the whole Series, I selected the values attribute, which is a NumPy array.

The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more.

But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy, it is natural to be moved by the story it tells.

Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family life, reproduction, and health. At the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NSFG/Cycle6Codebook-Pregnancy.pdf)

In [None]:
# Solution

preg.birthord.value_counts().sort_index()

We can also use `isnull` to count the number of nans.

In [None]:
preg.birthord.isnull().sum()

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NSFG/Cycle6Codebook-Pregnancy.pdf)

In [None]:
# Solution

preg.prglngth.value_counts().sort_index()

To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [None]:
preg.totalwgt_lb.mean()

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [None]:
# Solution
preg['totalwgt_kg'] = preg.totalwgt_lb / 2.2
preg.totalwgt_kg.mean()

`nsfg.py` also provides `read_fem_resp`, which reads the female respondents file and returns a `DataFrame`:

In [None]:
resp = nsfg.read_fem_resp()

`DataFrame` provides a method `head` that displays the first five rows:

In [None]:
resp.head()

Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [None]:
# Solution

resp.age_r.value_counts().sort_index()

We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [None]:
resp[resp.caseid==2298]

And we can get the corresponding rows from `preg` like this:

In [None]:
preg[preg.caseid==2298]

In [None]:
preg.query('caseid==2298')

How old is the respondent with `caseid` 1?

In [None]:
# Solution

resp[resp.caseid==1].age_r

In [None]:
resp.query('caseid == 1').age_r

What are the pregnancy lengths for the respondent with `caseid` 2298?

In [None]:
# Solution

preg[preg.caseid==2298].prglngth

What was the birthweight of the first baby born to the respondent with `caseid` 5012?

In [None]:
# Solution

preg[preg.caseid==5012].birthwgt_lb