# Capter 1 Exploratory Data Analysis

In [60]:
import nsfg
import numpy as np

from IPython.core import page
page.page = print

If you know how to program, you have the skills to turn data into knowledge using tools of probability and statistics. This concise introduction shows you how to perform statistical analysis computationally, rather than mathematically, with programs written in Python.

By working with a single case study throughout this course, you’ll learn the entire process of exploratory data analysis—from collecting data and generating statistics to identifying patterns and testing hypotheses.

You’ll explore distributions, rules of probability, visualization, and many other tools and concepts. Chapters on time series analysis, survival analysis, and analytic methods will enrich your discoveries.

* Develop an understanding of probability and statistics by writing and testing code
* Run experiments to test statistical behavior, such as generating samples from several distributions
* Use simulations to understand concepts that are hard to grasp mathematically 
* Import data from most sources with Python, rather than rely on data that’s cleaned and formatted for statistics tools
* Use statistical inference to answer questions about real-world data

This course is an introduction to the practical tools of exploratory data analysis. The organization of the course follows the process we use when we start working with a dataset:

* **Importing and cleaning:** Whatever format the data is in, it usually takes some time and effort to read the data, clean and transform it, and check that everything made it through the translation process intact.<br><br>
* **Single variable explorations:** We usually start by examining one variable at a time, finding out what the variables mean, looking at distributions of the values, and choosing appropriate summary statistics.<br><br>
* **Pair-wise explorations:** To identify possible relationships between variables, we look at tables and scatter plots, and compute correlations and linear fits.<br><br>
* **Multivariate analysis:** If there are apparent relationships between variables, we use multiple regression to add control variables and investigate more complex relationships.<br><br>
* **Estimation and hypothesis testing:** When reporting statistical results, it is important to answer three questions: 
    * How big is the effect? 
    * How much variability should we expect if we run the same measurement again? 
    * Is it possible that the apparent effect is due to chance?<br><br>
* **Visualization:** During exploration, visualization is an important tool for finding possible relationships and effects. Then if an apparent effect holds up to scrutiny, visualization is an effective way to communicate results.

This course takes a computational approach, which has several advantages over mathematical approaches:

* I present most ideas using Python code, rather than mathematical notation. In general, Python code is more readable; also, because it is executable, readers can download it, run it, and modify it.
* Each chapter includes exercises readers can do to develop and solidify their learning. When you write programs, you express your understanding in code; while you are debugging the program, you are also correcting your understanding.
* Some exercises involve experiments to test statistical behavior. For example, you can explore the Central Limit Theorem (CLT) by generating random samples and computing their sums. The resulting visualizations demonstrate why the CLT works and when it doesn’t.
* Some ideas that are hard to grasp mathematically are easy to understand by simulation. For example, we approximate p-values by running random simulations, which reinforces the meaning of the p-value.
* Because the book is based on a general-purpose programming language (Python), readers can import data from almost any source. They are not limited to datasets that have been cleaned and formatted for a particular statistics tool.

The course lends itself to a project-based approach. In class, students work on a semester-long project that requires them to pose a statistical question, find a dataset that can address it, and apply each of the techniques they learn to their own data.

To demonstrate our approach to statistical analysis, the course presents a case study that runs through all of the chapters. It uses data from two sources:

* The National Survey of Family Growth (NSFG), conducted by the U.S. Centers for Disease Control and Prevention (CDC) to gather “information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health.” (See http://cdc.gov/nchs/nsfg.htm.)
* The Behavioral Risk Factor Surveillance System (BRFSS), conducted by the National Center for Chronic Disease Prevention and Health Promotion to “track health conditions and risk behaviors in the United States.” (See http://cdc.gov/BRFSS/.)

Other examples use data from the IRS, the U.S. Census, and the Boston Marathon.

If you have never studied statistics, We think this course is a good place to start. And if you have taken a traditional statistics class, I hope this course will help repair the damage.

The thesis of this book is that data combined with practical methods can answer questions and guide decisions under uncertainty.

As an example, I present a case study motivated by a question I heard when my wife and I were expecting our first child: *do first babies tend to arrive late?*

If you Google this question, you will find plenty of discussion. Some people claim it’s true, others say it’s a myth, and some people say it’s the other way around: first babies come early.

In many of these discussions, people provide data to support their claims. I found many examples like these:

> “My two friends that have given birth recently to their first babies, BOTH went almost 2 weeks overdue before going into labour or being induced.”
> “My first one came 2 weeks late and now I think the second one is going to come out two weeks early!!”
> “I don’t think that can be true because my sister was my mother’s first and she was early, as with many of my cousins.”

Reports like these are called **anecdotal evidence** because they are based on data that is unpublished and usually personal. In casual conversation, there is nothing wrong with anecdotes, so I don’t mean to pick on the people I quoted.

But we might want evidence that is more persuasive and an answer that is more reliable. By those standards, anecdotal evidence usually fails, because:

* **Small number of observations :**
If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.

* **Selection bias :**
People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.

* **Confirmation bias :**
People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.

* **Inaccuracy :**
Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.

So how can we do better?

# A Statistical Approach
To address the limitations of anecdotes, we will use the tools of statistics, which include:
* **Data collection**
We will use data from a large national survey that was designed explicitly with the goal of generating statistically valid inferences about the U.S. population.
* **Descriptive statistics**
We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
* **Exploratory data analysis**
We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
* **Estimation**
We will use data from a sample to estimate characteristics of the general population.
* **Hypothesis testing**
Where we see apparent effects, like a difference between two groups, we will evaluate whether the effect might have happened by chance.

By performing these steps with care to avoid pitfalls, we can reach conclusions that are more justifiable and more likely to be correct.

# The National Survey of Family Growth
Since 1973, the US Centers for Disease Control and Prevention (CDC) have conducted the *National Survey of Family Growth (NSFG)*, which is intended to gather “information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health. The survey results are used...to plan health services and health education programs, and to do statistical studies of families, fertility, and health.”

We will use data collected by this survey to investigate whether first babies tend to come late, as well as answer other questions. In order to use this data effectively, we have to understand the design of the study.

The NSFG is a **cross-sectional study**, which means that it captures a snapshot of a group at a point in time. The most common alternative is a **longitudinal study**, which observes a group repeatedly over a period of time.

The NSFG has been conducted seven times; each deployment is called a **cycle**. We will use data from Cycle 6, which was conducted from January 2002 to March 2003.

The goal of the survey is to draw conclusions about a **population**; the target population of the NSFG is people in the United States aged 15-44. Ideally surveys would collect data from every member of the population, but that’s seldom possible. Instead we collect data from a subset of the population called a **sample**. The people who participate in a
survey are called **respondents**.

In general, cross-sectional studies are meant to be **representative**, which means that every member of the target population has an equal chance of participating. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can.

The NSFG is not representative; instead it is deliberately **oversampled**. The designers of the study recruited three groups—Hispanics, African Americans and teenagers—at rates higher than their representation in the U.S. population, in order to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inferences.

Of course, the drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey. We will come back to this point later.

When working with this kind of data, it is important to be familiar with the *codebook*, which documents the design of the study, the survey questions, and the encoding of the responses. The codebook and user’s guide for the NSFG data are available from the [CDC’s website](http://1.usa.gov/1pi2BP2).

## Examples from Chapter 1

Pregnancy data from Cycle 6 of the NSFG is in a file called `2002FemPreg.dat.gz`; it is a gzip-compressed data file in plain text (ASCII), with fixed width columns. Each line in the file is a record that contains data about one pregnancy.

The format of the file is documented in 2002FemPreg.dct, which is a Stata dictionary file. Stata is a statistical software system; a “dictionary” in this context is a list of variable names, types, and indices that identify where in each line to find each variable.

For example, here are a few lines from 2002FemPreg.dct:

`
infile dictionary 
{
    _column(1) str12 caseid %12s "RESPONDENT ID NUMBER"
    _column(13) byte pregordr %2f "PREGNANCY ORDER (NUMBER)"
}
`

This dictionary describes two variables: caseid is a 12-character string that represents the respondent ID; pregorder is a one-byte integer that indicates which pregnancy this record describes for this respondent.

The code you downloaded includes thinkstats2.py which is a Python module that contains many classes and functions used in this book, including functions that read the Stata dictionary and the NSFG data file. Here’s how they are used in nsfg.py:

In [27]:
%psource nsfg.ReadFemPreg

[1;32mdef[0m [0mReadFemPreg[0m[1;33m([0m[0mdct_file[0m[1;33m=[0m[1;34m'2002FemPreg.dct'[0m[1;33m,[0m[1;33m
[0m                [0mdat_file[0m[1;33m=[0m[1;34m'2002FemPreg.dat.gz'[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""Reads the NSFG pregnancy data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """[0m[1;33m
[0m    [0mdct[0m [1;33m=[0m [0mthinkstats2[0m[1;33m.[0m[0mReadStataDct[0m[1;33m([0m[0mdct_file[0m[1;33m)[0m[1;33m
[0m    [0mdf[0m [1;33m=[0m [0mdct[0m[1;33m.[0m[0mReadFixedWidth[0m[1;33m([0m[0mdat_file[0m[1;33m,[0m [0mcompression[0m[1;33m=[0m[1;34m'gzip'[0m[1;33m)[0m[1;33m
[0m    [0mCleanFemPreg[0m[1;33m([0m[0mdf[0m[1;33m)[0m[1;33m
[0m    [1;32mreturn[0m [0mdf[0m[1;33m[0m[1;33m[0m[0m



`ReadStataDct` takes the name of the dictionary file and returns dct, a `FixedWidthVariables` object that contains the information from the dictionary file.

dct provides `ReadFixedWidth`, which reads the data file.

Read NSFG data into a Pandas DataFrame.

In [28]:
preg = nsfg.ReadFemPreg()
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,2,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.875
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,9.125
3,2,2,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,7.0
4,2,3,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,6.1875


# DataFrames
The result of ReadFixedWidth is a DataFrame, which is the fundamental data structure provided by pandas, which is a Python data and statistics package. A DataFrame contains a row for each record, in this case one row per pregnancy, and a column for each variable.

In addition to the data, a DataFrame also contains the variable names and their types, and it provides methods for accessing and modifying the data.

The shape of the DataFrame, which is 13593 rows/records and 244 columns/variables.

In [29]:
preg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13593 entries, 0 to 13592
Columns: 244 entries, caseid to totalwgt_lb
dtypes: float64(171), int64(73)
memory usage: 25.3 MB


The attribute columns returns a sequence of column names as Unicode strings:

In [30]:
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

The result is an Index, which is another pandas data structure. We’ll learn more about Index later, but for now we’ll treat it like a list:

In [31]:
# Select a single column name
preg.columns[1]

'pregordr'

To access a column from a DataFrame, you can use the column name as a key:

In [56]:
# Select a column and check what type it is
pregordr = preg['pregordr']
type(pregordr)

pandas.core.series.Series

The result is a Series, yet another pandas data structure. A Series is like a Python list with some additional features. When you print a Series, you get the indices and the corresponding values:

In [57]:
# Print a column
pregordr

0        1
1        2
2        1
3        2
4        3
5        1
6        2
7        3
8        1
9        2
10       1
11       1
12       2
13       3
14       1
15       2
16       3
17       1
18       2
19       1
20       2
21       1
22       2
23       1
24       2
25       3
26       1
27       1
28       2
29       3
        ..
13563    2
13564    3
13565    1
13566    1
13567    1
13568    2
13569    1
13570    2
13571    3
13572    4
13573    1
13574    2
13575    1
13576    1
13577    2
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

In this example the indices are integers from 0 to 13592, but in general they can be any sortable type. The elements are also integers, but they can be any type.

The last line includes the variable name, Series length, and data type; int64 is one of the types provided by NumPy.

You can access the elements of a Series using integer indices and slices:

In [58]:
# Select a single element from a column
pregordr[0]

1

Select a slice from a column.

In [35]:
pregordr[2:5]

2    1
3    2
4    3
Name: pregordr, dtype: int64

The result of the index operator is an int64; the result of the slice is another Series.

You can also access the columns of a DataFrame using dot notation (not recommended):

This notation only works if the column name is a valid Python identifier, so it has to begin with a letter, can’t contain spaces, etc.

In [59]:
#Select a column using dot notation
pregordr = preg.pregordr

# Variables
We have already seen two variables in the NSFG dataset, `caseid` and `pregordr`, and we have seen that there are 244 variables in total. For the explorations in this course, we use the following variables:
* `caseid` is the integer ID of the respondent.
* `prglength` is the integer duration of the pregnancy in weeks.
* `outcome` is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.
* `pregordr` is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.
* `birthord` is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live births, this field is blank.
* `birthwgt_lb` and birthwgt_oz contain the pounds and ounces parts of the birth weight of the baby.
* `agepreg` is the mother’s age at the end of the pregnancy.
* `finalwgt` is the statistical weight associated with the respondent. It is a floatingpoint value that indicates the number of people in the U.S. population this respondent represents.

If you read the codebook carefully, you will see that many of the variables are recodes, which means that they are not part of the raw data collected by the survey; they are calculated using the raw data.

For example, `prglngth` for live births is equal to the raw variable `wksgest` (weeks of gestation) if it is available; otherwise it is estimated using `mosgest` * 4.33 (months of gestation times the average number of weeks in a month).

Recodes are often based on logic that checks the consistency and accuracy of the data. In general it is a good idea to use recodes when they are available, unless there is a compelling reason to process the raw data yourself.

# Transformation
When you import data like this, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called data cleaning.

`nsfg.py` includes `CleanFemPreg`, a function that cleans the variables we am planning to use.

In [37]:
%psource nsfg.CleanFemPreg

[1;32mdef[0m [0mCleanFemPreg[0m[1;33m([0m[0mdf[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""Recodes variables from the pregnancy frame.

    df: DataFrame
    """[0m[1;33m
[0m    [1;31m# mother's age is encoded in centiyears; convert to years[0m[1;33m
[0m    [0mdf[0m[1;33m.[0m[0magepreg[0m [1;33m/=[0m [1;36m100.0[0m[1;33m
[0m[1;33m
[0m    [1;31m# birthwgt_lb contains at least one bogus value (51 lbs)[0m[1;33m
[0m    [1;31m# replace with NaN[0m[1;33m
[0m    [0mdf[0m[1;33m.[0m[0mloc[0m[1;33m[[0m[0mdf[0m[1;33m.[0m[0mbirthwgt_lb[0m [1;33m>[0m [1;36m20[0m[1;33m,[0m [1;34m'birthwgt_lb'[0m[1;33m][0m [1;33m=[0m [0mnp[0m[1;33m.[0m[0mnan[0m[1;33m
[0m    [1;33m
[0m    [1;31m# replace 'not ascertained', 'refused', 'don't know' with NaN[0m[1;33m
[0m    [0mna_vals[0m [1;33m=[0m [1;33m[[0m[1;36m97[0m[1;33m,[0m [1;36m98[0m[1;33m,[0m [1;36m99[0m[1;33m][0m[1;33m
[0m    [0mdf[0m[1;33m.[0m[0mbirthw

`agepreg` contains the mother’s age at the end of the pregnancy. In the data file, `agepreg` is encoded as an integer number of centiyears. So the first line divides each element of agepreg by 100, yielding a floating-point value in years.

`birthwgt_lb` and `birthwgt_oz` contain the weight of the baby, in pounds and ounces, for pregnancies that end in live births.

In addition they use several special codes:

97 NOT ASCERTAINED<br>
98 REFUSED<br>
99 DON'T KNOW<br>

Special values encoded as numbers are dangerous because if they are not handled properly, they can generate bogus results, like a 99-pound baby. 

The replace method replaces these values with `np.nan`, a special floating-point value that represents “not a number.” The inplace flag tells replace to modify the existing Series rather than create a new one.

As part of the IEEE floating-point standard, all mathematical operations return nan if either argument is nan:

In [61]:
np.nan / 100.0

nan

So computations with `nan` tend to do the right thing, and most pandas functions handle nan appropriately. But dealing with missing data will be a recurring issue.

The last line of `CleanFemPreg` creates a new column `totalwgt_lb` that combines pounds and ounces into a single quantity, in pounds.

One important note: when you add a new column to a DataFrame, you must use dictionary syntax, like this:

**CORRECT**
`df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0`
Not dot notation, like this:

**WRONG!**
`df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.0`

The version with dot notation adds an attribute to the DataFrame object, but that attribute is not treated as a new column.

# Validation
When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other misunderstandings. If you take time to validate the data, you can save time later and avoid errors.

One way to validate data is to compute basic statistics and compare them with published results. For example, the NSFG codebook includes tables that summarize each variable.

Here is the table for outcome, which encodes the outcome of each pregnancy:

| value | label | Total |
|------|------|------|
| 1 | LIVE BIRTH | 9148 |
| 2 | INDUCED ABORTION | 1862 |
| 3 | STILLBIRTH | 120 |
| 4 | MISCARRIAGE | 1921 |
| 5 | ECTOPIC PREGNANCY | 190 |
| 6 | CURRENT PREGNANCY | 352 |

The Series class provides a method, `value_counts`, that counts the number of times each value appears. If we select the outcome Series from the DataFrame, we can use value_counts to compare with the published data:

In [38]:
# Count the number of times each value occurs
preg.outcome.value_counts().sort_index()

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

The result of value_counts is a Series; `sort_index` sorts the Series by index, so the values appear in order.

Comparing the results with the published table, it looks like the values in outcome are correct.

Similarly, here is the published table for birthwgt_lb
  

| value | label | Total |
|------|------|------|
| . | INAPPLICABLE | 4449 |
| 0-5 | UNDER 6 POUNDS | 1125 |
| 6 6 | POUNDS | 2223 |
| 7 7 | POUNDS | 3049 |
| 8 8 | ECTOPIC PREGNANCY | 1889 |
| 9-95 | 9 POUNDS OR MORE | 799 |

In [39]:
# Check the values of another variable
preg.birthwgt_lb.value_counts().sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

The counts for 6, 7, and 8 pounds check out, and if you add up the counts for 0-5 and 9-95, they check out, too. 

But if you look more closely, you will notice one value that has to be an error, a 51 pound baby!

To deal with this error, we added a line to CleanFemPreg:

`df.birthwgt_lb[df.birthwgt_lb > 20] = np.nan`

This statement replaces invalid values with np.nan. The expression in brackets yields a Series of type bool, where True indicates that the condition is true. When a Boolean Series is used as an index, it selects only the elements that satisfy the condition.

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

# Interpretation
To work with data effectively, you have to think on two levels at the same time: 
* the level of statistics and 
* the level of context.

As an example, let’s look at the sequence of outcomes for a few respondents. Because of the way the data files are organized, we have to do some processing to collect the pregnancy data for each respondent. 

Here’s a function that does that:

In [40]:
%psource nsfg.MakePregMap

[1;32mdef[0m [0mMakePregMap[0m[1;33m([0m[0mdf[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""Make a map from caseid to list of preg indices.

    df: DataFrame

    returns: dict that maps from caseid to list of indices into `preg`
    """[0m[1;33m
[0m    [0md[0m [1;33m=[0m [0mdefaultdict[0m[1;33m([0m[0mlist[0m[1;33m)[0m[1;33m
[0m    [1;32mfor[0m [0mindex[0m[1;33m,[0m [0mcaseid[0m [1;32min[0m [0mdf[0m[1;33m.[0m[0mcaseid[0m[1;33m.[0m[0miteritems[0m[1;33m([0m[1;33m)[0m[1;33m:[0m[1;33m
[0m        [0md[0m[1;33m[[0m[0mcaseid[0m[1;33m][0m[1;33m.[0m[0mappend[0m[1;33m([0m[0mindex[0m[1;33m)[0m[1;33m
[0m    [1;32mreturn[0m [0md[0m[1;33m[0m[1;33m[0m[0m



`df` is the DataFrame with pregnancy data. The `iteritems` method enumerates the index (row number) and `caseid` for each pregnancy.

`d` is a dictionary that maps from each case ID to a list of indices. If you are not familiar with `defaultdict`, it is in the Python collections module. 

Using `d`, we can look up a respondent and get the indices of that respondent’s pregnancies. This example looks up one respondent and prints a list of outcomes for her pregnancies:

In [41]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1], dtype=int64)

`indices` is the list of indices for pregnancies corresponding to respondent 10229.

Using this list as an index into `df.outcome` selects the indicated rows and yields a Series. Instead of printing the whole Series, we selected the values attribute, which is a NumPy array.

The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more.

But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy, it is natural to be moved by the story it tells.

Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family life, reproduction, and health. At the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [42]:
# Solution

preg.birthord.value_counts().sort_index()

1.0     4413
2.0     2874
3.0     1234
4.0      421
5.0      126
6.0       50
7.0       20
8.0        7
9.0        2
10.0       1
Name: birthord, dtype: int64

We can also use `isnull` to count the number of nans.

In [43]:
preg.birthord.isnull().sum()

4445

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [44]:
# Solution

preg.prglngth.value_counts().sort_index()

0       15
1        9
2       78
3      151
4      412
5      181
6      543
7      175
8      409
9      594
10     137
11     202
12     170
13     446
14      29
15      39
16      44
17     253
18      17
19      34
20      18
21      37
22     147
23      12
24      31
25      15
26     117
27       8
28      38
29      23
30     198
31      29
32     122
33      50
34      60
35     357
36     329
37     457
38     609
39    4744
40    1120
41     591
42     328
43     148
44      46
45      10
46       1
47       1
48       7
50       2
Name: prglngth, dtype: int64

To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [45]:
preg.totalwgt_lb.mean()

7.265628457623368

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [46]:
# Solution

preg['totalwgt_kg'] = preg.totalwgt_lb / 2.2
preg.totalwgt_kg.mean()

3.302558389828807

`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [47]:
%psource nsfg.ReadFemResp

[1;32mdef[0m [0mReadFemResp[0m[1;33m([0m[0mdct_file[0m[1;33m=[0m[1;34m'2002FemResp.dct'[0m[1;33m,[0m[1;33m
[0m                [0mdat_file[0m[1;33m=[0m[1;34m'2002FemResp.dat.gz'[0m[1;33m,[0m[1;33m
[0m                [0mnrows[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""Reads the NSFG respondent data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """[0m[1;33m
[0m    [0mdct[0m [1;33m=[0m [0mthinkstats2[0m[1;33m.[0m[0mReadStataDct[0m[1;33m([0m[0mdct_file[0m[1;33m)[0m[1;33m
[0m    [0mdf[0m [1;33m=[0m [0mdct[0m[1;33m.[0m[0mReadFixedWidth[0m[1;33m([0m[0mdat_file[0m[1;33m,[0m [0mcompression[0m[1;33m=[0m[1;34m'gzip'[0m[1;33m,[0m [0mnrows[0m[1;33m=[0m[0mnrows[0m[1;33m)[0m[1;33m
[0m    [0mCleanFemResp[0m[1;33m([0m[0mdf[0m[1;33m)[0m[1;33m
[0m    [1;32mreturn[0m [0mdf[0m[1;33m[0m[1;33m[0m[0m



In [48]:
resp = nsfg.ReadFemResp()

`DataFrame` provides a method `head` that displays the first five rows:

In [49]:
resp.head()

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667
1,5012,1,5,1,5,5.0,42,42,718,42,...,0,2335.279149,2846.79949,4744.19135,2,18,1233,1221,16:30:59,64.294
2,11586,1,5,1,5,5.0,43,43,708,43,...,0,2335.279149,2846.79949,4744.19135,2,18,1234,1222,18:19:09,75.149167
3,6794,5,5,4,1,5.0,15,15,1042,15,...,0,3783.152221,5071.464231,5923.977368,2,18,1234,1222,15:54:43,28.642833
4,616,1,5,4,1,5.0,20,20,991,20,...,0,5341.329968,6437.335772,7229.128072,2,18,1233,1221,14:19:44,69.502667


Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [50]:
# Solution

resp.age_r.value_counts().sort_index()

15    217
16    223
17    234
18    235
19    241
20    258
21    267
22    287
23    282
24    269
25    267
26    260
27    255
28    252
29    262
30    292
31    278
32    273
33    257
34    255
35    262
36    266
37    271
38    256
39    215
40    256
41    250
42    215
43    253
44    235
Name: age_r, dtype: int64

We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [51]:
resp[resp.caseid==2298]

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


And we can get the corresponding rows from `preg` like this:

In [52]:
preg[preg.caseid==2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb,totalwgt_kg
2610,2298,1,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875,3.125
2611,2298,2,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,5.5,2.5
2612,2298,3,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,4.1875,1.903409
2613,2298,4,,,,,6.0,,1.0,,...,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875,3.125


How old is the respondent with `caseid` 1?

In [53]:
# Solution

resp[resp.caseid==1].age_r

1069    44
Name: age_r, dtype: int64

What are the pregnancy lengths for the respondent with `caseid` 2298?

In [54]:
# Solution

preg[preg.caseid==2298].prglngth

2610    40
2611    36
2612    30
2613    40
Name: prglngth, dtype: int64

What was the birthweight of the first baby born to the respondent with `caseid` 5012?

In [55]:
# Solution

preg[preg.caseid==5012].birthwgt_lb

5515    6.0
Name: birthwgt_lb, dtype: float64