- [Chapter 1: Exploratory data analysis](#chapter1)
    - [1.1 A statistical approach](#subchapter1.1)
    - [1.2 The National Survey of Family Growth](#subchapter1.2)
    - [1.3 Importing the data](#subchapter1.3)
    - [1.4 DataFrames](#subchapter1.4)
    - [1.5 Variables](#subchapter1.5)
    - [1.6 Transformation](#subchapter1.6)
    - [1.7 Validation](#subchapter1.7)
    - [1.8 Interpretation](#subchapter1.8)
    - [1.9 Exercises](#subchapter1.9)
    - [1.10 Glossary](#subchapter1.10)

<a id='chapter1'></a>
# Chapter 1: Exploratory data analysis

The thesis of the book is that data combined with practical methods can answer questions and guide decisions under uncertainty.

The writer uses a real life example regarding the assumption that first babies tend to arrive late. The personal stories people tell to confirm this theory are <i>anacdotal evidence</i>, since these reports are based on data that is unpublished and usually personal. 

Anacdotal evidence usually fails to give a persuasive and reliable answer to questions such as these, as a result of: 
- <b>Small number of observations</b>: A large sample of data is needed in order to be sure the difference exists.
- <b>Selection bias</b>: People who join the discussion might be interested <b>because</b> their baby was late.
- <b>Confirmation bias</b>: People who believe the claim may be more likely to contribute examples to confirm it.
- <b>Inaccuracy</b>: Anacdotes are often personal stories, and often misremembered, misrepresented or repeated inaccurately.

So, how can we do better?

<a id='subchapter1.1'></a>
## 1.1 A statistical approach

We will use the tools of statistics to adress the limitations of anacdotes. These tools include: 

- <b>Data collection</b>: Specifically we'll use the National Survey of Family Growth.
- <b>Descriptive statistics</b>: We will summarize and visualize the data through statistics.
- <b>Exploratory data analysis</b>: We will look for patterns, differences, and other features that adress the questions we're interested in.
- <b>Estimation</b>: We will use data from a sample to estimate characteristics of the general population.
- <b>Hypothesis testing</b>: Where we see apparent effects, we will evaluate whether or not the effect might have happened by chance.


<a id='subchapter1.2'></a>
## 1.2 The National Survey of Family Growth

[The National Survey of Family Growth](http://www.cdc.gov/nchs/nsfg/about_nsfg.htm) (NSFG) is a survey conducted in the US by the CDC since 1973, in order to collect data about health, families etc. 

About the design of the study: 
- <b>Cross-sectional</b>: Captures a snapshot of a group at a point in time (As opposed to a <b>longitudinal</b> study which observes a group repeatedly over a period of time).
- <b>Conducted in cycles</b>: The survey has been conducted several times, each of them referred to as <b>cycle</b>. In this book we'll use data from cycle 6.
- <b>Data was collected from a subset of the population</b>: From the sample data, we aim to draw conclusions about the entire US population. The people who participate in a survey are called <b>respondents</b>.
- <b>Oversampled</b>: In general, cross-sectional studies are meant to be <b>representative</b>, which means that every member of the target population has an equal chance of participating. The NSFG is not representative. Instead it is deliberately <b>oversampled</b>, which means the designers of the samples recruited three groups at higher rates than their representation in the US population. These three groups are Hispanic, African Americans and teenagers. The reason was to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inference.
- <b>Codebook</b>: When working with this kind of data, it is important to be familiar with the [codebook](http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm) which documents the design of the study, the survey questions, etc.

<a id='subchapter1.3'></a>
## 1.3 Importing the data

The original code and data used in this book are available [here](https://github.com/AllenDowney/ThinkStats2). Running the code from file <b>nsfg.py</b> should read the data file, run some tests and print a message like, "All tests passed." 

In [1]:
%run nsfg.py

(13593, 244)
All tests passed.


At the top of the file we import libraries and packages we intend to use in the file:

In [2]:
"""This file contains code for use with "Think Stats",
by Allen B. Downey, available from greenteapress.com

Copyright 2010 Allen B. Downey
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
"""

from __future__ import print_function

from collections import defaultdict
import numpy as np
import sys

import thinkstats2

Pregnancy data from Cycle 6 of the NSFG is in a file called <b>2002FemPreg.dat.gz</b>. This is a gzip-compressed data file in plain text (ASCII), with fixed width columns. Each line in this file is a <b>record</b> that contains data about one pregnancy.

This is how the first 3 lines of data looks like in this file:

In [6]:
import gzip

with gzip.open('2002FemPreg.dat.gz', 'r') as dat_example:
    head = [next(dat_example) for x in range(3)]
print(head)

[b'           1 1     6 1     11093 1084     9 039 9   0  1 813             1093 13837                        1                5                                                                        116610931166 9201093                111             3    1   12  11         5391 110933316108432411211     2995 1212 69544441116122222 2 2224693215    000000000000000000000000000000000000003410.38939935294273869.3496019830486 6448.2711117047512 91231\n', b'           1 2     6 1     11166 1157     9 039 9   0  2 714             1166  6542112  2  05 1 4  5       51               1   41  4  201 20                                                        1166109311661093116611661231        111             3    1   14  11         5391 211663925115738501211 2 432 8701414 69544441116122222 2 2224693215    000000000000000000000000000000000000003410.38939935294273869.3496019830486 6448.2711117047512 91231\n', b'           2 1     5 35    11156 1147     03939 9   0  1 9 2 2 2 0 1 1 4 1156  7524      

The format of this file is documented in <b>2002FemPreg.dct</b>, which is a Stata dictionary file. Stata is a statistical software system. A dictionary in this context is a list of variable names, types and indices that identify where in each line to find each variable. 

For example, here is a few lines from this .dct file:

In [8]:
with open('2002FemPreg.dct', 'r') as dct_example:
    head = [next(dct_example) for x in range(3)]
print(head)

['infile dictionary {\n', '    _column(1)      str12                             caseid  %12s  "RESPONDENT ID NUMBER"\n', '   _column(13)       byte                           pregordr   %2f  "PREGNANCY ORDER (NUMBER)"\n']


By using the module <i>thinkstats2</i> we can preview the data in a formatted and clean way:

In [9]:
dct_file='data/2002FemPreg.dct'
dat_file='data/2002FemPreg.dat.gz'
"""Reads the NSFG pregnancy data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """
dct = thinkstats2.ReadStataDct(dct_file)
df = dct.ReadFixedWidth(dat_file, compression='gzip')
CleanFemPreg(df)
print(df.head())

FileNotFoundError: [Errno 2] No such file or directory: 'data/2002FemPreg.dct'

The function <i>ReadFemPreg</i> takes a name of the dictionary file and returns a dct, a <i>FixedWidthVariables</i> object that contains the information from the dictionary file.

In [None]:
def ReadFemPreg(dct_file='data/2006_2010_FemPregSetup.dct',
                dat_file='data/2006_2010_FemPreg.dat.gz'):
    """Reads the NSFG 2006-2010 pregnancy data.
    dct_file: string file name
    dat_file: string file name
    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file, encoding='iso-8859-1')
    df = dct.ReadFixedWidth(dat_file, compression='gzip')
    CleanFemPreg(df)
    return df

<a id='subchapter1.4'></a>
## 1.4 DataFrames

The result of <i>ReadFixedWidth</i> is a DataFrame, which is the fundamental data structure provided by Pandas, which is a Python data and statistics package. A DataFrame contains a row for each record, and a column for each variable. In addition, it contains the variable names and types, and it provides methods for accessing and modifying the data.

Here is a list of useful commands to get yourself used to working with DataFrames:

Since df is big, this next command will only print the beginning and the end of the df, and specify its dimensions at the bottom

In [4]:
df

NameError: name 'df' is not defined

This next command prints columns names as Unicode strings. The result is another pandas data structure: Index

In [9]:
df.columns

Index([         u'caseid',        u'pregordr',       u'howpreg_n',
             u'howpreg_p',        u'moscurrp',        u'nowprgdk',
              u'pregend1',        u'pregend2',        u'nbrnaliv',
              u'multbrth',
       ...
            u'laborfor_i',      u'religion_i',         u'metro_i',
               u'basewgt', u'adj_mod_basewgt',        u'finalwgt',
                u'secu_p',            u'sest',         u'cmintvw',
           u'totalwgt_lb'],
      dtype='object', length=244)

We can access a certain column by using its index:

In [10]:
df.columns[1]

u'pregordr'

We can also access it by using its name this way:

In [11]:
pregordr = df['pregordr']
type(pregordr)

pandas.core.series.Series

The result of accessing the column is a Series - another pandas data structure. A Series is like a Python list with some additional features. When you print a Series, you get the indices and the corresponding values. We will only print the first values by using .head():

In [12]:
pregordr.head()

0    1
1    2
2    1
3    2
4    3
Name: pregordr, dtype: int64

In the example above the indices are ints, but they can be any sortable type. The elements can be of any type.

You can access the elements of a Series using integer indices and slices:

In [13]:
pregordr[4]

3

In [14]:
pregordr[7:16]

7     3
8     1
9     2
10    1
11    1
12    2
13    3
14    1
15    2
Name: pregordr, dtype: int64

You can also access the columns of a DataFrame using dot notation, but the column name has to be a valid Python identifier.

In [15]:
pregordr = df.pregordr

<a id='subchapter1.5'></a>
## 1.5 Variables

There are 244 variables in the NSFG dataset. This book intends to use:
- <b>caseid</b>: Int; ID of the respondent.
- <b>preglngth</b>: int; Duration of the pregnancy in weeks.
- <b>outcome</b>: int; Outcome of the pregnancy {0, 1 = live birth}.
- <b>pregordr</b>: int; Pregnancy chronologic number.
- <b>birthord</b>: int; Chronological number for live birth.
- <b>birthwgt_lb</b>, <b>birthwgt_oz</b>: int; Pounds and ounces of the parts of the birth weight of the baby.
- <b>agepreg</b>: int; Mother's age at the end of the pregnancy.
- <b>finalwgt</b>: float; Statistical weight associated with the respondent (% of US population).

Reading the codebook will reveal that many of the variables are <b>records</b>, which means they are not part of the <b>raw data</b> but a result of a calculation using the raw data. It is a good idea to use records when they are available, unless there is a compelling reason to process the raw data yourself. 

<a id='subchapter1.6'></a>
## 1.6 Transformation

When you import data, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called <b>data cleaning</b>. 

Things to remember: 
- <b>Special numbers encoded as values are dangerous</b> because if they are not handled properly, they can generate bogus results (like a 99 pound baby).
- <b>Dealing with missing data will be a recurring issue</b>.
- <b>Creating a new column in the DataFrame requires dictionary syntax</b> and *not* attribute assignment (like df.totalwgt_lb)

<b>CleanFemPreg</b> is a function that cleans the variable we're going to use:

In [16]:
def CleanFemPreg(df):
    """Recodes variables from the pregnancy frame.
    df: DataFrame
    """
    # mother's age is encoded in centiyears; convert to years
    df.agepreg /= 100.0

    # birthwgt_lb contains at least one bogus value (51 lbs)
    # replace with NaN
    # The expression in brackets yields a Series of type bool
    # that is used as an index to select only elements that satisfy condition
    df.birthwgt_lb1[df.birthwgt_lb1 > 20] = np.nan
    
    # replace 'not ascertained', 'refused', 'don't know' with NaN
    na_vals = [97, 98, 99]
    df.birthwgt_lb1.replace(na_vals, np.nan, inplace=True)
    df.birthwgt_oz1.replace(na_vals, np.nan, inplace=True)

    # birthweight is stored in two columns, lbs and oz.
    # convert to a single column in lb
    # NOTE: creating a new column requires dictionary syntax,
    # not attribute assignment (like df.totalwgt_lb)
    df['totalwgt_lb'] = df.birthwgt_lb1 + df.birthwgt_oz1 / 16.0    

    # due to a bug in ReadStataDct, the last variable gets clipped;
    # so for now set it to NaN
    df.phase = np.nan


<a id='subchapter1.7'></a>
## 1.7 Validation

When data is exported from one software environment and imported into another, errors might be introduced. Taking the time to validate the data when you're getting familiar with the new dataset may save you time and help avoid errors. 

One way to validate data is to compute basic sttatistics and compare them with published results.

Here is a table for <i></i> which computes the outcome of each pregnancy:

<img src="figs/chap01outcome.png" align="left">


In [17]:
# The Series class provides the method value_counts
# that counts the number each value appears
df.outcome.value_counts(sort=False)

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

In [18]:
type(df.outcome.value_counts(sort=False))

pandas.core.series.Series

<a id='subchapter1.8'></a>
## 1.8 Interpretation

To work with data effectively, you have to think on two levels at the same time: The level of statistics and the level of context. As an example, let's look at the sequence of outcomes for a few respondents.  Because of the way the data files are organized, we have to do some processing to collect the pregnancy data for each respondent.

Here's a function that does that:

In [19]:
def MakePregMap(df):
    """Make a map from caseid to list of preg indices.

    df: DataFrame

    returns: dict that maps from caseid to list of indices into preg df
    """
    d = defaultdict(list)
    for index, caseid in df.caseid.iteritems():
        d[caseid].append(index)
    return d

The module <i>defaultdict</i> returns a new dictionary-like object. defaultdict is a subclass of the built-in dict class. It overrides one method and adds one writable instance variable. The remaining functionality is the same as for the dict class and is not documented here.

The returned <i>d</i> is a dictionary that maps from each case ID to a list of indices. Using d, we can look up a respondent and get the indices of this respondent's pregnancies.

In [20]:
d = defaultdict(list)
for index, caseid in df.caseid.iteritems():
    d[caseid].append(index)

# index of respondent
caseid = 10229

# list of indices for pregnancies assosiated with respondent
indices = d[caseid]

df.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1])

The last part of <b>nsfg.py</b> is this, and it's intended to check the script runs correctly and you have all of the packages installed correctly on your system.  

In [21]:
def main(script):
    """Tests the functions in this module.

    script: string script name
    """
    df = ReadFemPreg()
    print(df.shape)

    assert len(df) == 13593

    assert df.caseid[13592] == 12571
    assert df.pregordr.value_counts()[1] == 5033
    assert df.nbrnaliv.value_counts()[1] == 8981
    assert df.babysex.value_counts()[1] == 4641
    assert df.birthwgt_lb.value_counts()[7] == 3049
    assert df.birthwgt_oz.value_counts()[0] == 1037
    assert df.prglngth.value_counts()[39] == 4744
    assert df.outcome.value_counts()[1] == 9148
    assert df.birthord.value_counts()[1] == 4413
    assert df.agepreg.value_counts()[22.75] == 100
    assert df.totalwgt_lb.value_counts()[7.5] == 302

    weights = df.finalwgt.value_counts()
    key = max(weights.keys())
    assert df.finalwgt.value_counts()[key] == 6

    print('%s: All tests passed.' % script)

<a id='subchapter1.9'></a>
## 1.9 Exercises

<b>Exercise 1.1</b>

Open the exercises in the links below. Some cells are already filled in, and you should execute them. Other cells give you instructions for exercises you should try.

[Exercise 1](chap01ex01.ipynb)

[Exercise 1 solution](chap01ex01soln.ipynb)

<b>Exercise 1.2</b>

[Exercise 2](chap01ex02.ipynb)

[Exercise 2 solution](chap01ex02soln.ipynb)

<b>Exercise 1.3</b>

The best way to learn about statistics is to work on a project you are interested in.  Is there a question like, "Do first babies arrive late", that you want to investigate?

Think about questions you find personally interesting, or items of conventional wisdom, or controversial topics, or questions that have political consequences, and see if you can formulate a question that lends itself to statistical inquiry.

Look for data to help you address the question.  Governments are good sources because data from public research is often freely available.  Good places to start include http://www.data.gov/, and http://www.science.gov/, and in the United Kingdom, http://data.gov.uk/.

Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, and the European Social Survey at http://www.europeansocialsurvey.org/.

If it seems like someone has already answered your question, look closely to see whether the answer is justified.  There might be flaws in the data or the analysis that make the conclusion unreliable.  In that case you could perform a different analysis of the same data, or look for a better source of data.

If you find a published paper that addresses your question, you should be able to get the raw data.  Many authors make their data available on the web, but for sensitive data you might have to write to the authors, provide information about how you plan to use the data, or agree to certain terms of use.  

<b>Be persistent!</b>

<a id='subchapter1.10'></a>
## 1.10 Glossary

<b>Anecdotal evidence</b>: Evidence, often personal, that is collected casually rather than by a well-designed study.

<b>Population</b>: A group we are interested in studying. "Population" often refers to a group of people, but the term is used for other subjects, too.

<b>Cross-sectional study</b>: A study that collects data about a population at a particular point in time.

<b>Cycle</b>: In a repeated cross-sectional study, each repetition of the study is called a cycle.

<b>Longitudinal study</b>: A study that follows a population over time, collecting data from the same group repeatedly.

<b>Record</b>: In a dataset, a collection of information about a single person or other subject.

<b>Respondent</b>: A person who responds to a survey.

<b>Sample</b>: The subset of a population used to collect data.

<b>Representative</b>: A sample is representative if every member of the population has the same chance of being in the sample.

<b>Oversampling</b>: The technique of increasing the representation of a sub-population in order to avoid errors due to small sample sizes.

<b>Raw data</b>: Values collected and recorded with little or no checking, calculation or interpretation.

<b>Recode</b>: A value that is generated by calculation and other logic applied to raw data.

<b>Data cleaning</b>: Processes that include validating data, identifying errors, translating between data types and representations, etc.

Next up: [Chapter 2: Distributions](chap02.ipynb)