<img src="https://user-images.strikinglycdn.com/res/hrscywv4p/image/upload/c_limit,fl_lossy,h_300,w_300,f_auto,q_auto/1266110/Logo_wzxi0f.png" style="float: left; margin: 20px; height: 55px">

**El ser derechista, como el ser izquierdista, supone siempre expulsar del alma la mitad de lo que hay que sentir - [José Antonio Primo de Rivera](https://en.wikipedia.org/wiki/Jos%C3%A9_Antonio_Primo_de_Rivera)**

# Chapter 1: Exploratory Data Analysis

The question:

Do first time babies tend to arrive late?

Many anecdotal evidence because they are based on data that is unpublished and usually personal. Which fails because:

- **Small number of observations**: If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
- **Selection bias**: People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.
- **Confirmation bias**: People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
- **Inaccuracy**: Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.

### Statistical Approach

To address the limitations of anecdotes, we will use the tools of statistics, which include:
Data collection: We will use data from a large national survey that
was designed explicitly with the goal of generating statistically valid
inferences about the U.S. population.
- **Descriptive statistics**: We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
- **Exploratory data analysis**: We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
- **Estimation**: We will use data from a sample to estimate characteristics of the general population.
- **Hypothesis testing**: Where we see apparent effects, like a difference between two groups, we will evaluate whether the eect might have happened by chance.

### The Data Source

We will be usin the National Survey of Family Growth.

See [link](http://cdc.gov/nchs/nsfg.htm) and explore the different data sets and informarion.

The NSFG is a **cross-sectional** study, which means that it captures a snapshot of a group at a point in time. The most common alternative is a **longitudinal study**, which observes a group repeatedly over a period of time.

The goal of the survey is to draw conclusions about a **population**; the target population of the NSFG is people in the United States aged 15-44. Ideally surveys would collect data from every member of the population, but that's seldom possible. Instead we collect data from a subset of the population called a **sample**. The people who participate in a survey are called **respondents**.

In general, cross-sectional studies are meant to be **representative**, which means that every member of the target population has an equal chance of participating. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can.

The NSFG is not representative; instead it is deliberately **oversampled**. The designers of the study recruited three groups|Hispanics, African-Americans and teenagers|at rates higher than their representation in the U.S. population, in order to make sure that the number of respondents in each of these groups is large enough to draw valid statistical inferences.

Of course, the drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey. We will come back to this point later.

The codebook and user's guide for the NSFG data are available from [link](http://www.cdc.gov/nchs/nsfg/nsfgcycle6.htm)

### Importing the data

Now explore the data on the folder. How does 2002FemPreg.dct look?

This is a Stata dictionary file.

thinkstats2.py has a module to open Stata dictionaries.

A **module** is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

Explore the module nsfg. find the function ReadFemPreg() and then import it.

You might have to copy the module to the correct directory.

In [106]:
import pandas as pd
from Resources.Think_Stats.Thinkstats2 import nsfg
import numpy as np

In [107]:
preg = nsfg.ReadFemPreg(dct_file='Resources/Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Resources/Think_Stats/Thinkstats2/2002FemPreg.dat.gz')

In [108]:
preg

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,cmotpreg,prgoutcome,cmprgend,flgdkmo1,cmprgbeg,ageatend,hpageend,gestasun_m,gestasun_w,wksgest,mosgest,dk1gest,dk2gest,dk3gest,bpa_bdscheck1,bpa_bdscheck2,bpa_bdscheck3,babysex,birthwgt_lb,birthwgt_oz,lobthwgt,babysex2,birthwgt_lb2,birthwgt_oz2,lobthwgt2,babysex3,birthwgt_lb3,birthwgt_oz3,lobthwgt3,cmbabdob,kidage,hpagelb,birthplc,paybirth1,paybirth2,paybirth3,knewpreg,trimestr,ltrimest,priorsmk,postsmks,npostsmk,getprena,bgnprena,pnctrim,lpnctri,workpreg,workborn,didwork,matweeks,weeksdk,matleave,matchfound,livehere,alivenow,cmkidied,cmkidlft,lastage,wherenow,legagree,parenend,anynurse,fedsolid,frsteatd_n,frsteatd_p,frsteatd,quitnurs,ageqtnur_n,ageqtnur_p,ageqtnur,matchfound2,livehere2,alivenow2,cmkidied2,cmkidlft2,lastage2,wherenow2,legagree2,parenend2,anynurse2,fedsolid2,frsteatd_n2,frsteatd_p2,frsteatd2,quitnurs2,ageqtnur_n2,ageqtnur_p2,ageqtnur2,matchfound3,livehere3,alivenow3,cmkidied3,cmkidlft3,lastage3,wherenow3,legagree3,parenend3,anynurse3,fedsolid3,frsteatd_n3,frsteatd_p3,frsteatd3,quitnurs3,ageqtnur_n3,ageqtnur_p3,ageqtnur3,cmlastlb,cmfstprg,cmlstprg,cmintstr,cmintfin,cmintstrop,cmintfinop,cmintstrcr,cmintfincr,evuseint,stopduse,whystopd,whatmeth01,whatmeth02,whatmeth03,whatmeth04,resnouse,wantbold,probbabe,cnfrmno,wantbld2,timingok,toosoon_n,toosoon_p,wthpart1,wthpart2,feelinpg,hpwnold,timokhp,cohpbeg,cohpend,tellfath,whentell,tryscale,wantscal,whyprg1,whyprg2,whynouse1,whynouse2,whynouse3,anyusint,prglngth,outcome,birthord,datend,agepreg,datecon,agecon,fmarout5,pmarpreg,rmarout6,fmarcon5,learnprg,pncarewk,paydeliv,lbw1,bfeedwks,maternlv,oldwantr,oldwantp,wantresp,wantpart,cmbirth,ager,agescrn,fmarital,rmarital,educat,hieduc,race,hispanic,hisprace,rcurpreg,pregnum,parity,insuranc,pubassis,poverty,laborfor,religion,metro,brnout,yrstrus,prglngth_i,outcome_i,birthord_i,datend_i,agepreg_i,datecon_i,agecon_i,fmarout5_i,pmarpreg_i,rmarout6_i,fmarcon5_i,learnprg_i,pncarewk_i,paydeliv_i,lbw1_i,bfeedwks_i,maternlv_i,oldwantr_i,oldwantp_i,wantresp_i,wantpart_i,ager_i,fmarital_i,rmarital_i,educat_i,hieduc_i,race_i,hispanic_i,hisprace_i,rcurpreg_i,pregnum_i,parity_i,insuranc_i,pubassis_i,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,,1.0,1093.0,,1084.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,1.0,8.0,13.0,,,,,,,,,,1093.0,138.0,37.0,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1166.0,1093.0,1166.0,920.0,1093.0,,,,,1.0,1.0,1.0,,,,,,,,,,3.0,,,1.0,,,1,2.0,,,1.0,1.0,,,,,,,,5,39,1,1.0,1093.0,33.16,1084,3241,1.0,2.0,1.0,1,,,,2.0,995.0,,1,2,1,2,695,44,44,1,1,16,12,2,2,2,2,2,2,2,2,469,3,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,2,,,,,6.0,,1.0,,,1.0,1166.0,,1157.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,2.0,7.0,14.0,,,,,,,,,,1166.0,65.0,42.0,1.0,1.0,2.0,,2.0,,,0.0,5.0,,1.0,4.0,,,5.0,,,,,,5.0,1.0,,,,,,,,1.0,,4.0,1.0,4.0,,20.0,1.0,20.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1166.0,1093.0,1166.0,1093.0,1166.0,1166.0,1231.0,,,1.0,1.0,1.0,,,,,,,,,,3.0,,,1.0,,,1,4.0,,,1.0,1.0,,,,,,,,5,39,1,2.0,1166.0,39.25,1157,3850,1.0,2.0,1.0,1,2.0,4.0,3.0,2.0,87.0,0.0,1,4,1,4,695,44,44,1,1,16,12,2,2,2,2,2,2,2,2,469,3,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.8750
2,2,1,,,,,5.0,,3.0,5.0,,1.0,1156.0,,1147.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,1.0,9.0,2.0,,2.0,2.0,0.0,,1.0,1.0,4.0,,1156.0,75.0,24.0,,,,,,,,,,,,,,,,,,,,,5.0,1.0,,,,,,,,5.0,,,,,,,,,5.0,5.0,5.0,1156.0,,0.0,,,,,,,,,,,,,5.0,5.0,5.0,1156.0,,0.0,,,,,,,,,,,,,1204.0,1156.0,1204.0,1153.0,1156.0,,,,,5.0,,,,,,,5.0,5.0,,,,,,,,4.0,,5,,5.0,5.0,1.0,1.0,,,,,,,,5,39,1,1.0,1156.0,14.33,1147,1358,5.0,1.0,6.0,5,,,,2.0,995.0,,5,5,5,5,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,9.1250
3,2,2,,,,,6.0,,1.0,,,1.0,1198.0,,1189.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,2.0,7.0,0.0,,,,,,,,,,1198.0,33.0,25.0,1.0,3.0,,,3.0,,,0.0,5.0,,1.0,4.0,,,5.0,,,,,,5.0,5.0,1.0,,1205.0,7.0,2.0,,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1204.0,1156.0,1204.0,1156.0,1198.0,,,,,,,,4.0,,,,,5.0,,,,,,,,4.0,3.0,1,1.0,5.0,5.0,1.0,1.0,2.0,3.0,2.0,,,,,1,39,1,2.0,1198.0,17.83,1189,1708,5.0,1.0,6.0,5,3.0,4.0,4.0,2.0,995.0,0.0,5,3,5,3,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,7.0000
4,2,3,,,,,6.0,,1.0,,,1.0,1204.0,,1195.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,2.0,6.0,3.0,,,,,,,,,,1204.0,27.0,25.0,1.0,3.0,,,2.0,,,0.0,5.0,,1.0,4.0,,,1.0,5.0,2.0,,,,5.0,5.0,1.0,,1221.0,17.0,2.0,,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1204.0,1156.0,1204.0,1198.0,1204.0,1204.0,1231.0,,,,,,4.0,,,,,5.0,,,,,,,,4.0,5.0,5,,1.0,1.0,,,4.0,4.0,2.0,,,,,1,39,1,3.0,1204.0,18.33,1195,1758,5.0,1.0,6.0,5,2.0,4.0,4.0,2.0,995.0,3.0,5,5,5,5,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,6.1875
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13588,12571,1,,,,,6.0,,1.0,,,1.0,993.0,,984.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,1.0,6.0,3.0,,,,,,,,,,993.0,234.0,19.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1037.0,993.0,1037.0,955.0,993.0,,,,,5.0,,,,,,,1.0,,,,,1.0,4.0,2.0,,1.0,,1,1.0,1.0,1.0,,,,,,,,,,5,39,1,1.0,993.0,17.91,984,1716,1.0,2.0,1.0,5,,,,2.0,,,3,3,3,3,778,37,37,1,1,13,10,2,1,1,2,5,3,2,2,213,6,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,6.1875
13589,12571,2,,,,,3.0,,,,1000.0,2.0,1000.0,1.0,999.0,19.0,21.0,0.0,6.0,6.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1037.0,993.0,1037.0,993.0,1000.0,,,,,1.0,5.0,,3.0,,,,,5.0,,,,,,,,3.0,,5,,,,1.0,1.0,,,,,,,,5,6,2,,1000.0,18.50,999,1841,1.0,2.0,1.0,1,,,,,,,5,5,5,5,778,37,37,1,1,13,10,2,1,1,2,5,3,2,2,213,6,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13590,12571,3,,,,,3.0,,,,1015.0,2.0,1015.0,1.0,1014.0,20.0,23.0,0.0,5.0,5.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1037.0,993.0,1037.0,1000.0,1015.0,,,,,1.0,5.0,,3.0,,,,,5.0,,,,,,,,3.0,,5,,,,1.0,1.0,,,,,,,,5,5,2,,1015.0,19.75,1014,1966,1.0,2.0,1.0,1,,,,,,,5,5,5,5,778,37,37,1,1,13,10,2,1,1,2,5,3,2,2,213,6,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13591,12571,4,,,,,6.0,,1.0,,,1.0,1037.0,,1028.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,1.0,7.0,8.0,,,,,,,,,,1037.0,190.0,24.0,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,1.0,,1.0,1.0,1.0,,3.0,2.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1037.0,993.0,1037.0,1015.0,1037.0,,,,,1.0,5.0,,3.0,,,,,1.0,,,,2.0,,,1.0,,,1,2.0,,,1.0,1.0,,,,,,,,5,39,1,2.0,1037.0,21.58,1028,2083,1.0,2.0,1.0,1,,,,2.0,3.0,,2,2,2,2,778,37,37,1,1,13,10,2,1,1,2,5,3,2,2,213,6,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,7.5000


What are the columns in the Dataframe?

In [109]:
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

How many columns does it have? Use 2 methods to calculate it.

In [110]:
preg.shape

(13593, 244)

Remember that columns is not a **method** is an **attribute**

What is the first column?

In [111]:
preg.columns[0]

'caseid'

Access pregordr column. Use 2 different methods.

In [112]:
preg.loc[:,'pregordr']

0        1
1        2
2        1
3        2
4        3
        ..
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

What type is that column? and what type in the column object?

In [113]:
type(preg.loc[:,'pregordr'])

pandas.core.series.Series

In [114]:
preg.loc[:,'pregordr'].dtypes

dtype('int64')

Get the rows 2 to 4 of the column

In [115]:
i = preg.loc[1:3,'pregordr']
i

1    2
2    1
3    2
Name: pregordr, dtype: int64

1st baby final weight bt 3 and 4kg. finalwgt

In [116]:
boolean = (preg['pregordr']==1) & (preg['finalwgt'] < 4000) & (preg['finalwgt'] > 3000)
preg.loc[boolean,:]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,cmotpreg,prgoutcome,cmprgend,flgdkmo1,cmprgbeg,ageatend,hpageend,gestasun_m,gestasun_w,wksgest,mosgest,dk1gest,dk2gest,dk3gest,bpa_bdscheck1,bpa_bdscheck2,bpa_bdscheck3,babysex,birthwgt_lb,birthwgt_oz,lobthwgt,babysex2,birthwgt_lb2,birthwgt_oz2,lobthwgt2,babysex3,birthwgt_lb3,birthwgt_oz3,lobthwgt3,cmbabdob,kidage,hpagelb,birthplc,paybirth1,paybirth2,paybirth3,knewpreg,trimestr,ltrimest,priorsmk,postsmks,npostsmk,getprena,bgnprena,pnctrim,lpnctri,workpreg,workborn,didwork,matweeks,weeksdk,matleave,matchfound,livehere,alivenow,cmkidied,cmkidlft,lastage,wherenow,legagree,parenend,anynurse,fedsolid,frsteatd_n,frsteatd_p,frsteatd,quitnurs,ageqtnur_n,ageqtnur_p,ageqtnur,matchfound2,livehere2,alivenow2,cmkidied2,cmkidlft2,lastage2,wherenow2,legagree2,parenend2,anynurse2,fedsolid2,frsteatd_n2,frsteatd_p2,frsteatd2,quitnurs2,ageqtnur_n2,ageqtnur_p2,ageqtnur2,matchfound3,livehere3,alivenow3,cmkidied3,cmkidlft3,lastage3,wherenow3,legagree3,parenend3,anynurse3,fedsolid3,frsteatd_n3,frsteatd_p3,frsteatd3,quitnurs3,ageqtnur_n3,ageqtnur_p3,ageqtnur3,cmlastlb,cmfstprg,cmlstprg,cmintstr,cmintfin,cmintstrop,cmintfinop,cmintstrcr,cmintfincr,evuseint,stopduse,whystopd,whatmeth01,whatmeth02,whatmeth03,whatmeth04,resnouse,wantbold,probbabe,cnfrmno,wantbld2,timingok,toosoon_n,toosoon_p,wthpart1,wthpart2,feelinpg,hpwnold,timokhp,cohpbeg,cohpend,tellfath,whentell,tryscale,wantscal,whyprg1,whyprg2,whynouse1,whynouse2,whynouse3,anyusint,prglngth,outcome,birthord,datend,agepreg,datecon,agecon,fmarout5,pmarpreg,rmarout6,fmarcon5,learnprg,pncarewk,paydeliv,lbw1,bfeedwks,maternlv,oldwantr,oldwantp,wantresp,wantpart,cmbirth,ager,agescrn,fmarital,rmarital,educat,hieduc,race,hispanic,hisprace,rcurpreg,pregnum,parity,insuranc,pubassis,poverty,laborfor,religion,metro,brnout,yrstrus,prglngth_i,outcome_i,birthord_i,datend_i,agepreg_i,datecon_i,agecon_i,fmarout5_i,pmarpreg_i,rmarout6_i,fmarcon5_i,learnprg_i,pncarewk_i,paydeliv_i,lbw1_i,bfeedwks_i,maternlv_i,oldwantr_i,oldwantp_i,wantresp_i,wantpart_i,ager_i,fmarital_i,rmarital_i,educat_i,hieduc_i,race_i,hispanic_i,hisprace_i,rcurpreg_i,pregnum_i,parity_i,insuranc_i,pubassis_i,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
11,14,1,,,,,6.0,,1.0,,,1.0,1065.0,,1056.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,2.0,7.0,0.0,,,,,,,,,,1065.0,167.0,24.0,,,,,,,,,,,,,,,,,,,,,5.0,1.0,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1084.0,1065.0,1147.0,1046.0,1065.0,,,,,1.0,1.0,5.0,,,,,,1.0,,,,1.0,1.0,2.0,,1.0,,1,1.0,,,1.0,1.0,,,,,,,,5,39,1,1.0,1065.0,23.00,1056,2225,1.0,2.0,1.0,1,,,,2.0,995.0,,3,3,3,3,789,36,36,3,4,13,10,3,1,1,2,3,2,2,2,164,1,3,1,1,1980.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2418.069494,2810.302771,3039.904507,2,56,,7.0000
30,36,1,,,,,1.0,,,,1149.0,2.0,1149.0,0.0,1146.0,,28.0,2.0,3.0,12.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1178.0,1149.0,1178.0,1029.0,1149.0,,,,,1.0,5.0,,12.0,,,,,1.0,,,,1.0,1.0,2.0,,1.0,,1,1.0,1.0,1.0,1.0,1.0,,,,,,,,5,12,4,,1149.0,30.66,1146,3041,5.0,1.0,5.0,5,,,,,,,3,3,3,3,781,37,37,1,1,16,12,2,2,2,2,2,1,2,2,400,1,3,2,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1668.415087,1969.717778,3594.537973,1,69,,
66,72,1,,,,,3.0,,,,1189.0,2.0,1189.0,0.0,1185.0,,25.0,2.0,8.0,17.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1211.0,1189.0,1211.0,1166.0,1189.0,,,,,1.0,1.0,5.0,,,,,,1.0,,,,1.0,4.0,2.0,,3.0,2.0,6,,5.0,5.0,1.0,1.0,3.0,3.0,3.0,,,,,5,17,2,,1189.0,16.00,1185,1566,5.0,1.0,6.0,5,,,,,,,3,6,3,6,997,19,19,5,6,9,5,1,2,3,2,2,1,3,1,84,7,3,2,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2335.596182,2547.349832,3391.286832,1,44,,
90,92,1,,,,,6.0,,1.0,,,1.0,913.0,,904.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,2.0,6.0,5.0,,,,,,,,,,913.0,322.0,20.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1069.0,913.0,1069.0,901.0,913.0,,,,,5.0,,,,,,,5.0,5.0,,,,,,,,2.0,,1,1.0,5.0,5.0,1.0,1.0,,,,,,,,5,39,1,1.0,913.0,17.58,904,1683,5.0,1.0,6.0,5,,,,2.0,,,5,3,5,3,702,44,44,1,1,14,11,1,2,3,2,3,3,2,2,491,1,3,2,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2335.117781,2747.818893,3352.339049,1,25,,6.3125
114,135,1,,,,,6.0,,1.0,,,1.0,1175.0,,1166.0,,,0.0,40.0,40.0,9.0,,,,0.0,,,1.0,6.0,10.0,,,,,,,,,,1175.0,54.0,21.0,1.0,3.0,,,9.0,,,0.0,5.0,,1.0,9.0,,,5.0,,,,,,1.0,,,,,,,,,1.0,,3.0,2.0,1.0,,3.0,2.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1175.0,1175.0,1175.0,1168.0,1175.0,1175.0,1229.0,,,1.0,1.0,1.0,,,,,,,,,,1.0,3.0,2.0,,1.0,,6,,5.0,5.0,1.0,1.0,,,,,,,,5,40,1,1.0,1175.0,16.25,1166,1550,5.0,1.0,6.0,5,9.0,9.0,4.0,2.0,3.0,0.0,3,6,3,6,980,20,20,5,6,12,9,1,2,3,2,1,1,3,1,200,1,1,2,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1914.347440,2055.233337,3118.405543,2,8,,6.6250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13485,12477,1,,,,,5.0,,1.0,,,1.0,1066.0,,1057.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,1.0,7.0,12.0,,,,,,,,,,1066.0,165.0,17.0,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1212.0,1066.0,1212.0,1063.0,1066.0,,,,,1.0,1.0,5.0,,,,,,1.0,,,,1.0,2.0,2.0,,1.0,,1,3.0,1.0,1.0,,,,,,,,,,5,39,1,1.0,1066.0,17.25,1057,1650,5.0,1.0,5.0,5,,,,2.0,995.0,,3,1,3,1,859,31,31,5,2,11,7,3,1,1,2,9,9,4,1,50,7,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1914.471020,2425.981741,3078.813428,2,69,,7.7500
13520,12508,1,,,,,6.0,,1.0,,,1.0,1137.0,,1128.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,2.0,5.0,8.0,,,,,,,,,,1137.0,101.0,20.0,,,,,,,,,,,,,,,,,,,,,5.0,1.0,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1137.0,1137.0,1137.0,1125.0,1137.0,1137.0,1238.0,,,1.0,1.0,1.0,,,,,,,,,,2.0,,,1.0,,,1,2.0,5.0,5.0,1.0,1.0,,,,,,,,5,39,1,1.0,1137.0,18.08,1128,1733,5.0,1.0,6.0,5,,,,2.0,995.0,,2,2,2,2,920,26,26,5,6,11,7,1,2,3,2,1,1,2,1,14,1,3,3,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1806.511457,2225.544850,3000.702121,2,17,,5.5000
13532,12520,1,,,,,3.0,,,,1193.0,2.0,1193.0,0.0,1191.0,,21.0,2.0,2.0,11.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1193.0,1203.0,1156.0,1193.0,,,,,,1.0,5.0,,,,,,1.0,,,,1.0,1.0,2.0,,1.0,8.0,6,,1.0,1.0,1.0,1.0,5.0,5.0,2.0,,,,,1,11,2,,1193.0,19.75,1191,1958,5.0,1.0,5.0,5,,,,,,,3,6,3,6,956,23,23,5,6,12,9,3,1,1,2,2,0,1,2,469,1,2,1,1,1980.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2418.518453,2811.028826,3156.785372,1,77,,
13566,12551,1,,,,,5.0,,1.0,,,1.0,1163.0,,1154.0,,,0.0,40.0,40.0,9.0,,,,0.0,,,1.0,7.0,8.0,,,,,,,,,,1163.0,72.0,36.0,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,1.0,,2.0,1.0,2.0,,2.0,1.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1163.0,1163.0,1163.0,991.0,1163.0,1163.0,1235.0,,,1.0,1.0,1.0,,,,,,,,,,2.0,,,1.0,,,1,2.0,1.0,1.0,,,,,,,,,,5,40,1,1.0,1163.0,32.66,1154,3191,5.0,1.0,5.0,5,,,,2.0,9.0,,2,2,2,2,771,38,38,5,6,12,9,2,1,1,2,1,1,2,2,156,1,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2418.538866,3653.453268,3951.940400,2,75,,7.5000


### Variables

Out of the 244 we are only going to use:

- *prglngth* is the integer duration of the pregnancy in weeks.
- *outcome* is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.
- *pregordr* is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.
- *birthord* is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank.
- *birthwgt_lb* and *birthwgt_oz* contain the pounds and ounces parts of the birth weight of the baby.
- *agepreg* is the mother’s age at the end of the pregnancy.
- *finalwgt* is the statistical weight associated with the respondent. It is a floating-point value that indicates the number of people in the U.S. population this respondent represents.

If you read the codebook carefully, you will see that many of the variables are **recodes**, which means that they are not part of the raw data collected by the survey; they are calculated using the **raw data**.

For example, prglngth for live births is equal to the raw variable wksgest (weeks of gestation) if it is available; otherwise it is estimated using mosgest * 4.33 (months of gestation times the average number of weeks in a month).

### Transformation

When you import data like this, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called **data cleaning**.

First of all, ReadFemPreg() has a function CleanFemResp() within it that cleans it. Open the module again in the text editor and edit ReadFemPreg() so that CleanFemResp() is has an input to decide whether to clean it or not. <br> such as:
>ReadFemPreg(clean=True)

After loading the unclean file again. (You might have to restart the kernel to get the new function imported) Code the following Data cleaning transformation processes:

In [117]:
preg_dirty = nsfg.ReadFemPreg(dct_file='Resources/Think_Stats/Thinkstats2/2002FemPreg.dct',
                      dat_file='Resources/Think_Stats/Thinkstats2/2002FemPreg.dat.gz', clean = False)

In [118]:
pd.set_option('display.max_columns', None)
preg_dirty.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,cmotpreg,prgoutcome,cmprgend,flgdkmo1,cmprgbeg,ageatend,hpageend,gestasun_m,gestasun_w,wksgest,mosgest,dk1gest,dk2gest,dk3gest,bpa_bdscheck1,bpa_bdscheck2,bpa_bdscheck3,babysex,birthwgt_lb,birthwgt_oz,lobthwgt,babysex2,birthwgt_lb2,birthwgt_oz2,lobthwgt2,babysex3,birthwgt_lb3,birthwgt_oz3,lobthwgt3,cmbabdob,kidage,hpagelb,birthplc,paybirth1,paybirth2,paybirth3,knewpreg,trimestr,ltrimest,priorsmk,postsmks,npostsmk,getprena,bgnprena,pnctrim,lpnctri,workpreg,workborn,didwork,matweeks,weeksdk,matleave,matchfound,livehere,alivenow,cmkidied,cmkidlft,lastage,wherenow,legagree,parenend,anynurse,fedsolid,frsteatd_n,frsteatd_p,frsteatd,quitnurs,ageqtnur_n,ageqtnur_p,ageqtnur,matchfound2,livehere2,alivenow2,cmkidied2,cmkidlft2,lastage2,wherenow2,legagree2,parenend2,anynurse2,fedsolid2,frsteatd_n2,frsteatd_p2,frsteatd2,quitnurs2,ageqtnur_n2,ageqtnur_p2,ageqtnur2,matchfound3,livehere3,alivenow3,cmkidied3,cmkidlft3,lastage3,wherenow3,legagree3,parenend3,anynurse3,fedsolid3,frsteatd_n3,frsteatd_p3,frsteatd3,quitnurs3,ageqtnur_n3,ageqtnur_p3,ageqtnur3,cmlastlb,cmfstprg,cmlstprg,cmintstr,cmintfin,cmintstrop,cmintfinop,cmintstrcr,cmintfincr,evuseint,stopduse,whystopd,whatmeth01,whatmeth02,whatmeth03,whatmeth04,resnouse,wantbold,probbabe,cnfrmno,wantbld2,timingok,toosoon_n,toosoon_p,wthpart1,wthpart2,feelinpg,hpwnold,timokhp,cohpbeg,cohpend,tellfath,whentell,tryscale,wantscal,whyprg1,whyprg2,whynouse1,whynouse2,whynouse3,anyusint,prglngth,outcome,birthord,datend,agepreg,datecon,agecon,fmarout5,pmarpreg,rmarout6,fmarcon5,learnprg,pncarewk,paydeliv,lbw1,bfeedwks,maternlv,oldwantr,oldwantp,wantresp,wantpart,cmbirth,ager,agescrn,fmarital,rmarital,educat,hieduc,race,hispanic,hisprace,rcurpreg,pregnum,parity,insuranc,pubassis,poverty,laborfor,religion,metro,brnout,yrstrus,prglngth_i,outcome_i,birthord_i,datend_i,agepreg_i,datecon_i,agecon_i,fmarout5_i,pmarpreg_i,rmarout6_i,fmarcon5_i,learnprg_i,pncarewk_i,paydeliv_i,lbw1_i,bfeedwks_i,maternlv_i,oldwantr_i,oldwantp_i,wantresp_i,wantpart_i,ager_i,fmarital_i,rmarital_i,educat_i,hieduc_i,race_i,hispanic_i,hisprace_i,rcurpreg_i,pregnum_i,parity_i,insuranc_i,pubassis_i,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,,1.0,1093.0,,1084.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,1.0,8.0,13.0,,,,,,,,,,1093.0,138.0,37.0,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1166.0,1093.0,1166.0,920.0,1093.0,,,,,1.0,1.0,1.0,,,,,,,,,,3.0,,,1.0,,,1,2.0,,,1.0,1.0,,,,,,,,5,39,1,1.0,1093.0,3316.0,1084,3241,1.0,2.0,1.0,1,,,,2.0,995.0,,1,2,1,2,695,44,44,1,1,16,12,2,2,2,2,2,2,2,2,469,3,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,,1.0,1166.0,,1157.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,2.0,7.0,14.0,,,,,,,,,,1166.0,65.0,42.0,1.0,1.0,2.0,,2.0,,,0.0,5.0,,1.0,4.0,,,5.0,,,,,,5.0,1.0,,,,,,,,1.0,,4.0,1.0,4.0,,20.0,1.0,20.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1166.0,1093.0,1166.0,1093.0,1166.0,1166.0,1231.0,,,1.0,1.0,1.0,,,,,,,,,,3.0,,,1.0,,,1,4.0,,,1.0,1.0,,,,,,,,5,39,1,2.0,1166.0,3925.0,1157,3850,1.0,2.0,1.0,1,2.0,4.0,3.0,2.0,87.0,0.0,1,4,1,4,695,44,44,1,1,16,12,2,2,2,2,2,2,2,2,469,3,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,,1.0,1156.0,,1147.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,1.0,9.0,2.0,,2.0,2.0,0.0,,1.0,1.0,4.0,,1156.0,75.0,24.0,,,,,,,,,,,,,,,,,,,,,5.0,1.0,,,,,,,,5.0,,,,,,,,,5.0,5.0,5.0,1156.0,,0.0,,,,,,,,,,,,,5.0,5.0,5.0,1156.0,,0.0,,,,,,,,,,,,,1204.0,1156.0,1204.0,1153.0,1156.0,,,,,5.0,,,,,,,5.0,5.0,,,,,,,,4.0,,5,,5.0,5.0,1.0,1.0,,,,,,,,5,39,1,1.0,1156.0,1433.0,1147,1358,5.0,1.0,6.0,5,,,,2.0,995.0,,5,5,5,5,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,,1.0,1198.0,,1189.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,2.0,7.0,0.0,,,,,,,,,,1198.0,33.0,25.0,1.0,3.0,,,3.0,,,0.0,5.0,,1.0,4.0,,,5.0,,,,,,5.0,5.0,1.0,,1205.0,7.0,2.0,,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1204.0,1156.0,1204.0,1156.0,1198.0,,,,,,,,4.0,,,,,5.0,,,,,,,,4.0,3.0,1,1.0,5.0,5.0,1.0,1.0,2.0,3.0,2.0,,,,,1,39,1,2.0,1198.0,1783.0,1189,1708,5.0,1.0,6.0,5,3.0,4.0,4.0,2.0,995.0,0.0,5,3,5,3,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,,1.0,1204.0,,1195.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,2.0,6.0,3.0,,,,,,,,,,1204.0,27.0,25.0,1.0,3.0,,,2.0,,,0.0,5.0,,1.0,4.0,,,1.0,5.0,2.0,,,,5.0,5.0,1.0,,1221.0,17.0,2.0,,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1204.0,1156.0,1204.0,1198.0,1204.0,1204.0,1231.0,,,,,,4.0,,,,,5.0,,,,,,,,4.0,5.0,5,,1.0,1.0,,,4.0,4.0,2.0,,,,,1,39,1,3.0,1204.0,1833.0,1195,1758,5.0,1.0,6.0,5,2.0,4.0,4.0,2.0,995.0,3.0,5,5,5,5,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231



agepreg contains the mother’s age at the end of the pregnancy. In the data file, agepreg is encoded as an integer number of centiyears. So first divide each element of agepreg by 100, yielding a floating-point value in
years.

In [119]:
preg_dirty['agepreg'] = preg_dirty['agepreg']/100
preg_dirty['agepreg']

0        33.16
1        39.25
2        14.33
3        17.83
4        18.33
         ...  
13588    17.91
13589    18.50
13590    19.75
13591    21.58
13592    21.58
Name: agepreg, Length: 13593, dtype: float64

birthwgt_lb and birthwgt_oz contain the weight of the baby, in pounds and ounces, for pregnancies that end in live birth. In addition it uses several special codes:<br/>
97 NOT ASCERTAINED<br/>
98 REFUSED<br/>
99 DON'T KNOW<br/>

1. Replace those values with nan. 

Special values encoded as numbers are dangerous because if they are not handled properly, they can generate bogus results, like a 99-pound baby. The replace method replaces these values with np.nan, a special floating- point value that represents “not a number.” The inplace flag tells replace to modify the existing Series rather than create a new one.<br/>

In [120]:
preg_dirty['birthwgt_lb'].unique()

array([ 8.,  7.,  9.,  6.,  4., nan,  5., 10., 12., 14., 99.,  3.,  1.,
       11.,  2., 13.,  0., 97., 51., 15., 98.])

In [121]:
mask = (preg_dirty['birthwgt_lb'] == 97) | (preg_dirty['birthwgt_lb'] == 98) | (preg_dirty['birthwgt_lb'] == 99)
preg_dirty.loc[mask,'birthwgt_lb'] = None
preg_dirty['birthwgt_lb']

0        8.0
1        7.0
2        9.0
3        7.0
4        6.0
        ... 
13588    6.0
13589    NaN
13590    NaN
13591    7.0
13592    7.0
Name: birthwgt_lb, Length: 13593, dtype: float64

In [122]:
preg_dirty['birthwgt_lb'].unique()

array([ 8.,  7.,  9.,  6.,  4., nan,  5., 10., 12., 14.,  3.,  1., 11.,
        2., 13.,  0., 51., 15.])

In [123]:
preg_dirty['birthwgt_oz'].unique()

array([13., 14.,  2.,  0.,  3.,  9.,  6., 10., nan, 11.,  8.,  5., 12.,
        1.,  7.,  4., 15., 99., 97., 98.])

In [124]:
mask = (preg_dirty['birthwgt_oz'] == 97) | (preg_dirty['birthwgt_oz'] == 98) | (preg_dirty['birthwgt_oz'] == 99)
preg_dirty.loc[mask,'birthwgt_oz'] = None
preg_dirty['birthwgt_oz']

0        13.0
1        14.0
2         2.0
3         0.0
4         3.0
         ... 
13588     3.0
13589     NaN
13590     NaN
13591     8.0
13592     8.0
Name: birthwgt_oz, Length: 13593, dtype: float64

In [125]:
preg_dirty['birthwgt_oz'].unique()

array([13., 14.,  2.,  0.,  3.,  9.,  6., 10., nan, 11.,  8.,  5., 12.,
        1.,  7.,  4., 15.])

Be careful nan is not a string is a numpy object. np.nan

As part of the IEEE floating-point standard, all mathematical operations
return nan if either argument is nan:<br/>
```
>import numpy as np
>np.nan / 100.0
nan
```
So computations with nan tend to do the right thing, and most pandas functions handle nan appropriately. But dealing with missing data will be a recurring issue.

Create a new column totalwgt_lb that com- bines pounds and ounces into a single quantity, in pounds.<br>
One important note: when you add a new column to a DataFrame, you must use dictionary syntax

In [126]:
preg_dirty['totalwgt_lb'] = preg_dirty['birthwgt_lb'] + (preg_dirty['birthwgt_oz']/16)
preg_dirty[['birthwgt_lb','birthwgt_oz','totalwgt_lb']]

Unnamed: 0,birthwgt_lb,birthwgt_oz,totalwgt_lb
0,8.0,13.0,8.8125
1,7.0,14.0,7.8750
2,9.0,2.0,9.1250
3,7.0,0.0,7.0000
4,6.0,3.0,6.1875
...,...,...,...
13588,6.0,3.0,6.1875
13589,,,
13590,,,
13591,7.0,8.0,7.5000


In [127]:
preg_dirty['totalwgt_kg'] = preg_dirty['totalwgt_lb'] / 2.2
preg_dirty['totalwgt_kg']

0        4.005682
1        3.579545
2        4.147727
3        3.181818
4        2.812500
           ...   
13588    2.812500
13589         NaN
13590         NaN
13591    3.409091
13592    3.409091
Name: totalwgt_kg, Length: 13593, dtype: float64

Compare them with the results the fuction when it cleans the data

In [132]:
preg_dirty['totalwgt_lb']

0        8.8125
1        7.8750
2        9.1250
3        7.0000
4        6.1875
          ...  
13588    6.1875
13589       NaN
13590       NaN
13591    7.5000
13592    7.5000
Name: totalwgt_lb, Length: 13593, dtype: float64

In [128]:
preg['totalwgt_lb']

0        8.8125
1        7.8750
2        9.1250
3        7.0000
4        6.1875
          ...  
13588    6.1875
13589       NaN
13590       NaN
13591    7.5000
13592    7.5000
Name: totalwgt_lb, Length: 13593, dtype: float64

### Validation

When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other
misunderstandings. If you take time to validate the data, you can save time later and avoid errors. <br>
One way to validate data is to compute basic statistics and compare them with published results. For example, the NSFG codebook includes tables that summarize each variable. Here is the table for outcome, which encodes the outcome of each pregnancy:<br>

![alt text](Resources/Think_Stats/notebookpics/number_rows_table.png "Title")

The Series class provides a method, value_counts, that counts the number of times each value appears. If we select the outcome Series from the DataFrame.<br>
Use value_counts to compare with the published data:

In [129]:
# Code here

Similarly, here is the published table for birthwgt_lb. Is there anything weird? If so, fix it.<br>
![alt text](Resources/Think_Stats/notebookpics/number_rows_table2.png "Title")

In [130]:
# Code here

### Interpretation

To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.<br>
As an example, let’s look at the sequence of outcomes for a respondents.
Because of the way the data files are organized, we have to do some processing to collect the pregnancy data for each respondent.

Create a dictionary that maps each caseid to all of index for the pregnancies she has been involved on:

An output as: {1:[1,1,1,4],2:[1,1,1].....}

dont use pandas dataframe functions.

In [131]:
# Code here

What are all the outcomes observed for caseid = 10229 (use your calculated dictionary)

The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more.

But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy,
it is natural to be moved by the story it tells.

Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family life, reproduction, and health. At
the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.<br>