# Lab 3.
We've been studying study design and sampling methods. We've seen how bias comes into play. Now, imagine we have a great dataset that is representative of whatever population we are interested in.

![free samples](images/free-samples.jpeg)

## Goals
1. Understand **sample** from **population**
2. Take estimate from a **sampling distribution**
3. As $n$ increases, your sampling distribution becomes less variant and closer to the true value (see on a histogram)

In [2]:
library(readr)
library(dplyr)

In [3]:
birth_data <- read_csv("data/us-territory-births.csv")[,-1]
head(birth_data)

“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_integer(),
  babyID = col_integer(),
  dbwt = col_integer(),
  combgest = col_integer(),
  sex = col_character(),
  dob_mm = col_integer(),
  cig_rec = col_character()
)


babyID,dbwt,combgest,sex,dob_mm,cig_rec
1,2977,37,M,1,N
2,3191,41,M,1,Y
3,1786,32,F,1,N
4,4489,39,M,1,N
5,3203,38,M,1,N
6,3203,39,F,1,N


## Data dictionary
This was provided from the [data documentation](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/natality/UserGuide2016.pdf).

| Variable        | Description                                                  |
|-----------------|--------------------------------------------------------------|
| babyID          | Unique identifier: row number                                |
| dbwt            | Birth weight in Grams: 0227-8165 Number of grams           | 
| combgest        | Combined Gestation - in Weeks: 17th to 47th week of Gestation|                
| sex             | Sex of Infant: M (Male) or F (Female)	                       |
| dob_mm          | Birth month of the infant                                    | 
| cig_rec         | If the mother reports smoking in any of the three trimesters of pregnancy she is classified as a smoker: (Y) Yes, (N) No, or (U) Unknown|

For us, what is important right now is `dbwt` which is the birth weight. We're in the US, so we don't normally record babies' birth weight in grams.

In [4]:
install.packages("measurements")
library(measurements)
birth_data <- birth_data %>% mutate(bw_lbs = conv_unit(dbwt,"g","lbs"))
head(birth_data)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


babyID,dbwt,combgest,sex,dob_mm,cig_rec,bw_lbs
1,2977,37,M,1,N,6.563161
2,3191,41,M,1,Y,7.03495
3,1786,32,F,1,N,3.937455
4,4489,39,M,1,N,9.89655
5,3203,38,M,1,N,7.061405
6,3203,39,F,1,N,7.061405


In [5]:
birth_data <- birth_data %>% mutate(under_5lbs = bw_lbs < 5)
head(birth_data) # have a look at the added variable

babyID,dbwt,combgest,sex,dob_mm,cig_rec,bw_lbs,under_5lbs
1,2977,37,M,1,N,6.563161,False
2,3191,41,M,1,Y,7.03495,False
3,1786,32,F,1,N,3.937455,True
4,4489,39,M,1,N,9.89655,False
5,3203,38,M,1,N,7.061405,False
6,3203,39,F,1,N,7.061405,False


In [6]:
sample_10 <- birth_data %>% sample_n(size = 10) %>% mutate(sample_size = n())

In [8]:
sample_10 %>% summarize(proportion=mean(under_5lbs))

proportion
0


In [10]:
sample_100 <- birth_data %>% sample_n(size = 100) %>% mutate(sample_size = n())

In [11]:
sample_100 %>% summarize(proportion=mean(under_5lbs))

proportion
0.05
