In [1]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats/raw/v3/nb/thinkstats.py")

Downloaded thinkstats.py


In [2]:
try:
    import empiricaldist
except ImportError:
    %pip install empiricaldist

Collecting empiricaldist
  Downloading empiricaldist-0.9.0.tar.gz (14 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: empiricaldist
  Building wheel for empiricaldist (pyproject.toml) ... [?25ldone
[?25h  Created wheel for empiricaldist: filename=empiricaldist-0.9.0-py3-none-any.whl size=14296 sha256=bde34ac3c76cb6414fd55a8cc9663cae6694889111e5a65b4d2b248a92e4bef7
  Stored in directory: /Users/thienhuongvu/Library/Caches/pip/wheels/1a/32/45/308a55ccffc79208a70c80ebbc916d6d8dbd905650fbb354c5
Successfully built empiricaldist
Installing collected packages: empiricaldist
Successfully installed empiricaldist-0.9.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import HTML
from thinkstats import decorate


In [4]:
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats/raw/v3/data/2002FemPreg.dat.gz")

Downloaded 2002FemPreg.dct
Downloaded 2002FemPreg.dat.gz


In [5]:
try:
    import statadict
except ImportError:
    %pip install statadict

Collecting statadict
  Downloading statadict-1.1.0-py3-none-any.whl (9.4 kB)
Installing collected packages: statadict
Successfully installed statadict-1.1.0
Note: you may need to restart the kernel to use updated packages.


In [6]:
dct_file = "2002FemPreg.dct"
dat_file = "2002FemPreg.dat.gz"

In [7]:
from statadict import parse_stata_dict


def read_stata(dct_file, dat_file):
    stata_dict = parse_stata_dict(dct_file)
    resp = pd.read_fwf(
        dat_file,
        names=stata_dict.names,
        colspecs=stata_dict.colspecs,
        compression="gzip",
    )
    return resp

In [8]:
preg = read_stata(dct_file, dat_file)

In [9]:
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231


### Exercise 1.1
Select the birthord column from preg, print the value counts, and compare to results published in the codebook at https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NSFG/Cycle6Codebook-Pregnancy.pdf.

In [12]:
preg.birthord.value_counts(dropna=False).sort_index()

1.0     4413
2.0     2874
3.0     1234
4.0      421
5.0      126
6.0       50
7.0       20
8.0        7
9.0        2
10.0       1
NaN     4445
Name: birthord, dtype: int64

From the codebook https://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm

| Value | Label        | Total |
| ----- | ------------ | ----- |
| .     | Inapplicable | 4445  |
| 01    | 1st Birth    | 4413  |
| 02    | 2nd Birth    | 2874  |
| 03    | 3rd Birth    | 1234  |
| 04    | 4th Birth    | 421   |
| 05    | 5th Birth    | 126   |
| 06    | 6th Birth    | 50    |
| 07    | 7th Birth    | 20    |
| 08    | 8th Birth    | 7     |
| 09    | 9th Birth    | 2     |
| 10    | 10th Birth   | 1     |

### Exercise 1.2
Create a new column named `totalwgt_kg` that contains birth weight in kilograms (there are approximately 2.2 pounds per kilogram). Compute the mean and standard deviation of the new column.

In [13]:
preg["totalwgt_lb"] = preg["birthwgt_lb"] + preg["birthwgt_oz"] / 16.0

In [17]:
preg["totalwgt_kg"] = preg["totalwgt_lb"] / 2.2

In [18]:
preg["totalwgt_kg"].mean()

3.327127539842133

In [19]:
preg["totalwgt_kg"].std()

0.9527280814371938

### Exercise 1.3
What are the pregnancy lengths for the respondent with `caseid` 2298?

What was the birth weight of the first baby born to the respondent with `caseid` 5013? 

In [27]:
# pregnancy lengths for the respondent with `caseid` 2298
preg[preg.caseid == 2298][["pregordr", "prglngth", "outcome"]]

Unnamed: 0,pregordr,prglngth,outcome
2610,1,40,1
2611,2,36,1
2612,3,30,1
2613,4,40,1


In [35]:
# the birth weight of the first baby born to the respondent with `caseid` 5013
preg.query("caseid == 5013 and pregordr == 1")[["totalwgt_kg", "totalwgt_lb", "prglngth"]]

Unnamed: 0,totalwgt_kg,totalwgt_lb,prglngth
5516,3.352273,7.375,29


# Daily Assessment Questions

Q1: Compare and contrast the mean, median, and mode as measures of central tendency. In what specific scenarios (e.g., in a dataset of patient ages, or gene expression levels) would you prefer to use the median over the mean, and why?
---
The mean is the arithmetic average. The median is the 50th percentile value (half of the data is less than this value, half is greater). The mode is the most frequently occurred value.

In a skewed dataset, such as the datasets of patient ages or gene expression levels, median is preferred over the mean, as it is less affected by outliers.

---
Q2: Explain the concepts of variance and standard deviation. What information do they convey about a dataset, and why is standard deviation often preferred over variance for interpretation?
---
Both variance and standard deviation are quantitative measures of the spread of the data.

The variance is the mean of the squared differences (or deviations) of each data point from the mean of the dataset.

The standard deviation is the square root of the variance.

Standard deviation is preferred for interpretation because it is expressed in the same units as the original data, which makes it much easier to comprehend and relate to the actual values in the dataset.

---
Q3: Imagine you have a dataset of immune cell counts that you suspect is positively skewed (has a long tail to the right). How would you visually confirm this skewness using a common plot, and what does this skewness imply about the distribution of cell counts?
---
The skewness can be confirmed using a histogram or a kernel density plot.

There is a small number of cell counts that are very high, comparing to the typical cell count value. This abnormally high cell counts can indicate a disease.

---