## Exploratory Data Analysis

*Coding along with third edition of the online version of __[Think Stats](https://allendowney.github.io/ThinkStats/chap01.html)__ by Allen Downey.*

In [12]:
from statadict import parse_stata_dict
import pandas as pd

#### __The `statadict` Python Package__

The `statadict` package is part of the book's supporting materials rather than a general-purpose Python package on PyPI. It can be found in the book's GitHub repository along with other code examples and datasets. It got installed with `poetry add statadict`.

The `statadict` package in "Think Stats" is specifically designed to work with the National Survey of Family Growth (NSFG) dataset, which is a key dataset used throughout the book for teaching statistical concepts.

Usage of `statadict` in the book:

- Used primarily in the early chapters to load pregnancy data
- Helps analyze demographic and health statistics
- Forms the foundation for many statistical examples in the book

Working with NSFG Data:

- The NSFG is a national survey conducted by the CDC
- Contains data about family life, marriage, divorce, pregnancy
- The data files are in fixed-width format with accompanying .dct files
- The package helps convert this into usable Python data structures

#### __The .dct File Format__

A .dct (dictionary) file is a format used by Stata to describe the structure of fixed-width data files. Here's why they're used and their advantages:

1. Dictionary (.dct) File Structure:
```stata
dictionary using "data.raw" {
    str20  name     %20s  "Person's name"
    int    age      %2f   "Age in years"
    float  income   %8.2f "Annual income"
    *      _column(31)
}
```
This tells us:
- Variable names (name, age, income)
- Data types (str20, int, float)
- Column widths (%20s, %2f, %8.2f)
- Variable descriptions (in quotes)
- Column positions (_column)

2. Advantages over CSV:
   - Works with fixed-width files where data fields have precise character positions
   - Handles legacy data formats that predate CSV
   - Maintains exact field widths which can be crucial for certain data types
   - Better handling of missing values and special codes
   - Includes metadata like variable descriptions
   - More precise control over data types

3. Fixed-width vs CSV:
```
# Fixed-width format:
John Smith           45  50000.00
Jane Doe             32  65000.00

# CSV format:
name,age,income
"John Smith",45,50000.00
"Jane Doe",32,65000.00
```

The fixed-width format was common in older systems and is still used by some government agencies and research institutions, particularly for large datasets like census data or survey results. ***The NSFG data used in "Think Stats" uses this format, which is why the `statadict` package is needed to parse it.***

In [13]:
dct_file = "../assets/data/2002FemPreg.dct"
dat_file = "../assets/data/2002FemPreg.dat.gz"

In [14]:
def read_stata(dct_file, dat_file):
    stata_dict = parse_stata_dict(dct_file)
    resp = pd.read_fwf(
        dat_file,
        names=stata_dict.names,
        colspecs=stata_dict.colspecs,
        compression="gzip",
    )
    return resp

In [15]:
read_stata(dct_file, dat_file)

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.301740,8567.549110,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.301740,8567.549110,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,...,0,0,0,0,7226.301740,8567.549110,12999.542264,2,12,1231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13588,12571,1,,,,,6.0,,1.0,,...,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,1227
13589,12571,2,,,,,3.0,,,,...,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,1227
13590,12571,3,,,,,3.0,,,,...,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,1227
13591,12571,4,,,,,6.0,,1.0,,...,0,0,0,0,4670.540953,5795.692880,6269.200989,1,78,1227
