# Reading Fixed width data files

In [None]:
import pandas as pd

In [None]:
import sys
sys.path.append('../lib')

In [None]:
import fwf

## Reading Babyboom data

On December 18, 1997, 44 babies were born in a hospital in Brisbane, Australia.

The time of birth for all 44 babies was reported in the local paper; the complete dataset is in a file called `babyboom.dat`

This is another fixed width data file, only this time we don't have to parse the schema from a `.dct` file.

In [None]:
var_info = fwf.read_schema([
    ('time', 1, 8, int),
    ('sex', 9, 16, int),
    ('weight_g', 17, 24, int),
    ('minutes', 25, 32, int)
])

In [None]:
var_info

We can now separate the widths, names and types

In [None]:
[c.width for c in var_info]

In [None]:
df = pd.read_fwf(
    '../data/babyboom.dat',
    width = [c.width for c in var_info],
    names = [c.name for c in var_info],
    dtype = dict([(c.name, c.vtype,) for c in var_info]),
    skiprows=59
)
df.head()

Or we can use a method in `fwf` that does it for us

In [None]:
df = fwf.read_fixed_width(
    '../data/babyboom.dat',
    var_info,
    include_dtypes=True,
    skiprows=59
)

In [None]:
df.head()

The columns are `time`, `sex`, `weight_g`, and `minutes`, where `minutes` is time of birth converted to minutes since midnight.

## BRFSS

The National Center for Chronic Disease Prevention and Health Promotion conducts an annual survey as part of the Behavioral Risk Factor Surveillance System (BRFSS).

In 2008, they interviewed 414,509 respondents and asked about their demographics, health, and health risks. Among the data they collected are the weights in kilograms of 398,484 respondents.

In [None]:
var_info = fwf.read_schema([
    ('age', 101, 102, pd.Int64Dtype()),
    ('sex', 143, 143, int),
    ('wtyrago', 127, 130, float),
    ('finalwt', 799, 808, int),
    ('wtkg2', 1254, 1258, float),
    ('htm3', 1251, 1253, pd.Int64Dtype()),
])

In [None]:
var_info

In [None]:
df = fwf.read_fixed_width(
    '../data/brfss.dat.gz',
    var_info,
    include_dtypes=True
)

In [None]:
df.head()

In [None]:
df.dtypes

Clean height

In [None]:
float('NaN')

In [None]:
df.htm3.replace([999], pd.NA, inplace=True)

Clean weight

In [None]:
df.wtkg2.replace([99999], float('NaN'), inplace=True)

Clean weight one year ago

In [None]:
df.wtyrago.replace([7777, 9999], float('NaN'), inplace=True)

Clean age

In [None]:
df.age.replace([7, 9], pd.NA, inplace=True)

Convert weight to kilograms

In [None]:
df.wtkg2.value_counts()

In [None]:
df.wtkg2 /= 100

In [None]:
df['wtyrago'] = df.wtyrago.apply(lambda x: x/2.2 if x < 9000 else x-9000)

In [None]:
df.dtypes

In [None]:
df = df.astype({
    'age': pd.Int64Dtype(),
    'htm3': pd.Int64Dtype()
}).rename(columns={
    'htm3': 'height',
    'wtkg2': 'weight'
})

In [None]:
df.to_feather('../data/brfss.feather')