# Reading Fixed width data files

In [1]:
import pandas as pd

In [2]:
import sys
sys.path.append('../lib')

In [3]:
import fwf

## Reading Babyboom data

On December 18, 1997, 44 babies were born in a hospital in Brisbane, Australia.

The time of birth for all 44 babies was reported in the local paper; the complete dataset is in a file called `babyboom.dat`

This is another fixed width data file, only this time we don't have to parse the schema from a `.dct` file.

In [4]:
var_info = fwf.read_schema([
    ('time', 1, 8, int),
    ('sex', 9, 16, int),
    ('weight_g', 17, 24, int),
    ('minutes', 25, 32, int)
])

In [5]:
var_info

[Column(start=1, vtype=<class 'int'>, name='time', end=9),
 Column(start=9, vtype=<class 'int'>, name='sex', end=17),
 Column(start=17, vtype=<class 'int'>, name='weight_g', end=25),
 Column(start=25, vtype=<class 'int'>, name='minutes', end=33)]

We can now separate the widths, names and types

In [6]:
[c.width for c in var_info]

[8, 8, 8, 8]

In [7]:
df = pd.read_fwf(
    '../data/babyboom.dat',
    width = [c.width for c in var_info],
    names = [c.name for c in var_info],
    dtype = dict([(c.name, c.vtype,) for c in var_info]),
    skiprows=59
)
df.head()

Unnamed: 0,time,sex,weight_g,minutes
0,5,1,3837,5
1,104,1,3334,64
2,118,2,3554,78
3,155,2,3838,115
4,257,2,3625,177


Or we can use a method in `fwf` that does it for us

In [8]:
df = fwf.read_fixed_width(
    '../data/babyboom.dat',
    var_info,
    include_dtypes=True,
    skiprows=59
)

In [9]:
df.head()

Unnamed: 0,time,sex,weight_g,minutes
0,5,1,3837,5
1,104,1,3334,64
2,118,2,3554,78
3,155,2,3838,115
4,257,2,3625,177


The columns are `time`, `sex`, `weight_g`, and `minutes`, where `minutes` is time of birth converted to minutes since midnight.

## BRFSS

The National Center for Chronic Disease Prevention and Health Promotion conducts an annual survey as part of the Behavioral Risk Factor Surveillance System (BRFSS).

In 2008, they interviewed 414,509 respondents and asked about their demographics, health, and health risks. Among the data they collected are the weights in kilograms of 398,484 respondents.

In [10]:
var_info = fwf.read_schema([
    ('age', 101, 102, pd.Int64Dtype()),
    ('sex', 143, 143, int),
    ('wtyrago', 127, 130, float),
    ('finalwt', 799, 808, int),
    ('wtkg2', 1254, 1258, float),
    ('htm3', 1251, 1253, pd.Int64Dtype()),
])

In [11]:
var_info

[Column(start=101, vtype=Int64Dtype(), name='age', end=103),
 Column(start=143, vtype=<class 'int'>, name='sex', end=144),
 Column(start=127, vtype=<class 'float'>, name='wtyrago', end=131),
 Column(start=799, vtype=<class 'int'>, name='finalwt', end=809),
 Column(start=1254, vtype=<class 'float'>, name='wtkg2', end=1259),
 Column(start=1251, vtype=Int64Dtype(), name='htm3', end=1254)]

In [12]:
df = fwf.read_fixed_width(
    '../data/brfss.dat.gz',
    var_info,
    include_dtypes=True
)

In [13]:
df.head()

Unnamed: 0,age,sex,wtyrago,finalwt,wtkg2,htm3
0,82,2,168.0,185,7091.0,157
1,65,2,160.0,126,7273.0,163
2,48,2,,181,99999.0,165
3,61,1,162.0,517,7364.0,170
4,26,1,195.0,1252,8864.0,185


In [14]:
df.dtypes

age          Int64
sex          int64
wtyrago    float64
finalwt      int64
wtkg2      float64
htm3         Int64
dtype: object

Clean height

In [15]:
float('NaN')

nan

In [16]:
df.htm3.replace([999], pd.NA, inplace=True)

Clean weight

In [17]:
df.wtkg2.replace([99999], float('NaN'), inplace=True)

Clean weight one year ago

In [18]:
df.wtyrago.replace([7777, 9999], float('NaN'), inplace=True)

Clean age

In [19]:
df.age.replace([7, 9], pd.NA, inplace=True)

Convert weight to kilograms

In [20]:
df.wtkg2.value_counts()

6818.0     19634
9091.0     19517
7273.0     19197
8182.0     19084
7727.0     16085
           ...  
8800.0         1
11100.0        1
18455.0        1
4600.0         1
24300.0        1
Name: wtkg2, Length: 473, dtype: int64

In [21]:
df.wtkg2 /= 100

In [22]:
df['wtyrago'] = df.wtyrago.apply(lambda x: x/2.2 if x < 9000 else x-9000)

In [23]:
df.dtypes

age         object
sex          int64
wtyrago    float64
finalwt      int64
wtkg2      float64
htm3        object
dtype: object

In [24]:
df = df.astype({
    'age': pd.Int64Dtype(),
    'htm3': pd.Int64Dtype()
}).rename(columns={
    'htm3': 'height',
    'wtkg2': 'weight'
})

In [25]:
df.to_feather('../data/brfss.feather')