This notebook reads the Behavioral Risk Factor Surveillance System (BRFSS) 2023 dataset. The BRFSS data is provided in ASCII fixed-width format, meaning that each variable occupies specific character positions in each row. To extract meaningful information, we use the official variable layout guide to define column ranges and names.

Data: https://www.cdc.gov/brfss/annual_data/annual_2023.html (Data Files --> 2023 BRFSS Data (ASCII) [ZIP – 41.5 MB])

Codebook: https://www.cdc.gov/brfss/annual_data/annual_2023.html (2023 Survey Data Information --> 2023 BRFSS Codebook CDC [ZIP – 3 MB])

Variable layout: https://www.cdc.gov/brfss/annual_data/2023/llcp_varlayout_23_onecolumn.html

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)

In [2]:
# Column positions are offset by one compared to the variable layout documentation,
# because Python uses 0-based indexing (i.e., the first column "1" in variable 
# layout is position 0 in Python).

colspecs = [
    (0, 2),       # _STATE (1–2)
    (87, 88),     # SEXVAR (88, width=1)
    (2064, 2066), # _AGEG5YR (2065, width=2)
    (186, 187),   # EDUCA (187, width=1)
    (203, 205),   # INCOME3 (204, width=2)
    (2059, 2060), # _RACE (2060, width=1)
    (132, 133),   # BPHIGH6
    (148, 149),   # DIABETE4
    (137, 138),   # CVDINFR4
    (138, 139),   # CVDCRHD4
    (139, 140),   # CVDSTRK3
    (224, 225),   # SMOKE100
    (225, 226),   # SMOKDAY2
    (112, 113),   # EXERANY2
    (2081, 2085), # _BMI5 (2082, width=4)
    (2085, 2086), # _BMI5CAT
    (2086, 2087), # _RFBMI5
    (228, 231),   # ALCDAY4 (229, width=3)
    (100, 101),   # GENHLTH
    (101, 103),   # PHYSHLTH (102, width=2)
    (103, 105),   # MENTHLTH (104, width=2)
    (134, 135),   # CHOLCHK3
    (111, 112),   # CHECKUP1
]

column_names = [
    '_STATE', 'SEXVAR', '_AGEG5YR', 'EDUCA', 'INCOME3', '_RACE',
    'BPHIGH6', 'DIABETE4', 'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3',
    'SMOKE100', 'SMOKDAY2', 'EXERANY2', '_BMI5', '_BMI5CAT', '_RFBMI5',
    'ALCDAY4', 'GENHLTH', 'PHYSHLTH', 'MENTHLTH', 'CHOLCHK3', 'CHECKUP1'
]

#ASCII file originally named LLCP2023.ASC when downloaded from CDC website
### --> renamed here to BRFSS2023.ASC for making it more evident. 
df = pd.read_fwf("data/BRFSS2023.ASC", colspecs=colspecs, names=column_names)

df.sample(50)

Unnamed: 0,_STATE,SEXVAR,_AGEG5YR,EDUCA,INCOME3,_RACE,BPHIGH6,DIABETE4,CVDINFR4,CVDCRHD4,CVDSTRK3,SMOKE100,SMOKDAY2,EXERANY2,_BMI5,_BMI5CAT,_RFBMI5,ALCDAY4,GENHLTH,PHYSHLTH,MENTHLTH,CHOLCHK3,CHECKUP1
157264,24,2,13,5.0,8.0,1.0,1.0,3.0,2.0,2.0,2.0,1.0,3.0,2.0,2654.0,3.0,2,230.0,3.0,88.0,88.0,7.0,1.0
69473,12,1,11,4.0,5.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,3.0,2.0,5007.0,4.0,2,888.0,3.0,88.0,88.0,2.0,1.0
4979,2,2,12,6.0,99.0,9.0,1.0,3.0,2.0,2.0,2.0,2.0,,1.0,2173.0,2.0,1,888.0,1.0,88.0,88.0,2.0,1.0
217693,29,1,11,4.0,3.0,1.0,4.0,1.0,2.0,2.0,2.0,1.0,3.0,1.0,3068.0,4.0,2,888.0,3.0,88.0,88.0,2.0,1.0
124490,19,1,11,4.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,,,2.0,,,9,,4.0,88.0,88.0,2.0,1.0
354571,49,2,13,6.0,6.0,1.0,1.0,3.0,2.0,2.0,2.0,2.0,,1.0,2369.0,2.0,1,888.0,2.0,88.0,2.0,5.0,1.0
72491,12,2,11,5.0,9.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,,2.0,4115.0,4.0,2,107.0,3.0,88.0,88.0,2.0,1.0
190310,26,1,10,5.0,1.0,2.0,1.0,3.0,1.0,2.0,1.0,1.0,3.0,1.0,2929.0,3.0,2,888.0,4.0,15.0,20.0,2.0,1.0
9050,2,1,5,6.0,6.0,1.0,3.0,3.0,2.0,2.0,2.0,2.0,,1.0,2585.0,3.0,2,888.0,3.0,88.0,30.0,8.0,4.0
28943,6,2,12,5.0,99.0,8.0,1.0,1.0,2.0,2.0,2.0,2.0,,1.0,2657.0,3.0,2,888.0,3.0,1.0,2.0,2.0,1.0


Check correct parsing, confirming value range for each variable correspond with codebook.

In [3]:
for col in df.columns:
    print(f"--- Value counts for: {col} ---")
    print(df[col].value_counts(dropna=False).sort_index())
    print("\n")


--- Value counts for: _STATE ---
_STATE
1      4362
2      5525
4     12036
5      5351
6     11976
8      8783
9      9501
10     4282
11     3207
12    13255
13     8227
15     7832
16     6895
17     5279
18    10993
19     8876
20     9884
22     5388
23    12255
24    17255
25     9528
26     9978
27    16170
28     4069
29     7219
30     7143
31    12886
32     2650
33     6960
34     9328
35     3220
36    17349
37     4088
38     5745
39    13384
40     6727
41     6234
44     5781
45    10038
46     5886
47     5645
48    10059
49    11154
50     7636
51     6981
53    26444
54     4339
55    12819
56     4484
66     1559
72     4594
78     2064
Name: count, dtype: int64


--- Value counts for: SEXVAR ---
SEXVAR
1    203782
2    229541
Name: count, dtype: int64


--- Value counts for: _AGEG5YR ---
_AGEG5YR
1     26280
2     21247
3     24803
4     27153
5     28463
6     27070
7     31291
8     34219
9     41974
10    46099
11    43533
12    34543
13    38869
14     7779
Name

Save to csv

In [4]:
df.to_csv("data/BRFSS2023.csv")

Ensure correct parsing

In [5]:
saved_data = pd.read_csv("data/BRFSS2023.csv", index_col=0)

In [6]:
print(df.shape)
print(saved_data.shape)

(433323, 23)
(433323, 23)


In [7]:
saved_data

Unnamed: 0,_STATE,SEXVAR,_AGEG5YR,EDUCA,INCOME3,_RACE,BPHIGH6,DIABETE4,CVDINFR4,CVDCRHD4,CVDSTRK3,SMOKE100,SMOKDAY2,EXERANY2,_BMI5,_BMI5CAT,_RFBMI5,ALCDAY4,GENHLTH,PHYSHLTH,MENTHLTH,CHOLCHK3,CHECKUP1
0,1,2,13,5.0,99.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,3047.0,4.0,2,888.0,2.0,88.0,88.0,3.0,2.0
1,1,2,13,5.0,99.0,1.0,1.0,3.0,2.0,2.0,2.0,2.0,,1.0,2856.0,3.0,2,888.0,2.0,88.0,88.0,2.0,2.0
2,1,2,13,4.0,2.0,2.0,1.0,3.0,2.0,2.0,2.0,1.0,3.0,1.0,2231.0,2.0,1,888.0,4.0,6.0,2.0,2.0,1.0
3,1,2,12,5.0,99.0,1.0,3.0,3.0,2.0,2.0,2.0,2.0,,1.0,2744.0,3.0,2,888.0,2.0,2.0,88.0,3.0,3.0
4,1,2,12,5.0,7.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,1.0,2585.0,3.0,2,202.0,4.0,88.0,88.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433318,78,1,10,5.0,5.0,2.0,1.0,3.0,2.0,2.0,7.0,2.0,,1.0,2921.0,3.0,2,105.0,3.0,12.0,30.0,2.0,1.0
433319,78,2,3,6.0,6.0,2.0,3.0,3.0,2.0,2.0,2.0,2.0,,2.0,2496.0,2.0,1,888.0,2.0,88.0,88.0,2.0,1.0
433320,78,2,7,6.0,10.0,8.0,3.0,3.0,2.0,2.0,2.0,2.0,,1.0,3438.0,4.0,2,201.0,2.0,10.0,88.0,2.0,1.0
433321,78,2,10,6.0,3.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,,1.0,2386.0,2.0,1,888.0,3.0,88.0,88.0,2.0,1.0
