## BRFSS Data Cleaning

### 2022 Files:
- [BFRSS Files](https://www.cdc.gov/brfss/annual_data/annual_2022.html)
- [2022 BRFSS Overview CDC](https://www.cdc.gov/brfss/annual_data/2022/pdf/Overview_2022-508.pdf)
- [2022 BRFSS Codebook CDC](https://www.cdc.gov/brfss/annual_data/2022/zip/codebook22_llcp-v2-508.zip)
- [Calculated Variables in Data Files CDC](https://www.cdc.gov/brfss/annual_data/2022/pdf/2022-calculated-variables-version4-508.pdf)
- [Summary Matrix of Calculated Variables (CV) in the 2022 Data File](https://www.cdc.gov/brfss/annual_data/2022/summary_matrix_22.html)
- [Variable Layout](https://www.cdc.gov/brfss/annual_data/2022/llcp_varlayout_22_onecolumn.html)
- [Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques](https://www.cdc.gov/pcd/issues/2019/19_0109.htm)

In [1]:
# package imports go here
import pandas as pd
import numpy as np
import fastparquet as fp
import os
import sys
import zipfile
import requests
import io
import pickle

sys.path.insert(1, '../pkgs')
import ml_functions as mlfuncs

In [2]:
# URL for 2022 Codebook that describes all the fields in the 2022 dataset
codebook_url = "https://www.cdc.gov/brfss/annual_data/2022/zip/codebook22_llcp-v2-508.zip"

In [3]:
# Step 1: Download the zip file from the URL
response = requests.get(codebook_url)
response.raise_for_status()  # Check that the request was successful

# Step 2: Read the zip file from the downloaded content
with zipfile.ZipFile(io.BytesIO(response.content)) as zip_ref:
    # List all contents of the zip file
    zip_contents = zip_ref.namelist()
    print("Contents of the zip file:", zip_contents)
    
    # # Extract all files to a directory
    # zip_ref.extractall('path/to/extract/to')

    # Read a specific file without extracting
    specific_file = 'USCODE22_LLCP_102523.HTML'
    with zip_ref.open(specific_file) as file:
        content = file.read()
#        print("\nContent of the specific file:")
        print(content)

codebook_data = pd.read_html( content )

Contents of the zip file: ['USCODE22_LLCP_102523.HTML']
b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\r\n<html>\r\n<head>\r\n<meta name="Generator" content="SAS Software Version 9.4, see www.sas.com">\r\n<meta http-equiv="Content-type" content="text/html; charset=windows-1252">\r\n<title>LLCP 2022: Codebook Report</title>\r\n<style type="text/css">\r\n<!--\r\n.batch\r\n{\r\n  border: 1px solid #000000;\r\n  border-collapse: separate;\r\n  border-spacing: 1px;\r\n  font-family: \'courier new\';\r\n}\r\n.bycontentfolder\r\n{\r\n  margin-left: 6pt;\r\n}\r\n.bylinecontainer\r\n{\r\n  width: 100%;\r\n}\r\n.contentfolder\r\n{\r\n  margin-left: 6pt;\r\n}\r\n.contentitem\r\n{\r\n  margin-left: 6pt;\r\n}\r\n.contents\r\n{\r\n}\r\n.folderaction\r\n{\r\n  margin-left: 6pt;\r\n}\r\n.graph\r\n{\r\n  border: 1px solid #000000;\r\n  border-collapse: separate;\r\n  border-spacing: 1px;\r\n}\r\n.indexaction\r\n{\r\n  margin-left: 6pt;\r\n}\r\n.indexitem\r\n{\r\n  margin-left: 6pt;\r

In [4]:
len(codebook_data)

650

In [5]:
i = 128
df = codebook_data[i]
labels = df.columns[1][0].replace(': ',':').split(' ')
print(labels)
dict = mlfuncs.get_label_dict( i, labels)

for i, key in enumerate(dict):
    print(f"{i} {key}:  {dict[key]}")

['Label:\xa0Income\xa0Level', 'Section\xa0Name:\xa0Demographics', 'Core\xa0Section\xa0Number:\xa08', 'Question\xa0Number:\xa016', 'Column:\xa0186-187', 'Type\xa0of\xa0Variable:\xa0Num', 'SAS\xa0Variable\xa0Name:\xa0INCOME3', 'Question\xa0Prologue:Question:Is\xa0your\xa0annual\xa0household\xa0income\xa0from\xa0all\xa0sources:(If\xa0respondent\xa0refuses\xa0at\xa0any\xa0income\xa0level,\xa0code\xa0´Refused.´)']
0 Label:  Income Level
1 Section Name:  Demographics
2 Core Section Number:  8
3 Question Number:  16
4 Column:  186-187
5 Type of Variable:  Num
6 SAS Variable Name:  INCOME3
7 Question Prologue:  
8 Question:  Is your annual household income from all sources: (If respondent refuses at any income level, code ´Refused.´)


In [6]:
#break evaluation here

In [7]:
codebook_list = []
#for i in range(2, 8, 2):
for i in range(2, len(codebook_data), 2):
     df = codebook_data[i]
#     print(df)
     labels = df.columns[1][0].replace(': ',':').split(' ')
     df.columns = [col[1] for col in df.columns.values]
     dict = mlfuncs.get_label_dict( i, labels)
     dict['table'] = df
     codebook_list.append(dict)

#codebook_list
print(f"Codebook List:  Len:{len(codebook_list)}")


Codebook List:  Len:324


In [8]:
import pickle

In [9]:
# Save list/dict/dataframe structure to file
with open('../data/codebook2022.pkl', 'wb') as file:
    pickle.dump(codebook_list, file)

In [10]:
# Load list/dict/dataframe structure from  file
with open('../data/codebook2022.pkl', 'rb') as file:
    loaded_dict = pickle.load(file)

In [11]:
loaded_dict


[{'Label': 'State FIPS Code',
  'Section Name': 'Record Identification',
  'Section Number': '0',
  'Question Number': '1',
  'Column': '1-2',
  'Type of Variable': 'Num',
  'SAS Variable Name': '_STATE',
  'Question Prologue': '',
  'Question': 'State FIPS Code',
  'table':     Value           Value Label  Frequency  Percentage  Weighted Percentage
  0       1               Alabama       4506        1.01                 1.50
  1       2                Alaska       5865        1.32                 0.21
  2       4               Arizona      10185        2.29                 2.18
  3       5              Arkansas       5309        1.19                 0.89
  4       6            California      10952        2.46                11.63
  5       8              Colorado       9365        2.10                 1.76
  6       9           Connecticut       9784        2.20                 1.09
  7      10              Delaware       3987        0.90                 0.30
  8      11  District of