## Get variable names and descriptions from BRFSS codebooks

The first step to getting the necessary data from the BRFSS CSV files is to figure out which variables that I want to extract from each year's CSV. Unfortunately, the variable names occasionally change, and so the codebooks that describe the question that each variable corresponds to are pdfs that are several hundred pages. As far as I can tell, there is no CSV or spreadsheet where the variable names are in one column and the questions are in another.

Therefore, the purpose of this notebook is to extract the variable names from the codebooks, and map them to the question that they correspond to.

In [1]:
import pandas as pd
import pickle

import re
import PyPDF2
import pickle

from progress_bar import log_progress
from functools import reduce

First, let's load one of the CSVs and take a look at the data structure.

In [4]:
brfss_2014 = pd.read_csv("../data/brfss/csv/brfss2014.csv", encoding = "cp1252", nrows=100)

In [5]:
brfss_2014.head()

Unnamed: 0.1,Unnamed: 0,x.state,fmonth,idate,imonth,iday,iyear,dispcode,seqno,x.psu,...,x.fobtfs,x.crcrec,x.aidtst3,x.impeduc,x.impmrtl,x.imphome,rcsbrac1,rcsrace1,rchisla1,rcsbirth
0,1,1,1,1172014,1,17,2014,1100,2014000001,2014000001,...,2.0,1.0,2.0,5,1,1,,,,
1,2,1,1,1072014,1,7,2014,1100,2014000002,2014000002,...,2.0,2.0,2.0,4,1,1,,,,
2,3,1,1,1092014,1,9,2014,1100,2014000003,2014000003,...,2.0,2.0,2.0,6,1,1,,,,
3,4,1,1,1072014,1,7,2014,1100,2014000004,2014000004,...,2.0,1.0,2.0,6,3,1,,,,
4,5,1,1,1162014,1,16,2014,1100,2014000005,2014000005,...,2.0,1.0,2.0,5,1,1,,,,


The first step to mapping each variable name to its question is to first read in each codebook pdf file as a string, and then figure out how to parse it in order to get the variable name and the question that corresponds to it.

In [5]:
def extract_pdf_string(pdf_path):
    
    pdfFileObj = open(pdf_path, 'rb')
    
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
    num_pages =  pdfReader.numPages
    
    pdf_string = ""
    
    for n in range(num_pages):
        pageObj = pdfReader.getPage(n)
        page_string = pageObj.extractText()
        clean_page_string = re.sub("\s+", " ", page_string)
        clean_page_string = clean_page_string.strip()
        pdf_string += clean_page_string
        
    # closing the pdf file object
    pdfFileObj.close()
    return pdf_string

In [5]:
# For each section, we assume that the first word that we come across that starts with an underscore is the
# variable name in that section. Otherwise, any word that has more than 60% of the letters upper-case is assumed
# to be the variable name. We can't just pick out words that are all capitals, because a lot of the variable
# names have numbers in them as well.

def find_var(words):
    for word in words:
        if word == 'BRFSS' or word == 'SAS' or len(word) <3:
            pass
        elif word[0] == '_':
            return word
        else:
            ratio = sum([letter.isupper() for letter in list(word)])/len(list(word))
            if ratio > 0.6:
                return word

In [227]:
# This function is meant to take a section, find the variable name, and then find the text of the question that
# was asked for that variable. The question text is located after 'Description:' for some codebooks, and in other
# codebooks it's located after 'Question:', so we'll check and split on whichever one is present. We then
# search the second subsection after splitting on 'Description:' or 'Question:'. The end of the question and
# start of the table of values is usually marked by the word 'Value' or 'Weighted', so we'll grab every after
# either 'Question:'/'Description:' and up to 'Value'/'Weighted', and that's the text of our question for this
# variable.

def extract_variable_name_and_description(section):
    words = section.split(" ")
    var_name = find_var(words)
    
    if 'Description:' in section:
        subsections = section.split("Description:")
    elif 'Question:' in section:
        subsections = section.split("Question:")
    else:
        description = None
        return var_name, description
        
    description = subsections[1]

    value_limit = description.find("Value")
    weighted_limit = description.find("Weighted")

    if value_limit == -1:
        limit = weighted_limit
    elif weighted_limit == -1:
        limit = value_limit
    else:
        limit = min(description.find("Value"), description.find("Weighted"))

    description = description[0:limit]
    description = description.strip()

    if description == '':
        description = None
    
    return var_name, description

In [505]:
years = range(1999, 2018)

In [228]:
# The purpose of this cell is to iterate through each year's codebook, read in the pdf contents as a string.
# We split the string on the phrase 'SAS Variable' since this marks each section that describes the variable
# and its question. We then iterate through each section and extract the variable name and description, and
# then put it into a dataframe. Finally, we store the variable name/question dataframe for each codebook in a 
# dictionary, where the year of that code book is the key for that codebook's dataframe.

codebook_dfs_dict = {}

for year in log_progress(years):
    pdf_string = extract_pdf_string(f"../data/brfss/codebooks/{year}_codebook.pdf")
    pdf_sections = pdf_string.split("SAS Variable")

    var_desc_array = []
    for section in pdf_sections:
        row = extract_variable_name_and_description(section)
        var_desc_array.append(row)

    var_desc_df = pd.DataFrame(var_desc_array, columns=['var_name', 'description'])
    
    # We'll drop the rows where the variable is none, and then also get rid of duplicate rows where the var_names
    # are the same.
    var_desc_df = var_desc_df.dropna(subset=['var_name']).drop_duplicates(subset=['var_name'])

    codebook_dfs_dict[year] = var_desc_df

VBox(children=(HTML(value=''), IntProgress(value=0, max=19)))

In [229]:
with open("../data/pickles/codebook_dfs_dict.pkl", "wb") as f:
    pickle.dump(codebook_dfs_dict, f)

We now have a dictionary of codebook dataframes for each year; the codebook dataframe has a column 'var_name' and a column 'description', and each row is a different variable. Now, let's try joining these dataframes together. The codebooks have different variable names for the same information between different years, which unfortunately means that in order to see how these names have evolved we have to use an outer join to get master dataframe of all of the variable names and their descriptions over the years.

In [232]:
master_df = codebook_dfs_dict[1999]
master_df.rename(columns={'description':1999}, inplace=True)
for year, df in codebook_dfs_dict.items():
    if year == 1999:
        pass
    else:
        master_df = pd.merge(master_df, df, how='outer', on='var_name')
        master_df.rename(columns={'description':year}, inplace=True)

In [516]:
with open("../data/pickles/master_codebook_all_years.pkl", "wb") as f:
    pickle.dump(master_df, f)

`_LLCPWT` and `_FINALWT` are the final weights for how 'much' each respondent's answers should be weighed when calculating any higher-level summary statistics. This is because certain demographics are more likely to be sampled than others, and so it's necesary to multiply each respondent's answers by their final weight when calculating group-level statistics.

In [508]:
master_df[master_df.var_name=='_LLCPWT']

Unnamed: 0,var_name,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
1189,_LLCPWT,,,,,,,,,,,,,Final weight assigned to each respondent: Land...,Final weight assigned to each respondent: Land...,Final weight assigned to each respondent: Land...,Final weight assigned to each respondent: Land...,Final weight assigned to each respondent: Land...,Final weight assigned to each respondent: Land...,Final weight assigned to each respondent: Land...


In [304]:
master_df[master_df.var_name=='_FINALWT']

Unnamed: 0,var_name,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
243,_FINALWT,,,Final weight assigned to each respondent.,Final weight assigned to each respondent,Final weight assigned to each respondent (Post...,Final weight assigned to each respondent (Post...,Final weight assigned to each respondent (Post...,Final weight assigned to each respondent (Post...,Final weight assigned to each respondent (Post...,Final weight assigned to each respondent (Post...,Final weight assigned to each respondent (Post...,Final weight assigned to each respondent (Post...,,,,,,,


We can see that the BRFSS used the variable `_FINALWT` from 1999 to 2010, and then in 2011 it started using `_LLCPWT`.

We want to translate the master dataframe of all the variable names and their descriptions (with each year as a different column) into a dictionary. What we're interested in is the variable name as a key, and the most 'common' description as the value. The description/question wording sometimes changes between years, so we'll pick the description/question that has been used most repeatedly over the years.

In [330]:
def make_consensus_var_desc_dict(df):
    raw_dict = dict(pd.DataFrame.transpose(df.set_index('var_name')))
    clean_dict = {}
    for key, value in raw_dict.items():
        if value.mode().empty:
            clean_dict[key] = None
        else:
            clean_dict[key] = value.mode()[0]
    return clean_dict

In [512]:
consensus_var_desc_dict = make_consensus_var_desc_dict(master_df)

In [514]:
with open("../data/pickles/consensus_var_desc_dict.pkl", "wb") as f:
    pickle.dump(consensus_var_desc_dict, f)

We can then iterate through the consensus variable/description dictionary and check to see which variable names contain the characters 'DIABE'; these are likely variables that are related to diabetes, and so we can see what questions correspond to these variables>

In [344]:
for key, value in consensus_var_desc_dict.items():
    if 'DIABE' in key:
        print(key)
        print(value)

DIABETES
5.1. [People may] provide regular care or assistance to [someone] who is elderly or has a long -term illness or di sability. During the past month, did you provide any such care or assistance to a family member or friend who is 60+ years of age? Column: 86
DIABEYE
Has a doctor ever told you that diabetes has affected your eyes or that you had retinopathy?
DIABEDU
Have you ever taken a course or class in how to manage your diabetes yourself?
DIABETE2
Have you ever been told by a doctor that you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre -diabetes or borderline diabetes, use response code 4.)
DIABETE3
(Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre -diabetes or borderline diabetes, use response code 4.)


Clearly, `DIABTES`, `DIABETE2`, and `DIABETE3` are all asking the same question, and so we can look at the master dataframe to see which year seach variable name was used.

In [345]:
master_df[master_df.var_name.str.contains("DIABET") == True]

Unnamed: 0,var_name,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
40,DIABETES,6.1. How long has it been since you last visit...,5.1. [People may] provide regular care or assi...,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,,,,,,,,,,,,,,
614,DIABETE2,,,,,,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,Have you ever been told by a doctor that you h...,,,,,,,
1088,DIABETE3,,,,,,,,,,,,,"(Ever told) you have diabetes (If ""Yes"" and re...","(Ever told) you have diabetes (If ""Yes"" and re...","(Ever told) you have diabetes (If ""Yes"" and re...","(Ever told) you have diabetes (If ""Yes"" and re...","(Ever told) you have diabetes (If ""Yes"" and re...",(Ever told) you have diabetes (If ´Yes´ and re...,(Ever told) you have diabetes (If ´Yes´ and re...
